LHCOPN operation phoneconf 2010-10-05

Participation

Sites represented:

  • CH-CERN (John Shade, Edoardo Martelli)
  • DE-KIT (Aurelie Reymund)
  • FR-CCIN2P3 (Guillaume Cessieux)
  • NDGF (Denis Walberg)
  • NL-T1 (Hannot Pet)
  • TW-ASGC (Wen-Shui Chen)
  • US-T1-BNL (John Bigrow)

Absent:

  • CA-TRIUMF
  • UK-T1-RAL
  • US-FNAL-CMS

Apologies:

  • IT-INFN-CNAF (Stefano Zani)
  • ES-PIC (Fernando Lopez)

John noted that RAL has been absent for at least the last three teleconferences, and wondered why they have such difficulty joining.

Operations Overview

During time-window [2010-07-01, 2010-09-31] :

  • 55 tickets: 25 ML2, 18 IL2 (43 ticket for L2 = 78%!), 7 ML3, 2 IL3, 2 info, 1 test
  • 48 tickets (87%) are about connectivy issues
  • Only 8 tickets (14%) seem really service impacting ('Loss of service')

Distribution of ticket assignements was as follows:

ownership.png

No tickets for UK-T1-RAL and US-T1-BNL, is that ok?

Long-standing open issues: [LHCOPN dashboard here: https://gus.fzk.de/pages/all_lhcopn.php ]

Sites were asked to carefully check aforementioned tickets before the upcoming LHCOPN meeting on Thursday.

Operations KPIs

Events matching per T1s:

matching.png

Regular backup tests:

2010 report is here: https://twiki.cern.ch/twiki/bin/view/LHCOPN/LhcopnBackupTestsResults2010

  • Missing entries for CA-TRIUMF, CH-CERN, DE-KIT, IT-INFN-CNAF, NDGF, NL-T1, TW-ASGC, US-CMS-FNAL, US-T1-BNL
  • I.e only three sites (ES-PIC, FR-CCIN2P3 and UK-T1-RAL have reported something).
  • Reboot of CERN's router is a good use case

Site Reports

CA-TRIUMF

CH-CERN

  • ACLs issue: In some cases a software bug prevents the hardware memory of the routers to be correctly updated when an ACL is changed. This has affected a prefix NDGF recently added which was unable to communicate with RAL and BNL. CERN is validating a new software release and will deploy it soon. Two maintenances will be called. A lot of time needed to detect this, and it required some 10G sniffers to be used.
  • RAL MDM prefix: the prefix of the Perfsonar MDM at RAL is not yet received. Working with RAL to sort this out. RAL should (must) open a ticket to have this tracked.

DE-KIT

  • 21 GGUS tickets related to DE-KIT:
    • 19 DE-KIT impacted
    • 1 indirectly concerned (#59997: Cern-Milan down, backup via KIT ok)
    • 1 error (#61582)
  • details:
    • ML2: 10 + 1 (wrong ticket category for #62238)
    • ML3: 3
    • IL2: 4
      • + #61582: error, can be ignored
      • + #59997: KIT not directly impacted
    • IL3: 1 (#62381)

  • the maintenance tickets for the link DE-KIT -> NL-T1 should be opened by DE-KIT (responsible of the link)
  • a lot of flappings (< 1 minute) seen on several links during this period (july-september) -> no ticket openend for all (night, week-end, short period...)

ES-PIC

  • A few events since last meeting

  • 3 not correlated to any GGUS by the system:
    • 1268 ES-PIC 2010-08-28 23:10:01 2010-08-29 01:37:01 No info provided by RedIRIS
    • 1262 ES-PIC 2010-08-23 10:00:01 2010-08-23 20:50:01 No info provided by RedIRIS
    • 1235 ES-PIC 2010-07-25 23:00:01 2010-07-26 09:52:01 [IRIS-NOC #10047] Geant fiber cut. No GGUS opened

FR-CCIN2P3

  • 1 single service impacting scheduled event: Major network upgrade at FR-CCIN2P3 2010-09-21

Events > 1 minute:

  • CERN-IN2P3-LHCOPN-001
    • 11 small flaps (< 4 min) in July, no reason found
    • 2010-09-14 03:47:31 - 18:53:05 Link very unstable, #RENATER-2013370, replaced faulty card on an optical device in Geneva
    • 2010-09-21 19:04:20 - 19:26:38 #GGUS-61949, network upgrade at FR-CCIN2P3
  • GRIDKA-IN2P3-LHCOPN-001
    • 2010-08-26 11:59:38 - 2010-08-27 22:46:53 #GGUS-61584, fibre cut between Lyon and Dijon
    • 2010-09-21 19:04:20 - 19:26:49 #GGUS-61949, network upgrade at FR-CCIN2P3
    • 2010-09-30 00:44:03 - 01:05:43 , no reason found

AOB:

  • Working with NL-T1 on routing policies, very unclear so unknown if ok for everyone
  • Traceroute seems filtered at some hop in DE-KIT making troubleshooting routing problems difficult. Hanno reported that US-FNAL-CMS (Vyto) had fixed a similar issue, and Aurelie agreed to contact them for details.

IT-INFN-CNAF

NDGF

Events pulled from NORDUnet's ticket system:

Unscheduled events:

Start: 09.07.2010 05:03 UTC End: 21.07.2010 14:53 UTC NORDUNETTICKET-568 GGUS 60092 Backup connection The outage was caused by faulty hardware in the SARA router and a workaround is now in place, the link has been stable since then.

Start: 13.07.2010 12:53 UTC End: 13.07.2010 12:54 UTC NORDUNETTICKET-574 Primary connection The link between Copenhagen and Geneva flapped three times. The flaps could not be confirmed by GEANT. It remains unclear what caused the flaps.

Start: 26.07.2010 09:58 UTC End: 26.07.2010 09:59 UTC NORDUNETTICKET-604 Primary connection The link between NDGF and CERN flapped once. GEANT was informed, but their monitoring tools didn't detect the flap. The circuit has been stable since the incident.

Start: 11.08.2010 14:36 UTC End: 11.08.2010 14:37 UTC NORDUNETTICKET-636 GGUS 61122 Primary connection The outage was due to a hardware failure in GEANT´s network. No more flaps has been noticed.

Start: 16.08.2010 19:41 UTC End: 16.08.2010 19:57 UTC Start: 18.08.2010 18:27 UTC End: 18.08.2010 18:53 UTC NORDUNETTICKET-644 GGUS 61281 GEANT 7766, 7862 Primary connection Emergency maintenance which was not announced to the community by mistake.

Start: 20.08.2010 08:26 UTC End: 21.08.2010 01:14 UTC NORDUNETTICKET-650 GGUS n/a Backup connection. A fiber break near Hamburg caused input power loss in multiple transponders. The fiber has been restored by Netherlight's carrier.

Start: 25.08.2010 07:06 UTC End: 25.08.2010 07:32 UTC NORDUNETTICKET-657 GGUS 61540 GEANT 7997 Primary connection. There were a few flaps on the link between NDGF and CERN. GEANT noticed them as well, but couldn't find the cause.

Start: 25.09.2010 14:24 UTC End: 26.09.2010 10:32 UTC NORDUNETTICKET-716 NORDUNETTICKET-718 GGUS n/a Both Primary and backup connection. A major fiber cut in the stretch Lehnsann and Neindorf caused the outage. All the services are up and running after fiber repair. Note: NORDUnet will try to accomodate redundancy by making the nessesary installations next time we are in Hamburg.

Scheduled events:

Start: 27.09.2010 22:46 UTC End: 27.09.2010 23:29 UTC NORDUNETTICKET-685 GGUS 62444 GEANT 8318 Our supplier performed a planned maintenance on our link between Denmark and Germany.

The event affecting both primary and secondary path has only performance impact as the backup through the internet worked fine.

NL-T1

From 1-7-2010 until 30-9-2010 we have realized the following availability for the NL-T1 LHCOPN circuits:

  • asgc-sara.r0.ams.asgc.net 99.981%
  • dk-ndgf.nordu.net 96.830%
  • gw-sara-local.lhc-tier1.triumf.ca 99.817%
  • l513-c-rftec-2-be3.cern.ch 99.753%
  • r-inet-gis-i-b1-1-sara.gridka.de 99.064%
  • vlan2613-2.r-s-starlight-fnal.fnal.gov 99.285%

For all GGUS tickets assigned to NL-T1 between 1-7-2010 and 30-9-2010 this is the report based on the CSV export:

  • 17 GGUS tickets
  • IL2, 4
  • IL3, 2
  • Info, 1
  • ML2, 8
  • ML3, 2
  • Closed 13 (on 30-9-2010)

Link related problems:

  • CERN-SARA-LHCOPN-001 (3x ML2, 1x IL2
  • SARA-TRIUMF-LHCOPN-001 (2x ML3, 1x IL2)
  • GRIDKA-SARA-LHCOPN-001 (2x ML2, 2x IL2)
  • NDGF-SARA-LHCOPN-001 (1x ML3, 1x info)

One major major IL2 on NDGF-SARA-LHCOPN-001 which took us 6 days to find the problem together with NDGF, Nordunet and SURFnet:

Problem was caused by the interfacecard of our Juniper router.
After six days (20-7-2010) we made a workaround with which the link was stable again.
End of august we replaced the card in our router: This solved the issue definitely and made the work around obsolete. We forgot to update the change management database for this (I have done this today):

Hanno Pet -- NL-T1 / SARA 5-10-2010

Some complex issues were very hard to troubleshoot, each component was fine individualy, while combined this was wrong.

TW-ASGC

Long BGP flap in september. But no clearly figured out reasons. After checking, Wenshui reported by e-mail this was not service affecting.

UK-T1-RAL

US-FNAL-CMS

US-T1-BNL

Performance issue, with IT-INFN-CNAF not yet reached any kind of conclusion. Testbox should be set up to go ahead. (N.B: This is not a LHCOPN issue).

AOB

  • Shape of the page about routing policies changed: https://twiki.cern.ch/twiki/bin/view/LHCOPN/RoutingPolicies , please have a look and fill it for your site. Tickets will be opened to sites missing it in two weeks.
  • Is there particular thing to report around Operations at next LHCOPN meeting:http://indico.cern.ch/conferenceDisplay.py?confId=102716 ?
    • At next LHCOPN meeting several topics will be put on the table from WLCG
      • They have no network support unit in GGUS (Not LHCOPN related)
      • We often miss to regularly update tickets, giving the impression that nothing is being done
      • There is a lack of responsibilities on some network issue (not LHCOPN related)
      • The separate helpdesk we have in GGUS is isolating us from WLCG and also makes interaction with supporters in the standard GGUS more difficult. Is this justified?
  • Are there LHCOPN issues raised by WLCG which have not related ticket in the LHCOPN helpdesk ?
    • It seems not
  • UK-T1-RAL seems not very responsive and not playing enough in the LHCOPN TTS, tbd with Nick at next LHCOPN meeting

Next Ops Phoneconf

The next Teleconference was scheduled for Tuesday, January 11th, 2011 - 16:30 Geneva time.

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng matching.png r1 manage 18.2 K 2010-10-04 - 11:07 GuillaumeCessieux  
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2010-11-24 - GuillaumeCessieux
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCOPN All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback