LHCOPN operation phoneconf 2010-10-05
Participation
Sites represented:
- CH-CERN (John Shade, Edoardo Martelli)
- DE-KIT (Aurelie Reymund)
- FR-CCIN2P3 (Guillaume Cessieux)
- NDGF (Denis Walberg)
- NL-T1 (Hannot Pet)
- TW-ASGC (Wen-Shui Chen)
- US-T1-BNL (John Bigrow)
Absent:
- CA-TRIUMF
- UK-T1-RAL
- US-FNAL-CMS
Apologies:
- IT-INFN-CNAF (Stefano Zani)
- ES-PIC (Fernando Lopez)
John noted that RAL has been absent for at least the last three teleconferences, and wondered why they have such difficulty joining.
Operations Overview
During time-window [2010-07-01, 2010-09-31] :
- 55 tickets: 25 ML2, 18 IL2 (43 ticket for L2 = 78%!), 7 ML3, 2 IL3, 2 info, 1 test
- 48 tickets (87%) are about connectivy issues
- Only 8 tickets (14%) seem really service impacting ('Loss of service')
Distribution of ticket assignements was as follows:
No tickets for UK-T1-RAL and US-T1-BNL, is that ok?
Long-standing open issues: [LHCOPN dashboard here:
https://gus.fzk.de/pages/all_lhcopn.php ]
Sites were asked to carefully check aforementioned tickets before the upcoming LHCOPN meeting on Thursday.
Operations KPIs
Events matching per T1s:
Regular backup tests:
2010 report is here:
https://twiki.cern.ch/twiki/bin/view/LHCOPN/LhcopnBackupTestsResults2010
- Missing entries for CA-TRIUMF, CH-CERN, DE-KIT, IT-INFN-CNAF, NDGF, NL-T1, TW-ASGC, US-CMS-FNAL, US-T1-BNL
- I.e only three sites (ES-PIC, FR-CCIN2P3 and UK-T1-RAL have reported something).
- Reboot of CERN's router is a good use case
Site Reports
CA-TRIUMF
CH-CERN
- ACLs issue: In some cases a software bug prevents the hardware memory of the routers to be correctly updated when an ACL is changed. This has affected a prefix NDGF recently added which was unable to communicate with RAL and BNL. CERN is validating a new software release and will deploy it soon. Two maintenances will be called. A lot of time needed to detect this, and it required some 10G sniffers to be used.
- RAL MDM prefix: the prefix of the Perfsonar MDM at RAL is not yet received. Working with RAL to sort this out. RAL should (must) open a ticket to have this tracked.
DE-KIT
- 21 GGUS tickets related to DE-KIT:
- 19 DE-KIT impacted
- 1 indirectly concerned (#59997: Cern-Milan down, backup via KIT ok)
- 1 error (#61582)
- details:
- ML2: 10 + 1 (wrong ticket category for #62238)
- ML3: 3
- IL2: 4
- + #61582: error, can be ignored
- + #59997: KIT not directly impacted
- IL3: 1 (#62381)
- the maintenance tickets for the link DE-KIT -> NL-T1 should be opened by DE-KIT (responsible of the link)
- a lot of flappings (< 1 minute) seen on several links during this period (july-september) -> no ticket openend for all (night, week-end, short period...)
ES-PIC
- A few events since last meeting
- 3 not correlated to any GGUS by the system:
- 1268 ES-PIC 2010-08-28 23:10:01 2010-08-29 01:37:01 No info provided by RedIRIS
- 1262 ES-PIC 2010-08-23 10:00:01 2010-08-23 20:50:01 No info provided by RedIRIS
- 1235 ES-PIC 2010-07-25 23:00:01 2010-07-26 09:52:01 [IRIS-NOC #10047] Geant fiber cut. No GGUS opened
FR-CCIN2P3
- 1 single service impacting scheduled event: Major network upgrade at FR-CCIN2P3 2010-09-21
Events > 1 minute:
- CERN-IN2P3-LHCOPN-001
- 11 small flaps (< 4 min) in July, no reason found
- 2010-09-14 03:47:31 - 18:53:05 Link very unstable, #RENATER-2013370, replaced faulty card on an optical device in Geneva
- 2010-09-21 19:04:20 - 19:26:38 #GGUS-61949, network upgrade at FR-CCIN2P3
- GRIDKA-IN2P3-LHCOPN-001
- 2010-08-26 11:59:38 - 2010-08-27 22:46:53 #GGUS-61584, fibre cut between Lyon and Dijon
- 2010-09-21 19:04:20 - 19:26:49 #GGUS-61949, network upgrade at FR-CCIN2P3
- 2010-09-30 00:44:03 - 01:05:43 , no reason found
AOB:
- Working with NL-T1 on routing policies, very unclear so unknown if ok for everyone
- Traceroute seems filtered at some hop in DE-KIT making troubleshooting routing problems difficult. Hanno reported that US-FNAL-CMS (Vyto) had fixed a similar issue, and Aurelie agreed to contact them for details.
IT-INFN-CNAF
NDGF
Events pulled from NORDUnet's ticket system:
Unscheduled events:
Start: 09.07.2010 05:03 UTC
End: 21.07.2010 14:53 UTC
NORDUNETTICKET-568
GGUS 60092
Backup connection
The outage was caused by faulty hardware in the SARA router and a workaround is now in place, the link has been stable since then.
Start: 13.07.2010 12:53 UTC
End: 13.07.2010 12:54 UTC
NORDUNETTICKET-574
Primary connection
The link between Copenhagen and Geneva flapped three times. The flaps could not be confirmed by GEANT. It remains unclear what caused the flaps.
Start: 26.07.2010 09:58 UTC
End: 26.07.2010 09:59 UTC
NORDUNETTICKET-604
Primary connection
The link between NDGF and CERN flapped once. GEANT was informed, but their monitoring tools didn't detect the flap. The circuit has been stable since the incident.
Start: 11.08.2010 14:36 UTC
End: 11.08.2010 14:37 UTC
NORDUNETTICKET-636
GGUS 61122
Primary connection
The outage was due to a hardware failure in GEANT´s network. No more flaps has been noticed.
Start: 16.08.2010 19:41 UTC
End: 16.08.2010 19:57 UTC
Start: 18.08.2010 18:27 UTC
End: 18.08.2010 18:53 UTC
NORDUNETTICKET-644
GGUS 61281
GEANT 7766, 7862
Primary connection
Emergency maintenance which was not announced to the community by mistake.
Start: 20.08.2010 08:26 UTC
End: 21.08.2010 01:14 UTC
NORDUNETTICKET-650
GGUS n/a
Backup connection.
A fiber break near Hamburg caused input power loss in multiple transponders. The fiber has been restored by Netherlight's carrier.
Start: 25.08.2010 07:06 UTC
End: 25.08.2010 07:32 UTC
NORDUNETTICKET-657
GGUS 61540
GEANT 7997
Primary connection.
There were a few flaps on the link between NDGF and CERN. GEANT noticed them as well, but couldn't find the cause.
Start: 25.09.2010 14:24 UTC
End: 26.09.2010 10:32 UTC
NORDUNETTICKET-716
NORDUNETTICKET-718
GGUS n/a
Both Primary and backup connection.
A major fiber cut in the stretch Lehnsann and Neindorf caused the outage. All the services are up and running after fiber repair.
Note: NORDUnet will try to accomodate redundancy by making the nessesary installations next time we are in Hamburg.
Scheduled events:
Start: 27.09.2010 22:46 UTC
End: 27.09.2010 23:29 UTC
NORDUNETTICKET-685
GGUS 62444
GEANT 8318
Our supplier performed a planned maintenance on our link between Denmark and Germany.
The event affecting both primary and secondary path has only performance impact as the backup through the internet worked fine.
NL-T1
From 1-7-2010 until 30-9-2010 we have realized the following availability for the NL-T1 LHCOPN circuits:
- asgc-sara.r0.ams.asgc.net 99.981%
- dk-ndgf.nordu.net 96.830%
- gw-sara-local.lhc-tier1.triumf.ca 99.817%
- l513-c-rftec-2-be3.cern.ch 99.753%
- r-inet-gis-i-b1-1-sara.gridka.de 99.064%
- vlan2613-2.r-s-starlight-fnal.fnal.gov 99.285%
For all GGUS tickets assigned to NL-T1 between 1-7-2010 and 30-9-2010 this is the report based on the CSV export:
- 17 GGUS tickets
- IL2, 4
- IL3, 2
- Info, 1
- ML2, 8
- ML3, 2
- Closed 13 (on 30-9-2010)
Link related problems:
- CERN-SARA-LHCOPN-001 (3x ML2, 1x IL2
- SARA-TRIUMF-LHCOPN-001 (2x ML3, 1x IL2)
- GRIDKA-SARA-LHCOPN-001 (2x ML2, 2x IL2)
- NDGF-SARA-LHCOPN-001 (1x ML3, 1x info)
One major major IL2 on NDGF-SARA-LHCOPN-001 which took us 6 days to find the problem together with NDGF, Nordunet and SURFnet:
Problem was caused by the interfacecard of our Juniper router.
After six days (20-7-2010) we made a workaround with which the link was stable again.
End of august we replaced the card in our router:
This solved the issue definitely and made the work around obsolete.
We forgot to update the change management database for this (I have done this today):
Hanno Pet -- NL-T1 / SARA 5-10-2010
Some complex issues were very hard to troubleshoot, each component was fine individualy, while combined this was wrong.
TW-ASGC
Long BGP flap in september. But no clearly figured out reasons. After checking, Wenshui reported by e-mail this was not service affecting.
UK-T1-RAL
US-FNAL-CMS
US-T1-BNL
Performance issue, with IT-INFN-CNAF not yet reached any kind of conclusion. Testbox should be set up to go ahead. (N.B: This is not a LHCOPN issue).
AOB
- Shape of the page about routing policies changed: https://twiki.cern.ch/twiki/bin/view/LHCOPN/RoutingPolicies , please have a look and fill it for your site. Tickets will be opened to sites missing it in two weeks.
- Is there particular thing to report around Operations at next LHCOPN meeting:http://indico.cern.ch/conferenceDisplay.py?confId=102716 ?
- At next LHCOPN meeting several topics will be put on the table from WLCG
- They have no network support unit in GGUS (Not LHCOPN related)
- We often miss to regularly update tickets, giving the impression that nothing is being done
- There is a lack of responsibilities on some network issue (not LHCOPN related)
- The separate helpdesk we have in GGUS is isolating us from WLCG and also makes interaction with supporters in the standard GGUS more difficult. Is this justified?
- Are there LHCOPN issues raised by WLCG which have not related ticket in the LHCOPN helpdesk ?
- UK-T1-RAL seems not very responsive and not playing enough in the LHCOPN TTS, tbd with Nick at next LHCOPN meeting
Next Ops Phoneconf
The next Teleconference was scheduled for Tuesday, January 11th, 2011 - 16:30 Geneva time.