LHCOPN Operations Telecom 2011-04-12
The scope of this phoneconf is [2011-01-01, 2011-03-31].
Participation
Sites represented:
- CH-CERN: Edoardo Martelli, John Shade
- DE-KIT: Aurelie Reymund
- ES-PIC: Fernando Lopez
- FR-CCIN2P3: Guillaume Cessieux (Chair)
- IT-INFN-CNAF: Stefano Zani
- NDGF: Dennis Wallberg
- NL-T1: Sander Boele, Pieter de Boer
- TW-ASGC: Wen-Shui Chen
- US-FNAL-CMS: Vyto Grigaliunas
Apologies:
- CA-TRIUMF: Vitaliy Kondratenko
- UK-T1-RAL: Nick Moore
- US-T1-BNL: John Bigrow
Operations Overview
During time-window [2011-01-01, 2011-03-31]:
- 43 tickets: 15 (35%) IL2, 5 IL3 (12%), 5 Info (12%), 14 ML2 (32%), 4 ML3 (9%)
- Kind of problem: 41 connectivity issues (95%), 1 performance issue, 1 none
- 12 tickets (28%) reported impact on services: 4 Loss of service and 6 performance degradation
Distribution of tickets' assignments was as following:
Pending issues (
https://gus.fzk.de/pages/all_lhcopn.php ) :
Operations KPIs
Monitoring report
Computed by Sander Boele
from LHCOPN dashboard: http://casper.grid.sara.nl/
Top 20 numbers of measurements with Packetloss > 0.1% during the last three months (less than 2% of data missing in the period)
SRC |
DEST |
# of measurements |
DE-KIT-HADES |
CA-TRIUMF-HADES |
1118 |
DE-KIT-HADES |
US-FNAL-CMS-HADES |
913 |
NL-T1-HADES |
US-FNAL-CMS-HADES |
901 |
US-T1-BNL-HADES |
TW-ASGC-HADES |
666 |
TW-ASGC-HADES |
US-T1-BNL-HADES |
651 |
IT-INFN-CNAF-HADES |
TW-ASGC-HADES |
592 |
TW-ASGC-HADES |
IT-INFN-CNAF-HADES |
565 |
TW-ASGC-HADES |
CA-TRIUMF-HADES |
460 |
US-FNAL-CMS-HADES |
TW-ASGC-HADES |
363 |
US-FNAL-CMS-HADES |
NL-T1-HADES |
291 |
US-FNAL-CMS-HADES |
DE-KIT-HADES |
282 |
TW-ASGC-HADES |
US-FNAL-CMS-HADES |
269 |
US-T1-BNL-HADES |
CA-TRIUMF-HADES |
209 |
CA-TRIUMF-HADES |
UK-T1-RAL-HADES |
182 |
CA-TRIUMF-HADES |
ES-PIC-HADES |
180 |
TW-ASGC-HADES |
ES-PIC-HADES |
174 |
US-FNAL-CMS-HADES |
ES-PIC-HADES |
172 |
US-T1-BNL-HADES |
ES-PIC-HADES |
168 |
US-FNAL-CMS-HADES |
US-T1-BNL-HADES |
166 |
DE-KIT-HADES |
ES-PIC-HADES |
164 |
Backup tests league table
Site |
Date of last backup test report |
Have we a report since 1 year? |
CA-TRIUMF |
2008-06-03 |
KO |
CH-CERN |
2008-06-03 |
KO |
DE-KIT |
2009-10-14 |
KO |
ES-PIC |
2010-04-22 |
OK |
FR-CCIN2P3 |
2010-03-08 |
OK |
IT-INFN-CNAF |
2008-04-09 |
KO |
NDGF |
2008-04-09 |
KO |
NL-T1 |
2009-02-10 |
KO |
TW-ASGC |
2010-12-28 |
OK |
UK-T1-RAL |
2010-08-24 |
OK (but reported issue about during last ops phoneconf?) |
US-FNAL-CMS |
2008-04-24 |
KO |
US-T1-BNL |
2008-06-03 |
KO |
Site Reports
CA-TRIUMF
For all GGUS tickets assigned to TRIUMF between 1-04-2011 and 31-06-2011:
Opened last month and not closed yet:
On June 16,2011 TRIUMF migrated from the CWDM system to the dedicated fibre circuits.
The following circuits have been affected:
- BNL-TRIUMF-LHCOPN-001 - 10G
- CERN-TRIUMF-LHCOPN-001 - B - 1G
- CERN-TRIUMF-LHCOPN-002 - P - 5G
- SARA-TRIUMF-LHCOPN-001 - T1- 1G
CH-CERN
- CERN LHCOPN routers will be upgraded with new hardware. The first one will be replaced in May, the second one in June. Exact dates will be provided. Sites having two links to CERN will be migrated to new hardware one link at a time to avoid full disconnection. This will be discussed site by site, and for most of them this should be a 10 minutes maintenance. This will also be a good use case for backup tests.
DE-KIT
No service impacting event on any link:
- 0 GGUS tickets assigned to DE-KIT
- 1 planned maintenance at DFN (#65897)
- 1 link down event on the link DE-KIT/NL-T1 (#67086, duplicate #67087)
Aurelie reminded it was decided to not open a GGUS ticket for non service impacting event (for example maintenance announcements), this is why they now have few tickets.
ES-PIC
- All GGUS tickets opened are related with ML2
- PIC yearly electrical maintenance will be from 19th to 20th April
FR-CCIN2P3
No network service impacting event during the time window.
- Link CERN-IN2P3-LHCOPN-001: No event!
- Link GRIDKA-IN2P3-LHCOPN-001:
- 1 fiber cut around Besançon-Dijon #RENATER-2130329, 2011-02-15 19:43 -> 2011-02-16 04:19
- 16 flaps (sound regular: Nearly all occurred Tuesday or Thursday between 02:00 am and 03:00 am)
CCIN2P3 would like some brief backup tests to be made since the routing between FR-CCIN2P3, DE-KIT and NL-T1 turned to be really complex, particularly to ensure paths' symmetry.
IT-INFN-CNAF
Second Link between CNAF and CERN activated January 27th. The two links are used in a round robin load balancing. But as the two links are using two really diverse paths the RTT are not the same on both links and this may lead to some issue. This is being investigated. Efficiency of the redundancy was fully tested but not reported.
NDGF
Our backup connection was re-routed internally in NORDUnet to gain physical redundancy since the main connection was running in the same physical fibre trunk provided by GEANT.
Other than that nothing out of the ordinary to report.
NL-T1
For all GGUS tickets assigned to NL-T1 between 1-1-2011 and 31-3-2011 this is the report based on the CSV export:
- 16 GGUS tickets
- IL2, 8
- IL3, 1
- Info, 1
- ML2, 4
- ML3, 2
- Closed 16 (on 1-3-2011)
Link related problems:
- FERMI-SARA-LHCOPN-001 - T1 - 1G - (1x IL2, 1x ML3)
- GRIDKA-SARA-LHCOPN-001 - 10G - (2x IL2)
- NDGF-SARA-LHCOPN-001 - 10G - (1x IL2, 1x ML2)
- SARA-TRIUMF-LHCOPN-001 - T1- 1G - (2x IL2, 3x ML2, 1x ML3)
Despite our new policy to separate work for SURFnet6/Netherlight NOC with NL-T1 operations, a few tickets still have been logged for non NL-T1 links. It is our policy not to do so.
- CERN-TRIUMF-LHCOPN-002 - P - 5G - (3x IL2, 1x Info)
Ticket 68777 was incorrectly opened as IL2, while it should have been ML2
On January 18th we implemented a new routing setup allowing us to better serve Nordugrid en DE-KIT. In march we closed the long lasting ticket 62381, we've fixed this issues by smartly applying route metrics
Pieter de Boer -- NL-T1 / SARA 12/04/2011
TW-ASGC
Link CERN-ASGC-LHCOPN-003: two unscheduled down time and one scheduled down time events:
- International carrier reported that multiple fiber cut on the backhaul nearby Amsterdam. 2011-03-02 23:08 - 2011-03-04 11:30
- Fiber replacement scheduled maintenance requested by international carrier. The requested maintenance window was 8 hours, but the real down time was less then 5 minutes. 2011-02-28 00:00 - 2011-02-28 08:00.
- Unscheduled down time due to Japan-US submarine cable cut. 2011-02-16 22:47 - 2011-02-17 12:00
The procurement project of 2.5G link from Taipei to Amsterdam and CERN is delayed to mid of July because of price negotiation.
UK-T1-RAL
High utilization noted incoming from CERN to RAL and discussions being held re load balancing the two 10GB trunks.
Scheduled Internal RAL Site Router Upgrade on 15-3-11 and 22-3-11 No Tier1 traffic via JANET during upgrade.
* since upgrade reported RAL internal performance problems have affected traffic via JANET, taking the form of packet loss. It is beleived that a misconfiiguration of Microsoft NLB is the cause and is being investigated.
US-FNAL-CMS
Some event within USLHCnet domain, but nothing affecting service or L3.
US-T1-BNL
[Mail from John after the phoneconf: from BNL we've had minimal / no problems]
AOB
- Conclusions from Ops WG meeting 7: http://indico.cern.ch/materialDisplay.py?contribId=11&materialId=1&confId=129691
- Two items to go ahead
- Discussion with GGUS on how to precisely improve interactions between LHCOPN helpdesk and WLCG GGUS
- Discussion with Sander about gathering monitoring information indicating service impacting events happening on the LHCOPN
- John asked how behaves the monitoring system with the round robin balancing over the two links at IT-INFN-CNAF
- Sander plans to correlate OWD with Traceroute database to avoid issue and to be able to support RTT discrepencies
- Next LHCONE/LHCOPN meeting is June, 13th and 14th, Washington DC, http://indico.cern.ch/conferenceDisplay.py?confId=131550
Next Ops Phoneconf
The next Teleconference was scheduled for Tuesday, July the 5th, 2011 - 16:30 Geneva time (CEST)