Operations Telecom 8th October 2009

Sites represented: CCIN2P3, CERN, CNAF, DE-KIT, FNAL (report submitted later), NL-T1, PIC, TRIUMF

Operations Model Overview

GGUS' report For period [July 1st, October 1st[ :

  • 77 tickets: 12 IL2, 1 IL3, 1 INFO, 58 ML2, 4 ML3, 1 TEST => 75% are ML2
  • 80 % of problem's kind is network connectivity.
  • Service impacted reported in ticket: 13 Loss of service, 13 None service affecting, 13 Performance degradation, 7 Possible performance degradation, 30 Reduced redundancy, 1 Unknown => 55% of problems are not impacting services.
  • Number of ticket per root sites: 6 CA-TRIUMF, 17 CH-CERN, 12 DE-KIT, 8 ES-PIC, 2 IT-INFN-CNAF, 5 NDGF, 18 NL-T1, 4 TW-ASGC, 2 UK-T1-RAL, 3 US-FNAL-CMS, 0 US-T1-BNL, 0 FR-CCIN2P3
  • Number of ticket per site reported impacted: 24 CA-TRIUMF, 50 CH-CERN, 18 DE-KIT, 13 ES-PIC, 3 FR-CCIN2P3, 6 IT-INFN-CNAF, 9 NDGF, 29 NL-T1, 11 TW-ASGC, 5 UK-T1-RAL, 11 US-FNAL-CMS, 9 US-T1-BNL

Correlation between monitoring events and GGUS tickets for September: correlation.png

  • 17 events, 11 not handled (64%) , 3 ok (17%) , 3 suspicious.

Site Reports

ASGC

BNL

CCIN2P3

Only two events > 30 seconds detected by our monitoring system in the last four months:

Regular events: BGP glitches with CH-CERN (325 events < 30 sec, ~ 2/day) being investigated with RENATER (thought to be due to optical equipment default or else too low BGP timers).

CERN

Change of CERN LHCOPN prefix. It was done with two tickets, one for the new announcement and one for the withdrawn of the old. The first ticket was updated by almost all the T1s giving feedback that they were accepting the new prefix; only two T1s had to be contacted for confirmation. In the withdrawn's ticket there was an issue with the communication inside CNAF (GARR, the network operator, was not receiving the tickets). What is the status? Stefano replied that this is not yet resolved. Edoardo stated that there were two possibilities. That people from GARR are added to the GGUS distribution list or that they are added to the CNAF mailing list. Stefano preferred the first option and so that he would discuss thsi with Marco and Guillaume.

CNAF

There is a routing issue between CNAF and GARR to be solved in order to permit the automatic rerouting in case of problems on PIC LHC-OPN link as reported here: This activity is not easy because of the needs to change routing configurations on three different routers 1 GARR Juniper with two different virtual router instances and 2 INFN Routers (Cisco and Extreme Networks). The configuration change won't be completed until the end of October.

DE-KIT

Approx. 12 GGUS tickets from the period 2009-07-01 -> 2009-10-01, mainly ML2 3 dupplicate tickets concerning DFN maintenance. The GGUS tickets were opened by CERN : should T1 open the maintenance tickets announced by the NREN? It was re-iterated that although CERN had opened tickets for Tier-1 is the past, it was no longer doing this. Tier-1s should open their own tickets. The only exception will be when an incident/maintenance affects multiple sites and in this case CERN will open the ticket.

A backup test is planned for next Wednesday and more details will be made available on the 9th. Guillaume noted that there was a major event taking place a IN2P3 on this date and that if any significant support were required from IN2P3 then the test should be rescheduled. Aurelie will disucss this with Bruno.

FNAL

I still am confused as to who can open tickets on our behalf…the CERN NOC opened up several tickets on our behalf due to USLHCnet maintenance. I don’t have a problem with this and IMHO this is an efficient method of opening tickets, instead of us being middlemen, at least when it comes down to problems within USLHCnet. Please see comment above in the KIT report.

Can the MDM systems be integrated somehow with the ticketing system, so that if one of the systems detects a problem with a circuit, it can automatically generate a ticket on behalf of the site where the MDM system is ? Or is that the plan with the MDM systems ? This can be disucssed at the next LHCOPN meeting.

NDGF

We dont have much to report, no big problems. We have some ongoing issues with a flapping backup link NDGF-SARA, this is under investigation.

We also find that the refresh in GGUS is too short, but I assume that this has been fixed already smile

Next meeting on 14 Jan is OK for us (me)

Some historical events that I pulled from our ticketsystem: Downtime on the backup link 27.02.2009. Broken XFP 07.04.2009. 104m, carrier accidentally splitted the wrong fiber when doing a maintenance. 22.04.2009. 2m, No RFO. 23.04.2009. 179m, planned maintenance. 07.05.2009. 392m, fibercut. 15.05.2009. 3m, No RFO 03.08.2009. 281m,The problem was a faulty amplifier in SURFnet.

Downtime on the primary: 13.03.2009. 4s outage, faulty OSCU board in GEANT2 network. 07.04.2009. 3s flap, no RFO. 08.04.2009. 6m, planned event. 19.05.2009. 6m, some maint. work caused the outage. 18.05.2009. 190m, fiber cut between Denmark and Germany. 10.06.2009. 2144m, fiber cut between Obergerlafingen and Giebenach. 01.07.2009. 1205m, faulty amplifier in GEANT2 network.

11.05.2009. 2h downtime for NDGF when a router was relocated.

NL-T1

From 1-7-2009 until 30-9-2009 we have realized the following IP uptimes for the NL-T1 10G LHCOPN connection:
- NDGF: 99.959%
- CH-CERN: 99.120%
- DE-KIT: 99.777%
This is based on a ping each five minutes to the router at the other sides of the link.

Although we have opened many GGUS tickets we have not had real problems during this period.

My own report of GGUS tickets where NL-T1 is affected is: Opened: 34, IL2: 8, IL3: 1, INFO: 1, ML2: 20, ML3: 4 (based on the limited GGUS selection possibilities from 1-7-2009 until 30-9-2009). So lot of ML2 tickets which seems to be the common conclusion from the overall figures.

Some comments about GGUS from our operational people:
- In GGUS ticket history it would be nice to see the organization name of the person who is adding something to the ticket history.
- GGUS is now using a five minutes refresh periode to be sure the status of the dashboard is always up to date. This is good. But this refresh period is also used when opening a GGUS ticket and filling in all the details. If this takes longer then 5 minutes and the page refreshes all information filled into the form is lost. This is not good. Guillaume confirmed that GGUS had been modified and that this should no longer be the case.

We have two open requests to give SNMP access to our border router. One from FR-CCIN2P3 and one from CH-CERN. I like to discuss, maybe at the next LHCOPN meeting or during the conference call if this is still necessary and what it is used for. We have already given SNMP access to our border router for the NL-T1 pS machines (hades and mdm). Guillaume confirmed that IN2P3 no longer required this SNMP access. Edoardo suggested that the CERN access be discussed at the next LHCOPN meeting as part of a discussion on monitoring and how to provide WLCG with a quick overview of the status of the LHCOPN.

Hanno Pet -- NL-T1 / SARA

PIC

From 1-7-2009 until 30-9-2009 we have 9 GGUS; 5 are unscheduled fiber cuts and 4 are L2 maintenance. 5 of thos GGUS affected both primary and secondary LHCOPN links.

We're waiting for RedIRIS to get a new GEANT+ connection in order to provide an alternative path in the international part.

The issue https://gus.fzk.de/pages/ticket_lhcopn_details.php?ticket=50183 affecting ES-PIC and IT-INFN-CNAF is still open and assigned to IT-INFN-CNAF waiting for some routers to be reconfigured. Any update on this?

RAL

TRIUMF

No particular issues with TRIUMF. As usual, there are quite a lot interventions affecting TRIUMF as summarized in operation model overview, for example, 6 GGUS ticket rooted for TRIUMF and 24 GGUS ticket reported impacted TRIUMF. Also a large potion of BGP alarms is related to TRIUMF.

Miscellaneous

The next Teleconference was scheduled for the 14th January 2010.

-- WayneSalter - 10 Jul 2009

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng correlation.png r3 r2 r1 manage 33.4 K 2009-10-08 - 10:03 GuillaumeCessieux  
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2010-11-24 - GuillaumeCessieux
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCOPN All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback