PostMortem06June2020 < DB

DB Web>PostMortems>PostMortem06June2020 (2020-06-15, AndreiDumitru)

ATLAS Online database not accessible from the ATLAS Technical and Coordination Network (ATCN)

ATLAS Online database not accessible from the ATLAS Technical and Coordination Network (ATCN)
Description
Impact
Time line of the incident
Analysis
- Email informing the d513-a-ipy-lbr7t-11 stopped responding

Description

The ATLAS Online Database (ATONR) is was accessible from the ATCN (ATLAS Technical and Control Network) since Saturaday 6th of June 2020 7:00 until Sunday 7th of June 2020 19:20. The root cause is the failure of the switch which makes the link between ATCN and ATONR database in the Computer Center. The database was always up and running, accessible to clients from GPN, but the majority of applications using this database are in the ATCN.

Impact

Database applications in the ATCN could not connect to the ATONR database. When the ATONR database is unavailable the ATLAS Detector Control System starts buffering detector data to disk and if database connectivity is not restored then the local disk space will run out for the DCS systems.

Time line of the incident

06-June-2020 07:00 : Switch D513-A-IPY-LBR7T-11 failed. Connection from the ATLAS Technical and Control Network to the ATLAS Online database (ATONR) was lost. Database was accessible only through the GPN link.
06-June-2020 07:01 : Network monitoring tools detected the failure immediately, and as this switch is defined as “EXPERIMENT” an alarm was raised and an e-mail was sent to Network.Communications@cernNOSPAMPLEASE.ch and atlas-tdaq-sysadmins@cernNOSPAMPLEASE.ch at 7:01am on Staurday morning.
06-June-2020 10:42 : Ticket opened by ATLAS INC2455807 - (by default assigned to the Service Desk when opened; no Service Desk available during the weekend)
06-June-2020 10:50 : Email sent to certain people in IT-DB about the lack of connectivity from the ATCN to the ATONR database (no piquet service for Database Services during LS2)
07-June-2020 09:44 : The Computer Center operator was contacted by an ATLAS TDAQ Sysadmin via phone
07-June-2020 10:32 : Computer Center Operator contacted IT-DB via phone
07-June-2020 11:27 : Ticket opened by IT-DB to the Campus Network Service on Sunday morning to track this issue and provide more deatails - INC2456692 (by default assigned to the Service Desk when opened; no Service Desk available during the weekend)
07-June-2020 11:36 : Email sent to Computer.Operations@cern.ch with the details of the ticket INC2456692 opened for the Computer Network Service, after talking on the phone about the issue.
07-June-2020 13:15 : INC2456692 created by the Computer Center operator for the Campus Network Service piquet service, informing about the issue and giving the details mentioned in INC2456692. There was a delay due to the fact that the template to open the ticket did not accepted the interface name, being not a network equipment.
07-June-2020 15:00 : Network support Firstline was on-site and they started to do some test in order to figure our what was wrong; nobody told them the name of the affected network device.
07-June-2020 16:00 : The incident was escalated to an IT-CS engineer because Firstline was not able to identify the issue
07-June-2020 17:06 : ATLAS DCS informs us in INC2456692 that the ATLAS Detector Control System is currently buffering detector data to disk and in about 10 hours the local disk space will run out for some systems.
07-June-2020 17:15 : IT-CS engineer figured out where the root cause (D513-A-IPY-LBR7T-11 down) and he asked Firstline to check the switch
07-June-2020 17:16 : IT-DB following-up with the Computer Center operator to check the status: the issue was being worked on in INC2456702 since 13:15 but we were not aware of this ticket until this time.
07-June-2020 17:45 : Firstline has started to check the right switch, the switch was up, including uplinks, so they started to test the fibre path and everything was good.
07-June-2020 18:48 : Firstline has rebooted the switch and at 18:50 it was up
07-June-2020 19:20 : Network connectivity is confirmed as working from ATCN to ATONR database by ATLAS.
08-June-2020 : Outage records were created for this issue OTG0057011 (Datacenter Network Service) and OTG0057013 (Oracle Database Service)

Outage records:

ServniceNow tickets:

https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC2455807 (opened by ATLAS for the Oracle Database Service)
https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC2456692 (opened by IT-DB for Campus Network Service, referenced in INC2456702 )
https://cern.service-now.com/service-portal?id=ticket&table=incident&n=INC2456702 (opened by CC operator for Network Firstline support)

Analysis

Switch D513-A-IPY-LBR7T-11 failed on Saturday(06-June-2020) morning at 07:00. There is no network piquet actively checking the alarms outside working hours and this is only done by the IT operator, for LCG/ITS/GPN devices. The network Firstline (technicians) and the Neteng (engineers) are always ready to work remotely/onsite. There was no alert raised by the IT-DB monitoring as the database was always accessible through the General Purpose Network (GPN) and this is the link used to monitor the ATLAS Online Database (ATONR).

When the switch failed, an email was sent to atlas-tdaq-sysadmins@cern.ch and Network.Communications@cern.ch informing about the issue. The procedure described in the email was only followed the next day, on Sunday when the Computer Center operator was contacted and then the Oracle Database Service and Network Firstline support were informed of the issue.

Since the switch is labeled as “EXPERIMENT” (devices which start with D) and it is hosted in the ATLAS domain, it is expected to be managed by the responsible declared in CSDB, in this case ATLAS-TDAQ-SYSADMIN. In this setup, they are the only ones who get notified if something happens on the given device. Emails are also sent to Network.Communications@cern.ch only for history/statistics in this case. If the same issue happens to a LCG/ITS/GPN device, the CC Operators contacts directly Firstline and even if the alarm on a the console is missed, a ticket is generated for them in order to follow-up the incident.

There was a one hour delay in opening the ticket INC2456702 as the Operators could not use the template to open a network piquet ticket because the device was not recognized, as in the tickets opened only host interfaces were mentioned and no networking device name (e.g. name of the switch). Then network Firstline also spent time trying to identify the corresponding network equipment, as they are not familiar with servers and interfaces, especially for IT/DB setups where a single device have several interfaces on different services (this is the setup required by Oracle Grid Infrastructure, the clustering software from Oracle used to provide high availability to Oracle databases)

If the ticket to the operators would have been opened by ATLAS directly (being an experiment switch) as mentioned in the alert email received, then Computer Center Operators would have been warned, having the right information at hand (the name of the defective switch), networking Firstline would have been warned directly with the switch name facing the issue and this would mean that no escalation to the IT-CS engineers would have been required and the issue could have been solved much faster.

Email informing the `d513-a-ipy-lbr7t-11` stopped responding

-----Original Message-----
From: Network Communications <Network.Communications@cern.ch> 
Sent: Saturday, June 6, 2020 07:01
To: atlas-tdaq-sysadmins (ATLAS TDAQ System Administration related discussions) <atlas-tdaq-sysadmins@cern.ch>; Network Communications <Network.Communications@cern.ch>
Subject: [NEW] CRITICAL alarm on d513-a-ipy-lbr7t-11 (ID 2416511) : DEVICE HAS STOPPED RESPONDING TO POLLS (06/06 07:00:41)

Dear atlas-tdaq-sysadmins@cern.ch

The network monitoring reports the following on a device you manage:

Sat 06 Jun, 2020 - 07:00:41 - Device d513-a-ipy-lbr7t-11 of type FoundryIron has stopped responding to polls and/or external requests.  An alarm will be generated.   (event [0x00010d35])

If you require assistance from the IT/CS network support, please open a ticket through the following link or via the Service Desk (77777):
https://cern.service-now.com/service-portal?id=sc_cat_item&?name=incident&se=network-projects-experiments
Action will be taken within one working day.

In case of a critical network problem where urgent intervention is required from IT/CS, please call directly the Computer Centre operator (75011). 
At the same time send an email to Computer.Operations@cern.ch describing the problem and justifying its urgency. 
The meeting point (whenever applicable) and the contact person's phone number must be clearly stated in the incident description.

Before reporting an incident to IT/CS, please make sure that it is not due to any external condition, in particular lack of power or failure on the connected clients.

Best regards,
   The IT/CS Monitoring team.

More info:

atlas-tdaq-sysadmins@cern.ch

-- AndreiDumitru - 2020-06-08

Topic revision: r2 - 2020-06-15 - AndreiDumitru

Public webs

- Cern Search
- TWiki Search
- Google Search
DB All webs