Loss of connectivity with accelerator databases in TN due to a network issue

Description

On 19th July while preparing an intervention on accelerator database NAS controllers we noticed loss of connectivity for about 20 minutes from 10:25 till 10:48. No action was done on the controllers at that point in time as the intervention was scheduled to start at 11:00 am. Databases were running as expected, just connectivity was lost.

Impact

  • No session could be established to accelerator databases from GPN, despite databases were fine.
  • It has especially impacted the access control (the database is required for entering / leaving the controlled zones when the mode is "restricted") at the start of the technical stop.

Time line of the incident

  • 10:30 Ruben Gaspar (IT/DB): I noticed I lost all my ssh sessions to accelerators machines. No change was done on ACC NAS controllers. Asking around, checking IT status board for further intervention taking place at same time,...
  • 10:44 Ruben Gaspar (IT/DB): mail sent to netops and IT manager on duty. CT699404 REGISTERED [Connectivity to accelerators databases ] https://remedy01.cern.ch/cgi-bin/consult.cgi?caseid=CT0000000699404&email=ruben.gaspar.aparicio@cern.ch
  • 10:45 Zornitsa Zaharieva (BE/CO): mail asking for help directed to several accelerators and IT people
  • ~10:48 Ruben Gaspar (IT/DB): connectivity to TN is back.
  • 10:57 confirmation from Zornitsa Zaharieva (BE/CO) that connectivity is back.
  • 14:46 David Gutierrez (IT/CS): mail explaining the cause of the error:

"While carrying out a patching operation, one of our technicians damaged by mistake the UTP cable providing connectivity to service S513-C-IP904. He took the initiative to replace it but without informing us in advance. We will review our procedures in order to prevent this mistake and confusion from happening again. Sorry for the inconveniences caused. Frédéric will comment on this issue at the next TIOC."

Analysis

See above.

Follow up

It's our impression due to the criticality of the systems that a failover solution could be studied. Human errors like the one of this incident or hardware errors (broken fibers, optical adapters,…) may occur in the future damaging access to TN services. It will be discussed with IT-CS.

-- RubenGaspar - 20-Jul-2010

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2010-07-21 - RubenGaspar
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback