Power cut in the computer center

This is to keep track of issues raised after the power cut that occurred on Saturday, December 18th 2010 from 12:00 to 16:00.

TODO put link to EN post-mortem.

Virtual Machines Infrastructure and monitor on SLS

  • Description: G0 Pool: Error of reading in the NAS the config file, dbvrtg038 automatically restored as soon as connection came back. D-Pool completely down. SLS Entry Down?
  • Impact: All dbvrtd machines down, not critical because are used only for tests. Production pools were alive and the G0 Pool automatically came back.
  • Analysis: D-Pool machine went down and automatically came up but OVS was not able to resync the pool. The dbsrvd250, the manager node of the pool, came back but there was not proper communication with the other servers so all the pool remain down.
  • Follow up: The firewall configuration on the D-Pool impacted on the communication between the servers after the shutdown. So Firewall has to been check, everything start to run as soon as we stop the firewall. SLS Monitor of VM is still down need some investigation

J2EEE Public Service

  • Description: one container failed to start up
  • Impact: 3 applications unavailable (form-gen, mtf-gmaps, ts-project-mtfStats)
  • Analysis: sometimes, the ports used for RMI binding are switched, between containers, and in this case, a container may not start because the port it needs is used by another container
  • Follow up: need to investigate more

OracleHR DEV and TEST

  • Description: DCERNI and TCERNI were not restarted after reboot of the nodes
  • Impact: developers could not connect to dev/test OracleHR
  • Analysis: OnDemand didn't restart the instances, SR opened
  • Follow up: RCF 3-16UQWCJ and 3-16UQWCT have been created to restart the services. SR 3-2591273251 have been created to perform root cause analysis. Whatever the reason, we should check it also against Pcerni as it may suffer from the same problem.

EDMS

  • Some containers required a manual relocation.

Castor DB

  • Description: Alice Stager instance did not start in node dbsrvc203
  • Impact: Service working on second node
  • Analysis: A faulty value for sga_target prevented the instance from complete the startup process. Fix was to remove the parameter from the spfile, this parameter is not used in our systems
  • Follow up:

  • Description: SRM Public host failed to boot - dbsrvc217
  • Impact: Service working on the second node
  • Analysis: a hardware problem prevented the node from booting. This requires vendor intervention
  • Follow up: The sysadmin team will coordinate this intervention.

HRT/CET prod

  • Description: HTR/CET applications were partially unavailable after the reboot of one of the two nodes
  • Impact: 1 connection out of two was not successful
  • Analysis: the apache on dbsrvg3XXX was marked as not to be started automatically
  • Follow up: fixed the configuration (tab.xml) , now the service will start at boot time

AIS RAC service relocation

  • Description: the services in AISDB did not end up in the correct cluster node.
  • Impact: this caused a couple of days later an overload in the some of the nodes.
  • Analysis: The fix was to relocate the services to their correct place.
  • Follow up: A monitoring event will be developed to catch such an issue, in addition, automated relocation will be considered.

NAS controller loop for backup to disk NAS

  • Description: one fibre channel loop was lost on one NAS controller.
  • Impact: The controller service is not affected as we have configured multipathing.
  • Analysis: A case with Netapp support is open.
  • Follow up: Validation of GBIC or fiber cable is under way.

Hardware issues

  • Description: five servers now have one of two power supplies in fault.
  • Impact: only one node in the RAC cluster for dbsrvc217, transparent to the service
  • Analysis: Tickets being raised in itcm.
  • Follow up: IT-CF and supplier are replacing the components (dbsrvc217 - down after power cut - motherboard changed, dbsrvd112 : power supply fault, dbsrvd114 : disk fault, dbsrvd108 : power supply fault, dbsrvd107 : power supply fault, dbsrvd110 : errors : power removed and server now ok)

---++ Service name template

   * *Description:*
   * *Impact:*
   * *Analysis:*
   * *Follow up:*
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2011-01-07 - EricGrancher
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback