Postmortems for Service Incidents

Please use the PostMortemTemplate.

Link Date (of incident) Brief Description Follow-up implemented? Owner
PostMortem06June2020 6 June 2020 Unable to access ATONR database from ATLAS Technical Network no Andrei
PostMortem27May2020 27th May 2020 RAC52 storage outage no Eva
PostMortem19Nov17 19th November 2017 RAC52 storage outage no Miro
PostMortem21Jan16 21st January 2016 RAC51 storage switch extension outage no Paul
PostMortem07Jan16 7th Jan 2016 Cinder database restored from backup no David
PostMortem26May15 26th May 2015 Qualiac and other long-term export files deleted from TSM Yes Szymon
PostMortem18Oct14 18th October 2014 DBoD VOMS database unavailable Yes Eva, Ignacio
PostMortem4Oct14 4th October 2014 DBoD FTS3 database unavailable Yes Eva, Ruben
PostMortem7Jun14 7th June 2014 Multiple node restarts after power cut Yes Luca
PostMortem28Dec13 28th December 2013 ADCR_ADG_RAC7 database suffered from multiple block corruptions after disk failures Yes Emil
PostMortem18Sept13 18th September 2013 Database services unavailable or degraded after the network intervention Yes Eva
PostMortem15Aug13 15th August 2013 Performance problems on the Opendays ticketing application Yes Giacomo
PostMortem24Jul13 24rd July 2013 Several databases not available due to HW problem in dbnasr1132 Yes Nilo
PostMortem23Jul13 23rd July 2013 ACCLOG down due to failed Ontapp upgrade Yes Nilo
PostMortem15Jan13 15th January 2013 ADCR database suffered from lost writes yes (exceptions: castor databases) Emil
PostMortem04Apr13 04th April 2013 SIR application do not display correctly yes Nicolas
PostMortem04Mar13 04th March 2013 OVM hypervisor pool reboot no Giacomo
PostMortem28Feb13 28nd February 2013 ACCLOG instance freeze no (reviewed) Anton
PostMortem22Feb13 22nd February 2013 Storage maintenance operation breaking OVM infrastructure yes Ruben
PostMortem04Jan13 04th January 2013 Accelerator databases unavailability caused by problems with technical network yes (reviewed) Szymon
PostMortem20Nov12 20th November 2012 Problem with VAULT storage upgrade yes (reviewed) Ruben
PostMortem05Oct12 5th October 2012 OVM problems from October to December 2012 yes (reviewed) Giacomo
PostMortem9July12 9th July 2012 Problem with CASTOR storage upgrade yes Eric
PostMortem24Jun12 24th June 2012 Problem with ACCCON (1 node hang) no Eric
PostMortem22May12 22nd May 2012 Problem with CMSONR database during snapshot taking Yes (reviewed) Kate
PostMortem24Apr12 24th April 2012 Problem during NAS upgrade. Some database services affected no Eric
PostMortem11Apr12 11th April 2012 Short unavailability of ACCMEAS & LASER databases Yes (reviewed) Eva
PostMortem2Apr12 2nd April 2012 Loss of dev and test servers in virtual server   Paul
PostMortem28Mar12 28st March 2012 Unexpected problems during Apex upgrade in ITCORE Yes (reviewed) Kate
PostMortem21Mar12 21st March 2012 Unexpected problems during Qualiac migration to Safe Host Yes (reviewed) Eva
PostMortem7Mar12 7th March 2012 Qualiac database (WCERNP) not available Yes pending decisions Eva
PostMortem24Nov11 24th November 2011 ADCR database unavailable Yes (reviewed) Eva
PostMortem21Oct11 21st October 2011 OVM attempted suicide   Ewan
PostMortem12Oct11 11th-12th October 2011 ATLR database high load Yes (reviewed) Eva
PostMortem10Oct11 15th September-10th October 2011 CMSR database node reboots Yes (reviewed) Jacek
PostMortem27Sep11 27th September 2011 CMSR database hung following vendor mistake during broken disk replacement Yes (reviewed) Eva
PostMortem9August11 9th August 2011 ACCLOG blocked due to lack of space yes (reviewed) Ruben
PostMortem21Jul11 21st July 2011 Mysql backend down yes (reviewed) Ruben
PostMortem28Apr11 28th April 2011 CMSR database unavailable Yes (reviewed) Eva
PostMortem15Mar11 15th March 2011 CMSR database unavailable Yes, related to PostMortem11Mar11 Jacek
PostMortem11Mar11 11th March 2011 CMSR database unavailable Yes (reviewed) Jacek
PostMortem26Feb11 26th February 2011 LHCBR database got stuck Yes (reviewed) Jacek
PostMortem28Jan11 28th January 2011 Aislogin not available Not finished Nicolas
PostMortem25Jan11 25th January 2011 LCGR extended downtime during db upgrade to 10.2.0.5 Yes (reviewed) Eva
PostMortem18Dec10ATLARC 18th-22nd December 2010 Issues with ATLARC DB following the power cut on 18th of December Yes (reviewed) Marcin
PostMortem18Dec10 18th December 2010 Power cut in the Computer Center   Eric
PostMortem16Dec10 16th December 2010 ATLR database affected by FC switch replacement Yes (reviewed) Marcin
PostMortem15Dec10 15th December 2010 Reboots of Instance 4 of ATLR database Yes (reviewed) Marcin
PostMortem02Nov10 2nd November 2010 Local filesystem got full on two LHCBR nodes Yes (reviewed) Marcin
PostMortem26Oct10 26th October 2010 Data inconsistancy after recovery of SARA Yes (reviewed) Przemek
PostMortem15Sep10 14th September 2010 Real time downstream was not set for LFC replication No (reviewed) Marcin
PostMortem14Sep10 14th September 2010 Instability of node 2 and 4 of CMSR Yes, related to PostMortem20Aug10 (reviewed) Marcin
PostMortem10Sep10 10th September 2010 Replication for ATLAS conditions and LHCB conditions to SARA stopped Yes (reviewed) Marcin
PostMortem23Aug10 23rd August 2010 ATLAS data streaming to Tier1 sites stopped Yes (reviewed) Luca
PostMortem20Aug10 20th August 2010 Instability of node 3 and 4 of CMSR Yes (reviewed) Jacek
PostMortem11Aug10 11th August 2010 Correction of wrong shelfid No need Eric
PostMortem09Aug10 9th August 2010 LHCB Online database service extended downtime following power cut Yes (reviewed) Dawid
PostMortem19July10 19th July 2010 ~ 10:25-10:48 Loss of connectivity with Accelerators databases due to a network problem Being discussed with CS Ruben (reviewed)
PostMortem15Jul10 16th Jul 2010 BAAN LDAP login broken Yes (reviewed) Giacomo
PostMortem13July10 13th July 2010, 1:30-9:15 Few short interruptions of replication of CMS data from online to offline Yes (reviewed) Jacek
PostMortem26June10 26th June 2010, 15:00-16:00 9 services not failed over properly on ATLR after node eviction Yes but not updated Jacek
PostMortem24June10 24th June - 25th June 2010 LHCb propagation to PIC hung Yes (reviewed) Jacek
PostMortem02June10 2nd June - 3rd June 2010 Database issues after patching (April PSU) Yes (reviewed) Eva
PostMortem31May10 31st May - 2nd June 2010 Database issues during patching (April PSU) Yes (reviewed) Eva
PostMortem26May10 26th May 2010 CMSR node broken Yes (reviewed) Eva
PostMortem29Nov10 29th Nov 2010 Castor Name Server not available   Nilo
Edit | Attach | Watch | Print version | History: r99 < r98 < r97 < r96 < r95 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r99 - 2020-06-08 - AndreiDumitru
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback