.
Link |
Date (of incident) |
Brief Description |
Follow-up implemented? |
Owner |
PostMortem06June2020 |
6 June 2020 |
Unable to access ATONR database from ATLAS Technical Network |
no |
Andrei |
PostMortem27May2020 |
27th May 2020 |
RAC52 storage outage |
no |
Eva |
PostMortem19Nov17 |
19th November 2017 |
RAC52 storage outage |
no |
Miro |
PostMortem21Jan16 |
21st January 2016 |
RAC51 storage switch extension outage |
no |
Paul |
PostMortem07Jan16 |
7th Jan 2016 |
Cinder database restored from backup |
no |
David |
PostMortem26May15 |
26th May 2015 |
Qualiac and other long-term export files deleted from TSM |
Yes |
Szymon |
PostMortem18Oct14 |
18th October 2014 |
DBoD VOMS database unavailable |
Yes |
Eva, Ignacio |
PostMortem4Oct14 |
4th October 2014 |
DBoD FTS3 database unavailable |
Yes |
Eva, Ruben |
PostMortem7Jun14 |
7th June 2014 |
Multiple node restarts after power cut |
Yes |
Luca |
PostMortem28Dec13 |
28th December 2013 |
ADCR_ADG_RAC7 database suffered from multiple block corruptions after disk failures |
Yes |
Emil |
PostMortem18Sept13 |
18th September 2013 |
Database services unavailable or degraded after the network intervention |
Yes |
Eva |
PostMortem15Aug13 |
15th August 2013 |
Performance problems on the Opendays ticketing application |
Yes |
Giacomo |
PostMortem24Jul13 |
24rd July 2013 |
Several databases not available due to HW problem in dbnasr1132 |
Yes |
Nilo |
PostMortem23Jul13 |
23rd July 2013 |
ACCLOG down due to failed Ontapp upgrade |
Yes |
Nilo |
PostMortem15Jan13 |
15th January 2013 |
ADCR database suffered from lost writes |
yes (exceptions: castor databases) |
Emil |
PostMortem04Apr13 |
04th April 2013 |
SIR application do not display correctly |
yes |
Nicolas |
PostMortem04Mar13 |
04th March 2013 |
OVM hypervisor pool reboot |
no |
Giacomo |
PostMortem28Feb13 |
28nd February 2013 |
ACCLOG instance freeze |
no (reviewed) |
Anton |
PostMortem22Feb13 |
22nd February 2013 |
Storage maintenance operation breaking OVM infrastructure |
yes |
Ruben |
PostMortem04Jan13 |
04th January 2013 |
Accelerator databases unavailability caused by problems with technical network |
yes (reviewed) |
Szymon |
PostMortem20Nov12 |
20th November 2012 |
Problem with VAULT storage upgrade |
yes (reviewed) |
Ruben |
PostMortem05Oct12 |
5th October 2012 |
OVM problems from October to December 2012 |
yes (reviewed) |
Giacomo |
PostMortem9July12 |
9th July 2012 |
Problem with CASTOR storage upgrade |
yes |
Eric |
PostMortem24Jun12 |
24th June 2012 |
Problem with ACCCON (1 node hang) |
no |
Eric |
PostMortem22May12 |
22nd May 2012 |
Problem with CMSONR database during snapshot taking |
Yes (reviewed) |
Kate |
PostMortem24Apr12 |
24th April 2012 |
Problem during NAS upgrade. Some database services affected |
no |
Eric |
PostMortem11Apr12 |
11th April 2012 |
Short unavailability of ACCMEAS & LASER databases |
Yes (reviewed) |
Eva |
PostMortem2Apr12 |
2nd April 2012 |
Loss of dev and test servers in virtual server |
|
Paul |
PostMortem28Mar12 |
28st March 2012 |
Unexpected problems during Apex upgrade in ITCORE |
Yes (reviewed) |
Kate |
PostMortem21Mar12 |
21st March 2012 |
Unexpected problems during Qualiac migration to Safe Host |
Yes (reviewed) |
Eva |
PostMortem7Mar12 |
7th March 2012 |
Qualiac database (WCERNP) not available |
Yes pending decisions |
Eva |
PostMortem24Nov11 |
24th November 2011 |
ADCR database unavailable |
Yes (reviewed) |
Eva |
PostMortem21Oct11 |
21st October 2011 |
OVM attempted suicide |
|
Ewan |
PostMortem12Oct11 |
11th-12th October 2011 |
ATLR database high load |
Yes (reviewed) |
Eva |
PostMortem10Oct11 |
15th September-10th October 2011 |
CMSR database node reboots |
Yes (reviewed) |
Jacek |
PostMortem27Sep11 |
27th September 2011 |
CMSR database hung following vendor mistake during broken disk replacement |
Yes (reviewed) |
Eva |
PostMortem9August11 |
9th August 2011 |
ACCLOG blocked due to lack of space |
yes (reviewed) |
Ruben |
PostMortem21Jul11 |
21st July 2011 |
Mysql backend down |
yes (reviewed) |
Ruben |
PostMortem28Apr11 |
28th April 2011 |
CMSR database unavailable |
Yes (reviewed) |
Eva |
PostMortem15Mar11 |
15th March 2011 |
CMSR database unavailable |
Yes, related to PostMortem11Mar11 |
Jacek |
PostMortem11Mar11 |
11th March 2011 |
CMSR database unavailable |
Yes (reviewed) |
Jacek |
PostMortem26Feb11 |
26th February 2011 |
LHCBR database got stuck |
Yes (reviewed) |
Jacek |
PostMortem28Jan11 |
28th January 2011 |
Aislogin not available |
Not finished |
Nicolas |
PostMortem25Jan11 |
25th January 2011 |
LCGR extended downtime during db upgrade to 10.2.0.5 |
Yes (reviewed) |
Eva |
PostMortem18Dec10ATLARC |
18th-22nd December 2010 |
Issues with ATLARC DB following the power cut on 18th of December |
Yes (reviewed) |
Marcin |
PostMortem18Dec10 |
18th December 2010 |
Power cut in the Computer Center |
|
Eric |
PostMortem16Dec10 |
16th December 2010 |
ATLR database affected by FC switch replacement |
Yes (reviewed) |
Marcin |
PostMortem15Dec10 |
15th December 2010 |
Reboots of Instance 4 of ATLR database |
Yes (reviewed) |
Marcin |
PostMortem02Nov10 |
2nd November 2010 |
Local filesystem got full on two LHCBR nodes |
Yes (reviewed) |
Marcin |
PostMortem26Oct10 |
26th October 2010 |
Data inconsistancy after recovery of SARA |
Yes (reviewed) |
Przemek |
PostMortem15Sep10 |
14th September 2010 |
Real time downstream was not set for LFC replication |
No (reviewed) |
Marcin |
PostMortem14Sep10 |
14th September 2010 |
Instability of node 2 and 4 of CMSR |
Yes, related to PostMortem20Aug10 (reviewed) |
Marcin |
PostMortem10Sep10 |
10th September 2010 |
Replication for ATLAS conditions and LHCB conditions to SARA stopped |
Yes (reviewed) |
Marcin |
PostMortem23Aug10 |
23rd August 2010 |
ATLAS data streaming to Tier1 sites stopped |
Yes (reviewed) |
Luca |
PostMortem20Aug10 |
20th August 2010 |
Instability of node 3 and 4 of CMSR |
Yes (reviewed) |
Jacek |
PostMortem11Aug10 |
11th August 2010 |
Correction of wrong shelfid |
No need |
Eric |
PostMortem09Aug10 |
9th August 2010 |
LHCB Online database service extended downtime following power cut |
Yes (reviewed) |
Dawid |
PostMortem19July10 |
19th July 2010 ~ 10:25-10:48 |
Loss of connectivity with Accelerators databases due to a network problem |
Being discussed with CS |
Ruben (reviewed) |
PostMortem15Jul10 |
16th Jul 2010 |
BAAN LDAP login broken |
Yes (reviewed) |
Giacomo |
PostMortem13July10 |
13th July 2010, 1:30-9:15 |
Few short interruptions of replication of CMS data from online to offline |
Yes (reviewed) |
Jacek |
PostMortem26June10 |
26th June 2010, 15:00-16:00 |
9 services not failed over properly on ATLR after node eviction |
Yes but not updated |
Jacek |
PostMortem24June10 |
24th June - 25th June 2010 |
LHCb propagation to PIC hung |
Yes (reviewed) |
Jacek |
PostMortem02June10 |
2nd June - 3rd June 2010 |
Database issues after patching (April PSU) |
Yes (reviewed) |
Eva |
PostMortem31May10 |
31st May - 2nd June 2010 |
Database issues during patching (April PSU) |
Yes (reviewed) |
Eva |
PostMortem26May10 |
26th May 2010 |
CMSR node broken |
Yes (reviewed) |
Eva |
PostMortem29Nov10 |
29th Nov 2010 |
Castor Name Server not available |
|
Nilo |