PostMortem18Dec10ATLARC < DB

TWiki>

DB Web>WebTopicCreator>PostMortem18Dec10ATLARC (2013-01-24, MarcinBlaszczyk)

EditAttachPDF

Issues with ATLARC DB following the power cut at CERN CC

Description

The ATLAS archive DB (ATLARC) could not be restarted following the power cut at CERN CC on 18th of December.

Impact

The following Oracle services were not available from 12:40 on 18th of December to 00:00 on 23rd of December: atlas_tags_writer,atlas_tags,atlas_tagsprod,atlas_arch.

Time line of the incident

Saturday, 18th December 2010, 12:40 - ATLARC DB went completely down due to the power cut.
Saturday, 18th December 2010, 17:30 - standard procedures to restart the database failed
Saturday, 18th December 2010, 20:00 - a SR severity 1 to Oracle Support has been open.
Sunday, 19th December 2010, 17:50 - it turned out that one of redo logs needed for database crash recovery is corrupted. Also the archivelog produced out of the corrupted redo log was not usable for crash recovery. The database had to be restored from a tape backup, but the restore could not be started immediately due to problems with TSM63.
Monday, 20th December 2010, 13:30 - restore of the database from tape backup started. The database was restored using separate hardware.
Wednesday, 22nd December 2010, 20:00 - restore finished. Recovery possible only until 18.12.2010 at 7:30 as some backups sent to TSM after this time were lost.
Thursday, 23rd December 2010, 00:00 - database made available to users.

Analysis

The originating cause of the DB crash and further on of its to recover the DB services are related to the power cut and the interaction with the high DB load at the time of the incident and the HW used. In particular a lost write at the OS or more likely the storage array level is suspected, as indicated by Oracle support. Moreover we cannot fully exclude that a a software bug inthe Oracle RDBMS created a corrupted log entry.

A second issue that affected the restore time was related to on-disk backup: the DBA performing the recovery found such backups (which are intended to speed up recovery in case of corruption) in a unusable state. Additionally restore of the DB from tape backups was found to be severely affected in performance by overlad of the external tape system services (in particular of TSM63). Restore of the database was only possible to a point in time of 4 hours before the actual DB crash, as the backups of archivelogs following that point had been lost by the tape backup system on a different incident caused by the power-cut. We understand from Atlas that the missing data can be regenerated. Please note that a standby service is not avaiable for ATLARC, as agreed in the SLA with Atlas.

Follow up

Check disk arrays originally used by ATLARC db for corruption issues. Consider moving ATLARC DB services to critical power in the future. Add additional monitoring and procedures for checking consistency of on-disk backup issues.

ATLARC database has been moved to new backup system in October 2012 and is now being backed up with on Disk backup as well as Tape backups.

Topic revision: r5 - 2013-01-24 - MarcinBlaszczyk

Public webs

- Cern Search
- TWiki Search
- Google Search
DB All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback