Qualiac database unavailable

Description

Qualiac database was not available during in the early morning on 07.03.2012 . Interactive Qualiac users were not able to use the system.

Impact

Qualiac application was not available. Overnight extractions from other databases did not complete/execute, leading to manual data input by AIS.

Time line of the incident

  • 06-Mar-12 18:00 - Anton Topurov: Scheduled database stop, creating of snapshots of NAS volumes, and start of the database.
  • 07-Mar-12 03:16 - Enterprise Manager alarm on dump area used 97%
  • 07-Mar-12 03:27 - Volume mounted to /ORA/dbs00/WCERNP, which hosts online redo logs has reached it maximum possible size ( up to autosize maximum) and database crashed.
  • 07-Mar-12 03:40 - NAS alarm on volume full
  • 07-Mar-12 06:50 - Jan Janke (user) complains about access to Qualiac
  • 07-Mar-12 07:43 - Eric Grancher adds extra space to the volume, database available again
  • 07-Mar-12 07:48 - Jean-Luc Dublet restarts the Qualiac job launcher. Qualiac service fully available again.

Analysis

  • In order to facilitate the pre-migration tests of Qualiac to Safehost, a consistent copy of Qualiac database was needed. To achieve that, NAS snapshot technology was used, by creating snapshots on all the volumes while database is down, then copying the content of the snapshots to the destination Safehost machine.
  • The intervention was done outside working hours and the copy of datafiles was started at 18:30, let to run overnight. Snapshots were to be removed next morning once the copy is finished.
  • Expecting that snapshots use reserved space, and not the space of the volume, only 3GB of free space was available on the volume, hosting online redo logs. However, by default snapshots are not configured to have space reserved, and therefore used the space of the volume itself.
  • Since the files in dbs00 are around 7GB ( the online redo logs was 6 x 1GB ), this lead that snapshot of the volume, depending on the activity of the database can grow up to 6-7GB, depending on the activity of the database. Due to the scheduled extraction works during the night, Qualiac database is quite active, and therefore the snapshot of the redo logs used all the available space of the volume.
  • Once the volume was full and unable to autoextend any more, the database crashed due to inability to write in the dump area and to the redo log.

Follow up

  • New procedures being discussed - to be updated!

-- AntonTopurov - 14-Mar-2012

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2012-03-27 - EvaDafonte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback