SHORT UNAVAILABILITY OF ACCMEAS & LASER DATABASES
Description
Oracle database virtual IPs (vip) and listeners went down on the ACCMEAS and LASER databases on Wednesday (11th April) afternoon causing a short unavailability of those databases for about 5 minutes.
Impact
- It was not possible to make any new connections to the ACCMEAS and LASER databases. Existing connections were unaffected.
Time line of the incident
- 11-Apr-12 16:46 - ACCMEAS and LASER database vips and listeners went down. Detected immediately by the monitoring systems (RACMON and OEM).
- 11-Apr-12 16:50 - vips and listeners were manually restarted by Dawid.
Analysis
- A 5-year old change to configuration on quattor templates introduced a hidden (and to date unseen) dependency which led to a network restart on ACCMEAS and LASER database servers.
- Further investigations showed that
- RHEL4 machines (dbracdes, dbraccastor and dbracacc clusters) were risking the same problems;
- RHEL5 machines (dbgen3 and database clusters) did not have the same dependency, but there were two other components that add a dependency on network: rac and sendmail.
Follow up
- Remove the network component dependencies from other components:
- On dbracdes, dbraccastor, dbracacc clusters (10g on RHEL4):
- turn off the "dispatch" of dirperm
- remove the nfs -> network dependency
- On ncm-rac configuration (10g on RHEL4 and RHEL5):
- remove the rac -> network dependency
- On hardware bunch and service templates for dbgen3 and database clusters (10g and 11g on RHEL5):
- remove the sendmail -> network dependency
- All actions implemented by Giacomo on Thursday 12 April (15:30).
- Add explicitly in the instructions for operators that network must not be restarted in the database servers (message will prompt on login)
--
EvaDafonte - 13-Apr-2012