Database issues during patching, May 31st 2010 - June 2nd 2010
Description
Several issues observed/discovered during the patching of different databases. Some issues were related to the update of the OS packages (without any impact for the database services), others related to the Oracle patches.
Impact
- CMSONR, LCGR and ATLR: While patching some cluster-wide issues affected connectivity and usage of services on the databases. As a consequence, interventions were not rolling.
Time line of the incident
- 31.05.2010 - quattor issues during CMSONR kernel updates. Clusterware misbehavior during deployment of Oracle patches: all nodes blocked.
- 31.05.2010 - one node missing cluster partition could not start.
- 01.06.2010 - issues during the LCGR and ATLR patching.
Analysis
- CMSONR database patching on 31.05: Problem discovered with quattor profiles during ccm-fetch run: quattor certificate expired. Impossible to run OS updates.
- CMSR database patching on 31.05: After patching one of the cluster nodes, cluster partition disappeared and instance could not start. Fixed by adding the partition back. No impact for the database services.
- Cluster-wide issues affected connectivity and usage of services on the CMSONR, LCGR and ATLR databases being investigated. Many other databases were successfully patched: all test and integration databases and CMSR, ATONR, ATLARC, LHCBONR, LHCBR and downstream production databases. Problem might be related to the number of nodes per cluster.
- Also, FS label misconfiguration leading to kernel panic was discovered during the reboot of some nodes.
Follow up
- Investigating quattor certificates problem with quattor support.
- FS label misconfiguration: ticket open with sysadmins, list of affected machines provided, fixed on Friday 4th June 2010.
--
EvaDafonte - 07-Jun-2010