ACCLOG frozen instance 2, LHCLOGDB service unavailable
Description
ACCLOG instance number 2 was inaccesible due to memory problem. Service LHCLOGDB running on this instance was inaccessible.
Impact
- Service LHCLOGDB was inaccessible
Time line of the incident
- 28-Feb-13 04:03 - ACCLOG2 instance froze due to problems with memory
- 28-Feb-13 04:19 - RACMON sms alert sent to the shift phone.
- 28-Feb-13 04:25 - Investigation started
- 28-Feb-13 04:44 - Service LHCLOGDB manually relocated to the surviving instance.
- 28-Feb-13 04:48 - Instance killed by person on shift
- 28-Feb-13 04:48 - Automatic restart by clasterware failed with "unable to allocate Large Pages"
- 28-Feb-13 04:53 - Successfull manual start of the isnstance.
- 28-Feb-13 05:12 - Service LHCLOGDB relocated to its preferred instance.
Analysis
- ACCLOG2 instance was inaccessible from 04:03. Monitoring reported:
acclog: Error monitoring service lhclog: ORA-01034: ORACLE not available
acclog: Error monitoring service lhclog: ORA-27123: unable to attach to shared memory segment
Process W000 died, see its trace file
Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process
- Clusterware did not notice any anomalies and the LHCLOGDB service was not automatically relocated to the surviving instance 1, therefore service was not available.
- Connecting to the instance did not work, needed to be killed. Restart of the instance by the clusterware failed with huge pages allocation.
- Manual restart after few minutes did not encounter memory problems and went ok.
Follow up
- ALL IT-DB monitoring tools (RACMON, EM11, Legacy scripts) detected the problem very fast.
- Clusterware is not detecting such problems
- Service relocation to be implemented with scripts in such a case - not very streightforward.