PostMortem05Oct12 < DB

DB Web>PostMortems>PostMortem05Oct12 (2013-02-21, GiacomoTenaglia)

IT-DB Virtualisation Infrastructure degraded

Description

A pool of 19 OracleVM hypervisors locate at SafeHost, hosting various production AIS application servers and databases on demand, started a series of spontaneous reboots which took place over October and beginning of November.

Impact

The affected services experienced interruptions, from 10 minutes (when there was enough capacity to restart the VM elsewhere) to half a day (worst case when the whole pool came back in a bad state and heavy clean-up was needed).

Unless otherwise noted, the services were moved to physical hardware within one week from the start of the instabilities.

Production applications affected:

Qualiac
OracleHR
AISBatchP
Kitry (not moved to physical HW as the procedure is very complicated and involves an external company)

Production Oracle databases affected:

OracleHR

DBOD MySQL databases affected:

Drupalprod
Webcast
Mastro
Trac_svn
Hc_cms
Hc_lhcb
Hc_core
Puppet
Hc_atlas (moved after 2 weeks)
Nova (moved after 2 weeks)
Avisual (moved after 3 weeks)
Atlas_mc (moved after 3 weeks)
Aistoold (moved after 3 weeks)
Copvss (moved after 3 weeks)
Cstest1 (moved after 3 weeks)
Olabpvss (moved after 3 weeks)

Time line of the incident

05-Oct-2012 - first reboots, support request with Oracle opened
Until 12-Nov-2012 - further reboots, implementation of diagnostics, debugging and tentative fixes
12-Nov-2012 - increased HA timeouts as eventually suggested by Oracle

Analysis

As a foreword, it has to be noted that the affected pool was running rock-solid for 6 months prior to the start of the problems.

The initial investigations showed no evidence of hardware or network problems on either the hypervisors or the storage infrastructure used to provide high availability (NFS). Also, there was absolutely no trace of the reboots on the hypervisors, it was as if they were cold-rebooted (syslog files truncated etc). We needed to gather more data and debugging informations, yet trying to reduce the service downtime as much as possible.

Oracle Support provided debugging and diagnostic tools, which were automatised and deployed on all the hypervisors. At the same time we put in place some extra monitoring to be warned earlier in case of further reboots, which helped and resulted in faster recoveries. As the problems were recurring, we decided to start moving production services to physical hardware (see notes on "Impact" above).

After a few weeks, collection of a lot of debugging information, further diagnostic tools installed including a few different kernels, Oracle Support suggested to increase the HA timeouts from the current 1 minute to 4 minutes. We performed the operation on November 12, and since then no further reboots have been observed.

So we can conclude that the root cause of the problem has been eventually confirmed to be the duration of the timeout set in the HA layer used by OracleVM (based on OCFS2). Specifically, the "heartbeat" functionality (used by every hypervisor to check the availability of the shared storage) appears to have triggered the reboots due to missing response from the storage within the related timeout.

We have been able to measure the time taken to perform heartbeat operations, as shown in the graph below (note the logarithmic scale):

Heartbeat duration count

We have observed a few unusually high timeouts, exceeding sometimes 1 minute. Though initial investigations showed no match of those high values on the NFS storage infrastructure side (suggesting the problem was linked with the NFS-related code used by OCFS2 when performing heartbeat operations), we happened to reproduce the problem with a very peculiar workload on another volume hosted on the same NFS filer serving the OVM HA volume. The fact that the storage infrastructure doesn't show any evidence of this problem points to a problem on that side.

We have opened a case with NetApp support to work on that "high write timeout" behaviour.

While working on this case, Oracle Support has filled up a number of bugs (also related to problems we experienced with some of the kernels provided during the investigations) which will be followed up internally by development and will eventually end up in the mainline OracleVM code.

Follow up

The recommended timeouts have been set also on the other OracleVM pools
Kernel dumps have been enabled on the other OracleVM pools
We are in the process of gradually adding systems back to the problematic OracleVM pool, keeping the situation closely monitored at the same time

Reference documents

the values recommended values are about the same as the one described in "Oracle VM and NetApp Storage Best Practices Guide" http://www.netapp.com/us/media/tr-3712.pdf
Oracle "My Oracle Support" service request 3-6285488751
NetApp Support Case 2003833908

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
png	ovm_hb_1.png	r1	manage	47.1 K	2013-01-25 - 18:16	GiacomoTenaglia	Heartbeat duration count

Topic revision: r5 - 2013-02-21 - GiacomoTenaglia

Public webs

- Cern Search
- TWiki Search
- Google Search
DB All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback