IT-DB Virtualisation Infrastructure degraded

Description

A pool of 19 OracleVM hypervisors locate at SafeHost, hosting various production AIS application servers and databases on demand, started a series of spontaneous reboots which took place over October and beginning of November.

Impact

The affected services experienced interruptions, from 10 minutes (when there was enough capacity to restart the VM elsewhere) to half a day (worst case when the whole pool came back in a bad state and heavy clean-up was needed).

Unless otherwise noted, the services were moved to physical hardware within one week from the start of the instabilities.

Production applications affected:

  • Qualiac
  • OracleHR
  • AISBatchP
  • Kitry (not moved to physical HW as the procedure is very complicated and involves an external company)

Production Oracle databases affected:

  • OracleHR

DBOD MySQL databases affected:

  • Drupalprod
  • Webcast
  • Mastro
  • Trac_svn
  • Hc_cms
  • Hc_lhcb
  • Hc_core
  • Puppet
  • Hc_atlas (moved after 2 weeks)
  • Nova (moved after 2 weeks)
  • Avisual (moved after 3 weeks)
  • Atlas_mc (moved after 3 weeks)
  • Aistoold (moved after 3 weeks)
  • Copvss (moved after 3 weeks)
  • Cstest1 (moved after 3 weeks)
  • Olabpvss (moved after 3 weeks)

Time line of the incident

  • 05-Oct-2012 - first reboots, support request with Oracle opened
  • Until 12-Nov-2012 - further reboots, implementation of diagnostics, debugging and tentative fixes
  • 12-Nov-2012 - increased HA timeouts as eventually suggested by Oracle

Analysis

As a foreword, it has to be noted that the affected pool was running rock-solid for 6 months prior to the start of the problems.

The initial investigations showed no evidence of hardware or network problems on either the hypervisors or the storage infrastructure used to provide high availability (NFS). Also, there was absolutely no trace of the reboots on the hypervisors, it was as if they were cold-rebooted (syslog files truncated etc). We needed to gather more data and debugging informations, yet trying to reduce the service downtime as much as possible.

Oracle Support provided debugging and diagnostic tools, which were automatised and deployed on all the hypervisors. At the same time we put in place some extra monitoring to be warned earlier in case of further reboots, which helped and resulted in faster recoveries. As the problems were recurring, we decided to start moving production services to physical hardware (see notes on "Impact" above).

After a few weeks, collection of a lot of debugging information, further diagnostic tools installed including a few different kernels, Oracle Support suggested to increase the HA timeouts from the current 1 minute to 4 minutes. We performed the operation on November 12, and since then no further reboots have been observed.

So we can conclude that the root cause of the problem has been eventually confirmed to be the duration of the timeout set in the HA layer used by OracleVM (based on OCFS2). Specifically, the "heartbeat" functionality (used by every hypervisor to check the availability of the shared storage) appears to have triggered the reboots due to missing response from the storage within the related timeout.

We have been able to measure the time taken to perform heartbeat operations, as shown in the graph below (note the logarithmic scale):

Heartbeat duration count

We have observed a few unusually high timeouts, exceeding sometimes 1 minute. Though initial investigations showed no match of those high values on the NFS storage infrastructure side (suggesting the problem was linked with the NFS-related code used by OCFS2 when performing heartbeat operations), we happened to reproduce the problem with a very peculiar workload on another volume hosted on the same NFS filer serving the OVM HA volume. The fact that the storage infrastructure doesn't show any evidence of this problem points to a problem on that side.

We have opened a case with NetApp support to work on that "high write timeout" behaviour.

While working on this case, Oracle Support has filled up a number of bugs (also related to problems we experienced with some of the kernels provided during the investigations) which will be followed up internally by development and will eventually end up in the mainline OracleVM code.

Follow up

  • The recommended timeouts have been set also on the other OracleVM pools
  • Kernel dumps have been enabled on the other OracleVM pools
  • We are in the process of gradually adding systems back to the problematic OracleVM pool, keeping the situation closely monitored at the same time

Reference documents

  • the values recommended values are about the same as the one described in "Oracle VM and NetApp Storage Best Practices Guide" http://www.netapp.com/us/media/tr-3712.pdf
  • Oracle "My Oracle Support" service request 3-6285488751
  • NetApp Support Case 2003833908
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng ovm_hb_1.png r1 manage 47.1 K 2013-01-25 - 18:16 GiacomoTenaglia Heartbeat duration count
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2013-02-21 - GiacomoTenaglia
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback