DB Web>PostMortems>PostMortem20Nov12 (2013-01-18, RubenGaspar)

Storage upgrade on NetApp Vault clusters.

Description

As recommended by Netapp support and following its procedure an upgrade has been tried on 2 storage clusters. One upgrade succeeded with no incident. Second upgrade failed generating a downtime of about 20 minutes for some services, notably: devdb11, itcore (including LAS and rmdy), desfound_tcp, histo.

Intervention was announced at IT/SSB: https://itssb.web.cern.ch/planned-intervention/upgrade-ontap-version-two-nas-clusters-several-db-services-affected-lemonrac An incident was published on the IT/SSB: http://itssb.web.cern.ch/service-incident/loss-db-services-during-storage-ugprade/20-11-2012

Impact

Connectivity to this databases was lost for about ~20 minutes: devdb11, itcore (including LAS and rmdy), desfound_tcp, histo . Users couldnt connect, but no transaction committed was lost.

Time line of the incident

20/11/2012 17:30 The software is ready and all firmware upgrade has been done online. Intervention starts. A takeover is attempted.
20/11/2012 17:31 A takeover is forced. It succeeds. One node upgraded with no issue.
20/11/2012 17:32 I called Netapp support for advice. I have the suspicion we are going to have a problem. Method for upgrade is re-checked, it's ok. It's adviced to perform a 'system power off' in case the second node in that cluster hangs. Netapp support expects that the node will be taken over despite of Operating System mismatched.
20/11/2012 17:45 Giveback is initiated. It succeeded.
20/11/2012 17:53 Upgrade of second node failed. Second controller hanged. After few minutes waiting, following Netapp support indication a 'system power off' is initiated. Sadly the second node (already upgraded) didnt take over.
20/11/2012 18:25 Cluster is upgraded and back online.

Analysis

My conclusion is that there is some issue on Ontap 8.0.2P1 that makes an upgrade done to a major release to fail. The major challenge is that once you need to upgrade the second node, there is no commands to force it or break its workflow, it enters in a kind of infinite loop, where the only way I have found, confirmed by Netapp support, is to power cycle the controller. This has implications to the databases involved varying its behaviour, single instance froze, RAC database may reboot at least one node or also remain frozen. This implies downtime for the user.

Upgrades of ontap are done due to many factors: bugs, firmware only supported with new functionality, new features,...

Despite that it will make the whole process more slow, a possible way to bypass this issue would be to do a previous minor upgrade where a force operation can be provoked and then try the major one. This could be confirmed by Netapp case 2003719156. It would also imply to understand why the second controller on a cluster gets into a hanging state.

Follow up

A case has been open with Netapp support: 2003719156.
My previews analysis remains as best approach for this type of upgrade:

"Despite that it will make the whole process more slow, a possible way to bypass this issue would be to do a previous minor upgrade where a force operation can be provoked and then try the major one. This could be confirmed by Netapp case 2003719156. It would also imply to understand why the second controller on a cluster gets into a hanging state."

Topic revision: r5 - 2013-01-18 - RubenGaspar

Public webs

- Cern Search
- TWiki Search
- Google Search
DB All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback