PostMortem24Apr12 < DB

DB Web>PostMortems>PostMortem24Apr12 (2012-05-04, EricGrancher)

Problem during NAS upgrade. Some databases: Drupal, EDH, AIS RAC -IMPACT, Foundation, PPT, etc.-, NOVA, hammercloud (Atlas), etc. affected.

Description

This intervention was required in order to upgrade the BIOS and the NVRAM firmware on the pair of storage controllers dbnasg403/g404. It required a restart of of the dbnasg403 controller from the High-Availability pair, this restart unfortunately did not work normally in the fact that the second controller did not take the load automatically. After 7 minutes wait, and after checking with our local Netapp suport contact, we decided to launch a take over from the partner node to speed up the failover process. This was accomplished in about 5 minutes more. Thus for about just over 12 minutes the controller was not serving data. The upgrade was done following the NetApp instructions and in the same way which had been done for dbnasg401/g402 (for which the intervention was transparent for user applications).

The announce for the intervention was posted on http://itssb.web.cern.ch/planned-intervention/firmware-upgrade-some-components-safe-host-controllers/24-04-2012. The incident was posted (thanks to Derek and Catherine) on http://itssb.web.cern.ch/service-incident/problems-drupal-and-ais-applications/04-04-2012.

Phone contact was established with some application service manager like Drupal and AIS during the intervention. Unfortunately the IT SSB is not available when disruptive operations are carried out on the underlying storage supporting that application. We were in contact with Catherine Delamare to communicate the status.

Impact

This unfortunately lead to a large number of dabases services not serving request during this time : Drupal MySQL database, AISRAC and EDH Oracle database, NOVA and hammercloud (Atlas) MySQL databases, etc.

Although we tried to recover rapidly from the controller freeze, some database services (Drupal, AIS RAC -Foundation, Impact, EDH, DataBase on Demand), were affected for around 15 minutes. Some of the AIS applications required to be restarted, this lead to up to 90 minutes availability, one of the follow-up is to understand why the application servers and applications could not reconnect automatically as they should have done (and as it has happened in the past).

Time line of the incident

NAS timeline

Tue Apr 24 17:01:01: Intervention starts on dbnasg401, we followed Netapp procedure starting with a 'reboot', partner does the takeover in about 25 secs. This is the normal procedure. Both BIOS and firmware NVRAM firmware are upgraded
Tue Apr 24 17:09:13: Giveback is initiated and dbnasg401 is back and serving data.
Tue Apr 24 17:13:56: Same procedure is applied on dbnasg403. A 'reboot' is initiated that should trigger a takeover from the other partner node (dbnasg404), same as for the previous cluster. After a few minutes less than 5 there is no takeover. We called our Netapp local support person in order to determine the impact of forcing a takeover. About 3 minutes later we proceeded forcing the takeover that took place again 3 minutes later.
Tue Apr 24 17:24:11: The takeover of dbnasg403 finally took place. Data access to this controller file systems should have resumed around that time. We faced another problem with 'some active data on the NVRAM'. This is not feasible at this point in time on the upgrade, as the node has been already taken over. This is exactly what we tried to avoid, and why we opened Netapp case 2002847075, few months ago, that lead to the intervention procedure we have applied on shosts clusters. It's a kind of bug on FAS3240 controllers. We tried to boot the system on maintenance and we halted again in an attempt to clean the NVRAM SSD. This didnt work either. During this operation the partner node had taken over, there was no problem on data access on file system on this controller.
Tue Apr 24 17:55:32: We decided to continue the intervention on dbnasg404 forcing a takeover. This would also assert that the cluster is in a good state. The takeover works fine and system upgrades BIOS and NVRAM firmware.
Tue Apr 24 18:10:20: Intervention is over.

AIS RAC Prod timeline

Tue Apr 24 , 17:13 - volume with online redo log freeze
Tue Apr 24 , 17:24 - volumes back available
Tue Apr 24 , 17:24 - 17:32 instance 3 is overloaded by wave of connections, no more connections are possible to the AIS RAC prod
Tue Apr 24 , 17:32 - instance 3 is killed by us to unlock the situation
Tue Apr 24 , 17:32 - instance 2 is evicted by clusterware
Tue Apr 24 , 17:33 - instance 2 and 3 startup
Tue Apr 24 , 17:35 - foundation services automatically migrated to instance 4
Tue Apr 24 , 17:35 - all services are available (except specific PPT services not using service name, but specific instance)
Tue Apr 24 , 17:40 - foundation services relocated manually back to instance 3
Tue Apr 24 , 17:56 - ppt services relocated manually back to instance 2
Tue Apr 24 , 18:42 - restart of application servers to fix last PPT services
Tue Apr 24 , 18:45 - last PPT services made available again

Analysis

Due to the NFS freeze some databases hanged: for AIS (AISDBP, EDHP, WCERNP, BAAN6P, ERTP, PAYP) and Drupal/DBoD MySQL databases.
No data loss or corruption ; as noted at the C5 meeting by Derek, there was no data loss at the database level but the impossibility to reconnect in another way than restarting the application servers lead to data loss at the applicatoin level.
The procedure followed has been worked out with Netapp support and worked successfully on one cluster. A case has been open on Netapp support.

Follow up

IT-DB has to make sure that even for such "a priori transparent" operations, AIS, Drupal and others are explicitely aware of the intervention and agree that it can be made (knowing that there is always a risk and making sure that the period is not a "special one" like the introduction of IMPACT for a number of accelerators like on April 24th).
Understand why the failover was not performed automatically (Netapp case 2003092535 opened April 24th 19:16) and once understood, finish the patching, a future intervention will be announced using the normal channels, including TIOC, listing the affected services and likely with a downtime. * Following case 2003092535 the reboot failed due to a combination of Bug 514876 (Deadlock over a mutex that prevents stderr session destruction and causes a Giveback hung or a Shutdown hung), 542290 (not publicly available), 384214 (not publicly available) affecting our actual Ontap version. An intervention will be scheduled in order to upgrade Ontap release and finish previous intervention.
Understand why some applications (for example IMPACT) could not reconnect to the database server automatically (database, application server configurations? application itself?), this was the reason for which application data was lost (cached in the application). A test could be done on the AIS RAC test with the needed announcements, coordination and the right persons present.
List the depedencies for IMPACT.
Any intervention which can affect the IMPACT application has to be coordinated with the TIOC.

-- RubenGaspar - 24-Apr-2012

Topic revision: r12 - 2012-05-04 - EricGrancher

Public webs

- Cern Search
- TWiki Search
- Google Search
DB All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback