PostMortem21Jan16 < DB

DB Web>PostMortems>PostMortem21Jan16 (2016-01-27, PaulSmith)

RAC51 storage switch extension outage

Description

Planned intervention to add additonal storage switches to RAC51 resulted in a major service outage. All AIS services plus EDMS suffered an outage of about 10 mins, as the Oracle RAC servers in RAC51 rebooted following an incident on the storage switches.

Impact

All active users of AIS applications, EDMS.

Time line of the incident

21-Jan-16 08:22 - CRS core dump on some servers
21-Jan-16 08:25 - confirmation from IT/CS that Sixth storage switch added to RAC51
21-Jan-16 08:26 - unable to ping storage nor other servers over the interconnect.
21-Jan-16 08:31 - 34 oracle rac servers rebooting
21-Jan-16 08:32 - contacting cccr on 72201 to inform that there is an issue
21-Jan-16 08:33 - oracle restarting
21-Jan-16 08:36 -
21-Jan-16 08:39 - oracle db up ( - except oracle on demand cluster )
21-Jan-16 08:51 - transparent rebalancing of oracle services across nodes
21-Jan-16 08:53 - contaction cccr to inform that on our side dbs are back
21-Jan-16 09:00 - investigating why itrac5114 and itrac5142 have lost all network connetivity.
21-Jan-16 10:30 - decision to reinstall itrac5114 and also prepare itrac5146 to be part, whilst investigation continues on itrac5142
21-Jan-16 11:30 - itrac5146 Available to ondemand team
21-Jan-16 12:00 - itrac5114 Available to ondemand team
21-Jan-16 12:30 - the remaining ondemand databases being restored
21-Jan-16 14:30 - All database running as normal

Analysis

The addition of a new switch to the storage network caused an unforseen network disruption.
The oracle rac nodes where no longer able to see the CRS volume over NFS and rebooted.
below is report from Eric Sallaz :

Today, Thursday 21 January, intervention started, as announced by OTG https://cern.service-now.com/service-portal/view-outage.do?n=OTG0027624

at 07h30.

The first part of the work which consisted in adding a third switch in the secondary IRF went well.

Then we added the third switch in the primary IRF and we encountered a problem of connectivity for most of the servers. This was certainly due to the fact that we have a LAG between each pair of switches in racks (RA02, RA05 and RA07) to interconnect the 2 IRF.

When this last switch has been added, as expected it has been automatically added to the primary IRF, but while doing this, all the port numbers of the switch have been renumbered respecting its position in the IRF. This ended to have the port interconnecting the 2 switches renumbered from 1/050 to 3/0/50 and the port went out of the lag.

This created a loop which may explain the problem of the servers. Once this was seen, the interconnection was removed, the configuration of the IRF modified and then the connection reestablished. The lag was then complete and the IRF switches stable.

-- PaulSmith - 2016-01-22

Topic revision: r5 - 2016-01-27 - PaulSmith

Public webs

- Cern Search
- TWiki Search
- Google Search
DB All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback