RAC51 storage switch extension outage

Description

Planned intervention to add additonal storage switches to RAC51 resulted in a major service outage. All AIS services plus EDMS suffered an outage of about 10 mins, as the Oracle RAC servers in RAC51 rebooted following an incident on the storage switches.

Impact

  • All active users of AIS applications, EDMS.

Time line of the incident

  • 21-Jan-16 08:22 - CRS core dump on some servers
  • 21-Jan-16 08:25 - confirmation from IT/CS that Sixth storage switch added to RAC51
  • 21-Jan-16 08:26 - unable to ping storage nor other servers over the interconnect.
  • 21-Jan-16 08:31 - 34 oracle rac servers rebooting
  • 21-Jan-16 08:32 - contacting cccr on 72201 to inform that there is an issue
  • 21-Jan-16 08:33 - oracle restarting
  • 21-Jan-16 08:36 -
  • 21-Jan-16 08:39 - oracle db up ( - except oracle on demand cluster )
  • 21-Jan-16 08:51 - transparent rebalancing of oracle services across nodes
  • 21-Jan-16 08:53 - contaction cccr to inform that on our side dbs are back
  • 21-Jan-16 09:00 - investigating why itrac5114 and itrac5142 have lost all network connetivity.
  • 21-Jan-16 10:30 - decision to reinstall itrac5114 and also prepare itrac5146 to be part, whilst investigation continues on itrac5142
  • 21-Jan-16 11:30 - itrac5146 Available to ondemand team
  • 21-Jan-16 12:00 - itrac5114 Available to ondemand team
  • 21-Jan-16 12:30 - the remaining ondemand databases being restored
  • 21-Jan-16 14:30 - All database running as normal

Analysis

  • The addition of a new switch to the storage network caused an unforseen network disruption.
  • The oracle rac nodes where no longer able to see the CRS volume over NFS and rebooted.
  • below is report from Eric Sallaz :
Today, Thursday 21 January, intervention started, as announced by OTG https://cern.service-now.com/service-portal/view-outage.do?n=OTG0027624 at 07h30.

The first part of the work which consisted in adding a third switch in the secondary IRF went well.

Then we added the third switch in the primary IRF and we encountered a problem of connectivity for most of the servers. This was certainly due to the fact that we have a LAG between each pair of switches in racks (RA02, RA05 and RA07) to interconnect the 2 IRF.

When this last switch has been added, as expected it has been automatically added to the primary IRF, but while doing this, all the port numbers of the switch have been renumbered respecting its position in the IRF. This ended to have the port interconnecting the 2 switches renumbered from 1/050 to 3/0/50 and the port went out of the lag.

This created a loop which may explain the problem of the servers. Once this was seen, the interconnection was removed, the configuration of the IRF modified and then the connection reestablished. The lag was then complete and the IRF switches stable.

-- PaulSmith - 2016-01-22

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2016-01-27 - PaulSmith
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback