PostMortem24Jun12 < DB

TWiki>

DB Web>PostMortems>PostMortem24Jun12 (2012-06-24, EricGrancher)

EditAttachPDF

ACCCON partially not reachable

Description

ACCCON has been partially not reachable (acc_settings_db) between 5:29 and 6:10.

Impact

The acc_settings_db database service was not reachable and already established sessions were blocked.

Time line of the incident

24-Jun-2012 (not exactly known when, around 5:29) the machine dbsrva207 started to hang
24-Jun-2012 5:32 machine dbsrva207 evicted by dbsrva206 (expected)
24-Jun-2012 5:40 service LSA_s not noved to dbsrva206 (should have been)
24-Jun-2012 5:57 called by Chris Roderick
24-Jun-2012 6:05 forced a restart of the machine (remote-power-control cycle dbsrva207)
24-Jun-2012 6:10 service available again (service LSA_DB) automatically moved to dbsrva206
24-Jun-2012 6:23 service moved back to dbsrva207

Analysis

It seems that the machine has faced a local storage issue: dbsrva207 console log:

megasas: RESET -208368307 cmd=2a  retries=0
megasas: [ 0]waiting for 116 commands to complete
megasas: [ 5]waiting for 116 commands to complete
megasas: [10]waiting for 116 commands to complete
...
megasas[0]: Dumping Frame Phys Address of all pending cmds in FW
megasas[0]: Total OS Pending cmds : 116
megasas[0]: 64 bit SGLs were sent to FW
megasas[0]: Pending OS cmds in FW : 
megasas[0]: Frame addr :0x0009c000 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x0, lba lo : 0x70f9061, lba_hi : 0x0, sense_buf addr : 0x5600,sge count : 0x1
...
megasas[0]: Pending Internal cmds in FW : 
0x00577000 : <3>megasas[0]: Dumping Done.
megasas: failed to do reset
megasas: RESET -208368307 cmd=2a  retries=0
megasas: cannot recover from previous reset failures
megasas: RESET -208368307 cmd=2a  retries=0
megasas: cannot recover from previous reset failures
SCSI error : <0 2 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 118459761
Buffer I/O error on device sda6, logical block 1935381
lost page write due to I/O error on sda6
scsi0 (0:0): rejecting I/O to offline device
...
end_request: I/O error, dev sda, sector 118459777
Buffer I/O error on device sda6, logical block 1935383
lost page write due to I/O error on sda6
SCSI error : <0 2 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 118650489
Buffer I/O error on device sda6, logical block 1959222
...
Aborting journal on device sda6.
end_request: I/O error, dev sda, sector 68620533
ext3_abort called.
EXT3-fs error (device sda6): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
scsi0 (0:0): rejecting I/O to offline device
...
EXT3-fs error (device sda2) in start_transaction: Journal has aborted
EXT3-fs error (device sda2) in start_transaction: Journal has aborted

from the alert log node 1 (dbsrva206), the second instance was evicted but the LSA_S service was not moved to this node

Sun Jun 24 05:32:08 2012
minact-scn: Inst 1 is now the master inc#:3 mmon proc-id:7893 status:0x7
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x05b0.95a3faac gcalc-scn:0x05b0.9b977e65
minact-scn: Master returning as live inst:2 has inc# mismatch instinc:0 cur:3 errcnt:0
...
Sun Jun 24 05:38:24 2012
Remote instance kill is issued with system inc 3
Remote instance kill map (size 1) : 2 
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x10000000
The instance eviction map is 2 
Reconfiguration started (old inc 3, new inc 5)
List of instances:
 1 (myinst: 1) 
...
Sun Jun 24 05:40:42 2012
ALTER SYSTEM SET service_names='ACCCON_S','SYS$ACCCON_SYS.ANYDATA_QUEUE.ACCCON.WORLD' SCOPE=MEMORY SID='ACCCON1';

the node 2 has been evicted by node 1

from /ORA/dbs01/crs/log/dbsrva206/alertdbsrva206.log 
2012-06-24 05:38:24.078 [cssd(11462)]CRS-1663:Member kill issued by PID 7865 for 1 members, group DBACCCON. Details at (:CSSGM00044:) in /ORA/dbs01/crs/log/dbsrva206/cssd/ocssd.log.

from /ORA/dbs01/crs/log/dbsrva206/cssd/ocssd.log
2012-06-24 05:32:50.876: [    CSSD][1103702368]clssgmHandleMasterMemberExit: [s(2) d(1)]
2012-06-24 05:32:50.876: [    CSSD][1103702368]clssgmRemoveMember: grock IGACCCONSYS$BACKGROUND, member number 2 (0x2a9a
aa5d20) node number 2 state 0x0 grock type 2
2012-06-24 05:32:50.876: [    CSSD][1103702368]clssgmGrantLocks: 2-> new master (1/1) group IGACCCONSYS$BACKGROUND
2012-06-24 05:32:50.876: [    CSSD][1103702368]clssgmQueueGrockEvent: groupName(IGACCCONSYS$BACKGROUND) count(1) master(
1) event(1), incarn 3, mbrc 1, to member 1, events 0x0, state 0x0

Follow up

investigate a force reboot in such case of IO error
investigate why the node was not evicted
check why SMS was not sent

Topic revision: r2 - 2012-06-24 - EricGrancher

Public webs

- Cern Search
- TWiki Search
- Google Search
DB All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback