PostMortem13Jul09 < LCG

LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGServiceIncidents>PostMortem13Jul09 (2009-07-16, MariaGirone)

EditAttachPDF

The problem was related to the execution of a query requested by ATLAS operations to re-initialize the default group information for some locations (mcdisk, datadisk). The query although tested on intr was suboptimal and created row lock contentions. The ATLAS expert didn't managed to kill it immediately through the session manager and Luca Canali took over it. One mistake was to check the execution and plan on intr where the cardinality of locations was less than on production. Then the 'update in' execution plan has been changed in a full table scan on production which favors the contention with the real load. The query has been rewritten with the help of the ATLAS DB experts and its execution is being monitored (https://oraweb.cern.ch/pls/atlr/webinstance.sessions.show_sessions

) which for the moment is still transparent as it was originally supposed to be. A e-log ticket will be posted for the new interventions in the future.

Addition from the Physics Database Service team (Luca Canali): the ATLAS offline ATLR services have suffered DB service stop with session kill for several connected session on Sunday 12th at 10:26. Full service connectivity was restored at 10:28. A second similar issue happened on 10:32 and full service connectivity was restored on 10:36. Applications with 'automatic reconnect' have experienced 2 glitches in the time window mentioned. The event was triggered by what should have been a transparent change to fix an error message from the monitoring system. Namely the flash recovery area was filling up due to DB growth and the parameter db_recovery_file_dest_size needed to be increased accordingly. An erroneous setting of the parameter by the DBA triggered Oracle to perform service relocation as described above. The parameter has been set correctly at 10:36 by the DBA and services restarted and made fully available since then.

-- SimoneCampana - 13 Jul 2009

Topic revision: r2 - 2009-07-16 - MariaGirone

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback