The problem was related to the execution of a query requested by ATLAS operations to
re-initialize the default group information for some locations (mcdisk,
datadisk). The query although tested on intr was suboptimal and created
row lock contentions. The ATLAS expert didn't managed to kill it immediately through
the session manager and Luca Canali took over it. One mistake was to
check the execution and plan on intr where the cardinality of locations
was less than on production. Then the 'update in' execution plan has
been changed in a full table scan on production which favors the
contention with the real load. The query has been rewritten with the help of the ATLAS DB experts and its execution is being monitored
(
https://oraweb.cern.ch/pls/atlr/webinstance.sessions.show_sessions)
which for the moment is still transparent as it was originally supposed
to be. A e-log ticket will be posted for the new interventions in the future.
Addition from the Physics Database Service team (Luca Canali): the ATLAS offline ATLR services have suffered DB service stop with session kill for several connected session on Sunday 12th at 10:26. Full service connectivity was restored at 10:28. A second similar issue happened on 10:32 and full service connectivity was restored on 10:36. Applications with 'automatic reconnect' have experienced 2 glitches in the time window mentioned. The event was triggered by what should have been a transparent change to fix an error message
from the monitoring system. Namely the flash recovery area was filling up due to DB growth and the parameter db_recovery_file_dest_size needed to be increased accordingly. An erroneous setting of the parameter by the DBA triggered Oracle to perform service relocation
as described above. The parameter has been set correctly at 10:36 by the DBA and services restarted and made fully available since then.
:
--
SimoneCampana - 13 Jul 2009