Database issues after patching, June 2nd - 3rd 2010

Description

Oracle PSU APR 2010 patch, although it has passed validation on test and integration databases, showed not to be suitable for production on ATONR, ATLR and LHCBR production databases.

Impact

  • After applying the patch, spikes of high load were observed in the above mentioned databases affecting the databases access and the quality of the database services.

Time line of the incident

  • 31.05.2010 - patch applied on ATONR.
  • 01.06.2010 - patch applied on ATLR and LHCBR
  • 01.06.2010 - ATLAS users reported connection problems for some database sessions on the ATONR database while ATLR database was being patched
  • 01.06.2010 - once patches were applied, high load spikes were observed on ATLR and LHCBR databases every few hours, seriously affecting production database services

Analysis

  • The main symptoms that we see in production since we applied PSU APR10 are large spikes of load (mainly reported in Oracle as 'wait events' related to commit time) every few hours for a duration of 5-10 minutes. During such high load spikes database access and quality of DB services are compromised. This issue is substantially affecting production. On the alert log file: ORA-07445: exception encountered: core dump [ksxpmprp()+196]. Trace file references sys.aud$ table.
  • Problem seems to affect databases where auditing is enabled and COOL is used.

Follow up

  • Patch rolled back on ATONR, ATLR, LHBONR and LHCBR databases (rolling intervention) on Wednesday 2nd June. CMSONR and CMSR on Thursday 10th June.
  • Open a Service Request to Oracle for further investigations - Oracle SR number 3-1826315781
    • Fix patch provided, however needs to be validated in test environment where problem reproduces.
  • Try to reproduce the problem on the integration/test databases.
  • Recommendation from CERN to Tier1 sites: for those sites which have applied the April PSU patch on their databases and where auditing is enabled and COOL or similar (multiple sessions connected to one server process) is used to access the database, it is advised to rollback the patch.

-- EvaDafonte - 07-Jun-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2010-06-11 - EvaDafonte
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback