AISLogin not available

Description

One connection out of two to AISLogin and all the websites served by the application server ias_ais03_prod (aismedia.cern.ch e-groups.cern.ch found.cern.ch graybook.cern.ch identity-management.cern.ch safety.cern.ch) were failing from 9:45 to 10:30. All connections were failing from 10:30 to 11:10.

Impact

  • Users were not able to connect to the above websites. As a side effect, all users without an AISLogin cookie were not able to login to all other AISLogin-protected applications.

Time line of the incident

  • 28-Jan-2011 04:09 - Lemon notices "swap full" on one of the two servers running ias_ais04_prod (dbsrvg3301)
  • 28-Jan-2011 09:30 - Identified the cause of the alarm: HTTP_Server component using about 30 GB of memory
  • 28-Jan-2011 09:45 - Restarted Oracle Application Server HTTP_Server component on dbsrvg3301
  • 28-Jan-2011 09:45 - HTTP_Server didn't restart correctly, making all connections to the server fail
  • 28-Jan-2011 10:15 - Mail from AIS concerning failing connections
  • 28-Jan-2011 10:30 - Restarted HTTP_Server on both servers, which did not restarted, making all connections break. Started investigating why it didn't start up.
  • 28-Jan-2011 11:10 - Configuration fixed and components restarted.
  • 28-Jan-2011 11:15 - Mail to AIS to announce the restore.

Analysis

There are two problems:

  • The HTTP_Server component of Oracle iAS was consuming huge amount of memory (30GB). It had to be restarted as in the past it lead to server reboot due to lack of memory.
  • The HTTP_Server failed to start up. The cause was that in its configuration it referenced a no longer existing SSL certificate that had been removed few days before during general old certificate clean up. The certificate in question was no longer needed for operations but it was not removed from some parts of the configuration file. As the correctness of the configuration is checked before starting it up, this made the component fail.

Follow up

  • We'll open a SR to investigate the memory leak (TODO: update with SR #) - solved by cron job
  • Our configuration management framework that automatically generates the application server configuration files will be extended to handle the location of SSL certificates.- DONE
  • TODO: All AIS applications will be moved to SSO
Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2013-01-10 - NicolasMarescaux
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback