Accelerator databases unavailability

Description

Accelerator databases (acclog, acccon, accmeas, eniaar, laser, tim, zora, susi, encvorcl) could not be accessed from our machines due to problems with technical network.

Impact

  • Accelerator databases (acclog, acccon, accmeas, eniaar, laser, tim, zora, susi, encvorcl) were not accessible from any of our machines. There were no other problems with databases - during whole period of the incident, they were running fine (no restarts, etc.).

Time line of the incident

  • 04-Jan-13 00:33 - First RACMON alerts regarding connectivity problems (see follow-up section).
  • 04-Jan-13 ~00:50 - Call to operator - informed that there is a network problem being solved.
  • 04-Jan-13 01:04 - Incident posted on SSB: http://itssb.web.cern.ch/service-incident/accelerator-databases-unavailable/04-01-2013.
  • 04-Jan-13 01:11 - Accelerator databases e-group informed about the incident.
  • 04-Jan-13 ~01:20 - Cannot use lxtnadm to reach the machines, one can login to lxtnadm but not ssh to the servers hosting the databases. (see follow-up section)
  • 04-Jan-13 ~01:20 - Found out that even consoles are not reachable (see follow-up section).
  • 04-Jan-13 ~01:35 - Operator called again - ensured that the problem is being solved.
  • 04-Jan-13 01:40 - Entry about network problems put on SSB by operator - http://itssb.web.cern.ch/service-incident/incident-router-and-dns/04-01-2013.
  • 04-Jan-13 01:55 - Problem fixed.
  • 04-Jan-13 02:05 - Operator called again - ensured that the solution was final, but the root cause was not yet clear for him.
  • 04-Jan-13 02:09 - Accelerator databases e-group informed that connectivity was fixed.
  • 04-Jan-13 02:35 - After checking the impact, found out that there are many failed logins as user ABOP to ACCCON database from cs-ccr-log4.cern.ch - users informed. (see follow-up section)
  • 04-Jan-13 10:43 - ABOP sessions count went back to normal.

Analysis

  • There were no database related problems. Network problem was solved and root cause analysis of unavailability was done by Network Team. See http://itssb.web.cern.ch/service-incident/incident-router-and-dns/04-01-2013
  • Immediately after network was back, problems with logging as ABOP users to ACCCON database started. Analysis showed that maximum limit set by the profile assigned to the user was reached. (see follow-up section)
  • Even so there is piquet coverage over the end of year closure, during the remaining part of the year the service is covered with best effort. It is the first time over the past years that SUSI and ZORA databases have an outage outside working hours.

Follow up

  • Network Team was asked for more details regarding the incident. After analysis they think that some routers on the TN, including the one in 513, hit a bug that stopped the routing process and prevented them from sharing routing information. The end systems (i.e. DDBB) had local connectivity and could reach any service connected on the same router, but couldn’t reach anything outside. This means that between 0:33am and 1:54am our DDBB servers could not access any system on the TN outside the Data Center and none on the GPN.
  • SNOW ticket was opened to clarify problem with no access to consoles. Remote Console Team replied that accessibility problems were caused by the fact that affected hosts, as well as their IPMI (console) interfaces are connected to the same router T513-c-rhp82-1, which is serving Technical Network - the one which was affected by the bug. Lxtnadm was accessible, because it's on GPN, but there was no connection from it to target *-IPMI interfaces, that's why consoles could not be reached. According to them it was designed in that way (no mistake in configuration).
  • Failed logins as ABOP user were caused by problems with connection pool in Tomcat server running E-Logbook application - all available connections in the pool were used, but they were never released (even unused ones). Tomcat server reboot done by application manager solved the problem. Thanks to maximum session count properly set in the database for this user, problem didn't affect whole database and was only local to this particular application which cannot login, but we didn't hit more severe problems like exhausting all resources on operating system side. After similar incidents, affected databases should be monitored for unusual (higher) session counts, because application's connection pools could gone mad. In this case we should inform application owners, so they can reinitialize connection pools.

-- SzymonSkorupinski - 07-Jan-2013

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2013-01-10 - SzymonSkorupinski
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DB All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback