The Easter Cron or "Why did cron stop?".

Description

  • A routine glibc update across the computer center resulted in < 1% of the hosts left with no running crond.
  • This coincided with a recently updated lemon deployment where actuators and subsequently operator alarms were non functional.

Impact

  • crond not running on around 40 (number to confirmed) hosts starting at various times from the afternoon Tuesday 7th. This included voms, fts, srm and voboxes , full list below.
  • Cron not running in itself does no harm by itself. Of course other items depending on cron then start to fail.
    1. Monitoring of SRM service for SLS was unavailable 14:00 on the 7th to 10:00 on the 8th.
    2. VOMS requires cron to both be monitored and stay healthy.
      • Alice voms on lcg-voms.cern.ch faulty from 18:35 Wednesday 8th April to 19:20 on Friday 10th April.
      • Clients randomly pick one of voms.cern.ch or lcg-voms.cern.ch. Those picking the latter will have hung. * The clients hang forever so the vo boxes had a lot of idle processes requiring clean up.
      • For users only annoying. For VOBoxes they would have been protected by retries.
      • VOMS services available throughout only degraded.
    3. Loss of LHCb and CMS srm for CERN CA users on 1 hour Sunday morning.
      • CRLs were out of date.
    4. Free space monitoring of T0EXPORT for CMS via SLS is unavailable Tuesday 7th to Tuesday 14th of April.
      • A ticket was sent to sls.support on the 7th by CMS. (Can't find the ticket at the moment)

Time Line of Incident

The log entries below are from voms115 as an example here but the other hosts are the same give or take a few hours.

Monday April 6th

  • New versions of lemon-agent, lemon-sensor-exception, lemon-xml-api and lemon-xml-api are configured as default package versions in CDB. At a later date lemon-sensor-exception is found to be broken disabling actuators and operator alarms. This is update was part of the routine preprod to prod update.

Tuesday April 7th

  • The SMOD kicks of a routine package update. (Nothing special about it, this is normal). This would have included the the updated lemon packages from the day before as well.
    • 13:03  Executing: /usr/sbin/spma_ncm_wrapper.sh spma_ncm
  • The package update included a new glibc. The %postinstall script of glibc executes /usr/sbin/glibc_post_upgrade.i686 which causes a re-exec of init.
    • 13:08  init: Trying to re-exec init
  • This basically causes a restart of all the services running under init. This includes ssh, syslog, ntpd and also crond. These other services all restarted okay. On some boxes, e.g voms crond stopped and then did not start again. The crond not running on voms115 and similar hosts is where everything started to go wrong. See question A.
    • 13:08  crond: crond shutdown succeeded
  • crond stopping is confirmed by the last log line in /var/log/cron being at 13:01. Cron jobs normally run at least every 20 minutes.
  • 14:00 SLS status of SRM monitoring starts to show an"unknown" status due to cron not running on srmmon02.
  • SLS monitoring of T0EXPORT for CMS becomes unavailable at some point. CMS submit a ticket to sls.support.
    • It is unclear if this is SLS itself with a broken cron or as I expect some lower information source with a broken cron.

Wednesday April 8th

  • 10:00 SLS status for SRM restored via manual start of crond on srmmon02 by SRM service manager. cron not running here thought to be isolated.
  • The operator logs into fts115, they are looking for the cause of an rpmverify alarm it looks like.
    • 13:38  Authorized to root, krb5 principal console@CERN.CH
    • 13:39  grep RPMV /var/log/messages
  • The cause is almost certainly that rpmverify has not run because cron is not running.
  • The operator runs a procedure , no change to alarm state. In fact the procedure needs an update for this particular failure mode. todo.
  • The operator at this point does not contact the service manager.
  • 18:35 Retrospectively from Alice's log files this is when the alice voms service becomes degraded with lcg-voms.cern.ch going bad for alice. voms.cern.ch continues to function.

Thursday April 9th

  • In CDB the default version of lemon-sensor-exception has been downgraded from 1.4.6-1 to 1.4.5-1. Presumably someone has noticed that the recently updated lemon-sensor-exception was not working, no actuators or alarms to operators.
  • VOMS service manager away from work for the day.
  • 11:09 - Alice submit GGUS:47801, the ticket is passed through CERN-ROC to service manager efficiently. The VOMS service manager at home is not reading his mail well enough.
  • 12:13 - The operator opens a ticket to the voms service manager concerning the rpmverify alarm. At home service manager is not reading his mail well enough.
  • Nothing else happens for the day.

Friday April 10th

  • 10:26 Alice involves Maarten explicitly in GGUS:47801, Maarten sends private mail to voms service manager.
  • 18:46 Voms service manager sees Maarten's comment, updates GGUS:47801 that it is being looked at.
  • 18:58 Voms service no longer degraded, confirmation that a problem exists and resolution completed together.
  • 19:00 VOMS service manager decides not to post to http://cern.ch/ssb. Not any obvious urgency for a service only degraded and being in that state for over a day.
  • 19:26 GGUS:47801 updated to let Alice know.
  • 20:15 Enough analysis has happened to decide it won't return. This postmortem is started.
  • 23:30 VOMS service manager realizes this is potentially a general cron problem across computer center.

Saturday April 11th

  • 00:43 Since the package update of glibc was of course a global computer centre update all nodes must be considered. Conveniently the default lemon configuration template have a count crond processes.
  • lemon-cli -s --script -m 43 -n '*' | grep '0$'
cernhome21 43 1217313968 0
dbsrvd283 43 1227098644 0
fts111 43 1239403455 0
fts112 43 1239403357 0
fts113 43 1239403183 0
fts114 43 1239403249 0
fts202 43 1239402979 0
fts203 43 1239403364 0
fts205 43 1239403076 0
fts206 43 1239402938 0
fts207 43 1239403483 0
fts208 43 1239403400 0
lxb1937 43 1231147615 0
lxb7565 43 1239403098 0
lxc1rb46 43 1224233736 0
lxc1rc09 43 1228830351 0
lxc1rc21 43 1224235759 0
lxc1rf03old 43 1224232038 0
lxc2rb46 43 1224233633 0
lxc2rc09 43 1228830380 0
lxc3ed08old 43 1224231560 0
lxfs5564 43 1219928557 0
lxfsrd0501 43 1219928381 0
lxfsrl1103 43 1237327047 0
lxn1179 43 1221655073 0
srm502 43 1239403093 0
srm503 43 1239403137 0
srm602 43 1239403138 0
srm603 43 1239403153 0
srmmon01 43 1239403097 0
voatlas16 43 1239403372 0
vocms58 43 1239403132 0
 
  • Hosts involved are mainly and most importantly SRM and FTS.
  • 01:18 VOMS service manager sends details and diagnoses to SMOD. SRM and FTS all under SMOD's control. They can follow up.
  • There is a more general point here. Why given there is no crond running did the existing lemon sensor and exception apparently not run the actuator or tell any one. See question B:

Sunday April 12th

  • 12:00 CRLs expired on LHCb and CMS srm service. SMOD contacted storage expert. Realized caused by death of cron as detailed in earlier mail.
  • 13:00 LHCb and CMS SRM Service restored, both CMS and LHCb notified. SSB entry requested for SRM.
  • 13:00 According to lemon metric the following hosts are not running crond. cernhome21, dbsrvd283, lxb1937,       lxb7565, lxc1rb46, lxc1rc09, lxc1rc21, lxc1rf03old ,lxc2rb46, lxc2rc09, lxc3ed08old, lxfs5564, lxfsrd0501, lxfsrl1103,       lxn1179, voatlas16, vocms58. Some of these don't actually appear to exist or be running.
  • 16:00 Conveniently , SRM, FTS, VOMS service mangers and SMOD all meet at a party and discuss the situation a little.

Monday April 13th

  • 02:35 Message sent to MOD for incident by VOMS service manager, it includes SRM outage as well.
  • 12:45 Neither of the requests from SRM or VOMS service managers have been added to the CERN SSB?

Tuesday April 14th

  • 09:00 VOMS manager and FIO members discuss situation,

Wednesday April 15th

* 10:00 SMOD triggers a site wide SPMA update. At this point lemon-metric-exception is now known to be a working version across the computer centre.

Question A: Why did cron stop and not restart when init was re-exec'ed?

To be determined. Given this happened on four boxes I can downgrade glibc and upgrade again to see if this is reliably bad. While it was less than 1% of hosts particular services look to have been hit worse, voms, fts and srm. All to be investigated, at the very least should be reported to linux support. Action on Steve.

This is the SPMA log for glibc.

2009/04/07-13:08:04        erase  glibc 2.3.4 2.41 i686
2009/04/07-13:08:04        install http://swrep/swrep/i386_slc4/ glibc 2.3.4 2.41.el4_7.1 i686
2009/04/07-13:08:04        install http://swrep/swrep/i386_slc4/ glibc 2.3.4 2.41.el4_7.1 i686
2009/04/07-13:08:04        replace   glibc-common 2.3.4 2.41 i386  with  http://swrep/swrep/i386_slc4/ glibc-common 2.3.4 2.41.el4_7.1 i386
2009/04/07-13:08:04        replace   glibc-headers 2.3.4 2.41 i386  with  http://swrep/swrep/i386_slc4/ glibc-headers 2.3.4 2.41.el4_7.1 i386
2009/04/07-13:08:40 [WARN] warning: package glibc = 2.3.4-2.41.el4_7.1 was already added, replacing with glibc <= 2.3.4-2.41.el4_7.1

It's not obvious why that is erase ,install of glibc but a replace for glibc-common? No reason why that should matter though.

So the summary of package updates is

From To
glibc-headers-2.3.4-2.41.i386.rpm glibc-headers-2.3.4-2.41.el4_7.1.i386.rpm
glibc-common-2.3.4-2.41.i386.rpm glibc-common-2.3.4-2.41.el4_7.1.i386.rpm
glibc-2.3.4-2.41.i686.rpm glibc-2.3.4-2.41.el4_7.1.i686.rpm

Trying a simple downgrade and upgrade.

# rpm -Uvh --oldpackage glibc-2.3.4-2.41.i686.rpm  glibc-common-2.3.4-2.41.i386.rpm \
                                         glibc-headers-2.3.4-2.41.i386.rpm
# spma_wrapper.sh
then crond is successfully stopped and started. This is far less than the actual package update so not a real repeat of what happened.

Question B: Why did the configured lemon metric and exception for crond not alert anyone?

There is a metric number 43 that counts the number of crond processes. Plots below are for the window of downtime for cron on voms115. Both graphs are for the period 01:00, April 6th to 01:00 Aprils 11th. The first is the number of crond process showing 0 were detected during the downtime. The second is the exception metric set to non-zero by the fact of crond count being less than one. Given there is an exception it's unclear to me why the operators are not alerted... Asking lemon team. Done

  • Count of crond process over interesting period.:
    cron-metric.png

  • Exception values for crond count, non-zero I thought raised an alarm:
    cron-exception.png

The summary of events related to lemon.

The lemon-metric-exception at version 1.4.6 deployed was found to be broken. Summary of timeline.

  • April 6th - New lemon version configured as default in CDB.
  • April 7th - New lemon version configured is guarenteed to deployed due to wide area SPMA update.
  • April 9th - Old version of lemon-metric-exception is configured in CDB.
  • April 11th - VOMS service managers raises ticket with lemon.support about faulty metrics.
  • April 15th - Old version of lemon-metric-exception is now known to be deployed everywhere due to a wide area SPMA update.
  • April 15th - No response yet from lemon.support ticket from the 11th.

Followup

VOMS Response Slow

A GGUS ticket for VOMS out of hours is dependent on best effort at 3 layers. TPM, CERN-ROC and VOMS service manager. This should be considered.

Update Procedures for SSB Notices.

The 2009 procedure linked from http://cern.ch/ssb makes no mention that a different address should be used out of hours. The page should be updated but I would recommend it should be the same address regardless of time and the handover managed behind the scenes. Action on Steve to make the request.
  • April 14th 12:30 Request has been made.
  • April 14th. Response from MOD, an extremely polite reply smile The suggestions for addressing the out of hours SSB procedure will be be considered next week once particular staff have returned. This is fine, no particular urgency.

On Call SMOD

Next time email requests to on the call SMOD should be followed up with a telephone call if they require action before next day back at work.

Service Manager Notification

Service managers really need automatic notification that their nodes are changing due to someone elses commit. A change summary per CDB cluster or something. Ask elfms support if they have considered anything. Action on Steve.

SLS Monitoring, Lemon, Callouts.

On a few occasions through out this SLS has noticed a problem with a service that the current lemon configuration and consequently operators have not. Call outs are only made on lemon alarms so no call out is made despite SLS clearly detecting a problem. Various things can be considered which have benefit and drawbacks.
  • Call out on SLS status?
  • Lemon metric based on SLS status? (a bit circular)
  • Migrate existing SLS tests to lemon metrics, have SLS use the lemon metrics.

Catching a Lemon Release before Production Next Time.

In principal should be caught in the preprod->prod release cycle. The addition of some checks at preprod testing will happen to check that actuators and alarms work.

Full syslog from voms115.

Apr  7 13:03:34 voms115 notd[3727]: Connection from lxservb04.cern.ch port 33853/tcp
Apr  7 13:03:34 voms115 notd[3727]: Received tokens: "'change' 'spma_ncm'"
Apr  7 13:03:34 voms115 notd[3727]: Executing "/usr/sbin/notd_exec '-c' 'notd' 'lxservb04.cern.ch' '33853' 'tcp' '--' 'change' 'spma_ncm'"
Apr  7 13:03:34 voms115 notd_exec[30373]: Executing: /usr/sbin/spma_ncm_wrapper.sh spma_ncm
Apr  7 13:08:27 voms115 init: Trying to re-exec init
Apr  7 13:08:27 voms115 sshd[3104]: Received signal 15; terminating.
Apr  7 13:08:27 voms115 sshd: sshd -TERM succeeded
Apr  7 13:08:30 voms115 sshd[30949]: Server listening on :: port 22.
Apr  7 13:08:30 voms115 sshd[30949]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Apr  7 13:08:30 voms115 sshd:  succeeded
Apr  7 13:08:30 voms115 crond: crond shutdown succeeded
Apr  7 13:08:31 voms115 kernel: Kernel logging (proc) stopped.
Apr  7 13:08:31 voms115 kernel: Kernel log daemon terminating.
Apr  7 13:08:32 voms115 syslog: klogd shutdown succeeded
Apr  7 13:08:32 voms115 exiting on signal 15
Apr  7 13:08:32 voms115 syslogd 1.4.1: restart.
Apr  7 13:08:32 voms115 syslog: syslogd startup succeeded
Apr  7 13:08:32 voms115 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Apr  7 13:08:32 voms115 syslog: klogd startup succeeded
Apr  7 13:08:32 voms115 sshd[30949]: Received signal 15; terminating.
Apr  7 13:08:32 voms115 sshd: sshd -TERM succeeded
Apr  7 13:08:32 voms115 syslog: syslogd shutdown succeeded 
Apr  7 13:08:35 voms115 sshd[31041]: Server listening on :: port 22.
Apr  7 13:08:35 voms115 sshd[31041]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Apr  7 13:08:35 voms115 sshd:  succeeded
Apr  7 13:08:36 voms115 named[2911]: shutting down: flushing changes
Apr  7 13:08:36 voms115 named[2911]: stopping command channel on 127.0.0.1#953
Apr  7 13:08:36 voms115 named:  succeeded
Apr  7 13:08:36 voms115 named[2911]: no longer listening on 127.0.0.1#53
Apr  7 13:08:36 voms115 named[2911]: exiting
Apr  7 13:08:38 voms115 named[31076]: starting BIND 9.2.4 -u named
Apr  7 13:08:38 voms115 named[31076]: using 8 CPUs
Apr  7 13:08:38 voms115 named[31076]: loading configuration from '/etc/named.conf'
Apr  7 13:08:38 voms115 named[31076]: listening on IPv4 interface lo, 127.0.0.1#53
Apr  7 13:08:38 voms115 named[31076]: command channel listening on 127.0.0.1#953
Apr  7 13:08:38 voms115 named[31076]: running
Apr  7 13:08:38 voms115 named: named startup succeeded
Apr  7 13:08:38 voms115 ntpd[3124]: ntpd exiting on signal 15
Apr  7 13:08:38 voms115 ntpd: ntpd shutdown succeeded

-- SteveTraylen - 10 Apr 2009

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2009-04-16 - SteveTraylen
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback