LCG Web>LCGSecurity>VomsCern>VomsPostMortems>VomsPostMortem2009x04x10 (2009-04-16, SteveTraylen)

EditAttachPDF

The Easter Cron or "Why did cron stop?".

Description

A routine glibc update across the computer center resulted in < 1% of the hosts left with no running crond.
This coincided with a recently updated lemon deployment where actuators and subsequently operator alarms were non functional.

Impact

crond not running on around 40 (number to confirmed) hosts starting at various times from the afternoon Tuesday 7th. This included voms, fts, srm and voboxes , full list below.
Cron not running in itself does no harm by itself. Of course other items depending on cron then start to fail.
1. Monitoring of SRM service for SLS was unavailable 14:00 on the 7th to 10:00 on the 8th.
2. VOMS requires cron to both be monitored and stay healthy.
  - Alice voms on lcg-voms.cern.ch faulty from 18:35 Wednesday 8th April to 19:20 on Friday 10th April.
  - Clients randomly pick one of voms.cern.ch or lcg-voms.cern.ch. Those picking the latter will have hung. * The clients hang forever so the vo boxes had a lot of idle processes requiring clean up.
  - For users only annoying. For VOBoxes they would have been protected by retries.
  - VOMS services available throughout only degraded.
3. Loss of LHCb and CMS srm for CERN CA users on 1 hour Sunday morning.
  - CRLs were out of date.
4. Free space monitoring of T0EXPORT for CMS via SLS is unavailable Tuesday 7th to Tuesday 14th of April.
  - A ticket was sent to sls.support on the 7th by CMS. (Can't find the ticket at the moment)

Time Line of Incident

The log entries below are from voms115 as an example here but the other hosts are the same give or take a few hours.

Monday April 6th

New versions of lemon-agent, lemon-sensor-exception, lemon-xml-api and lemon-xml-api are configured as default package versions in CDB. At a later date lemon-sensor-exception is found to be broken disabling actuators and operator alarms. This is update was part of the routine preprod to prod update.

Tuesday April 7th

The SMOD kicks of a routine package update. (Nothing special about it, this is normal). This would have included the the updated lemon packages from the day before as well.
- 13:03 Executing: /usr/sbin/spma_ncm_wrapper.sh spma_ncm
The package update included a new glibc. The %postinstall script of glibc executes /usr/sbin/glibc_post_upgrade.i686 which causes a re-exec of init.
- 13:08 init: Trying to re-exec init
This basically causes a restart of all the services running under init. This includes ssh, syslog, ntpd and also crond. These other services all restarted okay. On some boxes, e.g voms crond stopped and then did not start again. The crond not running on voms115 and similar hosts is where everything started to go wrong. See question A.
- 13:08 crond: crond shutdown succeeded
crond stopping is confirmed by the last log line in /var/log/cron being at 13:01. Cron jobs normally run at least every 20 minutes.
14:00 SLS status of SRM monitoring starts to show an"unknown" status due to cron not running on srmmon02.
SLS monitoring of T0EXPORT for CMS becomes unavailable at some point. CMS submit a ticket to sls.support.
- It is unclear if this is SLS itself with a broken cron or as I expect some lower information source with a broken cron.

Wednesday April 8th

10:00 SLS status for SRM restored via manual start of crond on srmmon02 by SRM service manager. cron not running here thought to be isolated.
The operator logs into fts115, they are looking for the cause of an rpmverify alarm it looks like.
- 13:38 Authorized to root, krb5 principal console@CERN.CH
- 13:39 grep RPMV /var/log/messages
The cause is almost certainly that rpmverify has not run because cron is not running.
The operator runs a procedure , no change to alarm state. In fact the procedure needs an update for this particular failure mode. todo.
The operator at this point does not contact the service manager.
18:35 Retrospectively from Alice's log files this is when the alice voms service becomes degraded with lcg-voms.cern.ch going bad for alice. voms.cern.ch continues to function.

Thursday April 9th

In CDB the default version of lemon-sensor-exception has been downgraded from 1.4.6-1 to 1.4.5-1. Presumably someone has noticed that the recently updated lemon-sensor-exception was not working, no actuators or alarms to operators.
VOMS service manager away from work for the day.
11:09 - Alice submit GGUS:47801, the ticket is passed through CERN-ROC to service manager efficiently. The VOMS service manager at home is not reading his mail well enough.
12:13 - The operator opens a ticket to the voms service manager concerning the rpmverify alarm. At home service manager is not reading his mail well enough.
Nothing else happens for the day.

Friday April 10th

10:26 Alice involves Maarten explicitly in GGUS:47801, Maarten sends private mail to voms service manager.
18:46 Voms service manager sees Maarten's comment, updates GGUS:47801 that it is being looked at.
18:58 Voms service no longer degraded, confirmation that a problem exists and resolution completed together.
19:00 VOMS service manager decides not to post to http://cern.ch/ssb. Not any obvious urgency for a service only degraded and being in that state for over a day.
19:26 GGUS:47801 updated to let Alice know.
20:15 Enough analysis has happened to decide it won't return. This postmortem is started.
23:30 VOMS service manager realizes this is potentially a general cron problem across computer center.

Saturday April 11th

00:43 Since the package update of glibc was of course a global computer centre update all nodes must be considered. Conveniently the default lemon configuration template have a count crond processes.
lemon-cli -s --script -m 43 -n '*' | grep '0$'

Output of hosts with 0 crond process.

Hide output of hosts with 0 crond process.

cernhome21 43 1217313968 0
dbsrvd283 43 1227098644 0
fts111 43 1239403455 0
fts112 43 1239403357 0
fts113 43 1239403183 0
fts114 43 1239403249 0
fts202 43 1239402979 0
fts203 43 1239403364 0
fts205 43 1239403076 0
fts206 43 1239402938 0
fts207 43 1239403483 0
fts208 43 1239403400 0
lxb1937 43 1231147615 0
lxb7565 43 1239403098 0
lxc1rb46 43 1224233736 0
lxc1rc09 43 1228830351 0
lxc1rc21 43 1224235759 0
lxc1rf03old 43 1224232038 0
lxc2rb46 43 1224233633 0
lxc2rc09 43 1228830380 0
lxc3ed08old 43 1224231560 0
lxfs5564 43 1219928557 0
lxfsrd0501 43 1219928381 0
lxfsrl1103 43 1237327047 0
lxn1179 43 1221655073 0
srm502 43 1239403093 0
srm503 43 1239403137 0
srm602 43 1239403138 0
srm603 43 1239403153 0
srmmon01 43 1239403097 0
voatlas16 43 1239403372 0
vocms58 43 1239403132 0

Hosts involved are mainly and most importantly SRM and FTS.
01:18 VOMS service manager sends details and diagnoses to SMOD. SRM and FTS all under SMOD's control. They can follow up.
There is a more general point here. Why given there is no crond running did the existing lemon sensor and exception apparently not run the actuator or tell any one. See question B:

Sunday April 12th

12:00 CRLs expired on LHCb and CMS srm service. SMOD contacted storage expert. Realized caused by death of cron as detailed in earlier mail.
13:00 LHCb and CMS SRM Service restored, both CMS and LHCb notified. SSB entry requested for SRM.
13:00 According to lemon metric the following hosts are not running crond. cernhome21, dbsrvd283, lxb1937, lxb7565, lxc1rb46, lxc1rc09, lxc1rc21, lxc1rf03old ,lxc2rb46, lxc2rc09, lxc3ed08old, lxfs5564, lxfsrd0501, lxfsrl1103, lxn1179, voatlas16, vocms58. Some of these don't actually appear to exist or be running.
16:00 Conveniently , SRM, FTS, VOMS service mangers and SMOD all meet at a party and discuss the situation a little.

Monday April 13th

02:35 Message sent to MOD for incident by VOMS service manager, it includes SRM outage as well.
12:45 Neither of the requests from SRM or VOMS service managers have been added to the CERN SSB?

Tuesday April 14th

09:00 VOMS manager and FIO members discuss situation,

Wednesday April 15th

* 10:00 SMOD triggers a site wide SPMA update. At this point lemon-metric-exception is now known to be a working version across the computer centre.

Question A: Why did cron stop and not restart when init was re-exec'ed?

To be determined. Given this happened on four boxes I can downgrade glibc and upgrade again to see if this is reliably bad. While it was less than 1% of hosts particular services look to have been hit worse, voms, fts and srm. All to be investigated, at the very least should be reported to linux support. Action on Steve.

This is the SPMA log for glibc.

2009/04/07-13:08:04        erase  glibc 2.3.4 2.41 i686
2009/04/07-13:08:04        install http://swrep/swrep/i386_slc4/ glibc 2.3.4 2.41.el4_7.1 i686
2009/04/07-13:08:04        install http://swrep/swrep/i386_slc4/ glibc 2.3.4 2.41.el4_7.1 i686
2009/04/07-13:08:04        replace   glibc-common 2.3.4 2.41 i386  with  http://swrep/swrep/i386_slc4/ glibc-common 2.3.4 2.41.el4_7.1 i386
2009/04/07-13:08:04        replace   glibc-headers 2.3.4 2.41 i386  with  http://swrep/swrep/i386_slc4/ glibc-headers 2.3.4 2.41.el4_7.1 i386
2009/04/07-13:08:40 [WARN] warning: package glibc = 2.3.4-2.41.el4_7.1 was already added, replacing with glibc <= 2.3.4-2.41.el4_7.1

It's not obvious why that is erase ,install of glibc but a replace for glibc-common? No reason why that should matter though.

So the summary of package updates is

From	To
glibc-headers-2.3.4-2.41.i386.rpm	glibc-headers-2.3.4-2.41.el4_7.1.i386.rpm
glibc-common-2.3.4-2.41.i386.rpm	glibc-common-2.3.4-2.41.el4_7.1.i386.rpm
glibc-2.3.4-2.41.i686.rpm	glibc-2.3.4-2.41.el4_7.1.i686.rpm

Trying a simple downgrade and upgrade.

# rpm -Uvh --oldpackage glibc-2.3.4-2.41.i686.rpm  glibc-common-2.3.4-2.41.i386.rpm \
                                         glibc-headers-2.3.4-2.41.i386.rpm
# spma_wrapper.sh

then crond is successfully stopped and started. This is far less than the actual package update so not a real repeat of what happened.

Question B: Why did the configured lemon metric and exception for crond not alert anyone?

There is a metric number 43 that counts the number of crond processes. Plots below are for the window of downtime for cron on voms115. Both graphs are for the period 01:00, April 6th to 01:00 Aprils 11th. The first is the number of crond process showing 0 were detected during the downtime. The second is the exception metric set to non-zero by the fact of crond count being less than one. Given there is an exception it's unclear to me why the operators are not alerted... Asking lemon team. Done

Count of crond process over interesting period.:

Exception values for crond count, non-zero I thought raised an alarm:

The summary of events related to lemon.

The lemon-metric-exception at version 1.4.6 deployed was found to be broken. Summary of timeline.

April 6th - New lemon version configured as default in CDB.
April 7th - New lemon version configured is guarenteed to deployed due to wide area SPMA update.
April 9th - Old version of lemon-metric-exception is configured in CDB.
April 11th - VOMS service managers raises ticket with lemon.support about faulty metrics.
April 15th - Old version of lemon-metric-exception is now known to be deployed everywhere due to a wide area SPMA update.
April 15th - No response yet from lemon.support ticket from the 11th.

Followup

VOMS Response Slow

A GGUS ticket for VOMS out of hours is dependent on best effort at 3 layers. TPM, CERN-ROC and VOMS service manager. This should be considered.

Update Procedures for SSB Notices.

The 2009 procedure

linked from http://cern.ch/ssb

makes no mention that a different address should be used out of hours. The page should be updated but I would recommend it should be the same address regardless of time and the handover managed behind the scenes. Action on Steve to make the request.

April 14th 12:30 Request has been made.
April 14th. Response from MOD, an extremely polite reply The suggestions for addressing the out of hours SSB procedure will be be considered next week once particular staff have returned. This is fine, no particular urgency.

On Call SMOD

Next time email requests to on the call SMOD should be followed up with a telephone call if they require action before next day back at work.

Service Manager Notification

Service managers really need automatic notification that their nodes are changing due to someone elses commit. A change summary per CDB cluster or something. Ask elfms support if they have considered anything. Action on Steve.

SLS Monitoring, Lemon, Callouts.

On a few occasions through out this SLS has noticed a problem with a service that the current lemon configuration and consequently operators have not. Call outs are only made on lemon alarms so no call out is made despite SLS clearly detecting a problem. Various things can be considered which have benefit and drawbacks.

Call out on SLS status?
Lemon metric based on SLS status? (a bit circular)
Migrate existing SLS tests to lemon metrics, have SLS use the lemon metrics.

Catching a Lemon Release before Production Next Time.

In principal should be caught in the preprod->prod release cycle. The addition of some checks at preprod testing will happen to check that actuators and alarms work.

Full syslog from voms115.

Show Syslog from cron stoptime

Hide Syslog from cron stop time

Apr  7 13:03:34 voms115 notd[3727]: Connection from lxservb04.cern.ch port 33853/tcp
Apr  7 13:03:34 voms115 notd[3727]: Received tokens: "'change' 'spma_ncm'"
Apr  7 13:03:34 voms115 notd[3727]: Executing "/usr/sbin/notd_exec '-c' 'notd' 'lxservb04.cern.ch' '33853' 'tcp' '--' 'change' 'spma_ncm'"
Apr  7 13:03:34 voms115 notd_exec[30373]: Executing: /usr/sbin/spma_ncm_wrapper.sh spma_ncm
Apr  7 13:08:27 voms115 init: Trying to re-exec init
Apr  7 13:08:27 voms115 sshd[3104]: Received signal 15; terminating.
Apr  7 13:08:27 voms115 sshd: sshd -TERM succeeded
Apr  7 13:08:30 voms115 sshd[30949]: Server listening on :: port 22.
Apr  7 13:08:30 voms115 sshd[30949]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Apr  7 13:08:30 voms115 sshd:  succeeded
Apr  7 13:08:30 voms115 crond: crond shutdown succeeded
Apr  7 13:08:31 voms115 kernel: Kernel logging (proc) stopped.
Apr  7 13:08:31 voms115 kernel: Kernel log daemon terminating.
Apr  7 13:08:32 voms115 syslog: klogd shutdown succeeded
Apr  7 13:08:32 voms115 exiting on signal 15
Apr  7 13:08:32 voms115 syslogd 1.4.1: restart.
Apr  7 13:08:32 voms115 syslog: syslogd startup succeeded
Apr  7 13:08:32 voms115 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Apr  7 13:08:32 voms115 syslog: klogd startup succeeded
Apr  7 13:08:32 voms115 sshd[30949]: Received signal 15; terminating.
Apr  7 13:08:32 voms115 sshd: sshd -TERM succeeded
Apr  7 13:08:32 voms115 syslog: syslogd shutdown succeeded 
Apr  7 13:08:35 voms115 sshd[31041]: Server listening on :: port 22.
Apr  7 13:08:35 voms115 sshd[31041]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Apr  7 13:08:35 voms115 sshd:  succeeded
Apr  7 13:08:36 voms115 named[2911]: shutting down: flushing changes
Apr  7 13:08:36 voms115 named[2911]: stopping command channel on 127.0.0.1#953
Apr  7 13:08:36 voms115 named:  succeeded
Apr  7 13:08:36 voms115 named[2911]: no longer listening on 127.0.0.1#53
Apr  7 13:08:36 voms115 named[2911]: exiting
Apr  7 13:08:38 voms115 named[31076]: starting BIND 9.2.4 -u named
Apr  7 13:08:38 voms115 named[31076]: using 8 CPUs
Apr  7 13:08:38 voms115 named[31076]: loading configuration from '/etc/named.conf'
Apr  7 13:08:38 voms115 named[31076]: listening on IPv4 interface lo, 127.0.0.1#53
Apr  7 13:08:38 voms115 named[31076]: command channel listening on 127.0.0.1#953
Apr  7 13:08:38 voms115 named[31076]: running
Apr  7 13:08:38 voms115 named: named startup succeeded
Apr  7 13:08:38 voms115 ntpd[3124]: ntpd exiting on signal 15
Apr  7 13:08:38 voms115 ntpd: ntpd shutdown succeeded

-- SteveTraylen - 10 Apr 2009

Topic revision: r14 - 2009-04-16 - SteveTraylen

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback