FIO-Fs has reported issued with LSF killing jobs due to them exceeding the CPU time limit. On investigation, it turned out that these jobs could not have accumulated that much CPU time - in other words, the per-process statistics under
/proc/PID/stat
are corrupted.
Filed with Red Hat as
IT 127884, July 07.
LSF side
on the LSF side, the process responsible per machine is called "pim". It reads
/proc/PID/stat
regularly, and summarizes the information into
/tmp/pim.info.HOSTNAME
.
If the CPU time exceeds (wherever this is configured), the proccess is killed via SIGXCPU/signal24 (CPU time limit exceeded), this shold drop a core.
Central LSF accounting is collected in
/afs/cern.ch/project/lsf/7.0/mnt/batch/work/batch/acct
.
Ulrich already filters abnormal CPU time from the site statistics due to this bug. He has a program to translate the LSF accounting logs, see
/afs/cern.ch/user/u/uschwick/Programming/LSF/AnaLsbAcct/
Kernel side
/proc/pid/stat
is filled out by
fs/proc/array.c:do_task_stat()
, see there for the format. in 2.6.9, (starting at 1=pid, 2=(cmd),..):
field 14 is utime, user space (converted via
jiffies_to_clock_t()
, i.e. in "ticks")
field 15: stime, system calls
field 16: cutime (utime sum for children - waited-for and "ignored"?)
field 17: cstime (stime sum for children)
field 22: start_time: in jiffies, need to compare to uptime.
(small tool to convert is at
procstat, assumes HZ=100)
(see
man 2 times
on how these fields are supposed to be interpreted).
case: RFIOd
(this is where the case had been originally reported). Short analysis on lxfsra0803 (with lots of rfio core files under /tmp) shows:
1 #0 0x0000003857872400 in memcpy () from /lib64/tls/libc.so.6
40 #0 0x0000003857f0b222 in __read_nocancel () from /lib64/tls/libpthread.so.0
154 #0 0x00000038578c0545 in __select_nocancel () from /lib64/tls/libc.so.6
206 #0 0x0000003857f0c0b4 in wait () from /lib64/tls/libpthread.so.0
(i.e. most of the rfiods got killed while in wait(), the others in functions that are supposedly sleeping/be blocked, i.e. should not consume CPU time). Similar pattern on lxfsra0802.
Backtrace for the
wait()
case:
#0 0x0000003857f0c0b4 in wait () from /lib64/tls/libpthread.so.0
#1 0x000000000040873a in check_child_exit (block=1) at rfio_serv.c:1868
#2 0x00000000004051f3 in main (argc=19, argv=0x7fbffff6e8) at rfio_serv.c:705
Backtrace for
select_nocancel()
:
#0 0x00000038578c0545 in __select_nocancel () from /lib64/tls/libc.so.6
#1 0x000000000040899f in srrequest (s=0, bet=0x7fbffff1f8) at rfio_calls.c:325
#2 0x0000000000405c3f in doit (s=0, fromp=0x7fbffff590, mode=0) at rfio_serv.c:1175
#3 0x000000000040546d in main (argc=19, argv=0x7fbffff6e8) at rfio_serv.c:804
Backtrace for
read_nocancel()
:
#0 0x0000003857f0b222 in __read_nocancel () from /lib64/tls/libpthread.so.0
#1 0x000000000041e1d7 in srread64 (s=0, infop=0x7fbffff340, fd=1) at rfio_call64.c:1176
#2 0x00000000004068bf in doit (s=0, fromp=0x7fbffff590, mode=0) at rfio_serv.c:1298
#3 0x000000000040546d in main (argc=19, argv=0x7fbffff6e8) at rfio_serv.c:804
Presumably these are simply the functions where rfiod spends most of its time, i.e. waiting for IO. There seems to be a pattern, where the parent is stuck in
wait()
, and one of it's children in
read()
or
select()
. Both processes get killed at the same time by LSF - but at times a larger group of processes get killed together:
# ls -l core* | sort -k7,8 | sed -e 's/core.*/core-XXX/' | uniq -c -f 6 | sort -rn
47 -rw------- 1 stage st 901120 Mar 12 17:14 core-XXX
34 -rw------- 1 stage st 901120 Mar 11 14:35 core-XXX
30 -rw------- 1 stage st 901120 Mar 11 14:38 core-XXX
24 -rw------- 1 stage st 901120 Mar 10 16:20 core-XXX
23 -rw------- 1 stage st 901120 Mar 11 14:08 core-XXX
22 -rw------- 1 stage st 901120 Mar 10 15:02 core-XXX
19 -rw------- 1 stage st 901120 Mar 11 12:51 core-XXX
14 -rw------- 1 stage st 901120 Mar 12 16:47 core-XXX
14 -rw------- 1 stage st 901120 Mar 12 10:14 core-XXX
12 -rw------- 1 stage st 901120 Mar 10 16:29 core-XXX
10 -rw------- 1 stage st 901120 Mar 12 10:16 core-XXX
10 -rw------- 1 stage st 901120 Mar 11 14:36 core-XXX
10 -rw------- 1 stage st 901120 Mar 10 16:43 core-XXX
10 -rw------- 1 stage st 901120 Mar 10 15:03 core-XXX
8 -rw------- 1 stage st 901120 Mar 12 16:45 core-XXX
7 -rw------- 1 stage st 901120 Mar 10 17:09 core-XXX
6 -rw------- 1 stage st 901120 Mar 17 16:44 core-XXX
...
One REMEDY case with the "user" side is
CT509028, looks like they can repoduce this with a 30% chance..
case: user jobs in lxbatch