FIO-Fs has reported issued with LSF killing jobs due to them exceeding the CPU time limit. On investigation, it turned out that these jobs could not have accumulated that much CPU time - in other words, the per-process statistics under /proc/PID/stat are corrupted.

Filed with Red Hat as IT 127884, July 07.

LSF side

on the LSF side, the process responsible per machine is called "pim". It reads /proc/PID/stat regularly, and summarizes the information into /tmp/pim.info.HOSTNAME. If the CPU time exceeds (wherever this is configured), the proccess is killed via SIGXCPU/signal24 (CPU time limit exceeded), this shold drop a core.

Central LSF accounting is collected in /afs/cern.ch/project/lsf/7.0/mnt/batch/work/batch/acct. Ulrich already filters abnormal CPU time from the site statistics due to this bug. He has a program to translate the LSF accounting logs, see /afs/cern.ch/user/u/uschwick/Programming/LSF/AnaLsbAcct/

Kernel side

/proc/pid/stat is filled out by fs/proc/array.c:do_task_stat(), see there for the format. in 2.6.9, (starting at 1=pid, 2=(cmd),..):

field 14 is utime, user space (converted via jiffies_to_clock_t(), i.e. in "ticks")

field 15: stime, system calls

field 16: cutime (utime sum for children - waited-for and "ignored"?)

field 17: cstime (stime sum for children)

field 22: start_time: in jiffies, need to compare to uptime.

(small tool to convert is at procstat, assumes HZ=100)

(see man 2 times on how these fields are supposed to be interpreted).

case: RFIOd

(this is where the case had been originally reported). Short analysis on lxfsra0803 (with lots of rfio core files under /tmp) shows:

      1 #0  0x0000003857872400 in memcpy () from /lib64/tls/libc.so.6
     40 #0  0x0000003857f0b222 in __read_nocancel () from /lib64/tls/libpthread.so.0
    154 #0  0x00000038578c0545 in __select_nocancel () from /lib64/tls/libc.so.6
    206 #0  0x0000003857f0c0b4 in wait () from /lib64/tls/libpthread.so.0
(i.e. most of the rfiods got killed while in wait(), the others in functions that are supposedly sleeping/be blocked, i.e. should not consume CPU time). Similar pattern on lxfsra0802. Backtrace for the wait() case:
#0  0x0000003857f0c0b4 in wait () from /lib64/tls/libpthread.so.0
#1  0x000000000040873a in check_child_exit (block=1) at rfio_serv.c:1868
#2  0x00000000004051f3 in main (argc=19, argv=0x7fbffff6e8) at rfio_serv.c:705
Backtrace for select_nocancel():
#0  0x00000038578c0545 in __select_nocancel () from /lib64/tls/libc.so.6
#1  0x000000000040899f in srrequest (s=0, bet=0x7fbffff1f8) at rfio_calls.c:325
#2  0x0000000000405c3f in doit (s=0, fromp=0x7fbffff590, mode=0) at rfio_serv.c:1175
#3  0x000000000040546d in main (argc=19, argv=0x7fbffff6e8) at rfio_serv.c:804
Backtrace for read_nocancel():
#0  0x0000003857f0b222 in __read_nocancel () from /lib64/tls/libpthread.so.0
#1  0x000000000041e1d7 in srread64 (s=0, infop=0x7fbffff340, fd=1) at rfio_call64.c:1176
#2  0x00000000004068bf in doit (s=0, fromp=0x7fbffff590, mode=0) at rfio_serv.c:1298
#3  0x000000000040546d in main (argc=19, argv=0x7fbffff6e8) at rfio_serv.c:804

Presumably these are simply the functions where rfiod spends most of its time, i.e. waiting for IO. There seems to be a pattern, where the parent is stuck in wait(), and one of it's children in read() or select(). Both processes get killed at the same time by LSF - but at times a larger group of processes get killed together:

# ls -l core* | sort -k7,8 | sed -e 's/core.*/core-XXX/' | uniq -c -f 6 | sort -rn
     47 -rw-------  1 stage st  901120 Mar 12 17:14 core-XXX
     34 -rw-------  1 stage st  901120 Mar 11 14:35 core-XXX
     30 -rw-------  1 stage st  901120 Mar 11 14:38 core-XXX
     24 -rw-------  1 stage st  901120 Mar 10 16:20 core-XXX
     23 -rw-------  1 stage st  901120 Mar 11 14:08 core-XXX
     22 -rw-------  1 stage st  901120 Mar 10 15:02 core-XXX
     19 -rw-------  1 stage st  901120 Mar 11 12:51 core-XXX
     14 -rw-------  1 stage st  901120 Mar 12 16:47 core-XXX
     14 -rw-------  1 stage st  901120 Mar 12 10:14 core-XXX
     12 -rw-------  1 stage st  901120 Mar 10 16:29 core-XXX
     10 -rw-------  1 stage st  901120 Mar 12 10:16 core-XXX
     10 -rw-------  1 stage st  901120 Mar 11 14:36 core-XXX
     10 -rw-------  1 stage st  901120 Mar 10 16:43 core-XXX
     10 -rw-------  1 stage st  901120 Mar 10 15:03 core-XXX
      8 -rw-------  1 stage st  901120 Mar 12 16:45 core-XXX
      7 -rw-------  1 stage st  901120 Mar 10 17:09 core-XXX
      6 -rw-------  1 stage st  901120 Mar 17 16:44 core-XXX
...

One REMEDY case with the "user" side is CT509028, looks like they can repoduce this with a 30% chance..

case: user jobs in lxbatch

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2008-03-19 - JanIven
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LinuxSupport All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback