RB/WMS Monitoring Tool HowTo

Here is described the different metrics used and displayed on the RB/WMS monitoring page.

Description of the different metrics

Load

Sometimes some processes are in infinite loop and you can detect it with the load (the value given in the table is the average load during the last 15 minutes). It is also useful to check the system CPU utilization but this value is not in this table unfortunately. I usually check it with Lemon on the following web pages:

A high load can be caused by:

  • A gahp_server in an infinite loop can be killed (when "strace -p PID" shows no system calls being executed).

  • An edg-wl-renewd process in an infinite loop can be restarted: /etc/init.d/edg-wl-proxyrenewal restart

This may cause the edg-wl-workload_manager process to crash, in which case it can be restarted as well: /etc/init.d/edg-wl-wm restart

The problems with these PR and WM services occur when some job is cancelled multiple times. In that case one may have to restart both the PR and WM services once for every repeated cancellation.

input.fl

It represents the number of unprocessed entries in the input.fl file, the input queue for the WM service. A lemon sensor should be written.

queue.fl

It represents the number of unprocessed entries in the queue.fl file, the input queue for the Job Controller service. A lemon sensor should be written.

dg20logd

It represents the number of (temporary) files dg20logd_* in /var/tmp. Each file contains buffered events for some job, to be logged in the Logging and Bookkeeping service by the edg-wl-locallogger daemons.

input.fl size

The input.fl size file in bytes, including the processed entries that have not yet been deleted. The procedure to follow is:

  • Block jobs submission on this node.
  • Disable the cron job /etc/cron.d/glite-wms-check-daemons.cron.
  • Stop the NS, WMproxy and LM services.
  • Let all the jobs in /var/glite/workload_manager/input.fl be treated (status 'g' --> 'i').
  • Stop the WM and delete the input.fl file.
  • Restart the WM, NS, WMproxy and LM services.
  • Enable again the cron job /etc/cron.d/glite-wms-check-daemons.cron.

tmp

It represents the occupancy (in %) of the partition /tmp/.

FD

The number of file descriptors opened by the log monitor process (can not be above 1024!). The procedure is the one given by Maarten few days ago, ie.:

  • Edit /etc/cron.d/edg-wl-check-daemons and comment out the cron job.
  • /etc/init.d/edg-wl-lm stop
  • cd /var/edgwl/logmonitor/CondorG.log
  • find CondorG.*.log -mtime +30 -print -exec mv {} ./recycle/ \;
  • cd / ; /etc/init.d/edg-wl-lm start
  • Edit /etc/cron.d/edg-wl-check-daemons and uncomment the cron job.

This field is much for important for the LCG RBs since le log monitor process opens a lot of file descriptors.

job queue

The file size of /opt/condor/var/condor/spool/job_queue.log. If ever it reaches 1.8 GB, edg-wl-jc must be stopped, the file must be moved to another directory (and gzipped), and edg-wl-jc restarted. This will cause the RB to forget about all unfinished jobs, so one may want to block job submission for a few days and allow the RB to drain before resorting to this measure. Note that to block job submission, the rule GD_RB_BLOCKED must be set in the GD Firewall Database (persons to contact: Yvan.Calas@cernNOSPAMPLEASE.ch or Romain.Wartel@cernNOSPAMPLEASE.ch).

Sandbox

If the sandbox area is almost full (90%), you should check that no huge sandboxes have been generated, or (more likely) that no large amount of sandboxes have been left behind without user cleanup.

Particular users responsible for a large part of the disk usage can be contacted via e-mail or even blocked in an emergency (details later). You can also execute the cleanup-sandboxes cron job (on LCG RBs) with tighter parameters found in /opt/lcg/etc/cleanup-sandboxes.conf. On gLite WMS nodes see the cron job glite-wms-purger.cron.

LL, LB, PX, LM, JC, NS, WM

It represents the status of the services. It should be equal to zero. Normally, the services with a non-zero value are restarted automatically, thanks to the cron job edg-wl-check-daemons (resp. glite-wms-check-daemons.cron). For some daemons the automatic restart fails after a crash, as it tries to start the service without first stopping it (to remove locks etc.).

| Field | Service | LCG RB | gLite WMS |
LL locallogger /etc/init.d/edg-wl-locallogger /opt/glite/etc/init.d/glite-lb-localloger
LB logging and bookkeeping /etc/init.d/edg-wl-lbserver /opt/glite/etc/init.d/glite-lb-bkserverd
PX proxy renewal /etc/init.d/edg-wl-proxyrenewal /opt/glite/etc/init.d/glite-proxy-renewald
LM log monitor /etc/init.d/edg-wl-lm /opt/glite/etc/init.d/glite-wms-lm
JC job controller /etc/init.d/edg-wl-jc /opt/glite/etc/init.d/glite-wms-jc
NS network server /etc/init.d/edg-wl-ns /opt/glite/etc/init.d/glite-wms-ns
| WM | workload manager | /etc/init.d/edg-wl-wm | /opt/glite/etc/init.d/glite-wms-wm |

MJ

It checks the status of the service lcg-mon-job-status. I wrote a little cron job for this process since it can stopped several times by day... Note that I submitted a bug (https://savannah.cern.ch/bugs/index.php?23659) for that.

 
                [root@rb104 cron.hourly]# cat /etc/cron.hourly/lcg-mon-job-status.cron
                #!/bin/bash

                # This script check that the service lcg-mon-job-status is running.
                # This service is restarted if needed.

                MAILTO=""
                SCRIPT_LOCATION=/etc/rc.d/init.d

                 # don't do the check if the system has been up less than 20 minutes
                 if [ -r /proc/uptime ]; then
                 up=`cat /proc/uptime | sed -e 's/[ \.].*//'`
                 if [ $up -lt 1200 ]; then
                 exit 0
                 fi
                 fi

                 if [ -f ${SCRIPT_LOCATION}/lcg-mon-job-status ]; then
                 ${SCRIPT_LOCATION}/lcg-mon-job-status status > /dev/null 2>&1
                 if [ $? -ne 0 ]; then
                 ${SCRIPT_LOCATION}/lcg-mon-job-status restart > /dev/null 2>&1
                 fi
                 fi 

Running jobs

It represents the number of jobs in the state Running in the condor queue.

Current jobs

It represents the number of jobs in the states Running or Idle in the condor queue.

Jobs treated

It represents the number of jobs treated in the previous day.

Thresholds used for these metrics

| Metric | Threshold inf | Threshold sup |
load 10 20
input.fl 250 500
queue.fl 250 500
dg20logd 500 1000
history size 1048576kB 1572864kB
input.fl size 102400kB 204800kB
job_queue.log 1048576kB 1572864kB
FD 700 900
sandbox 80 90
| tmp | 50 | 70 |

RB/WMS/WMSLB/LB development

Here is the list of sensors used to monitor the health of the LCG RB, gLite WMS, gLite LB and gLite WMSLB types of node:

node date load %user %sys %idle %io status idle running jobs input queue dglogd /tmp Sand MB hist MB jobq MB input #FD LM #FD GM PX LM JC WM LB LL BK NS WMP
RB X X X X X X X X X X X X X X X X X X X - X X X X - X X X -
WMS X X X X X X X X X X X X X X X X X X X X X X X X X X - - X
LB X X X X X X X - - - - - - X - - - - - - - - - - - X X - -
WMSLB X X X X X X X X X X X X X X X X X X X X X X X X X X X - X

where:

  • node: type of the node (LCG RB, gLite WMS, gLite LB and gLite WMSLB).
  • date: date/time of the last monitoring sample (format: MM/DD-HH:MM).
  • load: load of the node in the last 15 minutes.
  • %user: % CPU in user mode.
  • %sys: % CPU in system mode.
  • %idle: % CPU in idle task.
  • %io: % CPU in IO wait mode.
  • status: status of the node in SMS ('P' for production, 'M' for maintenance).
  • idle: number of jobs in the state 'idle' in the condor queue.
  • running: number of jobs in the state 'running' in the condor queue.
  • jobs: number of jobs treated during the last day by condor.
  • input: number of unprocessed jobs in the file /var/{edgwl,glite}/workload_manager/input.fl. More precisely, it represents the number of lines ending with the string ' g' (Workload Manager input file).
  • queue: number of unprocessed jobs in the file /var/{edgwl,glite}/jobcontrol/queue.fl. More precisely, it represents the number of lines ending with the string ' g' (Job Controller input file).
  • dglogd: number of dg*logd* files in /var/tmp (LCG RB) or in /var/glite/log (gLite WMS and WMSLB). Each file contains buffered events for some job, to be logged in the logging and bookkeeping service by the locallogger daemon.
  • /tmp: occupancy (in %) of the /tmp partition.
  • Sand: occupancy (in %) of the Sandbox partition (i.e. /var/{edgwl.glite}/SandboxDir/).
  • MB hist: size (in MB) of the /opt/condor/var/condor/spool/history file (LCG RB) or /var/local/condor/spool/history (gLite WMS or gLite WMSLB).
  • MB jobq: size (in MB) of the /opt/condor/var/condor/spool/job_queue.log file (LCG RB) or /var/local/condor/spool/job_queue.log (gLite WMS or gLite WMSLB).
  • MB input: size (in MB) of the /var/{edgwl,glite}/workload_manager/input.fl file.
  • #FD LM: number of file descriptors opened by the log monitor process ( edg-wl-log_monitor or glite-wms-log_monitor). It can not be above 1024.
  • #FD GM: number of file descriptor opened by the condor_gridmanager process. It can not be above 1024.
  • PX: status of the proxy renewal service ( edg-wl-proxyrenewal or glite-proxy-renewald).
  • LM: status of the log monitor service ( edg-wl-log_monitor or glite-wms-log_monitor).
  • JC: status of the job controller service ( edg-wl-jc or glite-wms-jc).
  • WM: status of the worload manager service ( edg-wl-wm or glite-wms-wm).
  • LB: status of the LB proxy service ( glite-lb-proxy).
  • LL: status of the locallogger service ( edg-wl-locallogger or glite-lb-locallogger).
  • BK: status of the bkserver service ( edg-wl-lbserver or glite-lb-bkserverd).
  • NS: status of the network server service ( edg-wl-ns).
  • WMP: status of the wmproxy service ( glite-wms-wmproxy)
Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2008-08-27 - RobertoSantinel
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback