RB/WMS Monitoring Tool HowTo
Here is described the different metrics used and displayed on the
RB/WMS monitoring page.
Description of the different metrics
Load
Sometimes some processes are in infinite loop and you can detect it with the load (the value given in the table is the average load during the last 15 minutes). It is also useful to check the system CPU utilization but this value is not in this table unfortunately. I usually check it with Lemon on the following web pages:
A high load can be caused by:
- A gahp_server in an infinite loop can be killed (when "strace -p PID" shows no system calls being executed).
- An edg-wl-renewd process in an infinite loop can be restarted: /etc/init.d/edg-wl-proxyrenewal restart
This may cause the edg-wl-workload_manager process to crash, in which case it can be restarted as well:
/etc/init.d/edg-wl-wm restart
The problems with these PR and WM services occur when some job is cancelled multiple times. In that case one may have to restart both the PR and WM services once for every repeated cancellation.
input.fl
It represents the number of unprocessed entries in the input.fl file, the input queue for the WM service.
A lemon sensor should be written.
queue.fl
It represents the number of unprocessed entries in the queue.fl file, the input queue for the Job Controller service.
A lemon sensor should be written.
dg20logd
It represents the number of (temporary) files dg20logd_* in
/var/tmp. Each file contains buffered events for some job, to be logged in the Logging and Bookkeeping service by the edg-wl-locallogger daemons.
input.fl size
The input.fl size file in bytes, including the processed entries that have not yet been deleted. The procedure to follow is:
- Block jobs submission on this node.
- Disable the cron job /etc/cron.d/glite-wms-check-daemons.cron.
- Stop the NS, WMproxy and LM services.
- Let all the jobs in /var/glite/workload_manager/input.fl be treated (status 'g' --> 'i').
- Stop the WM and delete the input.fl file.
- Restart the WM, NS, WMproxy and LM services.
- Enable again the cron job /etc/cron.d/glite-wms-check-daemons.cron.
tmp
It represents the occupancy (in %) of the partition
/tmp/.
FD
The number of file descriptors opened by the log monitor process (can not be above 1024!). The procedure is the one given by Maarten few days ago, ie.:
- Edit /etc/cron.d/edg-wl-check-daemons and comment out the cron job.
- /etc/init.d/edg-wl-lm stop
- cd /var/edgwl/logmonitor/CondorG.log
- find CondorG.*.log -mtime +30 -print -exec mv {} ./recycle/ \;
- cd / ; /etc/init.d/edg-wl-lm start
- Edit /etc/cron.d/edg-wl-check-daemons and uncomment the cron job.
This field is much for important for the LCG RBs since le log monitor process opens a lot of file descriptors.
job queue
The file size of
/opt/condor/var/condor/spool/job_queue.log. If ever it reaches 1.8 GB, edg-wl-jc must be stopped, the file must be moved to another directory (and gzipped), and edg-wl-jc restarted. This will cause the RB to forget about all unfinished jobs, so one may want to block job submission for a few days and allow the RB to drain before resorting to this measure. Note that to block job submission, the rule
GD_RB_BLOCKED must be set in the GD Firewall Database (persons to contact:
Yvan.Calas@cernNOSPAMPLEASE.ch or
Romain.Wartel@cernNOSPAMPLEASE.ch).
Sandbox
If the sandbox area is almost full (90%), you should check that no huge sandboxes have been generated, or (more likely) that no large amount of sandboxes have been left behind without user cleanup.
Particular users responsible for a large part of the disk usage can be contacted via e-mail or even blocked in an emergency (details later). You can also execute the cleanup-sandboxes cron job (on LCG RBs) with tighter parameters found in /opt/lcg/etc/cleanup-sandboxes.conf. On gLite WMS nodes see the cron job glite-wms-purger.cron.
LL, LB, PX, LM, JC, NS, WM
It represents the status of the services. It should be equal to zero. Normally, the services with a non-zero value are restarted automatically, thanks to the cron job edg-wl-check-daemons (resp. glite-wms-check-daemons.cron). For some daemons the automatic restart fails after a crash, as it tries to start the service without first stopping it (to remove locks etc.).
| Field | Service | LCG RB | gLite WMS |
LL |
locallogger |
/etc/init.d/edg-wl-locallogger |
/opt/glite/etc/init.d/glite-lb-localloger |
LB |
logging and bookkeeping |
/etc/init.d/edg-wl-lbserver |
/opt/glite/etc/init.d/glite-lb-bkserverd |
PX |
proxy renewal |
/etc/init.d/edg-wl-proxyrenewal |
/opt/glite/etc/init.d/glite-proxy-renewald |
LM |
log monitor |
/etc/init.d/edg-wl-lm |
/opt/glite/etc/init.d/glite-wms-lm |
JC |
job controller |
/etc/init.d/edg-wl-jc |
/opt/glite/etc/init.d/glite-wms-jc |
NS |
network server |
/etc/init.d/edg-wl-ns |
/opt/glite/etc/init.d/glite-wms-ns |
| WM | workload manager | /etc/init.d/edg-wl-wm | /opt/glite/etc/init.d/glite-wms-wm |
MJ
It checks the status of the service lcg-mon-job-status. I wrote a little cron job for this process since it can stopped several times by day... Note that I submitted a bug (
https://savannah.cern.ch/bugs/index.php?23659) for that.
[root@rb104 cron.hourly]# cat /etc/cron.hourly/lcg-mon-job-status.cron
#!/bin/bash
# This script check that the service lcg-mon-job-status is running.
# This service is restarted if needed.
MAILTO=""
SCRIPT_LOCATION=/etc/rc.d/init.d
# don't do the check if the system has been up less than 20 minutes
if [ -r /proc/uptime ]; then
up=`cat /proc/uptime | sed -e 's/[ \.].*//'`
if [ $up -lt 1200 ]; then
exit 0
fi
fi
if [ -f ${SCRIPT_LOCATION}/lcg-mon-job-status ]; then
${SCRIPT_LOCATION}/lcg-mon-job-status status > /dev/null 2>&1
if [ $? -ne 0 ]; then
${SCRIPT_LOCATION}/lcg-mon-job-status restart > /dev/null 2>&1
fi
fi
Running jobs
It represents the number of jobs in the state
Running in the condor queue.
Current jobs
It represents the number of jobs in the states
Running or
Idle in the condor queue.
Jobs treated
It represents the number of jobs treated in the previous day.
Thresholds used for these metrics
| Metric | Threshold inf | Threshold sup |
load |
10 |
20 |
input.fl |
250 |
500 |
queue.fl |
250 |
500 |
dg20logd |
500 |
1000 |
history size |
1048576kB |
1572864kB |
input.fl size |
102400kB |
204800kB |
job_queue.log |
1048576kB |
1572864kB |
FD |
700 |
900 |
sandbox |
80 |
90 |
| tmp | 50 | 70 |
RB/WMS/WMSLB/LB development
Here is the list of sensors used to monitor the health of the LCG RB, gLite WMS, gLite LB and gLite WMSLB types of node:
where:
- node: type of the node (LCG RB, gLite WMS, gLite LB and gLite WMSLB).
- date: date/time of the last monitoring sample (format: MM/DD-HH:MM).
- load: load of the node in the last 15 minutes.
- %user: % CPU in user mode.
- %sys: % CPU in system mode.
- %idle: % CPU in idle task.
- %io: % CPU in IO wait mode.
- status: status of the node in SMS ('P' for production, 'M' for maintenance).
- idle: number of jobs in the state 'idle' in the condor queue.
- running: number of jobs in the state 'running' in the condor queue.
- jobs: number of jobs treated during the last day by condor.
- input: number of unprocessed jobs in the file /var/{edgwl,glite}/workload_manager/input.fl. More precisely, it represents the number of lines ending with the string ' g' (Workload Manager input file).
- queue: number of unprocessed jobs in the file /var/{edgwl,glite}/jobcontrol/queue.fl. More precisely, it represents the number of lines ending with the string ' g' (Job Controller input file).
- dglogd: number of dg*logd* files in /var/tmp (LCG RB) or in /var/glite/log (gLite WMS and WMSLB). Each file contains buffered events for some job, to be logged in the logging and bookkeeping service by the locallogger daemon.
- /tmp: occupancy (in %) of the /tmp partition.
- Sand: occupancy (in %) of the Sandbox partition (i.e. /var/{edgwl.glite}/SandboxDir/).
- MB hist: size (in MB) of the /opt/condor/var/condor/spool/history file (LCG RB) or /var/local/condor/spool/history (gLite WMS or gLite WMSLB).
- MB jobq: size (in MB) of the /opt/condor/var/condor/spool/job_queue.log file (LCG RB) or /var/local/condor/spool/job_queue.log (gLite WMS or gLite WMSLB).
- MB input: size (in MB) of the /var/{edgwl,glite}/workload_manager/input.fl file.
- #FD LM: number of file descriptors opened by the log monitor process ( edg-wl-log_monitor or glite-wms-log_monitor). It can not be above 1024.
- #FD GM: number of file descriptor opened by the condor_gridmanager process. It can not be above 1024.
- PX: status of the proxy renewal service ( edg-wl-proxyrenewal or glite-proxy-renewald).
- LM: status of the log monitor service ( edg-wl-log_monitor or glite-wms-log_monitor).
- JC: status of the job controller service ( edg-wl-jc or glite-wms-jc).
- WM: status of the worload manager service ( edg-wl-wm or glite-wms-wm).
- LB: status of the LB proxy service ( glite-lb-proxy).
- LL: status of the locallogger service ( edg-wl-locallogger or glite-lb-locallogger).
- BK: status of the bkserver service ( edg-wl-lbserver or glite-lb-bkserverd).
- NS: status of the network server service ( edg-wl-ns).
- WMP: status of the wmproxy service ( glite-wms-wmproxy)