Please put the node in maintenance (see procedure here) when you have to make some intervention on the middleware services, otherwise it will trigger some alarms.
Quattor configuration of the sensors and exceptions
- All the sensors and exceptions are defined in the CDB template pro_monitoring_metrics_gridrb.tpl. This template is included is the CDB template pro_system_gridrb.tpl.
- An example of a metric and an exception is:
"/system/monitoring/metric/_5240" = nlist( # Metric ID
"name", "ns_count", # Name of the metric
"descr", "Number of network server processes running.", # Description of the metric
"class", "system.processCount", # Class of the metric
"param", list("cmdline", "xxxxxxxxxxxxxxxxx", "uid", "edguser"), # Process to check
"period", 60,
"smooth", nlist("typeString", false, "maxdiff", 0.0, "maxtime", 600),
"active", true,
"latestonly", false,
);
"/system/monitoring/exception/_30108" = nlist( # Exception ID
"name", "ns_wrong", # Name of the exception
"descr", "Number of network server process wrong.", # Description of the exception
"active", true,
"latestonly", false,
"importance", 1,
"alarmtext", "ns_wrong", # Name of the exception displayed on the Lemon page
"correlation", "5240:1 < 1", # Exception triggered when the network server is not running
"actuator", nlist("execve", "xxxxxxxxxxxxxxxxxxxx", # Command to execute when the NS is not running
"maxruns", 1,
"timeout", 0,
"active", true)
Exceptions defined for the cluster gridrb
Exception name |
Description |
Service to restart |
Comment |
NS_WRONG |
Number of network server process(es) running wrong (should be one). |
glite-wms-ns |
- |
WM_WRONG |
Number of workload manager process(es) running wrong (should be one). |
glite-wms-wm |
- |
JOBCONTROLLER_WRONG |
Number of job controller process(es) running wrong (should be one and four). |
glite-wms-jc |
- |
CONDORMASTER_WRONG |
Number of condor master processes(es) running wrong (should be one). |
glite-wms-jc |
- |
CONDORSCHEDD_WRONG |
Number of Condor scheduler process(es) running wrong (should be one). |
glite-wms-jc |
- |
LOGD_WRONG |
Number of logd process(es) running wrong (should be one). |
glite-lb-locallogger |
- |
INTERLOGD_WRONG |
Number of interlogd process(es) running wrong (should be one). |
glite-lb-locallogger |
- |
RENEWD_WRONG |
Number of renewd process(es) running wrong (should be two). |
glite-proxy-renewald |
- |
LM_WRONG |
Number of log monitor process(es) running wrong (should be one). |
glite-wms-lm |
- |
FTPD_WRONG |
Number of ftpd process(es) running wrong (should be one). |
glite-wms-ftpd |
- |
BKSERVERD_WRONG |
Number of bkserverd process(es) running wrong (should be between 1 and 11). |
See OP |
- |
FTPD_WRONG |
Number of ftpd process(es) running wrong (should be between 1 and 2000). |
See OP |
- |
LBPROXY_WRONG |
Number of LB proxies processes running wrong (should be 11 ???). |
glite-lb-proxy |
- |
WMPROXY_WRONG |
Number of WMproxies servers processes running (should be one). |
glite-wms-wmproxy |
Need to be refined to count the httpd processes |
RGMA_COUNT |
Number of RGMA processes running wrong (should be 1). |
rgma-servicetool |
- |
WMS_DG20LOGD_HIGH |
Number of dglogd files in /tmp (gLite 3.0) or in /var/glite/log (gLite 3.1) too high (should be < 1000). |
- |
Backlog |
INPUTFL_SIZE |
Size of the /var/glite/workload_manager/input.fl file too high (should be < 1.5GB). |
- |
Backlog |
QUEUEFL_SIZE |
Size of the /var/glite/jobcontrol/queue.fl too high (should be < 1.5GB). |
- |
Backlog |
INPUTFL_WRONG |
Number of records in file /var/glite/workload_manager/input.fl |
- |
Not yet implemented |
QUEUEFL_WRONG |
Number of records in file /var/glite/jobcontrol/queue.fl |
- |
Not yet implemented |
LOGS_FULL |
Middleware log files partition (/data01) full. |
- |
- |
SANDBOX_FULL |
Sandbox partition (/data02) full. |
- |
- |
MYSQL_FULL |
MySQL partition (/data03) full. |
- |
- |
Operations help guide for gLite WMS nodes
The operational procedures (OP) can be found
here.
--
YvanCalas - 19 Mar 2007