LemonMonitoringWMS < LCG

LCG Web>LCGGridDeployment>LCGProductionServices>LemonMonitoringWMS (2007-08-02, YvanCalas)

Quattor configuration of the sensors and exceptions
Exceptions defined for the cluster gridrb
Operations help guide for gLite WMS nodes

Please put the node in maintenance (see procedure here) when you have to make some intervention on the middleware services, otherwise it will trigger some alarms.

Quattor configuration of the sensors and exceptions

All the sensors and exceptions are defined in the CDB template pro_monitoring_metrics_gridrb.tpl. This template is included is the CDB template pro_system_gridrb.tpl.

An example of a metric and an exception is:

"/system/monitoring/metric/_5240" = nlist(  # Metric ID
        "name",         "ns_count",   # Name of the metric
        "descr",        "Number of network server processes running.",   # Description of the metric
        "class",        "system.processCount",   # Class of the metric
        "param",        list("cmdline", "xxxxxxxxxxxxxxxxx", "uid", "edguser"),   # Process to check
        "period",       60,
        "smooth",       nlist("typeString", false, "maxdiff", 0.0, "maxtime", 600),
        "active",       true,
        "latestonly",   false,
);

"/system/monitoring/exception/_30108" = nlist(   # Exception ID                  
        "name",         "ns_wrong",   # Name of the exception
        "descr",        "Number of network server process wrong.",   # Description of the exception
        "active",       true,
        "latestonly",   false,
        "importance",   1,
        "alarmtext",    "ns_wrong",   # Name of the exception displayed on the Lemon page
        "correlation",  "5240:1 < 1",   # Exception triggered when the network server is not running
        "actuator",     nlist("execve",  "xxxxxxxxxxxxxxxxxxxx",   # Command to execute when the NS is not running
                              "maxruns", 1,
                              "timeout", 0,
                              "active",  true)

Exceptions defined for the cluster gridrb

Exception name	Description	Service to restart	Comment
NS_WRONG	Number of network server process(es) running wrong (should be one).	glite-wms-ns	-
WM_WRONG	Number of workload manager process(es) running wrong (should be one).	glite-wms-wm	-
JOBCONTROLLER_WRONG	Number of job controller process(es) running wrong (should be one and four).	glite-wms-jc	-
CONDORMASTER_WRONG	Number of condor master processes(es) running wrong (should be one).	glite-wms-jc	-
CONDORSCHEDD_WRONG	Number of Condor scheduler process(es) running wrong (should be one).	glite-wms-jc	-
LOGD_WRONG	Number of logd process(es) running wrong (should be one).	glite-lb-locallogger	-
INTERLOGD_WRONG	Number of interlogd process(es) running wrong (should be one).	glite-lb-locallogger	-
RENEWD_WRONG	Number of renewd process(es) running wrong (should be two).	glite-proxy-renewald	-
LM_WRONG	Number of log monitor process(es) running wrong (should be one).	glite-wms-lm	-
FTPD_WRONG	Number of ftpd process(es) running wrong (should be one).	glite-wms-ftpd	-
BKSERVERD_WRONG	Number of bkserverd process(es) running wrong (should be between 1 and 11).	See OP	-
FTPD_WRONG	Number of ftpd process(es) running wrong (should be between 1 and 2000).	See OP	-
LBPROXY_WRONG	Number of LB proxies processes running wrong (should be 11 ???).	glite-lb-proxy	-
WMPROXY_WRONG	Number of WMproxies servers processes running (should be one).	glite-wms-wmproxy	Need to be refined to count the httpd processes
RGMA_COUNT	Number of RGMA processes running wrong (should be 1).	rgma-servicetool	-
WMS_DG20LOGD_HIGH	Number of dglogd files in /tmp (gLite 3.0) or in /var/glite/log (gLite 3.1) too high (should be < 1000).	-	Backlog
INPUTFL_SIZE	Size of the /var/glite/workload_manager/input.fl file too high (should be < 1.5GB).	-	Backlog
QUEUEFL_SIZE	Size of the /var/glite/jobcontrol/queue.fl too high (should be < 1.5GB).	-	Backlog
INPUTFL_WRONG	Number of records in file /var/glite/workload_manager/input.fl	-	Not yet implemented
QUEUEFL_WRONG	Number of records in file /var/glite/jobcontrol/queue.fl	-	Not yet implemented
LOGS_FULL	Middleware log files partition (/data01) full.	-	-
SANDBOX_FULL	Sandbox partition (/data02) full.	-	-
MYSQL_FULL	MySQL partition (/data03) full.	-	-

Operations help guide for gLite WMS nodes

The operational procedures (OP) can be found here.

-- YvanCalas - 19 Mar 2007

Topic revision: r5 - 2007-08-02 - YvanCalas

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback