Please put the node in maintenance (see procedure here) when you have to make some intervention on the middleware services, otherwise it will trigger some alarms.

Quattor configuration of the sensors and exceptions

  • All the sensors and exceptions are defined in the CDB template pro_monitoring_metrics_gridrb.tpl. This template is included is the CDB template pro_system_gridrb.tpl.

  • An example of a metric and an exception is:
"/system/monitoring/metric/_5240" = nlist(  # Metric ID
        "name",         "ns_count",   # Name of the metric
        "descr",        "Number of network server processes running.",   # Description of the metric
        "class",        "system.processCount",   # Class of the metric
        "param",        list("cmdline", "xxxxxxxxxxxxxxxxx", "uid", "edguser"),   # Process to check
        "period",       60,
        "smooth",       nlist("typeString", false, "maxdiff", 0.0, "maxtime", 600),
        "active",       true,
        "latestonly",   false,
);

"/system/monitoring/exception/_30108" = nlist(   # Exception ID                  
        "name",         "ns_wrong",   # Name of the exception
        "descr",        "Number of network server process wrong.",   # Description of the exception
        "active",       true,
        "latestonly",   false,
        "importance",   1,
        "alarmtext",    "ns_wrong",   # Name of the exception displayed on the Lemon page
        "correlation",  "5240:1 < 1",   # Exception triggered when the network server is not running
        "actuator",     nlist("execve",  "xxxxxxxxxxxxxxxxxxxx",   # Command to execute when the NS is not running
                              "maxruns", 1,
                              "timeout", 0,
                              "active",  true)

Exceptions defined for the cluster gridrb

Exception name Description Service to restart Comment
NS_WRONG Number of network server process(es) running wrong (should be one). glite-wms-ns -
WM_WRONG Number of workload manager process(es) running wrong (should be one). glite-wms-wm -
JOBCONTROLLER_WRONG Number of job controller process(es) running wrong (should be one and four). glite-wms-jc -
CONDORMASTER_WRONG Number of condor master processes(es) running wrong (should be one). glite-wms-jc -
CONDORSCHEDD_WRONG Number of Condor scheduler process(es) running wrong (should be one). glite-wms-jc -
LOGD_WRONG Number of logd process(es) running wrong (should be one). glite-lb-locallogger -
INTERLOGD_WRONG Number of interlogd process(es) running wrong (should be one). glite-lb-locallogger -
RENEWD_WRONG Number of renewd process(es) running wrong (should be two). glite-proxy-renewald -
LM_WRONG Number of log monitor process(es) running wrong (should be one). glite-wms-lm -
FTPD_WRONG Number of ftpd process(es) running wrong (should be one). glite-wms-ftpd -
BKSERVERD_WRONG Number of bkserverd process(es) running wrong (should be between 1 and 11). See OP -
FTPD_WRONG Number of ftpd process(es) running wrong (should be between 1 and 2000). See OP -
LBPROXY_WRONG Number of LB proxies processes running wrong (should be 11 ???). glite-lb-proxy -
WMPROXY_WRONG Number of WMproxies servers processes running (should be one). glite-wms-wmproxy Need to be refined to count the httpd processes
RGMA_COUNT Number of RGMA processes running wrong (should be 1). rgma-servicetool -
WMS_DG20LOGD_HIGH Number of dglogd files in /tmp (gLite 3.0) or in /var/glite/log (gLite 3.1) too high (should be < 1000). - Backlog
INPUTFL_SIZE Size of the /var/glite/workload_manager/input.fl file too high (should be < 1.5GB). - Backlog
QUEUEFL_SIZE Size of the /var/glite/jobcontrol/queue.fl too high (should be < 1.5GB). - Backlog
INPUTFL_WRONG Number of records in file /var/glite/workload_manager/input.fl - Not yet implemented
QUEUEFL_WRONG Number of records in file /var/glite/jobcontrol/queue.fl - Not yet implemented
LOGS_FULL Middleware log files partition (/data01) full. - -
SANDBOX_FULL Sandbox partition (/data02) full. - -
MYSQL_FULL MySQL partition (/data03) full. - -

Operations help guide for gLite WMS nodes

The operational procedures (OP) can be found here.

-- YvanCalas - 19 Mar 2007

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2007-08-02 - YvanCalas
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback