Lemon metrics, alarms and recovery procedures
The lemon metrics and exceptions are defined in the metrics manager.
General description and notes
There are two types of exceptions, both at importance level 2 and stored history:
-
Raise an exception if the metric counting service processes is less than 1 for 5 times in 5 minutes.
-
Raise an exception if the metric counting server live checks hasn’t found positive matches (PING OK < 1) and has found failures (PING FAILED > 0) for 5 times in 5 minutes.
General recovery information
STEP 1
- Put cms-service-webtools in the watch list. Always update the ticket about any actions you take on the machines.
STEP 2
- If the alarm(s) are NOT described here, i.e. the usual alarms: escalate to the sysadmins as a usual standard ticket.
- If the alarm(s) are listed below:
- Log onto the node as root. If the host key has changed, delete the old one from your
.ssh/known_hosts
file to be able to login. - Follow the alarm procedures to check service status, possibly restart it, and check status again. Update the ticket with your actions.
- If the procedure solved the problem (i.e. service started and alarms disappeared), mark the ticket as Resolved (close it).
- Otherwise, report on the ticket the alarm persisted and assign it to the CMS Webtools 2nd Line Support from the CMS Webtools Functional Element. Don’t close the ticket.
- Log onto the node as root. If the host key has changed, delete the old one from your
Alarm procedures
CMSWEB_CONFDB_IS_DOWN or CMSWEB_CONFDB_IS_NOT_RESPONDING
-
Check service status. It should say the service is RUNNING, with process id.
$ sudo -H -u _confdb bashs -l -c "/data/srv/current/config/confdb/manage status" confdb is RUNNING, PID 3238
-
Check whether the server processes are running. There should be one python + one rotatelogs process.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_confdb' _confdb 3238 0.6 1.2 1655016 49916 ? Sl Sep 09 00:17:40 python /data/srv/HG1509j/sw.pre/slc6_amd64_gcc481/cms/confdb/1.1.1.fix-comp/Application_py266/Main.py _confdb 3239 0.0 0.0 19428 752 ? S Sep 09 00:00:00 rotatelogs /data/srv/logs/confdb/confdb-%Y%m%d.log 86400
-
See if the service is listening to its port. It should be the process listed above.
$ sudo /usr/sbin/lsof -P -i :8340 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 3238 _confdb 4u IPv4 31584 0t0 TCP *:8340 (LISTEN)
-
Restart the service. It should end with saying the service is RUNNING, with process id.
$ sudo -H -u _confdb bashs -l -c "/data/srv/current/config/confdb/manage restart 'I did read documentation'" stopping confdb Stopping 3238 (3238 Sep 9 python /data/srv/HG1509j/sw.pre/slc6_amd64_gcc481/cms/confdb/1.1.1.fix-comp/Application_py266/Main.py ): starting confdb $ sudo -H -u _confdb bashs -l -c "/data/srv/current/config/confdb/manage status" confdb is RUNNING, PID 19613
CMSWEB_COUCHDB_IS_DOWN or CMSWEB_COUCHDB_IS_NOT_RESPONDING
-
Check service status. It should say “is running” and tell the process id.
$ sudo -H -u _couchdb bashs -l -c "/data/srv/current/config/couchdb/manage status" Apache CouchDB is running as process 25070, time to relax. No active tasks (e.g. compactions)
-
Check whether the server processes are running. There should be beam.smp + heart erlang process. Ignore the wrapper /bin/sh processes.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_couchdb' _couchdb 25052 0.0 0.0 63988 1288 ? S Mar 14 00:00:00 /bin/sh -e /data/srv/be1103a/sw/slc5_amd64_gcc434/external/couchdb/1.0.1-cmp6/bin/couchdb -a /data/srv/be1103a/sw/slc5_amd64_gcc434/external/couchdb/1.0.1-cmp6/etc/couchdb/default.ini -a /data/srv/be1103a/sw/slc5_amd64_gcc434/external/couchdb/1.0.1-cmp6/etc/couchdb/local.ini -a /data/srv/current/config/couchdb//local.ini -a /data/srv/current/config/couchdb//monitoring.ini -a /data/srv/current/auth/couchdb/hmackey.ini -b -r 0 -p /data/srv/state/couchdb/couchdb.pid -o /data/srv/logs/couchdb/20110314-112639.stdout -e /data/srv/logs/couchdb/20110314-112639.stderr -R _couchdb 25069 0.0 0.0 63988 736 ? S Mar 14 00:00:00 \_ /bin/sh -e /data/srv/be1103a/sw/slc5_amd64_gcc434/external/couchdb/1.0.1-cmp6/bin/couchdb -a /data/srv/be1103a/sw/slc5_amd64_gcc434/external/couchdb/1.0.1-cmp6/etc/couchdb/default.ini -a /data/srv/be1103a/sw/slc5_amd64_gcc434/external/couchdb/1.0.1-cmp6/etc/couchdb/local.ini -a /data/srv/current/config/couchdb//local.ini -a /data/srv/current/config/couchdb//monitoring.ini -a /data/srv/current/auth/couchdb/hmackey.ini -b -r 0 -p /data/srv/state/couchdb/couchdb.pid -o /data/srv/logs/couchdb/20110314-112639.stdout -e /data/srv/logs/couchdb/20110314-112639.stderr -R _couchdb 25070 1.1 0.6 113124 25000 ? Sl Mar 14 00:17:14 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/erlang/R12B_5-cmp6/lib/erlang/erts-5.6.5/bin/beam.smp -Bd -K true -A 4 -- -root /data/srv/be1103a/sw/slc5_amd64_gcc434/external/erlang/R12B_5-cmp6/lib/erlang -progname erl -- -home /data/empty -noshell -noinput -sasl errlog_type error -couch_ini /data/srv/be1103a/sw/slc5_amd64_gcc434/external/couchdb/1.0.1-cmp6/etc/couchdb/default.ini /data/srv/be1103a/sw/slc5_amd64_gcc434/external/couchdb/1.0.1-cmp6/etc/couchdb/local.ini /data/srv/be1103a/sw/slc5_amd64_gcc434/external/couchdb/1.0.1-cmp6/etc/couchdb/default.ini /data/srv/be1103a/sw/slc5_amd64_gcc434/external/couchdb/1.0.1-cmp6/etc/couchdb/local.ini /data/srv/current/config/couchdb//local.ini /data/srv/current/config/couchdb//monitoring.ini /data/srv/current/auth/couchdb/hmackey.ini -s couch -pidfile /data/srv/state/couchdb/couchdb.pid -heart _couchdb 25083 0.0 0.0 3680 468 ? Ss Mar 14 00:00:00 \_ heart -pid 25070 -ht 11
-
See if the service is listening to its port. It should be the process listed above.
$ sudo /usr/sbin/lsof -P -i :5984 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME beam.smp 25070 _couchdb 10u IPv4 88528455 TCP *:5984 (LISTEN)
-
Restart the service. It should end with saying “is running” and process id.
$ sudo -H -u _couchdb bashs -l -c "/data/srv/current/config/couchdb/manage restart 'I did read documentation'" Apache CouchDB has been shutdown. waiting for couchdb... waiting for couchdb... waiting for couchdb... waiting for couchdb... waiting for couchdb... waiting for couchdb... waiting for couchdb... 2013-09-12 16:56:01 [INFO] Visit your CouchApp here: http://localhost:5984/acdcserver/_design/ACDC/index.html 2013-09-12 16:56:02 [INFO] Visit your CouchApp here: http://localhost:5984/acdcserver/_design/GroupUser/index.html 2013-09-12 16:56:02 [INFO] Visit your CouchApp here: http://localhost:5984/alertscollector/_design/AlertsCollector/index.html 2013-09-12 16:56:04 [INFO] Visit your CouchApp here: http://localhost:5984/asynctransfer/_design/AsyncTransfer/index.html 2013-09-12 16:56:07 [INFO] Visit your CouchApp here: http://localhost:5984/wmstats/_design/WMStats/index.html 2013-09-12 16:56:07 [INFO] Visit your CouchApp here: http://localhost:5984/workloadsummary/_design/WorkloadSummary/index.html 2013-09-12 16:56:08 [INFO] Visit your CouchApp here: http://localhost:5984/t1_uk_ral_statistics/_design/statistics/index.html 2013-09-12 16:56:09 [INFO] Visit your CouchApp here: http://localhost:5984/t1_us_fnal_statistics/_design/statistics/index.html 2013-09-12 16:56:10 [INFO] Visit your CouchApp here: http://localhost:5984/t1_tw_asgc_statistics/_design/statistics/index.html 2013-09-12 16:56:11 [INFO] Visit your CouchApp here: http://localhost:5984/t1_it_cnaf_statistics/_design/statistics/index.html 2013-09-12 16:56:11 [INFO] Visit your CouchApp here: http://localhost:5984/t1_de_kit_statistics/_design/statistics/index.html 2013-09-12 16:56:12 [INFO] Visit your CouchApp here: http://localhost:5984/t1_fr_ccin2p3_statistics/_design/statistics/index.html 2013-09-12 16:56:13 [INFO] Visit your CouchApp here: http://localhost:5984/t1_es_pic_statistics/_design/statistics/index.html 2013-09-12 16:56:14 [INFO] Visit your CouchApp here: http://localhost:5984/tier0_wmstats/_design/WMStats/index.html 2013-09-12 16:56:15 [INFO] Visit your CouchApp here: http://localhost:5984/t0_workloadsummary/_design/WorkloadSummary/index.html Installing WorkQueue into workqueue Installing WorkQueue into workqueue_inbox $ sudo -H -u _couchdb bashs -l -c "/data/srv/current/config/couchdb/manage status" Apache CouchDB is running as process 459246, time to relax. No active tasks (e.g. compactions)
CMSWEB_CRABCACHE_IS_DOWN or CMSWEB_CRABCACHE_IS_NOT_RESPONDING
-
Check service status. It should say “is RUNNING” and tell the process ID.
$ sudo -H -u _crabcache bashs -l -c "/data/srv/current/config/crabcache/manage status" crabcache is RUNNING, PID 3914
-
Check whether the server processes are running. There should be python processes for wmc-httpd.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_crabcache' _crabcache 3915 0.0 0.1 101460 9000 ? S 20:09:21 00:00:00 python /data/srv/beHG1309c/sw/slc5_amd64_gcc461/cms/crabcache/3.1.1pre1-comp5/bin/wmc-httpd -v -d /data/srv/state/crabcache -l |rotatelogs /data/srv/logs/crabcache/crabcache-%Y%m%d.log 86400 /data/srv/current/config/crabcache/config.py _crabcache 3935 0.0 0.1 103512 10940 ? Sl 20:09:21 00:00:03 \_ python /data/srv/beHG1309c/sw/slc5_amd64_gcc461/cms/crabcache/3.1.1pre1-comp5/bin/wmc-httpd -v -d /data/srv/state/crabcache -l |rotatelogs /data/srv/logs/crabcache/crabcache-%Y%m%d.log 86400 /data/srv/current/config/crabcache/config.py
-
See if the service is listening to its port. It should be the process listed above.
$ sudo /usr/sbin/lsof -P -i :8271 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 3935 _crabcache 3u IPv4 18866 0t0 TCP *:8271 (LISTEN)
-
Restart the service and check its status. It should end with saying “is RUNNING” and a process id.
$ sudo -H -u _crabcache bashs -l -c "/data/srv/current/config/crabcache/manage restart 'I did read documentation'" stopping crabcache Killing crabcache pgid 3914 . starting crabcache crabcache not running, not killing $ sudo -H -u _crabcache bashs -l -c "/data/srv/current/config/crabcache/manage status" crabcache is RUNNING, PID 466555
CMSWEB_CRABSERVER_IS_DOWN or CMSWEB_CRABSERVER_IS_NOT_RESPONDING
-
Check service status. It should say “is RUNNING” and tell the process ID.
$ sudo -H -u _crabserver bashs -l -c "/data/srv/current/config/crabserver/manage status" crabserver is RUNNING, PID 457300
-
Check whether the server processes are running. There should be python processes for wmc-httpd.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | egrep '^_crabserver|100039' 100039 457301 0.0 0.1 101392 9040 ? S 16:51:39 00:00:00 python /data/srv/beHG1309c/sw/slc5_amd64_gcc461/cms/crabserver/3.2.0pre16/bin/wmc-httpd -r -d /data/srv/state/crabserver -l |rotatelogs /data/srv/logs/crabserver/crabserver-%Y%m%d.log 86400 /data/srv/current/config/crabserver/config.py 100039 457302 0.0 0.2 161504 21756 ? Sl 16:51:39 00:00:00 \_ python /data/srv/beHG1309c/sw/slc5_amd64_gcc461/cms/crabserver/3.2.0pre16/bin/wmc-httpd -r -d /data/srv/state/crabserver -l |rotatelogs /data/srv/logs/crabserver/crabserver-%Y%m%d.log 86400 /data/srv/current/config/crabserver/config.py
-
See if the service is listening to its port. It should be the process listed above.
$ sudo /usr/sbin/lsof -P -i :8270 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 457302 _crabserver 3u IPv4 4057144 0t0 TCP *:8270 (LISTEN)
-
Restart the service and check its status. It should end with saying “is RUNNING” and a process id.
$ sudo -H -u _crabserver bashs -l -c "/data/srv/current/config/crabserver/manage restart 'I did read documentation'" I did read documentation stopping crabserver Killing crabserver pgid 457300 . starting crabserver crabserver not running, not killing $ sudo -H -u _crabserver bashs -l -c "/data/srv/current/config/crabserver/manage status" crabserver is RUNNING, PID 470257
CMSWEB_DAS_WEB_IS_DOWN or CMSWEB_DAS_WEB_IS_NOT_RESPONDING or CMSWEB_DAS_KWS_IS_DOWN or CMSWEB_DAS_KWS_IS_NOT_RESPONDING
-
Check and possibly restart MongoDB first.
-
Check service status. It should say DAS is RUNNING and tell a process ID.
$ sudo -H -u _das bashs -l -c "/data/srv/current/config/das/manage status" das is RUNNING, PID 3758
-
Check whether the server processes are running. There should be one python DAS process.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_das' _das 3758 7.8 1.7 2134328 143004 ? Sl 20:09:19 01:40:01 python -u /data/srv/beHG1309c/sw/slc5_amd64_gcc461/cms/das/2.0.0-comp/lib/python2.6/site-packages/DAS/web/das_server.py
-
See if the service is listening to its ports. It should be the processes listed above.
$ sudo /usr/sbin/lsof -P -i :8212 -i :8214 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 3903 _das 28u IPv4 22164 0t0 TCP *:8214 (LISTEN) python 3903 _das 48u IPv4 22133 0t0 TCP *:8212 (LISTEN)
-
Restart the service. It should end with DAS is RUNNING, followed by a process id.
$ sudo -H -u _das bashs -l -c "/data/srv/current/config/das/manage restart 'I did read documentation'" stopping das starting das $ sudo -H -u _das bashs -l -c "/data/srv/current/config/das/manage status" das is RUNNING, PID 473189
CMSWEB_DBS_IS_DOWN or CMSWEB_DBS_IS_NOT_RESPONDING
-
Check service status. It should say DBS is RUNNING and tell a process ID.
$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status" dbs is RUNNING, PID 3771
-
Check whether the server processes are running. There should be one python DBS process.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_dbs' _dbs 3771 0.1 0.9 516252 77412 ? Sl 20:09:18 00:01:21 python -u /data/srv/beHG1309c/sw/slc5_amd64_gcc461/cms/dbs3/3.1.6a-comp/lib/python2.6/site-packages/WMCore/WebTools/Root.py --ini=/data/srv/current/config/dbs/DBS.py _dbs 3772 0.0 0.0 15056 724 ? S 20:09:18 00:00:00 rotatelogs /data/srv/logs/dbs/dbs-%Y%m%d.log 86400
-
See if the service is listening to its ports. It should be the processes listed above.
$ sudo /usr/sbin/lsof -P -i :8250 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 3771 _dbs 3u IPv4 20025 0t0 TCP *:8250 (LISTEN)
-
Restart the service. It should end with DBS is RUNNING, followed by a process id.
$ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage restart 'I did read documentation'" stopping dbs Stopping 3771 (3771 20:09 python -u /data/srv/beHG1309c/sw/slc5_amd64_gcc461/cms/dbs3/3.1.6a-comp/lib/python2.6/site-packages/WMCore/WebTools/Root.py --ini=/data/srv/current/config/dbs/DBS.py ): starting dbs $ sudo -H -u _dbs bashs -l -c "/data/srv/current/config/dbs/manage status" dbs is RUNNING, PID 475539
CMSWEB_DBSMIGRATION_IS_DOWN or CMSWEB_DBSMIGRATION_IS_NOT_RESPONDING
-
Check service status. It should say dbsmigration is RUNNING and tell a process ID.
$ sudo -H -u _dbsmigration bashs -l -c "/data/srv/current/config/dbsmigration/manage status" dbsmigration is RUNNING, PID 3710
-
Check whether the server processes are running. There should be one python StartUp.py process.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | egrep '^_dbsmigration|100046' 100046 3710 1.2 0.4 599116 36584 ? Sl 20:09:19 00:16:01 python -u /data/srv/beHG1309c/sw/slc5_amd64_gcc461/cms/dbs3-migration/3.1.6a//lib/python2.6/site-packages/dbs/components/migration/StartUp.py --config=/data/srv/current/config/dbsmigration/Migration.py 100046 3711 0.0 0.0 15056 716 ? S 20:09:19 00:00:00 rotatelogs /data/srv/logs/dbsmigration/dbsmigration-%Y%m%d.log 86400
-
See if the service is listening to its ports. It should be the processes listed above.
$ sudo /usr/sbin/lsof -P -i :8251 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 3710 _dbsmigration 3u IPv4 19493 0t0 TCP localhost.localdomain:8251 (LISTEN)
-
Restart the service. It should end with dbsmigration is RUNNING, followed by a process id.
$ sudo -H -u _dbsmigration bashs -l -c "/data/srv/current/config/dbsmigration/manage restart 'I did read documentation'" stopping dbsmigration Stopping 3710 (3710 20:09 python -u /data/srv/beHG1309c/sw/slc5_amd64_gcc461/cms/dbs3-migration/3.1.6a//lib/python2.6/site-packages/dbs/components/migration/StartUp.py --config=/data/srv/current/config/dbsmigration/Migration.py ): starting dbsmigration $ sudo -H -u _dbsmigration bashs -l -c "/data/srv/current/config/dbsmigration/manage status" dbsmigration is RUNNING, PID 477474
CMSWEB_DMWMMON_IS_DOWN or CMSWEB_DMWMMON_IS_NOT_RESPONDING
-
Check service status. It should say dmwmmon is RUNNING, followed by a list of process ids.
$ sudo -H -u _dmwmmon bashs -l -c "/data/srv/current/config/dmwmmon/manage status" dmwmmon is RUNNING, PID 4415 4437 4438 4439 4440 4441 4442
-
Check whether the server processes are running. There should be one main httpd process followed by various other http processes and four rotatelogs processes.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_dmwmmon' _dmwmmon 4415 0.0 0.2 104640 18132 ? Ss 20:09:29 00:00:00 /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/dmwmmon/server.conf -DDMWMMON_DATASVC -DDMWMMON_WEB _dmwmmon 4429 0.0 0.0 15692 908 ? S 20:09:29 00:00:00 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/rotatelogs -f /data/srv/logs/dmwmmon/error_log_%Y%m%d.txt 86400 _dmwmmon 4430 0.0 0.0 15692 900 ? S 20:09:29 00:00:00 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/rotatelogs -f /data/srv/logs/dmwmmon/datasvc-error-%Y%m%d.log 86400 _dmwmmon 4431 0.0 0.0 15692 900 ? S 20:09:29 00:00:00 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/rotatelogs -f /data/srv/logs/dmwmmon/access_log_%Y%m%d.txt 86400 _dmwmmon 4432 0.0 0.0 15692 908 ? S 20:09:29 00:00:00 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/rotatelogs -f /data/srv/logs/dmwmmon/datasvc-access-%Y%m%d.log 86400 _dmwmmon 4437 0.0 0.2 132368 21788 ? S 20:09:29 00:00:01 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/dmwmmon/server.conf -DDMWMMON_DATASVC -DDMWMMON_WEB _dmwmmon 4438 0.0 0.2 132388 21792 ? S 20:09:29 00:00:01 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/dmwmmon/server.conf -DDMWMMON_DATASVC -DDMWMMON_WEB _dmwmmon 4439 0.0 0.2 132388 21760 ? S 20:09:29 00:00:00 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/dmwmmon/server.conf -DDMWMMON_DATASVC -DDMWMMON_WEB _dmwmmon 4440 0.0 0.1 104776 11760 ? S 20:09:29 00:00:00 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/dmwmmon/server.conf -DDMWMMON_DATASVC -DDMWMMON_WEB _dmwmmon 4441 0.0 0.1 104776 11760 ? S 20:09:29 00:00:00 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/dmwmmon/server.conf -DDMWMMON_DATASVC -DDMWMMON_WEB _dmwmmon 4442 0.0 0.1 104776 11760 ? S 20:09:29 00:00:00 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/dmwmmon/server.conf -DDMWMMON_DATASVC -DDMWMMON_WEB
-
See if the service is listening to its ports. It should be the processes listed above.
$ sudo /usr/sbin/lsof -P -i :8280 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME httpd 4415 _dmwmmon 3u IPv6 19508 0t0 TCP *:8280 (LISTEN) httpd 4437 _dmwmmon 3u IPv6 19508 0t0 TCP *:8280 (LISTEN) httpd 4438 _dmwmmon 3u IPv6 19508 0t0 TCP *:8280 (LISTEN) httpd 4439 _dmwmmon 3u IPv6 19508 0t0 TCP *:8280 (LISTEN) httpd 4440 _dmwmmon 3u IPv6 19508 0t0 TCP *:8280 (LISTEN) httpd 4441 _dmwmmon 3u IPv6 19508 0t0 TCP *:8280 (LISTEN) httpd 4442 _dmwmmon 3u IPv6 19508 0t0 TCP *:8280 (LISTEN)
-
Restart the service. It should end with dmwmmon is RUNNING, followed by a list of process ids.
$ sudo -H -u _dmwmmon bashs -l -c "/data/srv/current/config/dmwmmon/manage restart 'I did read documentation'" stopping dmwmmon Stopping 4415 (4415 20:09 /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/dmwmmon/server.conf -DDMWMMON_DATASVC -DDMWMMON_WEB): starting dmwmmon Stopping httpd: [FAILED] Starting httpd: [ OK ] $ sudo -H -u _dmwmmon bashs -l -c "/data/srv/current/config/dmwmmon/manage status" dmwmmon is RUNNING, PID 482412 482419 482422 482423 482424 482425 482426
CMSWEB_DQM_DEV_AGENTS_IS_DOWN or CMSWEB_DQM_DEV_AGENTS_IS_NOT_RESPONDING or CMSWEB_DQM_OFFLINE_AGENTS_IS_DOWN or CMSWEB_DQM_OFFLINE_AGENTS_IS_NOT_RESPONDING or CMSWEB_DQM_RELVAL_AGENTS_IS_DOWN or CMSWEB_DQM_RELVAL_AGENTS_IS_NOT_RESPONDING
-
Note that different servers run different combinations of DQM applications. No one server runs them all.
-
Check agent status. It should say “file agents” are RUNNING, with at least four process ids. For “offline” group “stageout” should be RUNNING, and “afs sync” may be RUNNING (it runs intermittently). For all other application groups only “file agents” should be RUNNING.
# CMSWEB_DQM_OFFLINE_* (vocms0138) $ sudo -H -u _dqmgui bashs -l -c "/data/srv/current/config/dqmgui/manage xstatus agents,stageout,afs-sync" dqmgui file agents is RUNNING, PID 4372 4374 4377 4378 4381 4384 4387 4390 4392 4395 4402 4405 4407 4408 4411 4414 4416 4418 16733 dqmgui stageout agent is RUNNING, PID 4357 4362 dqmgui afs sync is NOT RUNNING. # CMSWEB_DQM_RELVAL_* (vocms0139) $ sudo -H -u _dqmgui bashs -l -c "/data/srv/current/config/dqmgui/manage xstatus agents" dqmgui file agents is RUNNING, PID 4384 4387 4389 4390 4393 4395 4399 4402 17132 17210 # CMSWEB_DQM_DEV_* (others) $ sudo -H -u _dqmgui bashs -l -c "/data/srv/current/config/dqmgui/manage xstatus agents" dqmgui file agents is RUNNING, PID 5867 5870 5873 5876
-
Check whether the agent processes are running. For each application group (dev, offline, relval) there are at least four agents (visDQMReceiveDaemon, visDQMImportDaemon, visDQMRootFileQuotaControl, visDQMVerControlDaemon) plus a rotatelogs running. The application group is part of the path name argument, e.g.
.../dqmgui/offline/...
.# CMSWEB_DQM_OFFLINE_* (vocms0138) $ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | egrep '^(_dqmgui|cmsweb)' | egrep '[ ]python.*(Daemon|Stager|Verifier|QuotaControl)|[/]agent-' cmsweb 4357 0.0 0.0 17728 5628 ? S Jun 21 00:00:02 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMZipCastorStager /data/srv/state/dqmgui/offline/agents/stageout /data/srv/state/dqmgui/offline/zipped /castor/cern.ch/cms/store/dqm/data/dqmdata /data/srv/state/dqmgui/offline/agents/verify cmsweb 4358 0.0 0.0 15056 708 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/offline/agent-zstager-%Y%m%d.log 86400 cmsweb 4362 0.0 0.0 17012 4820 ? S Jun 21 00:00:00 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMZipCastorStager /data/srv/state/dqmgui/caf/agents/stageout /data/srv/state/dqmgui/caf/zipped /castor/cern.ch/cms/store/dqm/data/caf /data/srv/state/dqmgui/caf/agents/verify cmsweb 4363 0.0 0.0 15056 708 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/caf/agent-zstager-%Y%m%d.log 86400 _dqmgui 4372 0.1 0.1 126276 38440 ? S Jun 21 00:21:48 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMReceiveDaemon /data/srv/state/dqmgui/offline/uploads /data/srv/state/dqmgui/offline/data /data/srv/state/dqmgui/offline/agents/register /data/srv/state/dqmgui/offline/agents/zip _dqmgui 4373 0.0 0.0 15056 712 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/offline/agent-receive-%Y%m%d.log 86400 _dqmgui 4374 0.0 0.0 74012 6676 ? S Jun 21 00:00:12 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMZipDaemon /data/srv/state/dqmgui/offline/agents/zip /data/srv/state/dqmgui/offline/data /data/srv/state/dqmgui/offline/zipped /data/srv/state/dqmgui/offline/agents/freezer _dqmgui 4377 0.0 0.0 15056 724 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/offline/agent-zip-%Y%m%d.log 86400 _dqmgui 4378 0.0 0.0 72260 4892 ? S Jun 21 00:00:27 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMZipFreezeDaemon /data/srv/state/dqmgui/offline/agents/freezer /data/srv/state/dqmgui/offline/zipped 7 /data/srv/state/dqmgui/offline/agents/stageout _dqmgui 4379 0.0 0.0 15056 708 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/offline/agent-zfreeze-%Y%m%d.log 86400 _dqmgui 4381 0.0 0.0 95424 7740 ? S Jun 21 00:00:01 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMZipCastorVerifier /data/srv/state/dqmgui/offline/agents/verify lilopera@cern.ch /data/srv/state/dqmgui/offline/zipped /castor/cern.ch/cms/store/dqm/data/dqmdata 7 /data/srv/state/dqmgui/offline/agents/clean _dqmgui 4383 0.0 0.0 15056 712 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/offline/agent-zverifier-%Y%m%d.log 86400 _dqmgui 4384 0.0 0.3 203080 92456 ? S Jun 21 00:08:12 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMImportDaemon /data/srv/state/dqmgui/offline/agents/register /data/srv/state/dqmgui/offline/data /data/srv/state/dqmgui/offline/ix /data/srv/state/dqmgui/offline/agents/qcontrol /data/srv/state/dqmgui/offline/agents/vcontrol _dqmgui 4386 0.0 0.0 15056 708 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/offline/agent-import-%Y%m%d.log 86400 _dqmgui 4387 31.7 0.1 103112 35840 ? S Jun 21 2-13:17:17 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMRootFileQuotaControl /data/srv/state/dqmgui/offline/agents/qcontrol /data/srv/state/dqmgui/offline/agents/register /data/srv/state/dqmgui/offline/data /data/srv/current/config/dqmgui/quota-rfqc-offline-offline.py _dqmgui 4389 0.0 0.0 15056 708 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/offline/agent-qcontrol-%Y%m%d.log 86400 _dqmgui 4390 0.0 0.0 73000 5696 ? S Jun 21 00:00:25 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMVerControlDaemon /data/srv/state/dqmgui/offline/agents/vcontrol /data/srv/state/dqmgui/offline/data _dqmgui 4391 0.0 0.0 15056 712 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/offline/agent-vcontrol-%Y%m%d.log 86400 _dqmgui 4394 0.0 0.0 15056 712 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/offline/agent-osync-%Y%m%d.log 86400 _dqmgui 4395 0.2 0.1 127288 39484 ? S Jun 21 00:26:38 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMCreateInfoDaemon /data/srv/state/dqmgui/offline/data/OnlineData /data/srv/state/dqmgui/offline/agents/zip _dqmgui 4396 0.0 0.0 15056 712 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/offline/agent-coinfo-%Y%m%d.log 86400 _dqmgui 4402 0.0 0.0 93764 5848 ? S Jun 21 00:00:00 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMReceiveDaemon /data/srv/state/dqmgui/caf/uploads /data/srv/state/dqmgui/caf/data /data/srv/state/dqmgui/caf/agents/register /data/srv/state/dqmgui/caf/agents/zip _dqmgui 4403 0.0 0.0 15056 708 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/caf/agent-receive-%Y%m%d.log 86400 _dqmgui 4405 0.0 0.0 72004 4692 ? S Jun 21 00:00:00 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMZipDaemon /data/srv/state/dqmgui/caf/agents/zip /data/srv/state/dqmgui/caf/data /data/srv/state/dqmgui/caf/zipped /data/srv/state/dqmgui/caf/agents/freezer _dqmgui 4407 0.0 0.0 15056 708 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/caf/agent-zip-%Y%m%d.log 86400 _dqmgui 4408 0.0 0.0 71984 4596 ? S Jun 21 00:00:00 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMZipFreezeDaemon /data/srv/state/dqmgui/caf/agents/freezer /data/srv/state/dqmgui/caf/zipped 7 /data/srv/state/dqmgui/caf/agents/stageout _dqmgui 4409 0.0 0.0 15056 704 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/caf/agent-zfreeze-%Y%m%d.log 86400 _dqmgui 4411 0.0 0.0 95332 7392 ? S Jun 21 00:00:00 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMZipCastorVerifier /data/srv/state/dqmgui/caf/agents/verify lilopera@cern.ch /data/srv/state/dqmgui/caf/zipped /castor/cern.ch/cms/store/dqm/data/caf 7 /data/srv/state/dqmgui/caf/agents/clean _dqmgui 4412 0.0 0.0 15056 708 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/caf/agent-zverifier-%Y%m%d.log 86400 _dqmgui 4414 0.0 0.0 95912 6196 ? S Jun 21 00:00:00 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMImportDaemon /data/srv/state/dqmgui/caf/agents/register /data/srv/state/dqmgui/caf/data /data/srv/state/dqmgui/caf/ix /data/srv/state/dqmgui/caf/agents/qcontrol /data/srv/state/dqmgui/caf/agents/vcontrol _dqmgui 4415 0.0 0.0 15056 700 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/caf/agent-import-%Y%m%d.log 86400 _dqmgui 4416 0.0 0.0 71680 4316 ? S Jun 21 00:00:00 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMRootFileQuotaControl /data/srv/state/dqmgui/caf/agents/qcontrol /data/srv/state/dqmgui/caf/agents/register /data/srv/state/dqmgui/caf/data /data/srv/current/config/dqmgui/quota-rfqc-offline-caf.py _dqmgui 4417 0.0 0.0 15056 700 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/caf/agent-qcontrol-%Y%m%d.log 86400 _dqmgui 4418 0.0 0.0 71648 4140 ? S Jun 21 00:00:00 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMVerControlDaemon /data/srv/state/dqmgui/caf/agents/vcontrol /data/srv/state/dqmgui/caf/data _dqmgui 4419 0.0 0.0 15056 704 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/caf/agent-vcontrol-%Y%m%d.log 86400 # CMSWEB_DQM_RELVAL_* (vocms0139) $ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_dqmgui' | egrep 'Daemon|Stager|Verifier|QuotaControl|agent-' _dqmgui 4384 0.4 0.1 124032 36276 ? S Jun 21 00:57:13 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMReceiveDaemon /data/srv/state/dqmgui/relval/uploads /data/srv/state/dqmgui/relval/data /data/srv/state/dqmgui/relval/agents/register /data/srv/state/dqmgui/relval/agents/zip _dqmgui 4385 0.0 0.0 15056 712 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/relval/agent-receive-%Y%m%d.log 86400 _dqmgui 4387 0.0 0.0 72240 5008 ? S Jun 21 00:00:18 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMZipDaemon /data/srv/state/dqmgui/relval/agents/zip /data/srv/state/dqmgui/relval/data /data/srv/state/dqmgui/relval/zipped /data/srv/state/dqmgui/relval/agents/freezer _dqmgui 4389 0.0 0.0 15056 712 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/relval/agent-zip-%Y%m%d.log 86400 _dqmgui 4390 0.0 0.0 72352 4928 ? S Jun 21 00:00:15 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMZipFreezeDaemon /data/srv/state/dqmgui/relval/agents/freezer /data/srv/state/dqmgui/relval/zipped 7 /data/srv/state/dqmgui/relval/agents/stageout _dqmgui 4391 0.0 0.0 15056 708 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/relval/agent-zfreeze-%Y%m%d.log 86400 _dqmgui 4393 0.0 0.0 95368 7540 ? S Jun 21 00:00:00 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMZipCastorVerifier /data/srv/state/dqmgui/relval/agents/verify lilopera@cern.ch /data/srv/state/dqmgui/relval/zipped /castor/cern.ch/cms/store/dqm/data/dqmdata 7 /data/srv/state/dqmgui/relval/agents/clean _dqmgui 4394 0.0 0.0 15056 708 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/relval/agent-zverifier-%Y%m%d.log 86400 _dqmgui 4395 0.0 0.1 114764 25036 ? S Jun 21 00:08:55 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMImportDaemon /data/srv/state/dqmgui/relval/agents/register /data/srv/state/dqmgui/relval/data /data/srv/state/dqmgui/relval/ix /data/srv/state/dqmgui/relval/agents/qcontrol /data/srv/state/dqmgui/relval/agents/vcontrol _dqmgui 4398 0.0 0.0 15056 708 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/relval/agent-import-%Y%m%d.log 86400 _dqmgui 4399 4.2 0.0 76384 9080 ? S Jun 21 08:13:30 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMRootFileQuotaControl /data/srv/state/dqmgui/relval/agents/qcontrol /data/srv/state/dqmgui/relval/agents/register /data/srv/state/dqmgui/relval/data /data/srv/current/config/dqmgui/quota-rfqc-offline-relval.py _dqmgui 4401 0.0 0.0 15056 708 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/relval/agent-qcontrol-%Y%m%d.log 86400 _dqmgui 4402 0.0 0.0 82328 14948 ? S Jun 21 00:01:31 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/visDQMVerControlDaemon /data/srv/state/dqmgui/relval/agents/vcontrol /data/srv/state/dqmgui/relval/data _dqmgui 4404 0.0 0.0 15056 708 ? S Jun 21 00:00:00 rotatelogs /data/srv/logs/dqmgui/relval/agent-vcontrol-%Y%m%d.log 86400 # CMSWEB_DQM_DEV_* (others) $ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_dqmgui' | egrep 'Daemon|Stager|Verifier|QuotaControl|agent-' _dqmgui 5867 0.0 0.0 93808 5872 ? S Jun 27 00:00:00 python /data/srv/hg1107a/sw.pre/slc5_amd64_gcc434/cms/dqmgui/6.0.6-comp/bin/visDQMReceiveDaemon /data/srv/state/dqmgui/dev/uploads /data/srv/state/dqmgui/dev/data /data/srv/state/dqmgui/dev/agents/register _dqmgui 5869 0.0 0.0 15052 704 ? S Jun 27 00:00:00 rotatelogs /data/srv/logs/dqmgui/dev/agent-receive-%Y%m%d.log 86400 _dqmgui 5870 0.0 0.0 95928 6168 ? S Jun 27 00:00:00 python /data/srv/hg1107a/sw.pre/slc5_amd64_gcc434/cms/dqmgui/6.0.6-comp/bin/visDQMImportDaemon /data/srv/state/dqmgui/dev/agents/register /data/srv/state/dqmgui/dev/data /data/srv/state/dqmgui/dev/ix /data/srv/state/dqmgui/dev/agents/qcontrol /data/srv/state/dqmgui/dev/agents/vcontrol _dqmgui 5872 0.0 0.0 15052 700 ? S Jun 27 00:00:00 rotatelogs /data/srv/logs/dqmgui/dev/agent-import-%Y%m%d.log 86400 _dqmgui 5873 0.0 0.0 71696 4308 ? S Jun 27 00:00:00 python /data/srv/hg1107a/sw.pre/slc5_amd64_gcc434/cms/dqmgui/6.0.6-comp/bin/visDQMRootFileQuotaControl /data/srv/state/dqmgui/dev/agents/qcontrol /data/srv/state/dqmgui/dev/agents/register /data/srv/state/dqmgui/dev/data /data/srv/current/config/dqmgui/quota-rfqc-offline-dev.py _dqmgui 5875 0.0 0.0 15052 700 ? S Jun 27 00:00:00 rotatelogs /data/srv/logs/dqmgui/dev/agent-qcontrol-%Y%m%d.log 86400 _dqmgui 5876 0.0 0.0 71660 4124 ? S Jun 27 00:00:00 python /data/srv/hg1107a/sw.pre/slc5_amd64_gcc434/cms/dqmgui/6.0.6-comp/bin/visDQMVerControlDaemon /data/srv/state/dqmgui/dev/agents/vcontrol /data/srv/state/dqmgui/dev/data _dqmgui 5878 0.0 0.0 15052 700 ? S Jun 27 00:00:00 rotatelogs /data/srv/logs/dqmgui/dev/agent-vcontrol-%Y%m%d.log 86400
-
Restart the agents. It should end up saying number of file agents are RUNNING, with a list of process ids. You can only restart “file agents” from command line.
$ sudo -H -u _dqmgui bashs -l -c "/data/srv/current/config/dqmgui/manage xrestart agents 'I did read documentation'" stopping dqmgui agents Stopping dqmgui file agents: Stopping 5876 (5876 Jun 27 python /data/srv/hg1107a/sw.pre/slc5_amd64_gcc434/cms/dqmgui/6.0.6-comp/bin/visDQMVerControlDaemon): SIGTERM Stopping 5873 (5873 Jun 27 python /data/srv/hg1107a/sw.pre/slc5_amd64_gcc434/cms/dqmgui/6.0.6-comp/bin/visDQMRootFileQuotaControl): SIGTERM Stopping 5870 (5870 Jun 27 python /data/srv/hg1107a/sw.pre/slc5_amd64_gcc434/cms/dqmgui/6.0.6-comp/bin/visDQMImportDaemon): SIGTERM Stopping 5867 (5867 Jun 27 python /data/srv/hg1107a/sw.pre/slc5_amd64_gcc434/cms/dqmgui/6.0.6-comp/bin/visDQMReceiveDaemon): SIGTERM starting dqmgui agents $ sudo -H -u _dqmgui bashs -l -c "/data/srv/current/config/dqmgui/manage xstatus agents" dqmgui file agents is RUNNING, PID 24416 24419 24422 24425
CMSWEB_DQM_DEV_WEB_IS_DOWN or CMSWEB_DQM_DEV_WEB_IS_NOT_RESPONDING or CMSWEB_DQM_OFFLINE_WEB_IS_DOWN or CMSWEB_DQM_OFFLINE_WEB_IS_NOT_RESPONDING or CMSWEB_DQM_RELVAL_WEB_IS_DOWN or CMSWEB_DQM_RELVAL_WEB_IS_NOT_RESPONDING
-
Note that different servers run different combinations of DQM applications. No one server runs them all.
-
Check web server status. It should say several processes RUNNING for “web server”, “renderer”, “collector” and “loggers” each.
# CMSWEB_DQM_OFFLINE_* (vocms0138) $ sudo -H -u _dqmgui bashs -l -c "/data/srv/current/config/dqmgui/manage xstatus webserver,renderer,collector,loggers" dqmgui webserver is RUNNING, PID 4466 4468 4523 6648 dqmgui renderer is RUNNING, PID 4648 4658 4668 4669 4670 6650 6651 6652 6653 6654 dqmgui collector is RUNNING, PID 4422 dqmgui loggers is RUNNING, PID 392 808 1631 1632 1920 1958 1998 2032 2589 3143 3314 3425 3749 3874 4208 4373 4377 4379 4383 4386 4389 4391 4394 4396 4403 4407 4409 4412 4415 4417 4419 4423 4521 4522 4623 4641 4729 4804 5209 5571 5749 5751 5864 6253 6649 6799 6911 7016 7244 7387 7471 8125 8164 8333 8700 8825 8876 9069 9455 9670 10219 10551 10790 11003 11129 11623 11956 11969 12026 12433 12555 12757 13193 13304 13824 13898 13947 13961 14024 14072 14493 15258 15269 15348 15386 15460 15581 15765 16088 16788 16843 16877 16992 17176 17304 17843 18137 18141 19098 19111 19625 19688 19957 20068 20810 20876 20921 20988 21161 21293 21404 22237 22252 22348 22364 22597 23143 24126 24224 24335 24409 24734 25292 25561 25619 25672 26390 26895 26983 27006 27547 28029 28734 29068 29148 29916 30027 30293 30353 30834 31438 31500 31549 32056 32739 # CMSWEB_DQM_RELVAL_* (vocms0139) $ sudo -H -u _dqmgui bashs -l -c "/data/srv/current/config/dqmgui/manage xstatus webserver,renderer,collector,loggers" dqmgui webserver is RUNNING, PID 4594 4596 dqmgui renderer is RUNNING, PID 4600 4601 4602 4603 4604 dqmgui collector is RUNNING, PID 4405 dqmgui loggers is RUNNING, PID 4385 4389 4391 4394 4398 4401 4404 4406 4595 4599 # CMSWEB_DQM_DEV_* (others) $ sudo -H -u _dqmgui bashs -l -c "/data/srv/current/config/dqmgui/manage xstatus webserver,renderer,collector,loggers" dqmgui webserver is RUNNING, PID 6100 6183 dqmgui renderer is RUNNING, PID 6320 6321 6322 6323 6324 dqmgui collector is RUNNING, PID 5881 dqmgui loggers is RUNNING, PID 5882 6182 6319 24418 24421 24424 24427
-
Check whether the web servers are running. For each application group (caf, dev, offline, relval) there should be two “monGui” and “rotatelogs” processes plus five “visDQMRender” processes. The application group is part of the path name argument, e.g.
.../dqmgui/offline/...
or.../server-conf-offline.py
.# CMSWEB_DQM_OFFLINE_* (vocms0138) $ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_dqmgui' | egrep 'monGui|visDQMRender|[/](web|render)log-' _dqmgui 4466 0.0 0.0 100844 12904 ? Ss Jun 21 00:00:00 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/monGui /data/srv/current/config/dqmgui/server-conf-offline.py _dqmgui 4522 0.0 0.0 15056 728 ? S Jun 21 00:00:29 \_ rotatelogs /data/srv/logs/dqmgui/offline/weblog-%Y%m%d.log 86400 _dqmgui 6648 13.7 9.2 2515400 2294488 ? Sl Jun 22 22:33:00 \_ python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/monGui /data/srv/current/config/dqmgui/server-conf-offline.py _dqmgui 6649 0.0 0.0 15056 712 ? S Jun 22 00:00:06 \_ rotatelogs /data/srv/logs/dqmgui/offline/renderlog-%Y%m%d.log 86400 _dqmgui 6650 0.3 0.2 148392 72444 ? S Jun 22 00:37:38 \_ visDQMRender --state-directory /data/srv/state/dqmgui/offline/render-0 --load /data/srv/state/dqmgui/offline/render/d876a04d7823a99352eb427ead21046a.ext _dqmgui 6651 0.1 0.2 146972 71004 ? S Jun 22 00:19:40 \_ visDQMRender --state-directory /data/srv/state/dqmgui/offline/render-1 --load /data/srv/state/dqmgui/offline/render/d876a04d7823a99352eb427ead21046a.ext _dqmgui 6652 0.1 0.2 141432 63036 ? S Jun 22 00:11:04 \_ visDQMRender --state-directory /data/srv/state/dqmgui/offline/render-2 --load /data/srv/state/dqmgui/offline/render/d876a04d7823a99352eb427ead21046a.ext _dqmgui 6653 0.0 0.2 129760 51356 ? S Jun 22 00:00:32 \_ visDQMRender --state-directory /data/srv/state/dqmgui/offline/render-3 --load /data/srv/state/dqmgui/offline/render/d876a04d7823a99352eb427ead21046a.ext _dqmgui 6654 0.0 0.1 124268 42952 ? S Jun 22 00:00:10 \_ visDQMRender --state-directory /data/srv/state/dqmgui/offline/render-4 --load /data/srv/state/dqmgui/offline/render/d876a04d7823a99352eb427ead21046a.ext _dqmgui 4468 0.0 0.0 100644 12828 ? Ss Jun 21 00:00:00 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/monGui /data/srv/current/config/dqmgui/server-conf-offline-caf.py _dqmgui 4521 0.0 0.0 15056 716 ? S Jun 21 00:00:00 \_ rotatelogs /data/srv/logs/dqmgui/caf/weblog-%Y%m%d.log 86400 _dqmgui 4523 0.0 0.4 392760 111776 ? Sl Jun 21 00:00:48 \_ python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/monGui /data/srv/current/config/dqmgui/server-conf-offline-caf.py _dqmgui 4641 0.0 0.0 15056 704 ? S Jun 21 00:00:00 \_ rotatelogs /data/srv/logs/dqmgui/caf/renderlog-%Y%m%d.log 86400 _dqmgui 4648 0.0 0.1 104400 31748 ? S Jun 21 00:00:00 \_ visDQMRender --state-directory /data/srv/state/dqmgui/caf/render-0 --load /data/srv/state/dqmgui/caf/render/d876a04d7823a99352eb427ead21046a.ext _dqmgui 4658 0.0 0.1 104400 31744 ? S Jun 21 00:00:00 \_ visDQMRender --state-directory /data/srv/state/dqmgui/caf/render-1 --load /data/srv/state/dqmgui/caf/render/d876a04d7823a99352eb427ead21046a.ext _dqmgui 4668 0.0 0.1 104400 31744 ? S Jun 21 00:00:00 \_ visDQMRender --state-directory /data/srv/state/dqmgui/caf/render-2 --load /data/srv/state/dqmgui/caf/render/d876a04d7823a99352eb427ead21046a.ext _dqmgui 4669 0.0 0.1 104396 31748 ? S Jun 21 00:00:00 \_ visDQMRender --state-directory /data/srv/state/dqmgui/caf/render-3 --load /data/srv/state/dqmgui/caf/render/d876a04d7823a99352eb427ead21046a.ext _dqmgui 4670 0.0 0.1 104404 31748 ? S Jun 21 00:00:00 \_ visDQMRender --state-directory /data/srv/state/dqmgui/caf/render-4 --load /data/srv/state/dqmgui/caf/render/d876a04d7823a99352eb427ead21046a.ext # CMSWEB_DQM_RELVAL_* (vocms0139) $ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_dqmgui' | egrep 'monGui|visDQMRender|[/](web|render)log-' _dqmgui 4594 0.0 0.0 100680 12828 ? Ss Jun 21 00:00:00 python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/monGui /data/srv/current/config/dqmgui/server-conf-offline-relval.py _dqmgui 4595 0.0 0.0 15056 712 ? S Jun 21 00:00:01 \_ rotatelogs /data/srv/logs/dqmgui/relval/weblog-%Y%m%d.log 86400 _dqmgui 4596 11.2 11.6 3119136 2867164 ? Sl Jun 21 21:45:43 \_ python /data/srv/be1106h/sw/slc5_amd64_gcc434/cms/dqmgui/6.0.6/bin/monGui /data/srv/current/config/dqmgui/server-conf-offline-relval.py _dqmgui 4599 0.0 0.0 15056 708 ? S Jun 21 00:00:00 \_ rotatelogs /data/srv/logs/dqmgui/relval/renderlog-%Y%m%d.log 86400 _dqmgui 4600 0.0 0.2 128216 50404 ? S Jun 21 00:02:25 \_ visDQMRender --state-directory /data/srv/state/dqmgui/relval/render-0 --load /data/srv/state/dqmgui/relval/render/d876a04d7823a99352eb427ead21046a.ext _dqmgui 4601 0.0 0.1 124884 47068 ? S Jun 21 00:00:56 \_ visDQMRender --state-directory /data/srv/state/dqmgui/relval/render-1 --load /data/srv/state/dqmgui/relval/render/d876a04d7823a99352eb427ead21046a.ext _dqmgui 4602 0.0 0.1 118540 40836 ? S Jun 21 00:00:17 \_ visDQMRender --state-directory /data/srv/state/dqmgui/relval/render-2 --load /data/srv/state/dqmgui/relval/render/d876a04d7823a99352eb427ead21046a.ext _dqmgui 4603 0.0 0.1 124556 42648 ? S Jun 21 00:00:01 \_ visDQMRender --state-directory /data/srv/state/dqmgui/relval/render-3 --load /data/srv/state/dqmgui/relval/render/d876a04d7823a99352eb427ead21046a.ext _dqmgui 4604 0.0 0.1 119804 37720 ? S Jun 21 00:00:00 \_ visDQMRender --state-directory /data/srv/state/dqmgui/relval/render-4 --load /data/srv/state/dqmgui/relval/render/d876a04d7823a99352eb427ead21046a.ext # CMSWEB_DQM_DEV_* (others) $ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_dqmgui' | egrep 'monGui|visDQMRender|[/](web|render)log-' _dqmgui 17923 0.0 0.1 100660 12840 ? Ss 14:05:10 00:00:00 python /data/srv/hg1107a/sw.pre/slc5_amd64_gcc434/cms/dqmgui/6.0.6-comp/bin/monGui /data/srv/current/config/dqmgui/server-conf-offline-dev.py _dqmgui 17924 0.0 0.0 15052 700 ? S 14:05:11 00:00:00 \_ rotatelogs /data/srv/logs/dqmgui/dev/weblog-%Y%m%d.log 86400 _dqmgui 17925 0.6 0.4 280256 40564 ? Sl 14:05:11 00:00:03 \_ python /data/srv/hg1107a/sw.pre/slc5_amd64_gcc434/cms/dqmgui/6.0.6-comp/bin/monGui /data/srv/current/config/dqmgui/server-conf-offline-dev.py _dqmgui 17926 0.0 0.0 15052 704 ? S 14:05:13 00:00:00 \_ rotatelogs /data/srv/logs/dqmgui/dev/renderlog-%Y%m%d.log 86400 _dqmgui 17927 0.0 0.3 104748 32180 ? S 14:05:13 00:00:00 \_ visDQMRender --state-directory /data/srv/state/dqmgui/dev/render-0 --load /data/srv/state/dqmgui/dev/render/4334c49b6b11ebf66efba34892c407ab.ext _dqmgui 17928 0.0 0.3 104748 32180 ? S 14:05:13 00:00:00 \_ visDQMRender --state-directory /data/srv/state/dqmgui/dev/render-1 --load /data/srv/state/dqmgui/dev/render/4334c49b6b11ebf66efba34892c407ab.ext _dqmgui 17929 0.0 0.3 104752 32188 ? S 14:05:13 00:00:00 \_ visDQMRender --state-directory /data/srv/state/dqmgui/dev/render-2 --load /data/srv/state/dqmgui/dev/render/4334c49b6b11ebf66efba34892c407ab.ext _dqmgui 17930 0.0 0.3 104756 32184 ? S 14:05:13 00:00:00 \_ visDQMRender --state-directory /data/srv/state/dqmgui/dev/render-3 --load /data/srv/state/dqmgui/dev/render/4334c49b6b11ebf66efba34892c407ab.ext _dqmgui 17931 0.0 0.3 104748 32184 ? S 14:05:13 00:00:00 \_ visDQMRender --state-directory /data/srv/state/dqmgui/dev/render-4 --load /data/srv/state/dqmgui/dev/render/4334c49b6b11ebf66efba34892c407ab.ext
-
See if the service is listening to its ports. It should be the processes listed above.
# CMSWEB_DQM_OFFLINE_* (vocms138) $ sudo /usr/sbin/lsof -P -i :8080 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME python 6648 _dqmgui 3u IPv4 14726590 TCP *:8080 (LISTEN) # CMSWEB_DQM_RELVAL_* (vocms139) $ sudo /usr/sbin/lsof -P -i :8081 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME python 4596 _dqmgui 3u IPv4 21607 TCP *:8081 (LISTEN) # CMSWEB_DQM_DEV_* (others) $ sudo /usr/sbin/lsof -P -i :8060 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME python 6183 _dqmgui 8u IPv4 28507 TCP *:8060 (LISTEN)
-
Restart the service. It should end up saying number of web server is RUNNING, with a list of process ids.
$ sudo -H -u _dqmgui bashs -l -c "/data/srv/current/config/dqmgui/manage xrestart webserver 'I did read documentation'" stopping dqmgui webserver Stopping server at port 8060 in /data/srv/state/dqmgui/dev: 6100.. 6320 6321 6322 6323 6324 starting dqmgui webserver Plug-in render is up-to-date Starting server at port 8060 in /data/srv/state/dqmgui/dev $ sudo -H -u _dqmgui bashs -l -c "/data/srv/current/config/dqmgui/manage xstatus webserver,renderer,collector,loggers" dqmgui webserver is RUNNING, PID 17923 17925 dqmgui renderer is RUNNING, PID 17927 17928 17929 17930 17931 dqmgui collector is RUNNING, PID 5881 dqmgui loggers is RUNNING, PID 5882 17924 17926 24418 24421 24424 24427
CMSWEB_FRONTEND_IS_DOWN or CMSWEB_FRONTEND_IS_NOT_RESPONDING
At the moment this is merely a collection of cron jobs which run from time to time. If these alarms raised, it means the cron jobs have failed; there is nothing to restart. Simply inform the service managers about the alarm as described at the top of this document.
CMSWEB_GITWEB_IS_DOWN or CMSWEB_GITWEB_IS_NOT_RESPONDING
-
Check service status. It should say gitweb is RUNNING, followed by a list of process ids.
$ sudo -H -u _gitweb bashs -l -c "/data/srv/current/config/gitweb/manage status" gitweb is RUNNING, PID 3962 382505 443308 449868 472274 473402 483430
-
Check whether the server processes are running. There should be one main httpd process followed by various other http processes and two rotatelogs processes.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_gitweb' _gitweb 3962 0.0 0.0 48816 5116 ? Ss 20:09:21 00:00:00 /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/gitweb/server.conf _gitweb 3963 0.0 0.0 15676 880 ? S 20:09:21 00:00:00 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/rotatelogs -f /data/srv/logs/gitweb/error_log_%Y%m%d.txt 86400 _gitweb 3964 0.0 0.0 15676 888 ? S 20:09:21 00:00:00 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/rotatelogs -f /data/srv/logs/gitweb/access_log_%Y%m%d.txt 86400 _gitweb 382505 0.0 0.2 81132 19160 ? S 13:14:25 00:00:12 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/gitweb/server.conf _gitweb 443308 0.0 0.2 81132 19152 ? S 16:07:48 00:00:04 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/gitweb/server.conf _gitweb 449868 0.0 0.2 81220 19200 ? S 16:28:44 00:00:04 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/gitweb/server.conf _gitweb 472274 0.0 0.2 81132 19044 ? S 17:27:06 00:00:02 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/gitweb/server.conf _gitweb 473402 0.0 0.2 81116 19136 ? S 17:29:06 00:00:01 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/gitweb/server.conf _gitweb 483430 0.1 0.2 80984 19020 ? S 17:58:39 00:00:00 \_ /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/gitweb/server.conf
-
See if the service is listening to its ports. It should be the processes listed above.
$ sudo /usr/sbin/lsof -P -i :8290 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME httpd 3962 _gitweb 3u IPv6 18322 0t0 TCP *:8290 (LISTEN) httpd 382505 _gitweb 3u IPv6 18322 0t0 TCP *:8290 (LISTEN) httpd 443308 _gitweb 3u IPv6 18322 0t0 TCP *:8290 (LISTEN) httpd 449868 _gitweb 3u IPv6 18322 0t0 TCP *:8290 (LISTEN) httpd 472274 _gitweb 3u IPv6 18322 0t0 TCP *:8290 (LISTEN) httpd 473402 _gitweb 3u IPv6 18322 0t0 TCP *:8290 (LISTEN) httpd 483430 _gitweb 3u IPv6 18322 0t0 TCP *:8290 (LISTEN)
-
Restart the service. It should end with gitweb is RUNNING, followed by a list of process ids.
$ sudo -H -u _gitweb bashs -l -c "/data/srv/current/config/gitweb/manage restart 'I did read documentation'" stopping gitweb Stopping 3962 (3962 20:09 /data/srv/beHG1309c/sw/slc5_amd64_gcc461/external/apache2/2.2.24gsi-comp2/bin/httpd -f /data/srv/state/gitweb/server.conf ): starting gitweb Stopping httpd: [FAILED] Starting httpd: [ OK ] $ sudo -H -u _gitweb bashs -l -c "/data/srv/current/config/gitweb/manage status" gitweb is RUNNING, PID 488726 488732 488734 488735 488736 488737 488738
CMSWEB_MONGODB_IS_DOWN or CMSWEB_MONGODB_IS_NOT_RESPONDING
-
Check service status. It should say the service is RUNNING, with the process id.
$ sudo -H -u _mongodb bashs -l -c "/data/srv/current/config/mongodb/manage status" mongodb is RUNNING, PID 25627
-
Check whether the server processes are running. There should be a mongod + rotatelogs process.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_mongodb' _mongodb 25627 1.3 13.8 2825540 563052 ? Sl Mar 14 00:19:52 /data/srv/be1103a/sw/slc5_amd64_gcc434/external/mongo/1.6.5-cmp/bin/mongod --dbpath=/data/srv/state/mongodb/db --port 8230 --nohttpinterface --bind_ip 127.0.0.1 _mongodb 25628 0.0 0.0 15052 720 ? S Mar 14 00:00:00 rotatelogs /data/srv/logs/mongodb/mongodb-%Y%m%d.log 86400
-
See if the service is listening to its port. It should be the process listed above.
$ sudo /usr/sbin/lsof -P -i :8230 | grep LISTEN mongod 25627 _mongodb 5u IPv4 88529973 TCP localhost.localdomain:8230 (LISTEN)
-
Restart the service. It should end with saying the service is RUNNING, with process id.
$ sudo -H -u _mongodb bashs -l -c "/data/srv/current/config/mongodb/manage restart 'I did read documentation'" stopping mongodb Waiting for mongod to exit... starting mongodb waiting for mongodb... mongodb is ready Clean das_db MongoDB shell version: 2.4.4 connecting to: 127.0.0.1:8230/test $ sudo -H -u _mongodb bashs -l -c "/data/srv/current/config/mongodb/manage status" mongodb is RUNNING, PID 490199
CMSWEB_OVERVIEW_IS_DOWN or CMSWEB_OVERVIEW_IS_NOT_RESPONDING
-
Check service status. It should say the service is RUNNING, with two process ids.
$ sudo -H -u _overview bashs -l -c "/data/srv/current/config/overview/manage status" overview is RUNNING, PID 25055 25211
-
Check whether the server processes are running. It should list two monGui python processes and rotatelogs.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_overview' _overview 25055 0.0 0.1 103172 4676 ? Ss Mar 14 00:00:00 python /data/srv/be1103a/sw/slc5_amd64_gcc434/cms/overview/6.0.1/bin/monGui /data/srv/current/config/overview/server-conf.py _overview 25210 0.0 0.0 15056 652 ? S Mar 14 00:00:00 \_ rotatelogs /data/srv/logs/overview/weblog-%Y%m%d.log 86400 _overview 25211 7.0 70.1 3036108 2842616 ? Sl Mar 14 01:42:41 \_ python /data/srv/be1103a/sw/slc5_amd64_gcc434/cms/overview/6.0.1/bin/monGui /data/srv/current/config/overview/server-conf.py
-
See if the service is listening to its port. It should be one of the processes listed above.
$ sudo /usr/sbin/lsof -P -i :9000 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME python 25211 _overview 4u IPv4 88299354 TCP *:9000 (LISTEN)
-
Restart the service. It should end with saying the service is RUNNING, with two process ids.
$ sudo -H -u _overview bashs -l -c "/data/srv/current/config/overview/manage restart 'I did read documentation'" stopping overview Stopping server at port 9000 in /data/srv/state/overview: 25055.... starting overview Starting server at port 9000 in /data/srv/state/overview $ sudo -H -u _overview bashs -l -c "/data/srv/current/config/overview/manage status" overview is RUNNING, PID 13198 13206
CMSWEB_PHEDEX_DATASVC_IS_DOWN or CMSWEB_PHEDEX_DATASVC_IS_NOT_RESPONDING or CMSWEB_PHEDEX_WEBAPP_IS_DOWN or CMSWEB_PHEDEX_WEBAPP_IS_NOT_RESPONDING or CMSWEB_PHEDEX_WEB_IS_DOWN or CMSWEB_PHEDEX_WEB_IS_NOT_RESPONDING (the description for CMSWEB_PHEDEX_GRAPHS* alarms comes later)
-
Note that PhEDEx Web, PhEDEx WebApp and PhEDEx Data Service uses the same apache container, so restarting one will restart all three.
-
Check service status. It should say phedex and phedex grapher are RUNNING, with several process ids in the former.
$ sudo -H -u _phedex bashs -l -c "/data/srv/current/config/phedex/manage status" phedex is RUNNING, PID 11068 11644 17432 25398 26950 28201 29420 phedex grapher is RUNNING, PID 25423
-
Check whether the server processes are running. This should list several httpd + rotatelogs processes. For CMSWEB_PHEDEX_WEB_* exceptions there should be one python process with “PlotConfig” in the path name.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_phedex' _phedex 25398 0.0 0.2 134672 10596 ? Ss Mar 14 00:00:00 /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/httpd -f /data/srv/state/phedex/server.conf -DPHEDEX_DATASVC -DPHEDEX_WEBAPP -DPHEDEX_WEB _phedex 25401 0.0 0.0 17520 828 ? S Mar 14 00:00:00 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/rotatelogs -f /data/srv/logs/phedex/error_log_%Y%m%d.txt 86400 _phedex 25402 0.0 0.0 17520 680 ? S Mar 14 00:00:00 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/rotatelogs -f /data/srv/logs/phedex/phedex_web_error_log_%Y%m%d 86400 _phedex 25403 0.0 0.0 17520 680 ? S Mar 14 00:00:00 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/rotatelogs -f /data/srv/logs/phedex/phedex_webapp_error_log_%Y%m%d 86400 _phedex 25404 0.0 0.0 17520 680 ? S Mar 14 00:00:00 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/rotatelogs -f /data/srv/logs/phedex/phedex_datasvc_error_log_%Y%m%d 86400 _phedex 25405 0.0 0.0 17520 680 ? S Mar 14 00:00:00 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/rotatelogs -f /data/srv/logs/phedex/access_log_%Y%m%d.txt 86400 _phedex 25406 0.0 0.0 17520 824 ? S Mar 14 00:00:00 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/rotatelogs -f /data/srv/logs/phedex/phedex_web_access_log_%Y%m%d 86400 _phedex 25407 0.0 0.0 17520 828 ? S Mar 14 00:00:00 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/rotatelogs -f /data/srv/logs/phedex/phedex_webapp_access_log_%Y%m%d 86400 _phedex 25408 0.0 0.0 17520 828 ? S Mar 14 00:00:00 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/rotatelogs -f /data/srv/logs/phedex/phedex_datasvc_access_log_%Y%m%d 86400 _phedex 11644 0.0 0.9 166072 38248 ? S 09:27:06 00:00:02 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/httpd -f /data/srv/state/phedex/server.conf -DPHEDEX_DATASVC -DPHEDEX_WEBAPP -DPHEDEX_WEB _phedex 17432 0.0 0.8 163000 34840 ? S 09:54:42 00:00:01 \_ /data/srv/state/phedex/htdocs/WebSite/access25 -f /data/srv/state/phedex/server.conf -DPHEDEX_DATASVC -DPHEDEX_WEBAPP -DPHEDEX_WEB _phedex 26950 0.0 0.8 162660 34500 ? S 10:30:53 00:00:01 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/httpd -f /data/srv/state/phedex/server.conf -DPHEDEX_DATASVC -DPHEDEX_WEBAPP -DPHEDEX_WEB _phedex 28201 0.0 0.8 162736 34388 ? S 10:37:24 00:00:01 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/httpd -f /data/srv/state/phedex/server.conf -DPHEDEX_DATASVC -DPHEDEX_WEBAPP -DPHEDEX_WEB _phedex 29420 0.0 0.8 162736 34384 ? S 10:42:55 00:00:01 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/httpd -f /data/srv/state/phedex/server.conf -DPHEDEX_DATASVC -DPHEDEX_WEBAPP -DPHEDEX_WEB _phedex 11068 0.0 0.8 162736 34284 ? S 11:39:30 00:00:00 \_ /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/httpd -f /data/srv/state/phedex/server.conf -DPHEDEX_DATASVC -DPHEDEX_WEBAPP -DPHEDEX_WEB _phedex 25423 0.0 0.6 478816 27600 ? Sl Mar 14 00:00:26 python -u /data/srv/be1103a/sw/slc5_amd64_gcc434/cms/PHEDEX-web/4.0.1/Documentation/WebSite/PlotConfig/tools/phedex-web.py /data/srv/state/phedex/etc/graphs_website_prod.xml _phedex 25424 0.0 0.0 17508 828 ? S Mar 14 00:00:00 rotatelogs /data/srv/logs/phedex/graphs-%Y%m%d.log 86400
-
See if the service is listening to its port. There should be several httpd processes from above process listing (not all of them, there are two sets of httpd processes).
# CMSWEB_PHEDEX_DATASVC_IS_DOWN, CMSWEB_PHEDEX_DATASVC_IS_NOT_RESPONDING # CMSWEB_PHEDEX_WEBAPP_IS_DOWN, CMSWEB_PHEDEX_WEBAPP_IS_NOT_RESPONDING $ sudo /usr/sbin/lsof -P -i :7001 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME httpd 11068 _phedex 4u IPv6 88298090 TCP *:7001 (LISTEN) httpd 11644 _phedex 4u IPv6 88298090 TCP *:7001 (LISTEN) httpd 17432 _phedex 4u IPv6 88298090 TCP *:7001 (LISTEN) httpd 25398 _phedex 4u IPv6 88298090 TCP *:7001 (LISTEN) httpd 26950 _phedex 4u IPv6 88298090 TCP *:7001 (LISTEN) httpd 28201 _phedex 4u IPv6 88298090 TCP *:7001 (LISTEN) httpd 29420 _phedex 4u IPv6 88298090 TCP *:7001 (LISTEN) # CMSWEB_PHEDEX_WEB_IS_DOWN, CMSWEB_PHEDEX_WEB_IS_NOT_RESPONDING $ sudo /usr/sbin/lsof -P -i :7101 | grep LISTEN httpd 14202 _phedex 8u IPv6 90379573 TCP *:7101 (LISTEN) httpd 14216 _phedex 8u IPv6 90379573 TCP *:7101 (LISTEN) httpd 14217 _phedex 8u IPv6 90379573 TCP *:7101 (LISTEN) httpd 14218 _phedex 8u IPv6 90379573 TCP *:7101 (LISTEN) httpd 14220 _phedex 8u IPv6 90379573 TCP *:7101 (LISTEN) httpd 14253 _phedex 8u IPv6 90379573 TCP *:7101 (LISTEN) httpd 14771 _phedex 8u IPv6 90379573 TCP *:7101 (LISTEN) httpd 15379 _phedex 8u IPv6 90379573 TCP *:7101 (LISTEN) httpd 15399 _phedex 8u IPv6 90379573 TCP *:7101 (LISTEN) httpd 15591 _phedex 8u IPv6 90379573 TCP *:7101 (LISTEN) httpd 15646 _phedex 8u IPv6 90379573 TCP *:7101 (LISTEN)
-
Restart the service. It should end up with saying httpd is started OK, and status should say RUNNING, with several process ids. It should also say the graph server is running, with process id.
$ sudo -H -u _phedex bashs -l -c "/data/srv/current/config/phedex/manage restart 'I did read documentation'" stopping phedex Stopping 29420 (29420 10:42 /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/httpd -f /data/srv/state/phedex/server.conf -DPHEDEX_DATASVC -DPHEDEX_WEBAPP): Stopping 26950 (26950 10:30 /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/httpd -f /data/srv/state/phedex/server.conf -DPHEDEX_DATASVC -DPHEDEX_WEBAPP): Stopping 25398 (25398 Mar 14 /data/srv/be1103a/sw/slc5_amd64_gcc434/external/apache2/2.2.16gsi-cmp2/bin/httpd -f /data/srv/state/phedex/server.conf -DPHEDEX_DATASVC): Stopping 25423 (25423 Mar 14 python -u /data/srv/be1103a/sw/slc5_amd64_gcc434/cms/PHEDEX-web/4.0.1/Documentation/WebSite/PlotConfig/tools/phedex-web.py /data/srv/state/phedex/etc/graphs_website_prod.xml): starting phedex Stopping httpd: [FAILED] Starting httpd: [ OK ] $ sudo -H -u _phedex bashs -l -c "/data/srv/current/config/phedex/manage status" phedex is RUNNING, PID 14202 14215 14216 14217 14218 14220 14253 phedex grapher is RUNNING, PID 14226
CMSWEB_PHEDEX_GRAPHS_IS_DOWN or CMSWEB_PHEDEX_GRAPHS_IS_NOT_RESPONDING
-
Check whether the server processes are running. This should list a python process and rotatelogs.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_phedex' | grep -i graph _phedex 14226 1.7 1.2 441584 50052 pts/0 Sl 11:49:58 00:00:01 python -u /data/srv/be1103a/sw/slc5_amd64_gcc434/cms/PHEDEX-web/4.0.1/Documentation/WebSite/PlotConfig/tools/phedex-web.py /data/srv/state/phedex/etc/graphs_website_prod.xml _phedex 14227 0.0 0.0 17508 888 pts/0 S 11:49:58 00:00:00 rotatelogs /data/srv/logs/phedex/graphs-%Y%m%d.log 86400
-
See if the service is listening to its port. It should be the process listed above.
$ sudo /usr/sbin/lsof -P -i :7102 | grep LISTEN python 14226 _phedex 3u IPv4 90379739 TCP localhost.localdomain:7102 (LISTEN)
-
Restart the service. Status should say the graph service is RUNNING, with process id. It will also mention httpd is running; you can ignore that part.
$ sudo -H -u _phedex bashs -l -c "/data/srv/current/config/phedex/manage rgraphs 'I did read documentation'" Stopping 14226 (14226 11:49 python -u /data/srv/be1103a/sw/slc5_amd64_gcc434/cms/PHEDEX-web/4.0.1/Documentation/WebSite/PlotConfig/tools/phedex-web.py /data/srv/state/phedex/etc/graphs_website_prod.xml ): $ sudo -H -u _phedex bashs -l -c "/data/srv/current/config/phedex/manage status" phedex is RUNNING, PID 14202 14216 14217 14218 14220 14253 14771 phedex grapher is RUNNING, PID 14878
CMSWEB_REQMGR_IS_DOWN or CMSWEB_REQMGR_IS_NOT_RESPONDING
-
Check service status. It should say the service is RUNNING, with process id.
$ sudo -H -u _reqmgr bashs -l -c "/data/srv/current/config/reqmgr/manage status" reqmgr is RUNNING, PID 3719
-
Check whether the server processes are running. There should be a python + rotatelogs process.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_reqmgr[^2]' _reqmgr 3719 4.8 4.4 680020 360176 ? Sl 20:09:18 01:04:10 python -u -m WMCore.WebTools.Root --ini=/data/srv/current/config/reqmgr/ReqMgrConfig.py _reqmgr 3720 0.0 0.0 15056 716 ? S 20:09:18 00:00:00 rotatelogs /data/srv/logs/reqmgr/reqmgr-%Y%m%d.log 86400
-
See if the service is listening to its port. It should be the process listed above.
$ sudo /usr/sbin/lsof -P -i :8240 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 3719 _reqmgr 3u IPv4 19943 0t0 TCP *:8240 (LISTEN)
-
Restart the service. It should end with saying the service is RUNNING, with process id.
$ sudo -H -u _reqmgr bashs -l -c "/data/srv/current/config/reqmgr/manage restart 'I did read documentation'" stopping reqmgr Stopping 3719 (3719 20:09 python -u -m WMCore.WebTools.Root --ini=/data/srv/current/config/reqmgr/ReqMgrConfig.py): starting reqmgr $ sudo -H -u _reqmgr bashs -l -c "/data/srv/current/config/reqmgr/manage status" reqmgr is RUNNING, PID 492484
CMSWEB_REQMGR2_IS_DOWN or CMSWEB_REQMGR2_IS_NOT_RESPONDING
-
Check service status. It should say the service is RUNNING, with process id.
$ sudo -H -u _reqmgr2 bashs -l -c "/data/srv/current/config/reqmgr2/manage status" reqmgr2 is RUNNING, PID 3933
-
Check whether the server processes are running. There should be two python + one rotatelogs process.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_reqmgr2' _reqmgr2 3932 0.0 0.0 15056 716 ? S 20:09:21 00:00:00 rotatelogs /data/srv/logs/reqmgr2/reqmgr2-%Y%m%d.log 86400 _reqmgr2 3934 0.0 0.1 101396 9064 ? S 20:09:21 00:00:00 python /data/srv/beHG1309c/sw/slc5_amd64_gcc461/cms/reqmgr2/0.9.79a/bin/wmc-httpd -v -d /data/srv/state/reqmgr2 -l |rotatelogs /data/srv/logs/reqmgr2/reqmgr2-%Y%m%d.log 86400 /data/srv/current/config/reqmgr2/config.py _reqmgr2 3957 0.0 0.2 115360 17244 ? Sl 20:09:21 00:01:13 \_ python /data/srv/beHG1309c/sw/slc5_amd64_gcc461/cms/reqmgr2/0.9.79a/bin/wmc-httpd -v -d /data/srv/state/reqmgr2 -l |rotatelogs /data/srv/logs/reqmgr2/reqmgr2-%Y%m%d.log 86400 /data/srv/current/config/reqmgr2/config.py
-
See if the service is listening to its port. It should be the process listed above.
$ sudo /usr/sbin/lsof -P -i :8246 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 3957 _reqmgr2 3u IPv4 19383 0t0 TCP *:8246 (LISTEN)
-
Restart the service. It should end with saying the service is RUNNING, with process id.
$ sudo -H -u _reqmgr2 bashs -l -c "/data/srv/current/config/reqmgr2/manage restart 'I did read documentation'" stopping reqmgr2 Killing reqmgr2 pgid 3933 . starting reqmgr2 reqmgr2 not running, not killing $ sudo -H -u _reqmgr2 bashs -l -c "/data/srv/current/config/reqmgr2/manage status" reqmgr2 is RUNNING, PID 495352
CMSWEB_REQMON_IS_DOWN or CMSWEB_REQMON_IS_NOT_RESPONDING
-
Check service status. It should say the service is RUNNING, with process id.
$ sudo -H -u _reqmon bashs -l -c "/data/srv/current/config/reqmon/manage status" wmstatsserver is RUNNING, PID 5236
-
Check whether the server processes are running. There should be two python + one rotatelogs process.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_reqmon' _reqmon 5235 0.0 0.0 19428 828 pts/0 S 11:46:34 00:00:00 rotatelogs /data/srv/logs/reqmon/reqmon-%Y%m%d.log 86400 _reqmon 5237 0.0 0.2 143504 10104 ? S 11:46:34 00:00:00 python /data/srv/HG1509j/sw.pre/slc6_amd64_gcc481/cms/reqmon/1.0.9_cmsweb.patch1/bin/wmc-httpd -r -d /data/srv/state/reqmon -l |rotatelogs /data/srv/logs/reqmon/reqmon-%Y%m%d.log 86400 /data/srv/current/config/reqmon/config.py _reqmon 5238 0.1 0.4 1135736 16864 ? Sl 11:46:34 00:00:02 \_ python /data/srv/HG1509j/sw.pre/slc6_amd64_gcc481/cms/reqmon/1.0.9_cmsweb.patch1/bin/wmc-httpd -r -d /data/srv/state/reqmon -l |rotatelogs /data/srv/logs/reqmon/reqmon-%Y%m%d.log 86400 /data/srv/current/config/reqmon/config.py
-
See if the service is listening to its port. It should be the process listed above.
$ sudo /usr/sbin/lsof -P -i :8249 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python 5238 _reqmon 4u IPv4 9487927 0t0 TCP *:8249 (LISTEN)
-
Restart the service. It should end with saying the service is RUNNING, with process id.
$ sudo -H -u _reqmon bashs -l -c "/data/srv/current/config/reqmon/manage restart 'I did read documentation'" stopping reqmon Killing wmstatsserver pgid 5236 . starting reqmon wmstatsserver not running, not killing $ sudo -H -u _reqmon bashs -l -c "/data/srv/current/config/reqmon/manage status" wmstatsserver is RUNNING, PID 16470
CMSWEB_SITEDB_IS_DOWN or CMSWEB_SITEDB_IS_NOT_RESPONDING
-
Check service status. It should show both sitedb and the legacy sitedb are RUNNING, with one process id for each.
$ sudo -H -u _sitedb bashs -l -c "/data/srv/current/config/sitedb/manage status" sitedb is RUNNING, PID 4909 sitedb legacy is RUNNING, PID 4912
-
Check whether the server processes are running. There should be one python wmc-httpd, one python cmsWeb and two rotatelogs corresponding process.
$ ps f -A -o user:10,pid,%cpu,%mem,vsz,rss,tty,stat,start,time,args | grep '^_sitedb' _sitedb 4908 0.0 0.0 15052 708 ? S 12:07:47 00:00:00 rotatelogs /data/srv/logs/sitedb/sitedb-%Y%m%d.log 86400 _sitedb 4910 0.0 0.0 101992 9516 ? S 12:07:47 00:00:00 python /data/srv/be1111h/sw/slc5_amd64_gcc461/cms/sitedb/2.0.0/bin/wmc-httpd -r -d /data/srv/state/sitedb -l |rotatelogs /data/srv/logs/sitedb/sitedb-%Y%m%d.log 86400 /data/srv/current/config/sitedb/config.py _sitedb 4911 0.0 0.1 239748 25920 ? Sl 12:07:47 00:00:01 \_ python /data/srv/be1111h/sw/slc5_amd64_gcc461/cms/sitedb/2.0.0/bin/wmc-httpd -r -d /data/srv/state/sitedb -l |rotatelogs /data/srv/logs/sitedb/sitedb-%Y%m%d.log 86400 /data/srv/current/config/sitedb/config.py _sitedb 4912 0.0 0.2 374116 51728 ? Sl 12:07:47 00:00:15 python -m cmsWeb --base-url https://cmsweb.cern.ch/sitedb -p 8055 --default-page /sitelist/ --my-sitedb-ini=/data/srv/current/auth/sitedb/sitedb.ini _sitedb 4913 0.0 0.0 15052 704 ? S 12:07:47 00:00:00 rotatelogs /data/srv/logs/sitedb/sitedb-legacy-%Y%m%d.log 86400
-
See if the service is listening to its port. It should be the process listed above.
$ sudo /usr/sbin/lsof -P -i :8055 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME python 4912 _sitedb 3u IPv4 24748 TCP *:8055 (LISTEN) $ sudo /usr/sbin/lsof -P -i :8051 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME python 4911 _sitedb 6u IPv4 24130 TCP *:8051 (LISTEN)
-
Restart the service. It should end with saying sitedb and sitedb legacy are RUNNING, with process id for each.
$ sudo -H -u _sitedb bashs -l -c "/data/srv/current/config/sitedb/manage restart 'I did read documentation'" stopping sitedb Killing sitedb pgid 4909 . Stopping 4912 (4912 12:07 python -m cmsWeb --base-url https://cmsweb.cern.ch/sitedb): starting sitedb sitedb not running, not killing $ sudo -H -u _sitedb bashs -l -c "/data/srv/current/config/sitedb/manage status" sitedb is RUNNING, PID 585 sitedb legacy is RUNNING, PID 587
CMSWEB_WORKQUEUE_IS_DOWN or CMSWEB_WORKQUEUE_IS_NOT_RESPONDING
At the moment this service is merely a collection of cron jobs which run from time to time. If these alarms raised, it means the cron jobs have failed; there is nothing to restart. Simply inform the service managers about the alarm as described at the top of this document.
CMSWEB_DATA_FULL
This means the data partition where services are hosted is approaching to get full. Simply inform service managers about the alarm as described at the top of this document. Do NOT try to clean up anything.