CMSWEB service lemon alarms

Lemon metrics, alarms and recovery procedures

The lemon metrics and exceptions are defined in the metrics manager.

General description and notes

There are two types of exceptions, both at importance level 2 and stored history:

  1. Raise an exception if the metric counting service processes is less than 1 for 5 times in 5 minutes.

  2. Raise an exception if the metric counting server live checks hasn’t found positive matches (PING OK < 1) and has found failures (PING FAILED > 0) for 5 times in 5 minutes.

General recovery information

STEP 1

STEP 2

Alarm procedures

CMSWEB_CONFDB_IS_DOWN or CMSWEB_CONFDB_IS_NOT_RESPONDING

CMSWEB_COUCHDB_IS_DOWN or CMSWEB_COUCHDB_IS_NOT_RESPONDING

CMSWEB_CRABCACHE_IS_DOWN or CMSWEB_CRABCACHE_IS_NOT_RESPONDING

CMSWEB_CRABSERVER_IS_DOWN or CMSWEB_CRABSERVER_IS_NOT_RESPONDING

CMSWEB_DAS_WEB_IS_DOWN or CMSWEB_DAS_WEB_IS_NOT_RESPONDING or CMSWEB_DAS_KWS_IS_DOWN or CMSWEB_DAS_KWS_IS_NOT_RESPONDING

CMSWEB_DBS_IS_DOWN or CMSWEB_DBS_IS_NOT_RESPONDING

CMSWEB_DBSMIGRATION_IS_DOWN or CMSWEB_DBSMIGRATION_IS_NOT_RESPONDING

CMSWEB_DMWMMON_IS_DOWN or CMSWEB_DMWMMON_IS_NOT_RESPONDING

CMSWEB_DQM_DEV_AGENTS_IS_DOWN or CMSWEB_DQM_DEV_AGENTS_IS_NOT_RESPONDING or CMSWEB_DQM_OFFLINE_AGENTS_IS_DOWN or CMSWEB_DQM_OFFLINE_AGENTS_IS_NOT_RESPONDING or CMSWEB_DQM_RELVAL_AGENTS_IS_DOWN or CMSWEB_DQM_RELVAL_AGENTS_IS_NOT_RESPONDING

CMSWEB_DQM_DEV_WEB_IS_DOWN or CMSWEB_DQM_DEV_WEB_IS_NOT_RESPONDING or CMSWEB_DQM_OFFLINE_WEB_IS_DOWN or CMSWEB_DQM_OFFLINE_WEB_IS_NOT_RESPONDING or CMSWEB_DQM_RELVAL_WEB_IS_DOWN or CMSWEB_DQM_RELVAL_WEB_IS_NOT_RESPONDING

CMSWEB_FRONTEND_IS_DOWN or CMSWEB_FRONTEND_IS_NOT_RESPONDING

At the moment this is merely a collection of cron jobs which run from time to time. If these alarms raised, it means the cron jobs have failed; there is nothing to restart. Simply inform the service managers about the alarm as described at the top of this document.

CMSWEB_GITWEB_IS_DOWN or CMSWEB_GITWEB_IS_NOT_RESPONDING

CMSWEB_MONGODB_IS_DOWN or CMSWEB_MONGODB_IS_NOT_RESPONDING

CMSWEB_OVERVIEW_IS_DOWN or CMSWEB_OVERVIEW_IS_NOT_RESPONDING

CMSWEB_PHEDEX_DATASVC_IS_DOWN or CMSWEB_PHEDEX_DATASVC_IS_NOT_RESPONDING or CMSWEB_PHEDEX_WEBAPP_IS_DOWN or CMSWEB_PHEDEX_WEBAPP_IS_NOT_RESPONDING or CMSWEB_PHEDEX_WEB_IS_DOWN or CMSWEB_PHEDEX_WEB_IS_NOT_RESPONDING (the description for CMSWEB_PHEDEX_GRAPHS* alarms comes later)

CMSWEB_PHEDEX_GRAPHS_IS_DOWN or CMSWEB_PHEDEX_GRAPHS_IS_NOT_RESPONDING

CMSWEB_REQMGR_IS_DOWN or CMSWEB_REQMGR_IS_NOT_RESPONDING

CMSWEB_REQMGR2_IS_DOWN or CMSWEB_REQMGR2_IS_NOT_RESPONDING

CMSWEB_REQMON_IS_DOWN or CMSWEB_REQMON_IS_NOT_RESPONDING

CMSWEB_SITEDB_IS_DOWN or CMSWEB_SITEDB_IS_NOT_RESPONDING

CMSWEB_WORKQUEUE_IS_DOWN or CMSWEB_WORKQUEUE_IS_NOT_RESPONDING

At the moment this service is merely a collection of cron jobs which run from time to time. If these alarms raised, it means the cron jobs have failed; there is nothing to restart. Simply inform the service managers about the alarm as described at the top of this document.

CMSWEB_DATA_FULL

This means the data partition where services are hosted is approaching to get full. Simply inform service managers about the alarm as described at the top of this document. Do NOT try to clean up anything.