This page describes how to maintain DM/WM software monitoring specifications. All the code resides in DM/WM github deployment. For each service there is a sub-directory which contains the deployment script deploy, server management script manage, monitoring specs and any configuration files except security-sensitive authnetication data.
Please do not commit any changes directly. Please submit tested changes as pull requests as per dev-git guidelines. All changes pertaining to the service – deploy scripts, manage scripts, monitoring specs and server configuration files – should be supplied in the pull request. We will review your patch, and may request changes. Once they are approved, we will commit them to github for you. You should assume progressive clean-up of your deployment action can happen on every new server version submission.
A central ServerMonitor script scans project configuration directories $root/current/config/* for files matching name monitoring*.ini. It reloads any recently changed configuration files, and probes all servers every 30 seconds or so. The monitor looks for three things: whether server processes are running, whether server logs have any error messages, and whether the server responds to HTTP requests. Each server monitor file specifies how to monitor one server.
The server monitor collects all this data and feeds it into a log file which is then scanned by site monitoring tools, at CERN the lemon system. If expected output isn’t found in this feed, alarms will eventually be raised, operators notified and action taken. In order to have a server properly monitored the following are needed:
The service can have one or more monitoring spec files. If it has just one, it should be called monitoring.ini, and in this case the “stem” is the project directory name, myapp for $root/current/config/myapp/monitoring.ini. If it has several, the files should be called monitoring-*.ini, and the “stem” is whatever is matched by the star. The stem is what is used to report monitoring results in the output feed.
The file can have the following definitions:
LOG_FILES='*.log'
LOG_ERROR_REGEX='^Traceback .most recent|\] HTTP Traceback'
PS_REGEX="Root.py.*[/]T0MonConfig.py"
PING_URL="http://localhost:8300/"
PING_REGEX=".*T0Mon main page.*"
You can check for lemon errors with sudo /usr/sbin/lemon-host-check -sh. If there are no exceptions, the output will look like this:
$ sudo /usr/sbin/lemon-host-check -sh
[INFO] lemon-host-check version 1.3.3 started by root at Mon Feb 28 12:26:23 2011 on vocms106.cern.ch
[VERB] Exceptions: 0 - Running actuators: 0 - Disabled exceptions: 0 - State: Production
The typical monitoring feed in $root/logs/admin/feed.lemon looks like a series of these.
Feb 28 12:10:11 t0mon PING OK <http://localhost:8300/>
Feb 28 12:10:12 t0mon PROCESS pid:27446
If the server isn’t running, then PROCESS lines are simply missing and lemon will eventually raise an alarm. A failed PING might look like this:
Feb 28 12:10:07 phedex-web PING FAILED url:<http://localhost:7101/phedex>
An error scanned from server logs might look like this:
Feb 28 12:10:07 sitedb ERROR file:<.../sitedb-20110228.log>:50149 line:<[snip] error: Traceback (most recent call last):>
Here we provide example of configration which can be used to monitor DAS web server, along with sample output and its explanation:
# Monitor DAS web server process running under _das account
PROCESS_OWNER="_das"
PROCESS_REGEX_NAME="/DAS/web/das_server.*"
PROCESS_ACTIVITY="cpu,mem,threads,user,system,rss,vms"
With this configuration you’ll see the following output in the feed log file:
Jul 17 17:33:13 _das SYSTEM PID: 8635, NAME: das_server.py, STATUS: sleeping,
CPU: 0.0%, MEM: 1.7%, THR: 121, USER: 9615.04,
SYSTEM: 1558.9, RSS: 36139008, VMS: 1451147264
This output consists of the following metrics:
ENABLE_IF='app_is_enabled and host == "vocms138"'
LOG_FILES='offline/agent-*.log'
LOG_ERROR_REGEX='ERROR|WARNING'
LOG_ERROR_REGEX='Traceback \(most recent|.*cherrypy\.error.*|File ".*", line \d+, in'
REPORT_FILES='offline/agent-*.log'
REPORT_REGEX='\[(?P<AGENT>[A-Za-z0-9]+)/\d+\] found (?P<QUEUE>\d+) new files'
PS_REGEX='[/]dqmgui/[-a-z0-9.]+/bin/visDQM[A-Za-z0-9]+(Daemon|Stager|Verifier|QuotaControl) .*/state/dqmgui/offline/agents'
The typical monitoring feed in $root/logs/admin/feed.lemon would then have lines like these:
Oct 15 11:56:10 dqm-offline-agents REPORT AGENT=visDQMRootFileQuotaControl QUEUE=15
Oct 15 11:56:40 dqm-offline-agents REPORT AGENT=visDQMVerControlDaemon QUEUE=2
Oct 15 11:57:10 dqm-offline-agents REPORT AGENT=visDQMImportDaemon QUEUE=8