Ganga/DIANE Monitoring
Ganga/DIANE monitoring dashboard runs on port 80 at
http://gangamon.cern.ch, the underlying host is currently
voatlas90. You can view its Quattor templates
here
Instructions for users
How to configure Ganga to send monitoring messages?
.gangarc:
[MonitoringServices]
Executable/* = Ganga.Lib.MonitoringServices.MSGMS.MSGMS
[MSGMS]
server = ganga.msg.cern.ch
port = 6163
See also:
/afs/cern.ch/sw/arda/install/su3/2009/config-lostman.gear
How to configure DIANE to send monitoring messages?
Messages are automatically sent. It may be disabled in run() function in the run file:
config.MSGMonitoring.MSG_MONITORING_ENABLED = False
Production service operations
The code is developed in SVN:
http://svnweb.cern.ch/world/wsvn/ganga/trunk/external/dashb
The code is deployed in:
/data/django/dashboard
MySQL DB access: configuration
/data/django/dashboard/server/monitoringsite/settings.py
We use
gangage as a service account.
How to deploy a new version of the service?
Make sure that the code is tagged in SVN already with tag
dashboard-X-Y
by the developers.
Login as
gangage@gangamon
For convenience define:
export DASHTAG=dashboard-X-Y
Export the SVN tag:
cd /data/django
svn export svn+ssh://svn.cern.ch/reps/ganga/tags/packages/$DASHTAG
Fix configuration strings and write-protect the release area:
cd /data/django/service
python configure-access.py $DASHTAG
chmod -R a-w /data/django/$DASHTAG
Switch the production code:
cd /data/django
rm -f dashboard && ln -s $DASHTAG dashboard
Restart the service.
How to restart the service?
Login as
gangage@gangamon
The collector is restarted automatically via a crontab which is defined in Quattor.
All service related commands are in /data/django/service.
If you want to disable the collector for maintenance (e.g. code upgrade) do this
cd /data/django/service
./switch_to_maintanance
......
./switch_to_production
./status_gm_service
Manual check if the collector is running:
ps -aux | grep runcollector
Restart apache (as root):
sudo /etc/init.d/httpd restart
Automatic check if the service is available (and alarm)....
Not implemented yet.
Creation of 'user name' lists
Both Gangamon and Dianemon present an interactive list of users as a starting point for interaction with the interface.
These lists are generated from data held in the
gangamon_users
database table, which flags a user as having used
either Ganga, Diane, or both.
To populate this table we need to run over all users in the Ganga and Diane tables. This is currently done once every 5 minutes,
which seems to be a good compromise between how quickly a user might expect their jobs to appear on the display and how intensive
the query to extract user names is (the operation takes around 12 seconds to complete).
Rather than request a new cronjob for the host, we tagged a command onto the script that gets called every minute to check the Gangamon services
are running. The script lives at
/data/django/service/update_users.sh
and is called from within
/data/django/service/restart_gm_service_if_needed
Note that although the
update_users.sh
script is called every minute, it has an internal check that only permits the database query/update to run
once every 5 minutes.
Backup
Historically (voatlas65) backups went to:
voatlas30:/data/gangamon/gangamon
We now use
TSM to back up the
/data/backups/data
directory of voatlas90. The client to interact with the backup server is
dsmc.
To query the backup status of a file, call
dsmc query backup "/data/backups/data/*"
from voatlas90's command line:
[0]=> dsmc query backup "/data/backups/data/*"
IBM Tivoli Storage Manager
Command Line Backup/Archive Client Interface
Client Version 5, Release 5, Level 1.0
Client date/time: 12/09/2010 12:19:30
(c) Copyright by IBM Corporation and other(s) 1990, 2008. All Rights Reserved.
Node Name: VOATLAS90
Session established with server TSM66: Linux/x86_64
Server Version 5, Release 5, Level 4.1
Server date/time: 12/09/2010 12:19:30 Last access: 12/09/2010 12:18:51
Size Backup Date Mgmt Class A/I File
---- ----------- ---------- --- ----
0 B 12/03/2010 12:49:13 DEFAULT A /data/backups/data/backup.log
81,150,805 B 12/09/2010 03:13:52 DEFAULT A /data/backups/data/django.tgz
89,699,745 B 12/09/2010 03:13:55 DEFAULT A /data/backups/data/gangamon.mysql.backup.gz
Restoring backups
This page explains how you use the restore command to pull backups from the tape storage.
Some examples:
- Restore a particular file to a given (new) location
dsmc restore /data/backups/data/django.tgz /tmp/
- Restore all files in the backup to their default locations
dsmc restore "/data/backups/data*"
- Restore a file to the state it was in as of 1.00 PM on August 17th 2010
restore -pitd=8/17/2010 -pitt=13:00:00 /data/backups/data/django.tgz
So, for example, the
gangamon.mysql.backup.gz
file was pulled from tape in the following way:
#> dsmc restore /data/backups/data/gangamon.mysql.backup.gz /tmp/mkenyon/
IBM Tivoli Storage Manager
Command Line Backup/Archive Client Interface
Client Version 5, Release 5, Level 1.0
Client date/time: 12/09/2010 14:30:48
(c) Copyright by IBM Corporation and other(s) 1990, 2008. All Rights Reserved.
Node Name: VOATLAS90
Session established with server TSM66: Linux/x86_64
Server Version 5, Release 5, Level 4.1
Server date/time: 12/09/2010 14:30:48 Last access: 12/09/2010 14:30:01
Restore function invoked.
Restoring 89,699,745 /data/backups/data/gangamon.mysql.backup.gz --> /tmp/mkenyon/gangamon.mysql.backup.gz [Done]
Restore processing finished.
Total number of objects restored: 1
Total number of objects failed: 0
Total number of bytes transferred: 85.55 MB
Data transfer time: 1.32 sec
Network data transfer rate: 66,309.05 KB/sec
Aggregate data transfer rate: 28,200.17 KB/sec
Elapsed processing time: 00:00:03
and a quick checksum calculated to confirm that the file recovered from the backup is identical to the latest tarball on voatlas90:
#> md5sum /data/backups/data/gangamon.mysql.backup.gz /tmp/mkenyon/gangamon.mysql.backup.gz
6368376e8eac07a11cffd8fc6010ad7d /data/backups/data/gangamon.mysql.backup.gz
6368376e8eac07a11cffd8fc6010ad7d /tmp/mkenyon/gangamon.mysql.backup.gz
Disk space management - database logs
You may need to remove db logs from time to time:
mysql> PURGE BINARY LOGS BEFORE '2010-10-1 22:46:26';
Production server configuration
The software is managed by quattor (list of exceptions below).
Lemon monitoring page:
http://lemonweb.cern.ch/lemon-status/info.php?entity=voatlas90&type=host
Relevant quattor templates:
Installation notes for voatlas65 (local changes done by hand)
Local changes to voatlas65:
- added /etc/httpd/conf.d/dashboard.conf
- easy_installed the simplejson package in /usr/lib/python2.4/site-packages
- created symlink: ln -s /usr/lib/python2.4/site-packages/django /usr/lib64/python2.5/site-packages/django
- created symlink: ln -s /usr/lib/python2.4/site-packages/yaml /usr/lib64/python2.5/site-packages/yaml
- disabled python_mod by renaming the file: /etc/httpd/conf.d/python.conf__
- gangausage app: added pygooglechart in
/data/django/external
, it is referred to by /data/django/dashboard/server/django.wsgi
file
Migration of services from voatlas65 to voatlas90
In addition to the above (including the voatlas65 specific notes), some directories that aren't in SVN were
rsync
ed across from voatlas65 to voatlas90:
-
voatlas65:/data/django/service/*
-
voatlas65:/data/django/external
(which contains the pygooglechart
package.
Note that
/data/django/TRUNK/server/monitoringsite/settings.py
had to be modified to include the correct
MySQL password and
SECRET_KEY
, as these aren't stored in SVN.
The dashboard tools were checked out of the SVN Trunk:
cd /data/django
export SVNURL=svn+ssh://gangage@svn.cern.ch/reps/ganga
svn co $SVNURL/trunk/external/dashb ./TRUNK
ln -s TRUNK dashboard
apache configuration
# cat /etc/httpd/conf.d/dashboard.conf
WSGIScriptAlias /django /data/django/dashboard/server/django.wsgi
Alias /django_media /data/django/dashboard/server/monitoringsite/media
Alias /diane /data/django/dashboard/dianetaskmonitor/client
Alias /ganga /data/django/dashboard/gangataskmonitor/client
WSGIScriptAlias / /data/django/dashboard/server/django.wsgi
MSG Information
Destinations consumed by runcollector
Currently there is no protection for the production queues in the MSG server.
NEVER run collector in
the production mode outside of scope of the production service.
See
runcollector.py source for up-to-date destinations.
As of April 2013:
- Server
- ganga.msg.cern.ch, which is actually a DNS alias in front of three Apollo messaging servers.
- Port
- 6163
- Ganga destinations
- /queue/ganga.status, /topic/ganga.status
- Diane destinations
- /queue/diane.journal, /queue/diane.status
TODO:
- Restrict read-access of queues to official runcollector.
Web access to the message queues
You need to have a valid Grid certificate to access these pages (tested on firefox).
Production server:
https://gridmsg101.cern.ch/admin/queues.jsp
Development server:
https://gridmsg001.cern.ch/admin/queues.jsp
Creating development environment
It is recommended that the development environment and production server are as close as possible, to avoid compatibility issues when moving the code into production.
Whenever possible try to use the same versions as installed on gangamon server.
Check the installed packages in production.
What do we need?
- Apache2 webserver (with mod_wsgi installed)
- MySQL database
- Python 2.5+
- SVN
- Django
Ubuntu
Install Environment
Install apache, mysql, python, svn
sudo apt-get install apache2 mysql-server python subversion
Install mod_wsgi for apache
sudo apt-get install libapache2-mod-wsgi
Install
MySQL support for Python
sudo apt-get install python-mysqldb
Install Django:
sudo apt-get install python-django python-django-doc
If you want the latest development version
here are some hints
See also:
http://docs.djangoproject.com/en/1.1/intro/install
Setup MySQL database
Create mysql database (gangamon) and user (replace
myuser and
mypassword with some real values) with proper privileges:
> mysql -u root
CREATE DATABASE gangamon CHARACTER SET utf8;
CREATE USER 'myuser'@'localhost' identified by 'mypasswd';
GRANT ALL PRIVILEGES ON gangamon.* to 'myuser'@'localhost';
Here is more information on this:
Install Ganga/Diane Dashboard application
In production the application is installed to
/data/django
. We assume the same location for the development
environment because to setup up the application you anyway need to have root access to modify the apache configuration files. It also makes it simpler to manage the transition between development and production environments.
Create working copy of Ganga/Diane Dashboard application:
mkdir -p /data/django
cd /data/django
svn co svn+ssh://svn.cern.ch/reps/ganga/trunk/external/dashb dashboard
Update apache configuration:
sudo cp /data/django/dashboard/server/dashboard.conf /etc/apache2/conf.d
sudo service apache2 restart
Note: on Scientific Linux the apache conf path is:
/etc/httpd/conf.d
Update settings.py file with DB connection information:
cd /data/django/dashboard/server/monitoringsite
cp settings.py.TEMPLATE settings.py
#edit settings.py and update DATABASE_USER, DATABASE_PASSWORD and SECRET_KEY
NEVER commit settings.py file to SVN (it contains sensitive information)
Rename file settings.js-example (located in client/media/scripts/) to settings.js
Initialize django databases:
cd /data/dashboard/server/monitoringsite
python manage.py syncdb
You should get this output:
Creating table auth_permission
[...]
You just installed Django's auth system, which means you don't have any superusers defined.
Would you like to create one now? (yes/no): yes
Username (Leave blank to use 'myuser'):
E-mail address: myuser@some.mail
Password:
Password (again):
Superuser created successfully.
Installing index for auth.Permission model
[...]
That's it! Try
http://localhost/monitoring
If you want to test with web broweser running somewhere else than localhost, then you need to do one fix more:
the javascript web application needs to be told at which url django is serving data. Edit
/data/django/dashboard/client/media/scripts/settings.js
file and replace localhost with your fully qualified server name.
You may also run a collector in test mode (using test queues).
Development of Task Monitoring applications
Code base for taskmonitor ("TRUNK") is developed here:
http://svnweb.cern.ch/world/wsvn/ganga/trunk/external/dashb/taskmonitor
DIANE task monitoring is a branch of taskmonitor and it is kept here:
http://svnweb.cern.ch/world/wsvn/ganga/trunk/external/dashb/dianetaskmonitor
The branch may/should be kept up-to-date with the trunk. Simply:
svn merge svn+ssh://svn.cern.ch/reps/ganga/trunk/external/dashb/taskmonitor
The following are added and are
DIANE-specific:
client/media/scripts/settings.js
(simple
svn add setting.js
)
The following are svn copies:
-
client/media/css
is a copy of client/media/css_example
(and may be merged within the branch separately if needed)
-
index.html
The corresponding commands are:
svn cp index.html_example index.html
cd client/media
svn cp css_example css
svn ci -m "your commit message"
Similarly for Ganga (TODO).
Send test messages to your development service instance
Run collector in the test mode
By default the collector runs in the test mode. So simply run:
python manage.py runcollector
It will use the development server gridmsg001.cern.ch instead.
The msg destinations are used by the collector will be the same as on production server.
Configure Ganga to send messages in test mode
Now, suppose that you'd like to test the gangausage or monitoring messages messages. You may force Ganga to send messages to the development msg server like this:
ganga -o[MSGMS]server=gridmsg001.cern.ch
This works as of release 5.5.14 (previously the usage destination was hardcoded).
Configure DIANE to send messages in test mode
This works on code later than 2.1 release.
Set the DIANE_MSG_TEST environment variable and all messages will be published to the development server.
Notes
Other ways to restart Apache
/etc/init.d/apache2 restart
Or
/etc/init.d/httpd restart
Service Documentation Form
The service documentation form is available here:
GangaDIANEMonitoring
Dashboard Release Notes
1-14 : 01/04/2011
Gangamon:Greatly optimised the DB lookups for finding Ganga jobs with subjobs. Previously, this was done by iterating over the table for each job to count its subjobs. Now we collect this information via
ActiveMQ and place it straight into the gangamon_gangajobdetail table. This required changes of the server code; views.py, model.py and eventproc.py, plus the addition of a column in the gangajobdetail to hold the 'number of subjobs' integer. Note that the (single line) change to eventproc.py isn't included in the code tagged as 1-14 in SVN. This has since been added to the dashboard-trunk, so will automatically get included in 1-15 when that is tagged.
--
JakubMoscicki - 2009-09-04
--
JakubMoscicki - 10-Nov-2010