Support XRootD Federation Monitoring Dashboard
This page documents the support of the XRootD Federation Monitoring Dashboard.
This page describe all the support procedure for the XRootD Monitoring Dashboard dowstream of the GLED collector.
Messaging infrastructure
- ATLAS - dashb-atlas-xrootd-transfers.cern.ch (new style LB alias)
- dashb-ai-520.cern.ch
- dashb-ai-521.cern.ch
- CMS - dashb-cms-xrootd-transfers.cern.ch (old style LB alias)
- dashb-ai-522.cern.ch
- dashb-ai-523.cern.ch
All machines are managed by puppet
Messaging infrastructure
- Production broker
- Topics
- ATLAS
- FAX - /topic/xrootd.atlas.fax
- EOS - /topic/xrootd.atlas.eos
- CMS
- AAA - /topic/xrootd.cms.aaa
- EOS - /topic/xrootd.cms.eos
- Queues
- ATLAS
- FAX - /queue/Consumer.dashb-atlas.xrootd.atlas.eos
- EOS - /queue/Consumer.dashb-atlas.xrootd.atlas.eos
- CMS
- AAA - /queue/Consumer.dashb-cms.xrootd.cms.aaa
- EOS - /queue/Consumer.dashb-cms.xrootd.cms.eos
Stompclt: message consumption
Stompclt is used to consume messages from a
SINGLE broker and store them on local disk.
Stompclt command shouldn't be ran manually but handle by simplevisor (see next part)
All configuration files can be found in: /opt/dashboard/etc/dashboard-simplevisor/
Example of a configuration file:
vim /opt/dashboard/etc/dashboard-simplevisor/atlas_fax_consumer.cfg
<incoming-broker>
auth = "x509 cert=/home/dboard/.security/usercert.pem,key=/home/dboard/.security/userkey.pem"
</incoming-broker>
subscribe = "destination=/queue/Consumer.dashb-atlas.xrootd.atlas.fax"
outgoing-queue = "path=/opt/dashboard/var/messages/atlas_fax"
reliable = true
heart-beat = true
Basic command for stompclt
Start a consumer:
stompclt --incoming-broker-uri stomp+ssl://gridmsg202.cern.ch:6162 --conf /opt/dashboard/etc/dashboard-simplevisor/atlas_fax_consumer.cfg --pidfile /opt/dashboard/var/lock/gridmsg202.cern.ch-atlas_fax.pid --daemon
Check status of a specific consumer:
stompclt --pidfile /opt/dashboard/var/lock/gridmsg202.cern.ch-atlas_fax.pid --status
Stop a consumer:
stompclt --pidfile /opt/dashboard/var/lock/gridmsg202.cern.ch-atlas_fax.pid --quit
Check all the consumer running:
ps -ef |grep stompclt
Monitoring of stompclt
Stompclt clients are supervised by simplevisor and are silently restarted (if needed) every 30s.
Simplevisor: supervisor for all stompclt instances
The example in the previous section shown one stompclt instance is needed per broker and per queue. As a result a lot of stompclt client will be simultaneously on a single machine and occasionally crash. Simplevisor is used to supervise all this instances.
The simplevisor configuration file need to be re-generate every time machines are added or removed from the LB alias. An helper script has been develop to handle the configuration file creation and can be customized for ATLAS or CMS. Then, the generated file should be copied to the proper directory:
cd /opt/dashboard/doc/simplevisor/
python conf-generator.py <vo>
mv consumer-simplevisor.cfg /opt/dashboard/etc/dashboard-simplevisor/
Basic command for simplevisor
Start simplevisor:
simplevisor --conf /opt/dashboard/etc/dashboard-simplevisor/consumer-simplevisor.cfg --daemon start
Check status of simplevisor:
simplevisor --conf /opt/dashboard/etc/dashboard-simplevisor/consumer-simplevisor.cfg check
dashb-supervisor: OK, as expected
atlas_fax: OK, as expected
atlas_fax-gridmsg202.cern.ch: OK, running, as expected
atlas_fax-gridmsg206.cern.ch: OK, running, as expected
atlas_fax-gridmsg203.cern.ch: OK, running, as expected
atlas_fax-gridmsg204.cern.ch: OK, running, as expected
atlas_fax-gridmsg205.cern.ch: OK, running, as expected
atlas_eos: OK, as expected
atlas_eos-gridmsg202.cern.ch: OK, running, as expected
atlas_eos-gridmsg206.cern.ch: OK, running, as expected
atlas_eos-gridmsg203.cern.ch: OK, running, as expected
atlas_eos-gridmsg204.cern.ch: OK, running, as expected
atlas_eos-gridmsg205.cern.ch: OK, running, as expected
Stop simplevisor:
simplevisor --conf /opt/dashboard/etc/dashboard-simplevisor/consumer-simplevisor.cfg stop
Monitoring of simplevisor
Every 30 minutes, a cronjob check that simplevisor is running and that all the underlying services are in an expected state. If any problem occurs, automatic restart is triggered and a notification is sent.
Simplevisor: supervisor for all stompclt instances
All the messages temporarily stored in the local queue need then to be sent to the DB.
Basic command for the collector
/opt/dashboard/bin/dashb-agent-list
/opt/dashboard/bin/dashb-agent-start xrootd_atlas
/opt/dashboard/bin/dashb-agent-restart xrootd_atlas
/opt/dashboard/bin/dashb-agent-stop xrootd_atlas
/opt/dashboard/bin/dashb-agent-status xrootd_atlas
Monitoring of the collectors
Every 30 minutes, a cronjob if the collector are runningas expected and restart them if needed.
Database
The database consists of 2 sets of schema per VO.
Name |
Type of data |
Scope |
User |
Procedure |
#VO#_XRD_MON |
raw |
Shared with popularity |
Owner, Writer |
Monitoring |
#VO#_XROOTD_DASHBOARD |
aggregate |
Dashboard |
Owner, Writer, Reader |
Aggregation |
The collector should exclusively write to #VO#_XRD_MON_W
The web application should exclusively read from #VO#_XROOTD_DASHBOARD_R
Monitoring & Automatic recovery procedure
To optimize the reaction time and the recovery time, various procedure have been set. Depending on the impact, two kind of action can be taken: N (notification) or R (service restart). The following table describes the implemented strategy on the machines.
--
AlexandreBeche - 14 Nov 2013