Support XRootD Federation Monitoring Dashboard

This page documents the support of the XRootD Federation Monitoring Dashboard.

This page describe all the support procedure for the XRootD Monitoring Dashboard dowstream of the GLED collector.

Messaging infrastructure

  • ATLAS - dashb-atlas-xrootd-transfers.cern.ch (new style LB alias)
    • dashb-ai-520.cern.ch
    • dashb-ai-521.cern.ch
  • CMS - dashb-cms-xrootd-transfers.cern.ch (old style LB alias)
    • dashb-ai-522.cern.ch
    • dashb-ai-523.cern.ch
All machines are managed by puppet

Messaging infrastructure

  • Production broker
    • dashb-mb.cern.ch
      • gridmsg20[2-6].cern.ch
  • Topics
    • ATLAS
      • FAX - /topic/xrootd.atlas.fax
      • EOS - /topic/xrootd.atlas.eos
    • CMS
      • AAA - /topic/xrootd.cms.aaa
      • EOS - /topic/xrootd.cms.eos
  • Queues
    • ATLAS
      • FAX - /queue/Consumer.dashb-atlas.xrootd.atlas.eos
      • EOS - /queue/Consumer.dashb-atlas.xrootd.atlas.eos
    • CMS
      • AAA - /queue/Consumer.dashb-cms.xrootd.cms.aaa
      • EOS - /queue/Consumer.dashb-cms.xrootd.cms.eos

Stompclt: message consumption

Stompclt is used to consume messages from a SINGLE broker and store them on local disk. Stompclt command shouldn't be ran manually but handle by simplevisor (see next part)

All configuration files can be found in: /opt/dashboard/etc/dashboard-simplevisor/

Example of a configuration file:

vim /opt/dashboard/etc/dashboard-simplevisor/atlas_fax_consumer.cfg
<incoming-broker>
   auth = "x509 cert=/home/dboard/.security/usercert.pem,key=/home/dboard/.security/userkey.pem"
</incoming-broker>

subscribe = "destination=/queue/Consumer.dashb-atlas.xrootd.atlas.fax"
outgoing-queue = "path=/opt/dashboard/var/messages/atlas_fax"

reliable = true
heart-beat = true

Basic command for stompclt

Start a consumer:

stompclt --incoming-broker-uri stomp+ssl://gridmsg202.cern.ch:6162 --conf /opt/dashboard/etc/dashboard-simplevisor/atlas_fax_consumer.cfg --pidfile /opt/dashboard/var/lock/gridmsg202.cern.ch-atlas_fax.pid --daemon

Check status of a specific consumer:

stompclt --pidfile /opt/dashboard/var/lock/gridmsg202.cern.ch-atlas_fax.pid --status

Stop a consumer:

stompclt --pidfile /opt/dashboard/var/lock/gridmsg202.cern.ch-atlas_fax.pid --quit

Check all the consumer running:

 ps -ef |grep stompclt

Monitoring of stompclt

Stompclt clients are supervised by simplevisor and are silently restarted (if needed) every 30s.

Simplevisor: supervisor for all stompclt instances

The example in the previous section shown one stompclt instance is needed per broker and per queue. As a result a lot of stompclt client will be simultaneously on a single machine and occasionally crash. Simplevisor is used to supervise all this instances.

The simplevisor configuration file need to be re-generate every time machines are added or removed from the LB alias. An helper script has been develop to handle the configuration file creation and can be customized for ATLAS or CMS. Then, the generated file should be copied to the proper directory:

cd /opt/dashboard/doc/simplevisor/
python conf-generator.py <vo>
mv consumer-simplevisor.cfg /opt/dashboard/etc/dashboard-simplevisor/

Basic command for simplevisor

Start simplevisor:

simplevisor --conf /opt/dashboard/etc/dashboard-simplevisor/consumer-simplevisor.cfg --daemon start

Check status of simplevisor:

simplevisor --conf /opt/dashboard/etc/dashboard-simplevisor/consumer-simplevisor.cfg check
dashb-supervisor: OK, as expected
  atlas_fax: OK, as expected
    atlas_fax-gridmsg202.cern.ch: OK, running, as expected
    atlas_fax-gridmsg206.cern.ch: OK, running, as expected
    atlas_fax-gridmsg203.cern.ch: OK, running, as expected
    atlas_fax-gridmsg204.cern.ch: OK, running, as expected
    atlas_fax-gridmsg205.cern.ch: OK, running, as expected
  atlas_eos: OK, as expected
    atlas_eos-gridmsg202.cern.ch: OK, running, as expected
    atlas_eos-gridmsg206.cern.ch: OK, running, as expected
    atlas_eos-gridmsg203.cern.ch: OK, running, as expected
    atlas_eos-gridmsg204.cern.ch: OK, running, as expected
    atlas_eos-gridmsg205.cern.ch: OK, running, as expected

Stop simplevisor:

simplevisor --conf /opt/dashboard/etc/dashboard-simplevisor/consumer-simplevisor.cfg stop

Monitoring of simplevisor

Every 30 minutes, a cronjob check that simplevisor is running and that all the underlying services are in an expected state. If any problem occurs, automatic restart is triggered and a notification is sent.

Simplevisor: supervisor for all stompclt instances

All the messages temporarily stored in the local queue need then to be sent to the DB.

Basic command for the collector

/opt/dashboard/bin/dashb-agent-list
/opt/dashboard/bin/dashb-agent-start xrootd_atlas
/opt/dashboard/bin/dashb-agent-restart xrootd_atlas
/opt/dashboard/bin/dashb-agent-stop xrootd_atlas
/opt/dashboard/bin/dashb-agent-status xrootd_atlas

Monitoring of the collectors

Every 30 minutes, a cronjob if the collector are runningas expected and restart them if needed.

Database

The database consists of 2 sets of schema per VO.

Name Type of data Scope User Procedure
#VO#_XRD_MON raw Shared with popularity Owner, Writer Monitoring
#VO#_XROOTD_DASHBOARD aggregate Dashboard Owner, Writer, Reader Aggregation

The collector should exclusively write to #VO#_XRD_MON_W

The web application should exclusively read from #VO#_XROOTD_DASHBOARD_R

Monitoring & Automatic recovery procedure

To optimize the reaction time and the recovery time, various procedure have been set. Depending on the impact, two kind of action can be taken: N (notification) or R (service restart). The following table describes the implemented strategy on the machines.

  Consumer crash Simplevisor crash collector crash httpd crash local queue stacked list alarmed
FAX dashb-ai-520 R N+R N+R R N dashb-fax-alarms@cernNOSPAMPLEASE.ch
FAX dashb-ai-521 R N+R N+R R N dashb-fax-alarms@cernNOSPAMPLEASE.ch
AAA dashb-ai-522 R N+R N+R R N dashb-aaa-alarms@cernNOSPAMPLEASE.ch
AAA dashb-ai-523 R N+R N+R R N dashb-aaa-alarms@cernNOSPAMPLEASE.ch

-- AlexandreBeche - 14 Nov 2013

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2013-11-14 - AlexandreBeche
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback