BNL Historical Batch Monitoring

In the past we used a homegrown system since the mid 2000's of running condor_status and writing to RRD files, and displaying them with some CGI scripts on a webserver

This was very inflexible (RRDs can't change their schema or size), monitoring new pools and experiments required manually editing a lot of code, and the retention was not configurable once began.

This ^^ is bad because to dynamically add institutions was not possible beyond pre-defining things and remapping queue names into the RRD Furthermore it was impossible to plot everything we wanted, too many statistics and files; we'd be approaching replicating ganglia...

Current Monitoring

Some talks that give an overview:

2017 htcondor week: Analytics_with_HTCondor_Data.pdf And hepix in 2016

See slide 3/4 in hepix talk for overview of old rrd-based monitoring

The htcondor week talk gets across the point I was trying to make about accounting vs analytics

Limited public instance of our grafana: here

Technology Choices:
  1. Fifemon: data ingest from htcondor (heavily modified, link is to my github version)
  2. Graphite: 4 node clustered graphite with replication and SSDs for ingest (includes other collectd data as well)
  3. Grafana: front end for making plots from graphite
Current dashboard example:

Screenshot_20180924_142444.png

Experiences

Clustered graphite: pretty good, but handling partitions requires some manual scripting: tooling exists to filter and join plots together but occasional manual operation is a must

Grafana: no complaints here, very powerful plotting tools, even alerting

Fifemon: modifications to add custom points in the hierarchy and remove the daemon aspect and instead run as a cronjob (memory leaks in python bindings previously drove this as well as desire to run many queries against our many pools without so many daemons running) Flexibility is great, can generate plots of per-user

Extra Tooling (CGroups)

Mentioned in my talks linked above, we also have some code (again, my github) I wrote to pull data from CGroups in htcondor (assuming you're using cgroup bindings) and update graphite with them: enables plots of per-slot efficience / RAM+swap usage among other cool things

Accounting

A quick word about how we do accounting: parse schedd-history files on submit nodes with a daily cron job that dumps all records to MySQL

Our dumping script is history_dump.py.txt

Allows querying things like average job-latency and job-runtime / eviction / efficiency on a per-experiment and per-timeperiod basis

This we would like to work on and prehaps replace with ELK-stack that reads both startd and schedd history and enables correlation between them for better debugging as well as better accounting

-- WilliamSK - 2018-09-24

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf Analytics_with_HTCondor_Data.pdf r1 manage 1094.6 K 2018-09-24 - 19:43 WilliamSK  
PNGpng Screenshot_20180924_142444.png r1 manage 128.3 K 2018-09-24 - 20:25 WilliamSK  
Texttxt history_dump.py.txt r1 manage 10.9 K 2018-09-24 - 20:43 WilliamSK  
PNGpng t3_queues.png r1 manage 52.6 K 2018-09-24 - 19:43 WilliamSK  
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2018-09-24 - WilliamSK
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    HEPIX All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback