BNL Historical Batch Monitoring
In the past we used a homegrown system since the mid 2000's of running condor_status and writing to RRD files, and displaying them with some CGI scripts on a webserver
This was very inflexible (RRDs can't change their schema or size), monitoring new pools and experiments required manually editing a lot of code, and the retention was not configurable once began.
This ^^ is bad because to dynamically add institutions was not possible beyond pre-defining things and remapping queue names into the RRD
Furthermore it was impossible to plot everything we wanted, too many statistics and files; we'd be approaching replicating ganglia...
Current Monitoring
Some talks that give an overview:
2017 htcondor week:
Analytics_with_HTCondor_Data.pdf And
hepix in 2016
See slide 3/4 in hepix talk for overview of old rrd-based monitoring
The htcondor week talk gets across the point I was trying to make about accounting vs analytics
Limited public instance of our grafana:
here
Technology Choices:
- Fifemon: data ingest from htcondor (heavily modified, link is to my github version)
- Graphite: 4 node clustered graphite with replication and SSDs for ingest (includes other collectd data as well)
- Grafana: front end for making plots from graphite
Current dashboard example:
Experiences
Clustered graphite: pretty good, but handling partitions requires some manual scripting: tooling exists to filter and join plots together but occasional manual operation is a must
Grafana: no complaints here, very powerful plotting tools, even alerting
Fifemon: modifications to add custom points in the hierarchy and remove the daemon aspect and instead run as a cronjob (memory leaks in python bindings previously drove this as well as desire to run many queries against our many pools without so many daemons running) Flexibility is great, can generate plots of per-user
Extra Tooling (CGroups)
Mentioned in my talks linked above, we also have
some code (again, my github) I wrote to pull data from CGroups in htcondor (assuming you're using cgroup bindings) and update graphite with them: enables plots of per-slot efficience / RAM+swap usage among other cool things
Accounting
A quick word about how we do accounting: parse schedd-history files on submit nodes with a daily cron job that dumps all records to
MySQL
Our dumping script is
history_dump.py.txt
Allows querying things like average job-latency and job-runtime / eviction / efficiency on a per-experiment and per-timeperiod basis
This we would like to work on and prehaps replace with ELK-stack that reads both startd and schedd history and enables correlation between them for better debugging as well as better accounting
--
WilliamSK - 2018-09-24