BNLBatchMonitoring < HEPIX

HEPIX Web>BatchsysMonitoring>BNLBatchMonitoring (2018-09-24, WilliamSK)

BNL Historical Batch Monitoring

In the past we used a homegrown system since the mid 2000's of running condor_status and writing to RRD files, and displaying them with some CGI scripts on a webserver

This was very inflexible (RRDs can't change their schema or size), monitoring new pools and experiments required manually editing a lot of code, and the retention was not configurable once began.

This ^^ is bad because to dynamically add institutions was not possible beyond pre-defining things and remapping queue names into the RRD Furthermore it was impossible to plot everything we wanted, too many statistics and files; we'd be approaching replicating ganglia...

Current Monitoring

Some talks that give an overview:

2017 htcondor week: Analytics_with_HTCondor_Data.pdf And hepix in 2016

See slide 3/4 in hepix talk for overview of old rrd-based monitoring

The htcondor week talk gets across the point I was trying to make about accounting vs analytics

Limited public instance of our grafana: here

Technology Choices:

Fifemon: data ingest from htcondor (heavily modified, link is to my github version)
Graphite: 4 node clustered graphite with replication and SSDs for ingest (includes other collectd data as well)
Grafana: front end for making plots from graphite

Current dashboard example:

Experiences

Clustered graphite: pretty good, but handling partitions requires some manual scripting: tooling exists to filter and join plots together but occasional manual operation is a must

Grafana: no complaints here, very powerful plotting tools, even alerting

Fifemon: modifications to add custom points in the hierarchy and remove the daemon aspect and instead run as a cronjob (memory leaks in python bindings previously drove this as well as desire to run many queries against our many pools without so many daemons running) Flexibility is great, can generate plots of per-user

Extra Tooling (CGroups)

Mentioned in my talks linked above, we also have some code (again, my github) I wrote to pull data from CGroups in htcondor (assuming you're using cgroup bindings) and update graphite with them: enables plots of per-slot efficience / RAM+swap usage among other cool things

Accounting

A quick word about how we do accounting: parse schedd-history files on submit nodes with a daily cron job that dumps all records to MySQL

Our dumping script is history_dump.py.txt

Allows querying things like average job-latency and job-runtime / eviction / efficiency on a per-experiment and per-timeperiod basis

This we would like to work on and prehaps replace with ELK-stack that reads both startd and schedd history and enables correlation between them for better debugging as well as better accounting

-- WilliamSK - 2018-09-24

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who
pdf	Analytics_with_HTCondor_Data.pdf	r1	manage	1094.6 K	2018-09-24 - 19:43	WilliamSK
png	Screenshot_20180924_142444.png	r1	manage	128.3 K	2018-09-24 - 20:25	WilliamSK
txt	history_dump.py.txt	r1	manage	10.9 K	2018-09-24 - 20:43	WilliamSK
png	t3_queues.png	r1	manage	52.6 K	2018-09-24 - 19:43	WilliamSK

Topic revision: r1 - 2018-09-24 - WilliamSK

HEPIX

Public webs

- Cern Search
- TWiki Search
- Google Search
HEPIX All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback