HTCondor Monitoring at UW-Madison
Our monitoring is based around the output of the
condor_gangliad,
which polls the Collector every minute.
Ganglia
Our primary dashboard is the
"Miron dashboard",
which is focused on reporting the aggregated utilization of our HTCondor pool.
For example, we can see how many CPU cores are in use,
why some CPU cores are idle (e.g. machines have run out of allocatable memory),
and how many machines are in the draining state.
Todd has made available both our gangliad config and Ganglia dashboard config
as part of the materials to his
European HTCondor Workshop "Monitoring HTCondor" talk.
When our admins spot unusual behavior in our utilization graphs,
they can bring up utilization graphs of individual machines or users
to help narrow down the culprit(s).
Grafana
We (attempt to) forward many of the metrics we monitor with Ganglia
to a Graphite Carbon database.
For example,
the aggregated pool metrics from our Ganglia Miron dashboard are forwarded,
and we have
a Grafana dashboard that mirrors our Ganglia dashboard.
How to forward metrics to a Carbon database from Ganglia:
# Forward metrics to Carbon via UDP
# added to /etc/ganglia/gmetad.conf
carbon_server localhost
carbon_port 2003
carbon_protocol udp
graphite_prefix "ganglia"
(Warning, this will generate a lot of errors in your Carbon logs!)
Future monitoring
We are working on converting the
condor_gangliad to a more generic
condor_metricd
that sends raw metrics not only to Ganglia but can send aggregated metrics to
Carbon,
InfluxDB, etc. and/or write them to disk in JSON files.
We are working on a web-based "HTCondor View" tool that
reads and displays graphs from these JSON files.