HTCondor Monitoring at UW-Madison

Our monitoring is based around the output of the condor_gangliad, which polls the Collector every minute.

Ganglia

Our primary dashboard is the "Miron dashboard", which is focused on reporting the aggregated utilization of our HTCondor pool. For example, we can see how many CPU cores are in use, why some CPU cores are idle (e.g. machines have run out of allocatable memory), and how many machines are in the draining state.

Todd has made available both our gangliad config and Ganglia dashboard config as part of the materials to his European HTCondor Workshop "Monitoring HTCondor" talk.

When our admins spot unusual behavior in our utilization graphs, they can bring up utilization graphs of individual machines or users to help narrow down the culprit(s).

Grafana

We (attempt to) forward many of the metrics we monitor with Ganglia to a Graphite Carbon database. For example, the aggregated pool metrics from our Ganglia Miron dashboard are forwarded, and we have a Grafana dashboard that mirrors our Ganglia dashboard.

How to forward metrics to a Carbon database from Ganglia:

# Forward metrics to Carbon via UDP
# added to /etc/ganglia/gmetad.conf
carbon_server localhost
carbon_port 2003
carbon_protocol udp
graphite_prefix "ganglia"

(Warning, this will generate a lot of errors in your Carbon logs!)

Future monitoring

We are working on converting the condor_gangliad to a more generic condor_metricd that sends raw metrics not only to Ganglia but can send aggregated metrics to Carbon, InfluxDB, etc. and/or write them to disk in JSON files.

We are working on a web-based "HTCondor View" tool that reads and displays graphs from these JSON files.

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2018-12-21 - JasonPattonExternal
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    HEPIX All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback