HEPIX Monitoring Special Interest Group

The SIG was formed in 2016 with the following aims:

First attempt at creating a monitoring structure for groups. Initial idea was to create a collaboration effort but before any collaboration can take place, we need some starting point. Please also provide contact information so others know who they may reach out to.

Type desired (T)

  1. Monitoring - This is basically simple monitoring of health of host/services/center.
  2. Performance measurement - Data collected to measure how services/programs are functioning.
  3. Both - A combination of both desired types, this could be provided by the same package or multiple packages.

Effort (E)

  1. No people - There is no resources that could be provided to create new functions, basically want something that just works.
  2. Part-time staff - The people involved would be only available to keep things running and provide some customization.
  3. Full time staff - Research and development can be provided.

Possible solutions (This is not intended to be comprehensive since I do not know all possible solutions. Please update and modify as needed. I would like to create a straw-man to start discussion.) The markings are intended to give a minimum requirements for the install type.

  • Simple single package install T1 or T2, E1
    • Nagios
    • Cacti
    • Cloud - There are several cloud based companies that provide monitoring and/or performance packages for clusters.
    • netdata - https://github.com/firehol/netdata
  • Some number of packages with easy setup scripts T1 and T2, E1
    • Base ELK stack - This could be installed straight from the website packages. Configurations can be fairly basic. This could install on most machines and require only Java.
    • LDMS - This is a complete data collection/analyses/display tools which compiles from a small number of programs. Initially it was written for the Cray system but it can be adapted to non-Cray systems.
  • Center-wide T3, E2 or E3
    • ELK + others (Please list packages within the description of the monitoring type.)
      • NERSC has one type of solution. Basically we use collectd and custom scripts to collect data and send it into logstash to rabbitmq, from there to logstash to elastic. Ovirt, Docker both are used to help provide HA services with Rancher providing Docker management. (I need to provide full list of applications and links.) (Contact: clwhitney@lblNOSPAMPLEASE.gov)
    • LDMS + mysql + custom scripts - NCSA currently uses this solution for Blue Waters. This mostly works with a good DBA. There are some growing/scaling pains.
  • General purpose data displays

Collaboration efforts. Here are projects that are started with the idea of others being able to participate. Please list here failed experiments so others do not have to travel down the same path.

  • NERSC summer student program. We hired 4 summer students to generate dashboards and other graphs of performance data from the data collect. Issue: The students did not have a sufficient understanding of the underlying data and the structure of the data center, thus they struggled with coming up with meaning full dashboard. Mentor staff also did not have sufficient time to spend with the students to teach them what they needed to know. (Contact: clwhitney@lblNOSPAMPLEASE.gov)

Recipes.

  • Elastic and related Docker images for installation on Mac or Windows. Instructions
  • Elastic, netdata and syslog-ng setup by Fabien Wernli Instructions

Useful links

Software: This is a list of custom software that the sites develop. Other links should be included above.

Monitoring references:

  • Mailing lists/efforts
    • CUG (Cray Users Group) HPC monitoring effort. Several labs/groups working with Cray to define and get health and performance data from the Cray systems.
    • Cray Sonexion - There is an effort to gather Sonexion Lustre data and provide it to off system resources. (Contact: clwhitney@lblNOSPAMPLEASE.gov)
  • Other sites
  • Conferences
    • Monitorama Portland
    • Elasticon SF

If you would like to participate with this activity, please (fill)

Meetings and Minutes

-- HelgeMeinhard - 2016-05-10

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2017-10-16 - FabienWernliExternal
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    HEPIX All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback