MONIT feedback

ALICE

* How much your experiment relies on MONIT infrastructure?

  • EOS dashboards
  • IT services health checks in case of suspected problems
  • SAM A/R numbers

* For what tasks?

  • Given above

* Are you happy with it?

  • Yes, the system is stable

* What is missing in your view?

  • No new requirements

* Issues/bottlenecks?

  • No

* Any monitoring solutions which is used but not provided centrally (even commercial)

* Other comments

ATLAS

* How much your experiment relies on MONIT infrastructure?

  • Grafana dashboards are critical for daily operations, to monitor jobs and data transfers
  • These dashboards are also critical for accounting: weekly/monthly/yearly reports of HS06, disk space, etc
  • Some less-critical usage of Kibana for analytics purposes

* For what tasks?

  • See above

* Are you happy with it?

  • Yes, we are happy with the stability of the service and quick reactions to problems from the MONIT team

* What is missing in your view?

  • The ability to export images and make publication-quality images

* Issues/bottlenecks?

  • No, these days the performance is satisfactory, even loading years of historic data

* Any monitoring solutions which is used but not provided centrally (even commercial)

  • An ATLAS Elastic/Kibana system in University of Chicago is heavily used for analytics.
    • Also critical for squid monitoring
    • It generally provides (or at least did in the past) more flexibility and higher performance than MONIT at CERN.
  • awstats and mrtg provided by Frontier/squid group
  • Specialised monitoring using home-made scripts and eg rrdtool for static monitoring plots

* Other comments

  • Centrally-managed dashboards are stable and well-maintained. Private or development work tends to quickly get out of date or broken so this should be avoided and cleaned up regularly, especially when moving servers/platforms/software versions.
  • We meet once every two weeks with representatives of the MONIT team to discuss issues and requirements. It is very useful to have this regular direct communication channel.

CMS

* How much your experiment relies on MONIT infrastructure?

CMS heavily relies on the MONIT infrastructure (ES, influxDB, Grafana) for operations and accounting (basically all O&C subsystems store/visualize data in Monit: submission infrastructure, sites, WMA, Crab, Rucio and DDM, CMSweb, opportunistic resources). We also rely on the SAM, ETF, and Hammer Cloud infrastructure. Those services are the basis of all our site and services monitoring and critical to the operation of the CMS computing grid and success of the experiment.

* For what tasks?

To monitor system health, debug, operational status and use of our computing systems historical plots, performances

* Are you happy with it?

Yes, both with the infrastructure and support from the MONIT team

* What is missing in your view?

We run our own instances of Prometheus+AlertManager (and VictoriaMetrics as backend storage). Obviously it would be great to have these services centrally provided.

* Issues/bottlenecks?

Sometimes we experience performance issues with data with large cardinality, or sudden increases in data volume, usually handled together with the MONIT team or messaging team to work on solutions. Typically these issues are related to problems on the experiment side (data producers), and very rarely to the MONIT infrastructure.

* Any monitoring solutions which is used but not provided centrally (even commercial)

See above.

We also have several workflows running spark jobs on HDFS data, and sqoop jobs to dump DB tables, for example for rucio space monitoring. These data are then visualised in Grafana and/or with custom html pages

* Other comments

Sometimes it is not clear which CERN IT group (MONIT, messaging, HDFS, DB) needs to be contacted for support/information. Bouncing SNOW tickets back and forth among different groups may be time consuming. Ideally a common interface between CERN IT services/groups and the experiment would be useful. In general SNOW tickets (obviously this is not related to MONIT, it’s a general issue) are not practical when several groups/teams are involved because their visibility is limited (people need to be manually added to the ticket, and according to the settings they cannot see the history).

LHCb

* How much your experiment relies on MONIT infrastructure? * The daily operation rely on few grafana dashboards only in a minor way. * The ongoing plan is expanding the current usage and make MONIT-exposed dashboards the most valuable source of information (also) for LHCb distributed computing

* For what tasks? * grafana dashboards are used for:

    • Monitoring IT-provided services (e.g. Hosts, MySQL, ActiveMQ) used for LHCb central services infrastructures.
    • FTS transfers monitoring
    • LHCb Core SW (mostly CVMFS installations) [production level]
    • DIRAC distributed computing monitoring

* Are you happy with it? * The service is stable and rather easy to use and expand

* What is missing in your view? * Nothing now, but we'll have a better idea when it will be used for production (see answer to first bullet above)

* Issues/bottlenecks? * No

* Any monitoring solutions which is used but not provided centrally (even commercial) * Not that I am aware

* Other comments

WLCG central operations

* How much your experiment relies on MONIT infrastructure?

    • Apart of standard monitoring tools like SiteMon, WLCG uses MONIT infrastructure for accounting needs. WAU and WSSA applications rely on MONIT

* For what tasks?

    • See above

* Are you happy with it?

    • Yes

* What is missing in your view?

    • It would be good to query ES indesis directly without going through proxy

* Issues/bottlenecks?

    • No

* Any monitoring solutions which is used but not provided centrally (even commercial)

* Other comments

-- JuliaAndreeva - 2022-10-05

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2022-10-20 - JuliaAndreeva
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback