GridSiteMonitoringQuestionnaire < LCG

LCG Web>LCGMonitoringWorkingGroups>GridServiceMonitoringInfo>GridSiteMonitoringQuestionnaire (2007-03-07, IanNeilson)

Grid Site Monitoring Questionnaire

In December 2006, with the purpose of consolidating the mandate and understanding the existing deployment of monitoring tools within the infrastructure, a questionnaire was circulated to all site administrator contact addresses registered for LCG in the Grid Operations Centre database. The results presented below were also summarised at the WLCG Collaboration Meeting Monitoring BOF

Questions

1) What local fabric monitoring system do you use?:

GridICE/Lemon
Nagios
Other (please specify)
None.

2) Which Grid level sensors do you use?:

which services are monitored
what values/metrics are measured

3) Who provided the sensors?

4) Is your fabric monitoring part of any regional/off-site monitoring framework?

who are you linked with
generally, how is this implemented

5) When you learn that something is wrong with the services at your site, what is the most frequent way you are informed?

looking in the local fabric or Grid monitoring system
getting a trouble ticket
getting a mail/telephone call from VOs/users
other (please specify)..

6) Briefly describe what you see as your top 3 monitoring priorities to help improve your service reliability/availability

Replies

Over 200 sites were polled and 34 responses were received and analysed following reminder (prior to 17 Jan 2007). Due to variations in the detail and clarity of response the following inevitably includes some approximation.

What local fabric monitoring system do you use?

The majority of those who responded were using a local monitoring framework with a majority using multiple frameworks in combination. The count of sites for each category were -

Nagios: 22
GridICE/Lemon: 10
Other: =majority as (a or b) + Ganglia: 13
None : 3

Which Grid level sensors do you use?

12 sites reported monitoring some Grid services most commonly the CE + SE

Who provided the sensors?

Excluding reporting of the SAM sensors, variously as gLite, LCG etc., 6 sites reported using sensors supplied by ROCS from CE(2), AP(2) and IT(2)

Is your fabric monitoring part of any regional/off-site monitoring framework?

10 sites reported being part of a regional framework but few details were provided as to implementation. (There was clarly duplication with the previous question)

When you learn that something is wrong with the services at your site, what is the most frequent way you are informed?

Local monitoring : 21
Support Ticket : 10
Looking at SAM/GSTAT : 4
Direct from User/VO : 3

Briefly describe what you see as your top 3 monitoring priorities to help improve your service reliability/availability

Due to variety of response styles and detail it is hard to tabulate responses to this question. The following strong themes and keyword repetition were noted

single view - common interface - global view
unified tools - repository
more/deeper diagnostics
more flexible – alarm levels
improved/reliable/redundant SAM
hardware/network monitoring

Despite the focus on monitoring, several sites highlighted non-monitoring priorities for improving reliability -

Working/debugged middleware
Reliable hardware
Experience/knowledge transfer

-- IanNeilson - 05 Mar 2007

Topic revision: r2 - 2007-03-07 - IanNeilson

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback