Present: Dario Barberis, Simone Campana, Alastair Dewhurst, Alessandro Di Girolamo, Dave Dykstra, Luis Linares, Stefan Roiser, Andrea Valassi
Not present: Alexandre Beche, Doug Benjamin, Barry Blumenfeld
Discussed the difficulty of changing the squid monitoring machines' (currently single, virtual) IP address because it affects every squid's Access Control List and a majority of site's firewalls to allow incoming UDP/SNMP queries from MRTG. Suggestion is to change it this time to allow all IP addresses at CERN to give maximum flexibility of location of the monitoring machines.
- Barry asked CMS T2 mailing list, very little objection so far
- Simone will ask at ATLAS Jamboree December 10/11
- Stefan will ask LHCb regarding CVMFS squid monitoring (and it would be good to include such squids in the MRTG monitoring)
SAM/SUM tests currently read from an XML file known as
ATP
- The squid monitoring service could also use it, and probably should
- GOCDB & OIM are only for declaring services and up/down time
- ATP accepts VO-specific additions from another XML file; ATLAS generates this from AGIS
- Where would the XML file come from for other VOs? One possibility is to extend AGIS to be for all VOs
The user based monitoring (SAM/SUM) uses the proxy port 3128, but MRTG monitoring uses port 3401. The names & IP addresses of the same squid are fairly often different too, one on a private network and one on the public network. There are definitely known cases of non-standard ports for MRTG monitor (for example servers running multiple squid processes) and there may also be cases of the user port. Definitely the user port is different for the reverse-proxy squids used on servers, although that's not relevant to the SAM/SUM tests.
Discussed
SquidMonitoringTaskForceQuestions
- The failover mechanism isn't planned to change to reconcile the differences between the CMS & ATLAS experiments. There are failovers to the reverse-proxy squids used for all services, that should be monitored in a common way. This is not yet done for CVMFS stratum 1 nor for ATLAS Frontier server squids, only CMS Frontier (based on the awstats squid monitor). CMS also wants to monitor their centralized backup proxies this way.
- & 6. There's mostly a consensus that squids should be separately tested in SAM/SUM as a first-class service of their own, not only associated with CEs and servers as they are now, to help narrow down root causes when there's a problem. This implies they need to be defined in GOCDB/OIM. One possibility is to use the output of MRTG to determine if they are up or down, but if so in order to avoid false alarms do to dropped UDP packets it should look at it them multiple times to see if a problem persists. Or we could have the squid monitoring machines generate another web page that only shows errors if there are multiple failures in a row and have the SAM/SUM tests use those.
We will continue discussing the questions at the next meeting in 2 weeks.