Squid Monitoring Task Force Questions

This is a list of questions from one of the messages in the thread for this task force. Please feel free to add your comments on each topic!

1 Deployment model?

Initial description:
- Andrea: We should accept the ATLAS and CMS Squid models as they are. ATLAS are happy with their deployment and failover mechanisms and will not change it. We should only discuss common (WLCG) monitoring for Squid services (which exist for both ATLAS and CMS). [DETAILS. In ATLAS each site (generally) has a local Squid and a primary Frontier server, as well as a failover Squid and a failover Frontier server (four combinations). In CMS IIUC there is a failover to a single Frontier and there is or will be a single failover Squid. The CMS model allows monitoring by checking the single failover node to see if some Squids failed and their connections failed over - but many of us think that Squid monitoring should be done by monitoring directly if each Squid is working or not].
- Dave: CMS is also moving toward a primary & backup frontier service and a primary & backup squid proxy service. It's just that there will be only 2 each worldwide, and the backup squid proxies will be closely monitored.
Further developments:
- The issue was discussed at the Squid Monitoring Task Force meeting of 2012 Nov 30.
Agreed: monitoring should apply to the existing deployment models, deployment models will not change to adapt to monitoring.
- Actions: nothing to do.

2 SAM tests of Squid services?

Initial description:
- Andrea: we should agree if we want to have SAM tests of Squid services. I suggest we should. In this case, we should add Squid services in GOCDB and OIM (now they only exist in AGIS which is ATLAS specific). This is the step that would make the service Squid WLCG compliant, they will become similar to e.g. SRM or CE. [DETAILS. There are largely speaking two sets of monitoring tests ATLAS is using at the moment for Squids/Frontier: the MRTG/awstats tests from frontier.cern.ch and the internal SAM tests of ATLAS testing WNs. MRTG provides both detailed plots for experts and is used for on/off monitoring of sites. These direct SAM tests of Squid nodes would sit somewhere in between or better replace/include the on/off MRTG tests, see below].
- Alastair: It's still not clear to me that we know what we want to actually be monitoring. While its not perfect, I am happy with the information the ATLAS monitoring provide me with. I would like to add the ability to see if a squid service is a downtime but it doesn't need drastic changing. I am happy for the parts of ATLAS monitoring which are common with CMS to be done in one place. As far as I can see the commonality is only the MRTG monitoring. This does provide us with a very basic (up/down) measure as well as providing more detailed statistics that are useful for experts.
- Dave: Yes of course we'll have SAM tests, at least the ones we already have.
Further developments:
- The issue was discussed at the Squid Monitoring Task Force meeting of 2012 Nov 30.
Agreed: direct SAM tests of Squid services should be set up. Those Squids for which SAM tests are required must be declared in GOCDB.
- Action: add Squid services in GOCDB (see below).
- Action: implement SAM tests of Squid services (see below).
- Action: decide whether all Squids should be declared in GOCDB or not (see below).

3 Squids for Frontier and CVMFS: two different services in GOCDB/OIM?

Andrea: it makes sense to define SquidFrontier and SquidCvmfs as two separate services. The same nodes or clusters can be used for both services, but they should be declared in GOCDB/OIM, and appear in SAM tests, as two logical different entities (reasons could be that we would like to have specific tests for the 2 types, or just maybe they have different criticalities for the experiments, e.g. CVMFS squids are vital for experiment A while the Frontier squids are less).
Alastair: No it doesn't make sense. As I said in my last email the GOCDB/OIM is design for sites to declare downtimes for their services. It has no ability to determine how that service is configured or used by the VO.
Dave: I'm not sure what I think about that.
Agreed on 2013-02-01: there will be only one service type in GOCDB/OIM

4 Experiment-specific SAM tests via WN

Andrea: ATLAS will keep its WN-based SAM tests, these tests are good to measure the usability of the site from the experiment perspective (I imagine CMS will also keep other internal tests). [DETAILS. The internal SAM tests are visible on the ATLAS dashboard "site status board" (SSB) http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview#currentView=Frontier_Squid2&highlight=false . Clicking on any site brings to the dashboard Site Usability Monitor (SUM) http://dashb-atlas-sum.cern.ch/dashboard/request.py/latestresultssmry-sum#profile=ATLAS&group=All+sites&site[]=All+Sites&flavour[]=All+Service+Flavours&flavour[]=All+CE+flavours&metric[]=org.atlas.WN-FrontierSquid&status[]=All+Exit+Status . If here you click on the "OK" or "W" to the right of each site, you see the results of a test for that site, eg Triumf http://dashb-atlas-sum.cern.ch/dashboard/request.py/getMetricResultDetails?hostName=ce1.triumf.ca&flavour=CREAM-CE&metric=org.atlas.WN-FrontierSquid&timeStamp=2012-11-11T18:55:22Z . The test runs a simple job on the worker node (WN) trying to retrieve conditions from all combinations of Frontier and Squid servers and failovers: it is green if all ok, yellow (warning) if one fails.]

5 Should we keep MRTG available for experts?

Andrea: we should keep MRTG detailed plots available for experts. I understand that both ATLAS and CMS think these may be useful and must be kept. But these should be made more WLCG-like, see below. [DETAILS. MRTG/awstats are visible from the ATLAS SSB http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteview#currentView=Frontier_Squid&highlight=false . Clicking on the Frontier tests gives awstats details/plots. Clicking on the Squid tests gives MRTG details/plots. About Squid/MRTG tests in particular, these are already used to give two pieces of information: they do provide the detailed plots for experts (by clicking on each site).]
Agreed: We should keep MRTG available for experts

6 Transform MRTG on/off ATLAS tests into SAM tests?

Initial description:
- Andrea: (Assuming that we agree that we need SAM tests for squids), we should transform the existing MRTG on/off tests (those that retrieve Squid version and uptime and use this to determine if the service is up) into SAM tests. This requires Squid services to be defined in GOCDB/OIM. But otherwise the machinery is largely there (what is used to give on/off results for Squids in the ATLAS SSB). [DETAILS. The SAM WN tests rely on the AGIS information system of ATLAS to know how Squids are configured at each site. IIUC AGIS contains only one service DNS alias, then individual nodes are hardcoded in MRTG. Eg for Triumf http://atlas-agis.cern.ch/agis/site/main/67/ shows Squid http://t1-squid.triumf.ca:3129. The xml sent by Alastair is consistent with this as it is produced/extracted from AGIS, http://atlas-agis-api-dev.cern.ch/request/atp/xml/. The ATLAS Frontier_Squid SSB already uses the MRTG information for a on/off monitoring to show green/red boxes. The on/off monitoring is done by a MRTG queries, aggregated for all ATLAS sites on http://frontier.cern.ch/squidstats/frontier-squid.data.]
- Dave: It might be helpful to have the MRTG detection of response available as a SAM test. I would combine it with the failover test because they're both determined at the monitoring server.
- Dave: the reliability of the on/off test should be improved. Since SNMP is UDP-based, sometimes the packets are simply lost. It should only downgrade status if it sees about 3 missed replies in a row.
- Barry: This does not seem particularly useful to me. The CMS SAM squid test already checks every squid at a grid site for functionality. Adding MRTG to this would only be a test on monitoring so it could only give a WARNING. Those messages are generally ignored by non-experts. I don't think they affect site status in CMS. In addition, there are some grid sites that refuse to allow MRTG monitoring and many non-grid sites that do have MRTG monitoring. The power of MRTG is that it runs every 5 minutes independently of the grid while SAM tests (which are necessary) can have many hours of latency.
Further developments
- Andrea: during the Squid Monitoring Task Force meeting of 2012 Nov 30 it was mentioned that MRTG monitoring may give false alarms. We should take this into account (eg by implementing retrial in SAM as suggested by Simone or by changing the squid monitoring server to only report a problem after multiple missed responses as suggested by Dave).
- Agreed on 2013-02-01: there will be a SAM test for each registered squid service based on MRTG monitoring being down for a number of reporting periods in a row.

7 Squids in GOCDB: aliases or individual cluster nodes?

Andrea: We should declare Squids in GOCDB with a single URL and use other mechanisms to map these into individual nodes (eg for MRTG monitoring). Dynamic lookup with some caching may be the best option here. For both the on/off MRTG tests and the detailed MRTG tests (which can be exposed as links in the former), using GOCDB will make these tests more WLCG-like (than hardcoding nodes or using AGIS as presently done by CMS and ATLAS). In other words what I am suggesting is NOT to declare individual nodes anywhere (which will be very difficult to maintain), but to declare just the main service and discover dynamically the nodes behind. This is my answer to your question 2.
Dave: The problem with that is that it doesn't work for some cases. For example, places like CERN remove nodes that are down from round-robin aliases. Also some sites use non-standard ports for the MRTG monitoring, and sometimes squid machines run more than one squid on different ports.
Agreed on 2013-02-01: the information system will list aliases, and any additional information needed for monitoring will be on the squid monitoring server.

8 SAM tests of CVMFS?

Andrea: It may be that SAM tests of the availability of the CVMFS repository (e.g. ls /cvmfs/my_file) are also useful. But these are the responsibility of the CVMFS task force and we should not discuss them.

9 Which cluster for squid monitor machine?

Initial description:
- Andrea: Concerning your question 1 about which cluster should hold the monitoring machine, this is a point that we did not discuss and I have not thought much about. I have a vague understanding anyway that there are no "voWLCG" machines: generally ATLAS/CMS use the same tests from their own machines to monitor the same services (eg CE or SRM). Monitoring machines that monitor common ATLAS/CMS services do exist (one example being the perfsonar monitoring done from BNL?) in the end are managed by either ATLAS or CMS, so maybe this is the category we fit in and the monitoring machine can remain a voCMS node?
Further developments:
- Dave: Simone told me that he and Ian Fisk discussed the choice of cluster and decided on having them be vocms machines, a contribution from CMS to WLCG. The public alias will be wlcg-squid-monitor.cern.ch. I requested from the CMS VO Coordinator that the virtual machines be allocated, and I updated the task force twiki with this agreement. It may take extra time for them to be allocated because they're a special request: they need to be on two different VM clusters yet still be able to share a virtual IP address. I also asked that at least one be on critical power.
Agreed: node wlcg-squid-monitor.cern.ch will be a CMS virtual vobox (contribution from CMS to WLCG).
- Actions: work needs to be done to set up this machine.

10 How should we monitor failovers at the central squids?

During the Squid Monitoring Task Force meeting of 2012 Nov 30 (and at other times) we discussed having an additional SAM test for sites that is based on worker node clients failing past the local squids and directly contacting central squids. Such failovers are currently detected at CMS Frontier launchpads based on awstats monitoring, although it is not yet reported in a SAM test. CMS is planning to extend this to monitor central backup proxies in the same way. The same mechanism could be applied to ATLAS Frontier launchpad reverse proxy squids (ATLAS has no central backup proxies) and to CVMFS stratum 1 server squids. Should we add a general SAM test for reporting significant failover activity from sites detected by the squid monitoring server?
Agreed: such a SAM/SUM test will be made, although it is a low priority for ATLAS and they don't currently plan to use it.

11 Should all Squids be declared in GOCDB?

Andrea: during the Squid Monitoring Task Force meeting of 2012 Nov 30 it was agreed that we should implement direct SAM tests of Squid services. This requires those services to be declared in GOCDB. However we also discussed many times that some sites may prefer not to be declared in GOCDB and may still want to receive MRTG monitoring, even if SAM tests are not possible.
Dave: the extra monitoring information on the monitoring server will support being able to monitor squids that are not registered in GOCDB or OIM.

-- AndreaValassi - 21-Nov-2012

Topic revision: r9 - 2013-02-07 - DaveDykstra

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback