LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>SAMquestionnaireSiteReplies (2018-01-20, MaartenLitmaath)

EditAttachPDF

Site replies to SAM questionnaire
- CERN
- ASGC
- BNL
- CNAF
- FNAL
- IN2P3
- JINR
- KISTI
- KIT
- NDGF
- NL-T1
- NRC-KI
- PIC
- RAL
- TRIUMF

Site replies to SAM questionnaire

See Input regarding SAM tests and Site availability reports

CERN

Do you use SAM tests for operations? Yes
- If yes
  - do you use tests of the critical profiles of the LHC VOs? Yes
  - tests of other profiles
  - do you rely on SAM/ARGO/... results for any non-LHC VOs? yes, Ops
  - how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) daily
  - how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) ETF UI, Dashboard, soon API

Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
- what framework are you using? HC
- how different are those tests compared to SAM? HC runs experiment payload

Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? ...
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) When they come out
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
- to estimate the quality of the sites? Yes
- for reporting to funding agencies? Yes

ASGC

Do you use SAM tests for operations? Not exactly, we do rely on SAM test, but it's kind of baseline, let's say, if we can't pass SAM test, our services are definitely not in good shape, but even we pass SAM test, there is still possibility that production job/transfer will be failing.
- If yes
  - do you use tests of the critical profiles of the LHC VOs? Yes, but they are not enough
  - tests of other profiles No
  - do you rely on SAM/ARGO/... results for any non-LHC VOs? No
  - how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) Daily
  - how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) Dashboard.

Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
- what framework are you using? we have Nagios to test our services, but it checks difference aspect, e.g. H/W status, host certificate life time, ETC...
- how different are those tests compared to SAM?

Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? I think it's good enough at this moment
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Monthly.
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
- to estimate the quality of the sites? Yes
- for reporting to funding agencies? Yes It's reasonable report, and we are fine with it

BNL

Do you use SAM tests for operations? Yes
- If yes
  - do you use tests of the critical profiles of the LHC VOs? Yes
  - tests of other profiles Not much
  - do you rely on SAM/ARGO/... results for any non-LHC VOs? No
  - how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) Daily
  - how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) WLCG Dashboard

Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
- what framework are you using? Nagios, crontab, home-made test suites for individual services, and monitoring component that sometimes come with the service package itself
- how different are those tests compared to SAM? Very different. Actually this is the wrong question to ask, because SAM serves mainly for reporting to funding agencies, it should stay with the very basics, clear and simple. Well on the other hand, the monitoring and testing framework our site uses for operation, is more complicated, fine-grained and proactive. These two things should not be mixed and compared.
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? SAM doesn’t cover one particular case very well, that is, if site services go down unscheduled due to uncontrollable reasons, eg. mother nature disaster, SAM numbers are affected in both availability and reliability. At least the reliability number should not be affected in this case, can SAM handle this automatically, or site should follow a procedure to correct it?
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) each month, as soon as they come out.
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
- to estimate the quality of the sites? Yes
- for reporting to funding agencies? This requires careful thoughts. But, in the short-to-middle time frame, there is really not alternative method which are uniformly available.

CNAF

FNAL

IN2P3

Do you use SAM tests for operations? Yes
- If yes
  - do you use tests of the critical profiles of the LHC VOs? Yes
  - tests of other profiles Sometimes but not often
  - do you rely on SAM/ARGO/... results for any non-LHC VOs? Yes from OPS VO
  - how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) Each day, twice
  - how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) Mainly via WLCG Dashboard and via ETF Nagios UI but we use API to import on our local monitoring

Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
- what framework are you using? Nagios
- how different are those tests compared to SAM? This tests are more focusing to the technical aspects of the hosts (port status, check of hosts services, file system check, devices status…..)

Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Today Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? Critical profiles (used for availability/reliabilities reports) of SAM tests should be based only on the services declared on GOC DB and should be define/approved by WLCG. Some others SAM test profiles can be define to check others "views like: packages deployment and version, specifics configurations, security status,…
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Each month
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
- to estimate the quality of the sites? Yes
- for reporting to funding agencies? Yes, if we can ensure that the SAM tests use to define the availability are related exclusively to the status of services (define on GOCDB)

JINR

Do you use SAM tests for operations? Yes
- If yes
  - do you use tests of the critical profiles of the LHC VOs? Yes, ATLAS and CMS
  - tests of other profiles Yes, ATLAS and CMS
  - do you rely on SAM/ARGO/... results for any non-LHC VOs? Yes, OPS SAM
  - how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) Several times a day
  - how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) ETF Nagios UI,VO-specific tools, like CMS SSB for example

Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
- what framework are you using? Local nagios
- how different are those tests compared to SAM? Mostly for h/w and OS operations.

Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? No
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Never
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
- to estimate the quality of the sites? No
- for reporting to funding agencies? No

KISTI

Do you use SAM tests for operations? Yes
- If yes
  - do you use tests of the critical profiles of the LHC VOs? Yes. We use ALICE related profiles for T1: ALICE_CE, ALICE_VOBOX, AliEn-SE
  - tests of other profiles No
  - do you rely on SAM/ARGO/... results for any non-LHC VOs? We keep eye on ARGO Nagios tests for OPS since we are EGI sites
  - how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) We have a dedicated team to monitor on SAM tests since the site availability is critical to our funding. 7/7days 24hours monitoring and on-call duties for the recovery are in operation.
  - how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) ETF Nagios UI and WLCG Dashboard + ALICE specific monitoring
Do you run tests other than the ones of SAM to check availability of your services? Not yet implemented but we plan to do.
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes. We think that current LHC VO specific SAM tests are designed to reflect the minimum requirements from LHC VOs for use of sites.
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? Detailed information on test results (both for success and failure) would help us to understand the system.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) each month, as soon as they come out
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
- to estimate the quality of the sites? Yes
- for reporting to funding agencies? Yes We might change when we have a better alternative for current system.

KIT

Do you use SAM tests for operations? No, unless our local VO contacts notice irregularities
- If yes
  - how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) Once per week out of “tradition” (middleware meeting).
  - how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) WLCG Dashboard.

Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
- what framework are you using? Lots of bash/perl scripts supplying Icinga and Ganglia
- how different are those tests compared to SAM? Probably not very different. i.e. the same systems are tested, but with different parameters. Also, those tests are rarely VO specific (since we support all four major WLCG experiment VOs).

Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? We think that, in its current form, there is room for improvement. As with every metric that tries to boil down complex systems to a simple number between 0 and 1, there are always situations where the algorithm is unfit and/or unfair. All sites differ with regards to the services they provide to their individual user base, running on heterogeneous systems to variable scale with an unequal budget. In our opinion, what should matter most is the actual performance and throughput delivered by a site and how close that was to its obligation. SAM continuously tests the same things, that a site is able to fulfil certain tasks at any time (which obviously never comes close to production work load). When SAM tests fail because a site is overloaded by production activity (e.g. SAM jobs are stuck in a queue and time out), then availability is lost despite normal progress. We have also seen that too many monitoring tasks will consume a significant portion of resources just for testing work flows. Lastly, it is good that some SAM tests are marked as critical. Yet, it is still possible that a site is mostly operational and (from a VOs perspective) is deemed unusable at the same time, because one out of a dozen critical SAM tests had failed. In our opinion, that’s an unreasonable judgement.
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? Record whether a site is able to bear its burden, so to speak. It is illusionary to expect that everything will work all the time. Focus on services that are in demand then and there. In case something is broken, notifications have to be generated quickly and when a site shows effort to improve the situation in a timely manner, that should be acknowledged, too. Of course, CNAF has a period of 0 % SAM availability, but is that important considering their situation? In other words, SAM verifies that a site is able to meet expectations in theory at a certain point in time, which naturally is helpful to know. But for judging the “quality of a site”, the overall efficiency and accomplishments should weight heavier – SAM on its own is not enough.

How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Just casually
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports Only if those are not bare numbers drawn from SAM. Granted, some kind of reports are necessary, to generate pressure and foster competitiveness among the sites. Some last words that we could not fit in previously and explain the rather bad reputation SAM has at our site. To our knowledge, there is no portal/API/source that is guaranteed to be available to fetch the SAM test results from (w/o VO specific access or privileges). Once in the past we had the SAM test results incorporated in our alarms workflow and because the source back then was broken for more than a week, our own monitoring was quite disturbed. Add to that, that the actual testing frame work and results are not transparent enough for us to understand what is going on/wrong.

NDGF

Do you use SAM tests for operations?
- If yes
  - do you use tests of the critical profiles of the LHC VOs? Yes. Our nagios actually monitors Atlas monitoring, so we get alerts from that if Atlas thinks we are down.
  - tests of other profiles No
  - do you rely on SAM/ARGO/... results for any non-LHC VOs? No
  - how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) It's done by nagios every 5 minutes, and we look at the A/R metrics in a weekly meeting.
  - how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) API
Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
- what framework are you using? Local Nagios
- how different are those tests compared to SAM? They look at the individual dCache pools, not the whole system
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? It's nice to have the view from the project using their framework. Bugs have been found when our nagios says all is green but Atlas says no.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Each month
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
- to estimate the quality of the sites? Yes
- for reporting to funding agencies? Yes

NL-T1

Do you use SAM tests for operations? NO, with one exception. For certain classes of tests (I think the ops tests), we get a ticket from the NGI saying that this test is failing. We used to use the SAM tests for the experiments, but the transport of these through the Nagios network was discontinued, meaning that access to these tests relies on visiting a VO-specific Dashboard (correct me if I am wrong).

Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
- what framework are you using? Nagios
- how different are those tests compared to SAM? Quite comprehensive set of tests, many of the old HR.srce Nagios tests, EMI middleware tests, own tests on things like kernel versions, file system health, available disk space and memory, space left on CVMFS cache partitions, etc etc etc

Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Not anymore. Usually when SAM says we are down, it’s wrong. It could be a good metric if the tests are better.
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? Either the test is good enough so that both site and experiment look at it, or else drop it. Make it easy to import the test into a site Nagios, in general an off-site dashboard will not get looked at by a site admin supporting 40 VOs.

How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Used to be never, until I found out how wrong they can be.
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
- to estimate the quality of the sites? Given a solution for the critical test, I think it is perfectly reasonable. This means that the tests should never say “down” when the site is in fact not down and the experiment is in fact using the site.
- for reporting to funding agencies? for us this question is less important, although you never know who is looking at these numbers.

NRC-KI

Do you use SAM tests for operations? Yes
- If yes
  - do you use tests of the critical profiles of the LHC VOs? Yes
  - tests of other profiles Sometimes
  - do you rely on SAM/ARGO/... results for any non-LHC VOs? Use for OPS tests
  - how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) Daily
  - how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) ETF Nagios UI, WLCG Dashboard.

Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
- what framework are you using? Local Nagios
- how different are those tests compared to SAM?

Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes, since they describe the site in terms of its functionality as the grid center.
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? No, the current set of tests is adequate.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Monthly.
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
- to estimate the quality of the sites? Yes
- for reporting to funding agencies? Yes

PIC

Do you use SAM tests for operations? Yes
- If yes
  - do you use tests of the critical profiles of the LHC VOs? Yes. We use the Critical profiles of the LHC VOs supported in PIC Tier-1: ATLAS, CMS and LHCb. Tests and recovery procedures documented for the 24x7 shifter on duty at PIC.We also host a fraction of the Spanish ATLAS Tier-2. Since it is embedded in the system, failures in the Tier-2 also profit from the 24x7 support. Yes, we use the ATLAS critical profile.
  - tests of other profiles Yes, but not with 24x7 support. Experiment contacts react to problems on tests in those profiles, the individual checks that are not included in the CRITICAL profile. Typically, this support is in 8x5.
  - do you rely on SAM/ARGO/... results for any non-LHC VOs? Yes, We monitor and have integrated ARGO Nagios tests for OPS.
  - how often do you check the results of SAM tests We developed and offered to the community a plugin to Nagios for Experiment SAM tests and OPS tests. They are available here:
    https://github.com/jcasals/nagios-plugins-egi_ops
    https://github.com/jcasals/nagios-plugins-lcgsam
    All of the tests from CRITICAL profiles are integrated in our Nagios/Icinga system and we are notified from any failure as soon as it appears in the system. These failures are managed 24x7 by the PIC team. If any failure, we do often check ETF and ARGO to re-submit tests and understand the failures. The experiment contacts check often (daily) the tests from other profiles.
  - how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) In-house plugins to insert results into local Nagios/Icinga system ETF ARGO We are subscribed to alerts from CMS SSB We often check WLCG dashboard to understand errors, and do historical plots of reliability/availability of PIC and comparisons to other sites
Do you run tests other than the ones of SAM to check availability of your services? Yes. Basic checks integrated in Nagios, mounted areas ok, service checks ok, etc...
If yes
- what framework are you using? Local Nagios
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes, although it might be improved and expanded with other metrics that are relevant and/or new (for example CMS Site Readiness or ATLAS Usability). We do need such estimators to keep track of the health of the sites.
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? It would be easier for us to debug and understand what's wrong if SAM test can give more logs when a test failed.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) each month, as soon as they are available
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
- to estimate the quality of the sites? Yes
- for reporting to funding agencies? Yes Yes for both. It is an important ingredient to cross-compare site qualities and it is highly appreciated by evaluators.

RAL

Do you use SAM tests for operations? Yes
- If yes
  - do you use tests of the critical profiles of the LHC VOs? Yes
  - tests of other profiles Sometimes look at critical_full (e.g. for CMS)
  - do you rely on SAM/ARGO/... results for any non-LHC VOs? Use for OPS tests
  - how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) At least daily. We have nagios tests which check the state of (some of the SAM) availabilities and in some cases call our on-call team if failing.
  - how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) All of these listed. Depends on what we wish to do. Some more useful for drilling down into problems, others better for checking status.

Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
- what framework are you using? Local nagios
- how different are those tests compared to SAM? We run low level tests (nodes, services etc) as well as a limited number of local tests of functionality.

Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? The tests need to be a reasonable reflection of the type of task that the VO wishes to carry out on the storage, CE or whichever service.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Monthly.
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
- to estimate the quality of the sites? Yes (of course - if someone has a better idea then would be pleased to hear it).
- for reporting to funding agencies? Yes

TRIUMF

Do you use SAM tests for operations?
- If yes Yes, we use the SAM tests results for our 24x7 ops and paging around the clock which is quite important and they complement all of our other monitoring mesh. There is also a very strong correlation with SAM tests failing and real site issues, making them useful.
  - do you use tests of the critical profiles of the LHC VOs? Yes, we use the ATLAS critical profile.
  - tests of other profiles No
  - do you rely on SAM/ARGO/... results for any non-LHC VOs? Yes, for example for OPS SAM tests.
  - how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) We check each day. Actually as mentioned above we use the SAM tests results for our 24x7 ops and paging around the clock, thus we have a local Nagios probe which checks every 15 minutes.
  - how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) API,ETF Nagios UI,the WLCG Dashboard from time to time
Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
- what framework are you using? Local Nagios monitoring system and Ganglia system
- how different are those tests compared to SAM? Focusing more on node health, port check etc
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? It would be easier for us to debug and understand what's wrong if SAM test can give more logs when a test failed.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Each month
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
- to estimate the quality of the sites? Yes
- for reporting to funding agencies? Yes

Topic revision: r3 - 2018-01-20 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback