do you use tests of the critical profiles of the LHC VOs? Yes
tests of other profiles
do you rely on SAM/ARGO/... results for any non-LHC VOs? yes, Ops
how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) daily
how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) ETF UI, Dashboard, soon API
Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
what framework are you using? HC
how different are those tests compared to SAM? HC runs experiment payload
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? ...
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) When they come out
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
to estimate the quality of the sites? Yes
for reporting to funding agencies? Yes
ASGC
Do you use SAM tests for operations? Not exactly, we do rely on SAM test, but it's kind of baseline, let's say, if we can't pass SAM test, our services are definitely not in good shape, but even we pass SAM test, there is still possibility that production job/transfer will be failing.
If yes
do you use tests of the critical profiles of the LHC VOs? Yes, but they are not enough
tests of other profiles No
do you rely on SAM/ARGO/... results for any non-LHC VOs? No
how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) Daily
how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) Dashboard.
Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
what framework are you using? we have Nagios to test our services, but it checks difference aspect, e.g. H/W status, host certificate life time, ETC...
how different are those tests compared to SAM?
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? I think it's good enough at this moment
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Monthly.
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
to estimate the quality of the sites? Yes
for reporting to funding agencies? YesIt's reasonable report, and we are fine with it
BNL
Do you use SAM tests for operations? Yes
If yes
do you use tests of the critical profiles of the LHC VOs? Yes
tests of other profiles Not much
do you rely on SAM/ARGO/... results for any non-LHC VOs? No
how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) Daily
how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) WLCG Dashboard
Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
what framework are you using? Nagios, crontab, home-made test suites for individual services, and monitoring component that sometimes come with the service package itself
how different are those tests compared to SAM? Very different. Actually this is the wrong question to ask, because SAM serves mainly for reporting to funding agencies, it should stay with the very basics, clear and simple. Well on the other hand, the monitoring and testing framework our site uses for operation, is more complicated, fine-grained and proactive. These two things should not be mixed and compared.
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? SAM doesn’t cover one particular case very well, that is, if site services go down unscheduled due to uncontrollable reasons, eg. mother nature disaster, SAM numbers are affected in both availability and reliability. At least the reliability number should not be affected in this case, can SAM handle this automatically, or site should follow a procedure to correct it?
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) each month, as soon as they come out.
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
to estimate the quality of the sites? Yes
for reporting to funding agencies? This requires careful thoughts. But, in the short-to-middle time frame, there is really not alternative method which are uniformly available.
CNAF
FNAL
IN2P3
Do you use SAM tests for operations? Yes
If yes
do you use tests of the critical profiles of the LHC VOs? Yes
tests of other profiles Sometimes but not often
do you rely on SAM/ARGO/... results for any non-LHC VOs? Yes from OPS VO
how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) Each day, twice
how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) Mainly via WLCG Dashboard and via ETF Nagios UI but we use API to import on our local monitoring
Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
what framework are you using? Nagios
how different are those tests compared to SAM? This tests are more focusing to the technical aspects of the hosts (port status, check of hosts services, file system check, devices status…..)
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Today Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? Critical profiles (used for availability/reliabilities reports) of SAM tests should be based only on the services declared on GOC DB and should be define/approved by WLCG. Some others SAM test profiles can be define to check others "views like: packages deployment and version, specifics configurations, security status,…
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Each month
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
to estimate the quality of the sites? Yes
for reporting to funding agencies? Yes, if we can ensure that the SAM tests use to define the availability are related exclusively to the status of services (define on GOCDB)
JINR
Do you use SAM tests for operations? Yes
If yes
do you use tests of the critical profiles of the LHC VOs? Yes, ATLAS and CMS
tests of other profiles Yes, ATLAS and CMS
do you rely on SAM/ARGO/... results for any non-LHC VOs? Yes, OPS SAM
how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) Several times a day
how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) ETF Nagios UI,VO-specific tools, like CMS SSB for example
Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
what framework are you using? Local nagios
how different are those tests compared to SAM? Mostly for h/w and OS operations.
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? No
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Never
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
to estimate the quality of the sites? No
for reporting to funding agencies? No
KISTI
Do you use SAM tests for operations? Yes
If yes
do you use tests of the critical profiles of the LHC VOs? Yes. We use ALICE related profiles for T1: ALICE_CE, ALICE_VOBOX, AliEn-SE
tests of other profiles No
do you rely on SAM/ARGO/... results for any non-LHC VOs? We keep eye on ARGO Nagios tests for OPS since we are EGI sites
how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) We have a dedicated team to monitor on SAM tests since the site availability is critical to our funding. 7/7days 24hours monitoring and on-call duties for the recovery are in operation.
how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) ETF Nagios UI and WLCG Dashboard + ALICE specific monitoring
Do you run tests other than the ones of SAM to check availability of your services? Not yet implemented but we plan to do.
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes. We think that current LHC VO specific SAM tests are designed to reflect the minimum requirements from LHC VOs for use of sites.
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? Detailed information on test results (both for success and failure) would help us to understand the system.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) each month, as soon as they come out
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
to estimate the quality of the sites? Yes
for reporting to funding agencies? YesWe might change when we have a better alternative for current system.
KIT
Do you use SAM tests for operations? No, unless our local VO contacts notice irregularities
If yes
how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) Once per week out of “tradition” (middleware meeting).
how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) WLCG Dashboard.
Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
what framework are you using? Lots of bash/perl scripts supplying Icinga and Ganglia
how different are those tests compared to SAM? Probably not very different. i.e. the same systems are tested, but with different parameters. Also, those tests are rarely VO specific (since we support all four major WLCG experiment VOs).
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? We think that, in its current form, there is room for improvement.As with every metric that tries to boil down complex systems to a simple number between 0 and 1, there are always situations where the algorithm is unfit and/or unfair. All sites differ with regards to the services they provide to their individual user base, running on heterogeneous systems to variable scale with an unequal budget. In our opinion, what should matter most is the actual performance and throughput delivered by a site and how close that was to its obligation.SAM continuously tests the same things, that a site is able to fulfil certain tasks at any time (which obviously never comes close to production work load). When SAM tests fail because a site is overloaded by production activity (e.g. SAM jobs are stuck in a queue and time out), then availability is lost despite normal progress. We have also seen that too many monitoring tasks will consume a significant portion of resources just for testing work flows.Lastly, it is good that some SAM tests are marked as critical. Yet, it is still possible that a site is mostly operational and (from a VOs perspective) is deemed unusable at the same time, because one out of a dozen critical SAM tests had failed. In our opinion, that’s an unreasonable judgement.
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? Record whether a site is able to bear its burden, so to speak. It is illusionary to expect that everything will work all the time. Focus on services that are in demand then and there. In case something is broken, notifications have to be generated quickly and when a site shows effort to improve the situation in a timely manner, that should be acknowledged, too. Of course, CNAF has a period of 0 % SAM availability, but is that important considering their situation? In other words, SAM verifies that a site is able to meet expectations in theory at a certain point in time, which naturally is helpful to know. But for judging the “quality of a site”, the overall efficiency and accomplishments should weight heavier – SAM on its own is not enough.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Just casually
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports Only if those are not bare numbers drawn from SAM. Granted, some kind of reports are necessary, to generate pressure and foster competitiveness among the sites.Some last words that we could not fit in previously and explain the rather bad reputation SAM has at our site. To our knowledge, there is no portal/API/source that is guaranteed to be available to fetch the SAM test results from (w/o VO specific access or privileges). Once in the past we had the SAM test results incorporated in our alarms workflow and because the source back then was broken for more than a week, our own monitoring was quite disturbed. Add to that, that the actual testing frame work and results are not transparent enough for us to understand what is going on/wrong.
NDGF
Do you use SAM tests for operations?
If yes
do you use tests of the critical profiles of the LHC VOs? Yes. Our nagios actually monitors Atlas monitoring, so we get alerts from that if Atlas thinks we are down.
tests of other profiles No
do you rely on SAM/ARGO/... results for any non-LHC VOs? No
how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) It's done by nagios every 5 minutes, and we look at the A/R metrics in a weekly meeting.
how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) API
Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
what framework are you using? Local Nagios
how different are those tests compared to SAM? They look at the individual dCache pools, not the whole system
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? It's nice to have the view from the project using their framework. Bugs have been found when our nagios says all is green but Atlas says no.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Each month
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
to estimate the quality of the sites? Yes
for reporting to funding agencies? Yes
NL-T1
Do you use SAM tests for operations? NO, with one exception. For certain classes of tests (I think the ops tests), we get a ticket from the NGI saying that this test is failing. We used to use the SAM tests for the experiments, but the transport of these through the Nagios network was discontinued, meaning that access to these tests relies on visiting a VO-specific Dashboard (correct me if I am wrong).
Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
what framework are you using? Nagios
how different are those tests compared to SAM? Quite comprehensive set of tests, many of the old HR.srce Nagios tests, EMI middleware tests, own tests on things like kernel versions, file system health, available disk space and memory, space left on CVMFS cache partitions, etc etc etc
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Not anymore. Usually when SAM says we are down, it’s wrong. It could be a good metric if the tests are better.
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? Either the test is good enough so that both site and experiment look at it, or else drop it. Make it easy to import the test into a site Nagios, in general an off-site dashboard will not get looked at by a site admin supporting 40 VOs.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Used to be never, until I found out how wrong they can be.
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
to estimate the quality of the sites? Given a solution for the critical test, I think it is perfectly reasonable. This means that the tests should never say “down” when the site is in fact not down and the experiment is in fact using the site.
for reporting to funding agencies? for us this question is less important, although you never know who is looking at these numbers.
NRC-KI
Do you use SAM tests for operations? Yes
If yes
do you use tests of the critical profiles of the LHC VOs? Yes
tests of other profiles Sometimes
do you rely on SAM/ARGO/... results for any non-LHC VOs? Use for OPS tests
how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) Daily
how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) ETF Nagios UI, WLCG Dashboard.
Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
what framework are you using? Local Nagios
how different are those tests compared to SAM?
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes, since they describe the site in terms of its functionality as the grid center.
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? No, the current set of tests is adequate.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Monthly.
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
to estimate the quality of the sites? Yes
for reporting to funding agencies? Yes
PIC
Do you use SAM tests for operations? Yes
If yes
do you use tests of the critical profiles of the LHC VOs? Yes. We use the Critical profiles of the LHC VOs supported in PIC Tier-1: ATLAS, CMS and LHCb. Tests and recovery procedures documented for the 24x7 shifter on duty at PIC.We also host a fraction of the Spanish ATLAS Tier-2. Since it is embedded in the system, failures in the Tier-2 also profit from the 24x7 support.Yes, we use the ATLAS critical profile.
tests of other profiles Yes, but not with 24x7 support. Experiment contacts react to problems on tests in those profiles, the individual checks that are not included in the CRITICAL profile. Typically, this support is in 8x5.
do you rely on SAM/ARGO/... results for any non-LHC VOs? Yes, We monitor and have integrated ARGO Nagios tests for OPS.
how often do you check the results of SAM tests We developed and offered to the community a plugin to Nagios for Experiment SAM tests and OPS tests. They are available here: https://github.com/jcasals/nagios-plugins-egi_ops https://github.com/jcasals/nagios-plugins-lcgsam All of the tests from CRITICAL profiles are integrated in our Nagios/Icinga system and we are notified from any failure as soon as it appears in the system. These failures are managed 24x7 by the PIC team.If any failure, we do often check ETF and ARGO to re-submit tests and understand the failures.The experiment contacts check often (daily) the tests from other profiles.
how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) In-house plugins to insert results into local Nagios/Icinga systemETFARGOWe are subscribed to alerts from CMS SSBWe often check WLCG dashboard to understand errors, and do historical plots of reliability/availability of PIC and comparisons to other sites
Do you run tests other than the ones of SAM to check availability of your services? Yes. Basic checks integrated in Nagios, mounted areas ok, service checks ok, etc...
If yes
what framework are you using? Local Nagios
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes, although it might be improved and expanded with other metrics that are relevant and/or new (for example CMS Site Readiness or ATLAS Usability). We do need such estimators to keep track of the health of the sites.
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? It would be easier for us to debug and understand what's wrong if SAM test can give more logs when a test failed.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) each month, as soon as they are available
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
to estimate the quality of the sites? Yes
for reporting to funding agencies? YesYes for both. It is an important ingredient to cross-compare site qualities and it is highly appreciated by evaluators.
RAL
Do you use SAM tests for operations? Yes
If yes
do you use tests of the critical profiles of the LHC VOs? Yes
tests of other profiles Sometimes look at critical_full (e.g. for CMS)
do you rely on SAM/ARGO/... results for any non-LHC VOs? Use for OPS tests
how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) At least daily. We have nagios tests which check the state of (some of the SAM) availabilities and in some cases call our on-call team if failing.
how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) All of these listed. Depends on what we wish to do. Some more useful for drilling down into problems, others better for checking status.
Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
what framework are you using? Local nagios
how different are those tests compared to SAM? We run low level tests (nodes, services etc) as well as a limited number of local tests of functionality.
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? The tests need to be a reasonable reflection of the type of task that the VO wishes to carry out on the storage, CE or whichever service.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Monthly.
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports:
to estimate the quality of the sites? Yes (of course - if someone has a better idea then would be pleased to hear it).
for reporting to funding agencies? Yes
TRIUMF
Do you use SAM tests for operations?
If yes Yes, we use the SAM tests results for our 24x7 ops and paging around the clock which is quite important and they complement all of our other monitoring mesh. There is also a very strong correlation with SAM tests failing and real site issues, making them useful.
do you use tests of the critical profiles of the LHC VOs? Yes, we use the ATLAS critical profile.
tests of other profiles No
do you rely on SAM/ARGO/... results for any non-LHC VOs? Yes, for example for OPS SAM tests.
how often do you check the results of SAM tests (daily, weekly, monthly, few times per year, only if some updates performed) We check each day. Actually as mentioned above we use the SAM tests results for our 24x7 ops and paging around the clock, thus we have a local Nagios probe which checks every 15 minutes.
how do you access them (API for importing to site local monitoring, ETF Nagios UI,WLCG Dashboard,VO-specific tools, other) API,ETF Nagios UI,the WLCG Dashboard from time to time
Do you run tests other than the ones of SAM to check availability of your services? Yes
If yes
what framework are you using? Local Nagios monitoring system and Ganglia system
how different are those tests compared to SAM? Focusing more on node health, port check etc
Do you think site Availability/Reliability based on SAM tests is a good metric for site quality estimation? Yes
Any suggestions for changes in SAM tests to make them more realistic for site quality estimation? It would be easier for us to debug and understand what's wrong if SAM test can give more logs when a test failed.
How often do you check the WLCG availability reports (monthly as soon as they come out,once each few months,once per year, only if some updates performed, never) Each month
Do you think we (WLCG) should continue to rely on site Availability/Reliability reports: