LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOperationCosts>WLCGSiteSurvey (2015-05-06, MariaDimou)

EditAttachPDF

WLCG 2015 Site Survey

WLCG 2015 Site Survey

Number of Virtual Organisations

The number of supported LHC and non-LHC virtual organisations is plotted below (histograms are stacked, not superimposed).

It can be seen that 41% (43%) of the Tier-1's (Tier-2's) support a single LHC VO and the average number of non-LHC VOs supported is 7.2.

Effort by service

Sites were asked to quantify the amount of effort they dedicate to the operations of several types of services and to other activities. It is understood that these estimates can be affected by a large uncertaintly (including a misinterpretation of the question, which was apparent in a few cases): therefore utmost care must be taken in drawing strong conclusions from them.

The average FTE and its error for each category are plotted below.

And in tabular format:

                             var mean  err
1                   Batch system 0.27 0.03
2         Worker node middleware 0.19 0.02
3                 Storage system 0.68 0.10
4                     Networking 0.31 0.04
5             Computing elements 0.22 0.02
6                      perfSONAR 0.11 0.01
7               Local monitoring 0.28 0.03
8       Squid for Frontier/CVMFS 0.11 0.02
9                  ARGUS or GUMS 0.08 0.01
10            Information system 0.14 0.05
11                      VO boxes 0.09 0.02
12           Other Grid services 0.28 0.05
13    Giving support via tickets 0.22 0.04
14           Experiment contacts 0.29 0.06
15                 WLCG meetings 0.09 0.02
16 Participation in WLCG TFs/WGs 0.09 0.02
17      Testing new technologies 0.23 0.04
18      Other WLCG-related tasks 0.63 0.16

The sum of the averages is 4.5 FTE, that is, the average number of people working full time in a WLCG site, and the average over all categories is 0.25 FTE.

Not unexpectedly, the storage is the service that requires the highest amount of effort, equal to the 16% of the total, followed by "other' WLCG-related tasks (whose nature varies from site to site) and networking.

Experiment contacts use 1/3 of their time, which is consistent with the fact that only CMS has a contact at every site, while the other experiments have contacts at most at Tier-1 sites.

Infrastructure services such as perfSONAR, Squid, ARGUS/GUMS take very little manpower.

In the perspective of reducing the total amount of effort, many complementary approaches can be considered:

eliminating the need for a service/activity (which may or may not require to replace it with a less time-consuming alternative)
reducing the time spent on a service/activity
reducing the number of sites

Comments from Jeremy

Cost reduction:
- better trained staff (but training is a cost)
- better logging services and search capabilities
- ensuring efficient communications
- incident prevention by having well patched systems
- removing CSIRT contributions is a false economy, as it pays many times over in case of incidents
FTE estimation reliability:
- tasks greatly differ from site to site due to supported experiments, size, amount of available effort, internal constraints, etc.
- this variation is much larger than systematics due e.g. to misinterpreted questions (smaller effect), selective memory biasing towards recent past (larger effect)
- relative fractions are much more important than the overall scale
- local leverage must be considered. Consolidating too much may cause loss of leveraged effort and increase facility spending (that can otherwise be absorbed by the local institute)

By Tier type

The above plots are separately available for Tier-1s and Tier-2s.

TODO

Make separate plots for Tier-1s and Tier-2s (DONE)
Measure how the effort scales with the size of the site (DONE)
Estimate (not from the survey data) how the effort on central WLCG operations depends on sites (define exactly what "central operations" means)

Service upgrades and changes

Suggestions from sites on how to improve service upgrade operations

TO DO.

Communication

Preferred communication channels

It was asked to classify in order of importance the preferred communication channels to be used by WLCG central operations whenever a certain action from a site is requested (e.g. service upgrades, reconfigurations, etc.). The answers are summarised below.

Channel	Most important	Less important	Least important
WLCG broadcasts	51%	35%	14%
GGUS tickets	54%	33%	13%
Operations meetings	4%	29%	67%

Broadcasts and GGUS tickets are considered by far the best methods to communicate requests to sites, while operations meetings are far behind.

The recommendation is to make sure that requests to sites are backed up by broadcasts every time, and also by tickets whenever it is felt appropriate, for example when it is important to track the progression of the request implementation. The negative aspect of tickets is that their tracking may generate a large amount of work for the person who took charge of submitting and following up on them.

Recommendations from sites concerning communication between WLCG operations and sites

Make clearer the difference between requests and requirements (x2)
Important service requests should come from WLCG as formal requests, endorsed by the WLCB MB (x2)
Have clearer deadlines
Add a GGUS SU for WLCG operations
Use more formal tracking of requests
Further enhance the role of WLCG operations in making the experiments to converge in their demands
Maintain a list of VO and WLCG requests
Produce a (bi-)weekly email summarizing interesting information for sites, more concise than the minutes, self-explanatory but with pointers to details
Define and document all the ways in which WLCG can communicate with sites and define with checklists and standard procedures how the various operations at sites are to be done
Have a high level view document of the direction of travel of WLCG, including introduction of new features required by the experiments and solutions to shortfalls, but without all the details; this would allow sites to know where they stand with respect to the plan
Lower level plans showing all the details of what’s going to happen in the next weeks
This requires creating and updating timeline plans, checking progress and define milestones
One important idea is that of "transactions". For instance, updates to VOMS Servers and their associated records in the Operations Portal should be viewed as a single transation. One should never be changed without the other. This can be done via an SOP that makes this clear. Sites can then rely on the records in the Ops Portal being canonical and binding

Recommendations from sites concerning communication among sites

Consolidate information in a single entry point to updated information about the middleware, known issues, recipes, and the experiment software, public and indexed by search engines (x4)
Organize a face-to-face meeting at CERN between site operators, or an annual T1 jamboree (x2)
Issue an electronic newsletter
Have more site experience talks in WLCG workshops
The twikis should be updated giving explicitly the version stamp of the software
Create new e-groups to share information on site-specific services and/or issues
Consolidate the planning function to improve the consistency of messages to sites, to avoid having many ways of managing, tracking and tracing operational developments ad to have consistent communications
Avoid having important information hidden in tickets – if a problem/fix is identified, everyone should be notified
Have fewer sources of information and more things coming from official channels

Other used, and desirable communication channels

Mailing lists and their web archives, which must be searchable, for announcements and frequent communication, avoiding fragmentation (x4)
HEPIX meetings (x4)
Open and searchable wikis, provided they are carefully kept up to date for official documentation (x4)
GDB (x2)
Summaries with experience from sites
Searches on GGUS tickets
TF/WG
CHEP
WLCG broadcast list
Creation of an open wlcg-discussion list for sites, to discuss on new technologies, new services, deployments, validations, etc...
Short and structured meetings, mailing lists, wikis, whose content stems from a careful planning preventing conflicting messages to be sent

Suggestions from sites on how to improve the WLCG operations coordination meeting

The meeting is effective, the format is reasonable and cover important issues (x2)
Limit the duration to one hour and focus the meeting on a subset of issues (x2)
Have the technical discussions about the new upgrades during the meetings (x2)
Make the purpose of the meeting more clear
Eliminate it and have just the GDB
Start a bit later for US audiences -- i.e. 4 pm CERN time
Add a ‘actions from/to sites’ section, when the meeting minutes are circulated
The minutes of the "WLCG operations coordination meeting" appear to be a kind of planning process. If so, they are very important. So, refine the results of this "planning meeting" and add them in the "centralised plan", so that any admin can see at a glance what is coming up. Also, add it to the "central documentation and training" repository. Furthermore, publish the SOP (standard operating procedure) that describes this "WLCG operations coordination" process so all site admins can know for sure how planning in the organisation is done and what are its outputs. Make sure the output of the meeting is not the minutes, but a plan!

Site reasons for not contributing currently to task forces or working groups

Lack of manpower (x6)
Timezone difference
Task forces and working groups are the pioneers for our field of research and more participation is desirable
Do not see them as useful
Effort from site admins should rather go into maintaining a central documentation and training repository; moreover, the work of the task forces and working groups should be better advertised in a public forum

Site suggestions on how to improve GGUS

Allow sites to close the tickets assigned to them rather than having to wait for the submitter’s response (x2)
Include the possibility of closing a ticket for his creator
Suggested solutions or Reference to the support forum should be included in GGUS
The "red" colour seems to be inappropriately applied even to tickets that are on hold with reminder dates in the far future and low priority. We believe that it should only be used as an incident tracker and not a project management tool.
Billing system even if this is virtual money
Easy programmatic access to the current and historical content
Specific documentation of all the functionalities of the tool (sometimes difficult to know to whom to escalate a problem, or how to do it...)
Not a GGUS-related issue, but some VO-twiki pages relate to functionalities on GGUS which are deprecated. For example, the WLCG procedure to open a ticket when SAM tests need a correction point to a non-existing Support Unit
Sites normally receive tickets which are not very well detailed. A tutorial or examples on the GGUS webpage on “how to” open a ‘good ticket’ would be appreciated
Search on content, for example with a query builder or an SQL window where one can dredge out any info based on whatever criteria
A few tweaks to the UI (like not risking losing your post if the previous post you're replying to has been modified)

Monitoring

Site suggestions on how to improve SAM

Have a self-explaining output to allow troubleshooting (x3)
Provide a better documentation for SAM tests where the exact operations being tested are explained so the site can replicate the test (x2)
Send SAM tests via pilot jobs to avoid prioritisation issues on the batch system (x2)
Send email notifications to the site when a critical test is failing (x2)
Reduce the rate of false positives (e.g. tests pass but the corresponding functionality is broken) and negatives (e.g. failures due to central problems) (x3)
Link to the test documentation in the test output
Generate RSS feeds from Nagios
Have more meaningful error codes in SAM tests link to the possible solution (instead of having to rely on search engines)
Document how to reply a SAM test “on demand”
Documented how to interface a local Nagios to SAM tests with ability to use certificates for authorisation
Increase the test frequency
Adapt SAM tests to modern protocols (avoid using the gLite WMS and SRM when not relevant)
Allow sites to choose the level of criticality for each test
Interface the SAM framework with the CERN GNI monitoring, to make more practical their use for daily monitoring of service operations
Have a consistent meaning of the profiles across VOs (CRITICAL, CRITICAL_FULL, etc.)
Use more realistic (=longer) timeouts for SRM tests
Have a central home page that links all test result pages that pertain to a site

Suggestions from sites on how to improve site monitoring in general

Use more the experiment monitoring, often far more useful than the “standard” tools (x2)
Make sure that the relevant monitoring metric returns OK within a short time after the problem has been fixed
Use brower plug-ins (e.g. Nagios Checker for Firefox) for a real-time view
Make sure that all relevant monitoring portals (including by the experiments) are advertised and made easily accessible
Avoid spending time solving problems that affect SAM tests but not real operations
Do not impose to sites show they should monitor their infrastructure (cf Perfsonar)
Use data analytics
Provide a minimal dashboard per showing the overall status of sites by VO
Fully exploit the local Nagios probes and the log parser alerts, which can be more useful than the central monitoring
Provide a set of diagnostic programs that an admin can download (e.g. via github or CVMFS) and run manually to determine status. This set should test all functionality required by the experiments. In the event of failures, the package should automatically create a diagnostic tarball, and send to a central service for central debugging (x2)
For the WLCG Tier-1 reports, in addition to the SAM metrics, show also additional information such as: production/analysis jobs success rates, production/analysis CPU efficiency, transferred data volumes and data transfer quality (import/export), LHCOPN saturation (in hours per month) for each of the supported experiments. Most of this can be easily gathered from the experiments and it could be shown as a monthly executive summary
Complete and automate all the resource accounting used in the WLCG reports
Use tools, e.g. the Rob Fay's "testnodes" suite that tests WNs for black hole conditions and up-to-dateness, putting them offline if necessary; use the spacemon suite for alerting us to full disks; use the VomsSnooper suite for imposing and checking the validity of VOMS records
Use SAM as a single entry point to see all relevant monitoring information for the VO, such that the site is green if and only if the VO can use it

Service administration

The following plots show how easy (or difficult) sites find several aspects in the operations of the most important WLCG-related services.

Comments from sites about WLCG operations in general

Provide clear instructions for new institutes wanting to join WLCG as a new site, without assuming any background in Grid computing (x2)
The forums that we get the most out of is HEPIX and our weekly regional Tier-2 ops meeting. This is because the information either comes from other sites, or from our Tier-1. Information is better flowed through the Tier-1 than having a broadcast, as there is someone we know who we can ask questions of, and they can either answer it, or send it through up the chain
The number of services not really required for operations should be reduced (e.g. glexec, perfSonar): WLCG services should be simpler and fewer (x2)
The WLCG approach is too bureaucratic and the survey is too long with too much pressure to sites to fill it
The lack of proper support for ARGUS (x2) and other tools (Torque/Maui) is a concern
Many issues about service administration were caused by Yaim and its less than optimal implementations for the various packages; moreover, its future is unclear (x2)
The LSF integration into Grid services is sub-optimal
The survey didn't really address time spent managing hardware, user documentation, or Xrootd and FTS explicitly
WLCG should play a bigger role in networking, as this really crosses experimental boundaries
After all these years, WLCG remains an EU-centric organization and often lags on technology innovation and support. E.g. experiments themselves drove the innovation in data federations, and poured enormous resources into supporting the WLCG monitoring group (rather than the other way around), and still are responsible for software deployment
The future support of services post-EGI/EMI is unclear
Sites are regularly requested by experiments to provide ‘special features’ not specified on the VO-cards. This should be avoided
The XrootD deployment and upgrades need further attention. It is not clear for sites which plugins to install, where to find support and to find where the plugins/packages are placed. All of the XrootD deployment is in hands of the VOs and it should rather be addressed by WLCG Ops
It is hard for sites which joined only recently to really understand who is WLCG, what it does, with whom it interfaces, where it is going and how everything works
The WLCG goals should be consolidated into a published timeline of events, showing who is responsible for what and when, keeping track of events as update the plan accordingly, and make all this accessible by all sites
Any material this is not an "event" or an "activity" should go in documentation, e.g. functional and non-requirements, expectations, goals etc.

Suggestions from sites on the adoption of new technologies

WLCG should invest in containerisation of middleware services to reduce the workload on sites (x2)
Have Puppet modules officially maintained by middleware/service operators (x2)
Run Grid services and farm nodes on a cloud infrastructure (e.g. Openstack) (x3)
WLCG should better define a roadmap for Cloud Computing adoption
When considering new technologies, stick with industry-standard solutions, except in those cases in which there is a very good reason why this is not possible
Support volunteer computing techniques also as a possible solution for small Tier-2 sites needing much less effort that a full-fledged Grid site
Run many-core jobs
Have HTTP-based storage federations (to move away from HEP specific solutions like xrootd)
Consider SDN (software defined networks)
Organise a yearly European HTcondor workshop with developers in Europe
WLCG should provide tested and optimised disk images compatible with any cloud infrastructure. Ideally, running a WLCG site should mean running few (1-3) virtual machines with WLCG-provided images configured by a small and well documented configuration file made by the site admin, plus the virtual machines (also on WLCG-provided images) for WNs, storage servers, etc.
WLCG should move to a model whereby site admins no longer are required to install local services beyond core cluster functionality; no more CE's, WN software, information services, unreliable and poorly supported software used by a small community. Site admins should use only widely adopted standard tools and software. Nothing more than providing an ssh account and a container edge service which could be used to host additional needed services (squid cache, etc.).
WLCG should keep the barrier of entry for potential opportunistic resource providers at an extreme minimum: provide a grid model but leveraging container technologies such as Docker or Rocket. Site admins deploy a simple server, a trusted service agent, centrally managed, deploys the required services as run-time apps
WLCG overviews all of the services/middleware deployed. The project should ‘recommend’ and ‘unrecommend’ some of them, based on the mid-long term requirements from the experimentsSites are losing the capacity to control services (remote FTS3 servers, XrootD re-directors, job submission through pilots which mixes analysis/production activities, …)
ElasticSearch/Logstash/Kibana is getting popular in WLCG. WLCG might want to gather logstash templates for WLCG services, and made them available to sites through a central repository
A big number of sites are looking into CEPH as a storage technology. Among many pros, it enables SSD caching out of the box. This could be a game changer for more efficient WAN transfers now that uplinks are being upgraded to the next order (100 Gbps). So supporting CEPH as an SE technology could be very beneficial
The excellent work implementing ATLAS multicore on ARC/Condor should be completed with good documentation!
The use of github and the encouragement of sharing solutions among the sites, for "official" middleware and other pieces of installed software is taking off and it should be further encouraged in WLCG

Maria D.'s bullet points for slides

These are notes from the drupal results' view. If the plots and spreadsheet lead to another conclusion, please ignore.

Domains we are doing well:
- The Middelware (MW) releases' frequency
- WLCG support (incl. documentation)
- The repositories' trend (all known ones are used. EPEL is the most popular)
- Site-WLCG_Ops communication
- Info share across sites
- The channels we use are good and sufficient (email, meetings, wikis etc ...)
- Our TFs and WGs are considered useful
- GGUS is very much appreciated as a support tool
- The 3pm Ops call is considered useful by the T0/T1s and the frequency is fine.
- The SAM test results are complete and reliable
- Batch systems' installation, upgrade, operation, troubleshooting is mostly considered Normal or Rather Easy.

Domains we need to improve:
- Documentation and repository location discovery
- Upgrade procedures
- The role of WLCG Operations between the Experiments and their collaborating Sites
- The content of the fortnightly WLCG Ops Coord meeting (66/101 sites Never or Rarely attend). Maybe add a cross-sites communication item in the agenda?
- Invest some effort to improve the SAM documentation
- Batch systems' support is considered problematic

Overall remarks
- Sites are mostly based on real jobs and transfers to conclude on their performance
- The majority of replies about all MW components are in the Normal or Rather Easy colujmns.

Pepe's slides presented at the Amsterdam GDB on 2015/03/11 HERE

Alessandra's slides presented at the Okinawa WLCG Workshop on 2015/04/11 HERE

Andrea's slides presented at the Okinawa CHEP 2015 on 2015/04/13 HERE

-- AndreaSciaba - 2015-02-24

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who
png	Auth.FTE.png	r2 r1	manage	60.0 K	2015-03-06 - 14:46	AndreaSciaba
png	Batch.FTE.png	r2 r1	manage	60.2 K	2015-03-06 - 14:46	AndreaSciaba
png	CE.FTE.png	r2 r1	manage	61.3 K	2015-03-06 - 14:46	AndreaSciaba
png	Exp.FTE.png	r3 r2 r1	manage	61.9 K	2015-03-06 - 14:46	AndreaSciaba
png	Infosys.FTE.png	r2 r1	manage	64.7 K	2015-03-06 - 14:46	AndreaSciaba
png	Mon.FTE.png	r4 r3 r2 r1	manage	57.8 K	2015-03-06 - 14:46	AndreaSciaba
png	Network.FTE.png	r3 r2 r1	manage	59.1 K	2015-03-06 - 14:46	AndreaSciaba
png	Other.Svc.FTE.png	r3 r2 r1	manage	58.5 K	2015-03-06 - 14:46	AndreaSciaba
png	Other.WLCG.FTE.png	r3 r2 r1	manage	66.7 K	2015-03-06 - 14:46	AndreaSciaba
png	SE.FTE.png	r3 r2 r1	manage	64.7 K	2015-03-06 - 14:48	AndreaSciaba
png	Squid.FTE.png	r3 r2 r1	manage	67.4 K	2015-03-06 - 14:48	AndreaSciaba
png	Support.FTE.png	r3 r2 r1	manage	65.4 K	2015-03-06 - 14:48	AndreaSciaba
png	Testing.FTE.png	r3 r2 r1	manage	60.6 K	2015-03-06 - 14:48	AndreaSciaba
png	VObox.FTE.png	r2 r1	manage	60.0 K	2015-03-06 - 14:48	AndreaSciaba
png	WLCG.Meet.FTE.png	r2 r1	manage	62.6 K	2015-03-06 - 14:48	AndreaSciaba
png	WLCG.TF.FTE.png	r3 r2 r1	manage	66.4 K	2015-03-06 - 14:48	AndreaSciaba
png	WN.FTE.png	r3 r2 r1	manage	65.0 K	2015-03-06 - 14:48	AndreaSciaba
png	adm_Auth.png	r1	manage	153.0 K	2015-03-30 - 13:57	AndreaSciaba
png	adm_Batch.png	r1	manage	147.5 K	2015-03-30 - 13:57	AndreaSciaba
png	adm_CE.png	r1	manage	148.2 K	2015-03-30 - 13:57	AndreaSciaba
png	adm_Infosys.png	r1	manage	146.1 K	2015-03-30 - 13:57	AndreaSciaba
png	adm_Mon.png	r1	manage	145.9 K	2015-03-30 - 13:57	AndreaSciaba
png	adm_SE.png	r1	manage	147.8 K	2015-03-30 - 13:57	AndreaSciaba
png	adm_Squid.png	r1	manage	147.0 K	2015-03-30 - 13:57	AndreaSciaba
png	adm_VObox.png	r1	manage	148.5 K	2015-03-30 - 13:57	AndreaSciaba
png	adm_WN.png	r1	manage	149.2 K	2015-03-30 - 13:57	AndreaSciaba
png	adm_pS.png	r1	manage	149.2 K	2015-03-30 - 13:57	AndreaSciaba
png	channels.png	r2 r1	manage	77.5 K	2015-03-13 - 17:10	AndreaSciaba
png	comm.png	r1	manage	53.8 K	2015-03-05 - 10:59	AndreaSciaba
png	fte_summary.png	r3 r2 r1	manage	99.3 K	2015-03-06 - 14:48	AndreaSciaba
png	lhc_vo.png	r2 r1	manage	62.8 K	2015-03-06 - 14:33	AndreaSciaba
png	monit.png	r3 r2 r1	manage	85.1 K	2015-03-13 - 17:17	AndreaSciaba
png	other_vo.png	r2 r1	manage	64.0 K	2015-03-06 - 14:33	AndreaSciaba
png	pS.FTE.png	r2 r1	manage	63.1 K	2015-03-06 - 14:46	AndreaSciaba
png	rating_1.png	r1	manage	7.0 K	2015-03-03 - 16:25	AndreaSciaba
png	rating_10.png	r1	manage	12.2 K	2015-03-03 - 16:25	AndreaSciaba
png	rating_11.png	r1	manage	6.7 K	2015-03-03 - 16:25	AndreaSciaba
png	rating_12.png	r1	manage	7.0 K	2015-03-03 - 16:25	AndreaSciaba
png	rating_13.png	r1	manage	7.7 K	2015-03-03 - 16:25	AndreaSciaba
png	rating_14.png	r2 r1	manage	18.5 K	2015-03-13 - 17:20	AndreaSciaba
png	rating_15.png	r1	manage	18.4 K	2015-03-13 - 10:52	AndreaSciaba
png	rating_16.png	r1	manage	15.7 K	2015-03-13 - 10:52	AndreaSciaba
png	rating_17.png	r1	manage	14.4 K	2015-03-13 - 10:52	AndreaSciaba
png	rating_2.png	r1	manage	7.3 K	2015-03-03 - 16:25	AndreaSciaba
png	rating_3.png	r1	manage	6.5 K	2015-03-03 - 16:25	AndreaSciaba
png	rating_4.png	r1	manage	6.7 K	2015-03-03 - 16:25	AndreaSciaba
png	rating_5.png	r1	manage	7.3 K	2015-03-03 - 16:25	AndreaSciaba
png	rating_6.png	r1	manage	12.8 K	2015-03-03 - 16:25	AndreaSciaba
png	rating_7.png	r1	manage	7.1 K	2015-03-03 - 16:25	AndreaSciaba
png	rating_8.png	r1	manage	7.0 K	2015-03-03 - 16:25	AndreaSciaba
png	rating_9.png	r1	manage	7.1 K	2015-03-03 - 16:25	AndreaSciaba
png	repo.png	r1	manage	47.4 K	2015-03-03 - 17:36	AndreaSciaba

Topic revision: r18 - 2015-05-06 - MariaDimou

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback