WLCG 2015 Site Survey
Number of Virtual Organisations
The number of supported LHC and non-LHC virtual organisations is plotted below (histograms are stacked, not superimposed).
It can be seen that 41% (43%) of the Tier-1's (Tier-2's) support a single LHC VO and the average number of non-LHC VOs supported is 7.2.
Effort by service
Sites were asked to quantify the amount of effort they dedicate to the operations of several types of services and to other activities. It is understood that these estimates can be affected by a large uncertaintly (including a misinterpretation of the question, which was apparent in a few cases): therefore utmost care must be taken in drawing strong conclusions from them.
The average FTE and its error for each category are plotted below.
And in tabular format:
var mean err
1 Batch system 0.27 0.03
2 Worker node middleware 0.19 0.02
3 Storage system 0.68 0.10
4 Networking 0.31 0.04
5 Computing elements 0.22 0.02
6 perfSONAR 0.11 0.01
7 Local monitoring 0.28 0.03
8 Squid for Frontier/CVMFS 0.11 0.02
9 ARGUS or GUMS 0.08 0.01
10 Information system 0.14 0.05
11 VO boxes 0.09 0.02
12 Other Grid services 0.28 0.05
13 Giving support via tickets 0.22 0.04
14 Experiment contacts 0.29 0.06
15 WLCG meetings 0.09 0.02
16 Participation in WLCG TFs/WGs 0.09 0.02
17 Testing new technologies 0.23 0.04
18 Other WLCG-related tasks 0.63 0.16
The sum of the averages is 4.5 FTE, that is, the average number of people working full time in a WLCG site, and the average over all categories is 0.25 FTE.
Not unexpectedly, the storage is the service that requires the highest amount of effort, equal to the 16% of the total, followed by "other' WLCG-related tasks (whose nature varies from site to site) and networking.
Experiment contacts use 1/3 of their time, which is consistent with the fact that only CMS has a contact at every site, while the other experiments have contacts at most at Tier-1 sites.
Infrastructure services such as perfSONAR, Squid, ARGUS/GUMS take very little manpower.
In the perspective of reducing the total amount of effort, many complementary approaches can be considered:
- eliminating the need for a service/activity (which may or may not require to replace it with a less time-consuming alternative)
- reducing the time spent on a service/activity
- reducing the number of sites
Comments from Jeremy
- Cost reduction:
- better trained staff (but training is a cost)
- better logging services and search capabilities
- ensuring efficient communications
- incident prevention by having well patched systems
- removing CSIRT contributions is a false economy, as it pays many times over in case of incidents
- FTE estimation reliability:
- tasks greatly differ from site to site due to supported experiments, size, amount of available effort, internal constraints, etc.
- this variation is much larger than systematics due e.g. to misinterpreted questions (smaller effect), selective memory biasing towards recent past (larger effect)
- relative fractions are much more important than the overall scale
- local leverage must be considered. Consolidating too much may cause loss of leveraged effort and increase facility spending (that can otherwise be absorbed by the local institute)
By Tier type
The above plots are separately available for
Tier-1s and
Tier-2s.
TODO
- Make separate plots for Tier-1s and Tier-2s (DONE)
- Measure how the effort scales with the size of the site (DONE)
- Estimate (not from the survey data) how the effort on central WLCG operations depends on sites (define exactly what "central operations" means)
Service upgrades and changes
Suggestions from sites on how to improve service upgrade operations
TO DO.
Communication
Preferred communication channels
It was asked to classify in order of importance the preferred communication channels to be used by WLCG central operations whenever a certain action from a site is requested (e.g. service upgrades, reconfigurations, etc.). The answers are summarised below.
Broadcasts and GGUS tickets are considered by far the best methods to communicate requests to sites, while operations meetings are far behind.
The recommendation is to make sure that requests to sites are backed up by broadcasts every time, and also by tickets whenever it is felt appropriate, for example when it is important to track the progression of the request implementation. The negative aspect of tickets is that their tracking may generate a large amount of work for the person who took charge of submitting and following up on them.
Recommendations from sites concerning communication between WLCG operations and sites
- Make clearer the difference between requests and requirements (x2)
- Important service requests should come from WLCG as formal requests, endorsed by the WLCB MB (x2)
- Have clearer deadlines
- Add a GGUS SU for WLCG operations
- Use more formal tracking of requests
- Further enhance the role of WLCG operations in making the experiments to converge in their demands
- Maintain a list of VO and WLCG requests
- Produce a (bi-)weekly email summarizing interesting information for sites, more concise than the minutes, self-explanatory but with pointers to details
- Define and document all the ways in which WLCG can communicate with sites and define with checklists and standard procedures how the various operations at sites are to be done
- Have a high level view document of the direction of travel of WLCG, including introduction of new features required by the experiments and solutions to shortfalls, but without all the details; this would allow sites to know where they stand with respect to the plan
- Lower level plans showing all the details of what’s going to happen in the next weeks
- This requires creating and updating timeline plans, checking progress and define milestones
- One important idea is that of "transactions". For instance, updates to VOMS Servers and their associated records in the Operations Portal should be viewed as a single transation. One should never be changed without the other. This can be done via an SOP that makes this clear. Sites can then rely on the records in the Ops Portal being canonical and binding
Recommendations from sites concerning communication among sites
- Consolidate information in a single entry point to updated information about the middleware, known issues, recipes, and the experiment software, public and indexed by search engines (x4)
- Organize a face-to-face meeting at CERN between site operators, or an annual T1 jamboree (x2)
- Issue an electronic newsletter
- Have more site experience talks in WLCG workshops
- The twikis should be updated giving explicitly the version stamp of the software
- Create new e-groups to share information on site-specific services and/or issues
- Consolidate the planning function to improve the consistency of messages to sites, to avoid having many ways of managing, tracking and tracing operational developments ad to have consistent communications
- Avoid having important information hidden in tickets – if a problem/fix is identified, everyone should be notified
- Have fewer sources of information and more things coming from official channels
Other used, and desirable communication channels
- Mailing lists and their web archives, which must be searchable, for announcements and frequent communication, avoiding fragmentation (x4)
- HEPIX meetings (x4)
- Open and searchable wikis, provided they are carefully kept up to date for official documentation (x4)
- GDB (x2)
- Summaries with experience from sites
- Searches on GGUS tickets
- TF/WG
- CHEP
- WLCG broadcast list
- Creation of an open wlcg-discussion list for sites, to discuss on new technologies, new services, deployments, validations, etc...
- Short and structured meetings, mailing lists, wikis, whose content stems from a careful planning preventing conflicting messages to be sent
Suggestions from sites on how to improve the WLCG operations coordination meeting
- The meeting is effective, the format is reasonable and cover important issues (x2)
- Limit the duration to one hour and focus the meeting on a subset of issues (x2)
- Have the technical discussions about the new upgrades during the meetings (x2)
- Make the purpose of the meeting more clear
- Eliminate it and have just the GDB
- Start a bit later for US audiences -- i.e. 4 pm CERN time
- Add a ‘actions from/to sites’ section, when the meeting minutes are circulated
- The minutes of the "WLCG operations coordination meeting" appear to be a kind of planning process. If so, they are very important. So, refine the results of this "planning meeting" and add them in the "centralised plan", so that any admin can see at a glance what is coming up. Also, add it to the "central documentation and training" repository. Furthermore, publish the SOP (standard operating procedure) that describes this "WLCG operations coordination" process so all site admins can know for sure how planning in the organisation is done and what are its outputs. Make sure the output of the meeting is not the minutes, but a plan!
Site reasons for not contributing currently to task forces or working groups
- Lack of manpower (x6)
- Timezone difference
- Task forces and working groups are the pioneers for our field of research and more participation is desirable
- Do not see them as useful
- Effort from site admins should rather go into maintaining a central documentation and training repository; moreover, the work of the task forces and working groups should be better advertised in a public forum
Site suggestions on how to improve GGUS
- Allow sites to close the tickets assigned to them rather than having to wait for the submitter’s response (x2)
- Include the possibility of closing a ticket for his creator
- Suggested solutions or Reference to the support forum should be included in GGUS
- The "red" colour seems to be inappropriately applied even to tickets that are on hold with reminder dates in the far future and low priority. We believe that it should only be used as an incident tracker and not a project management tool.
- Billing system even if this is virtual money
- Easy programmatic access to the current and historical content
- Specific documentation of all the functionalities of the tool (sometimes difficult to know to whom to escalate a problem, or how to do it...)
- Not a GGUS-related issue, but some VO-twiki pages relate to functionalities on GGUS which are deprecated. For example, the WLCG procedure to open a ticket when SAM tests need a correction point to a non-existing Support Unit
- Sites normally receive tickets which are not very well detailed. A tutorial or examples on the GGUS webpage on “how to” open a ‘good ticket’ would be appreciated
- Search on content, for example with a query builder or an SQL window where one can dredge out any info based on whatever criteria
- A few tweaks to the UI (like not risking losing your post if the previous post you're replying to has been modified)
Monitoring
Site suggestions on how to improve SAM
- Have a self-explaining output to allow troubleshooting (x3)
- Provide a better documentation for SAM tests where the exact operations being tested are explained so the site can replicate the test (x2)
- Send SAM tests via pilot jobs to avoid prioritisation issues on the batch system (x2)
- Send email notifications to the site when a critical test is failing (x2)
- Reduce the rate of false positives (e.g. tests pass but the corresponding functionality is broken) and negatives (e.g. failures due to central problems) (x3)
- Link to the test documentation in the test output
- Generate RSS feeds from Nagios
- Have more meaningful error codes in SAM tests link to the possible solution (instead of having to rely on search engines)
- Document how to reply a SAM test “on demand”
- Documented how to interface a local Nagios to SAM tests with ability to use certificates for authorisation
- Increase the test frequency
- Adapt SAM tests to modern protocols (avoid using the gLite WMS and SRM when not relevant)
- Allow sites to choose the level of criticality for each test
- Interface the SAM framework with the CERN GNI monitoring, to make more practical their use for daily monitoring of service operations
- Have a consistent meaning of the profiles across VOs (CRITICAL, CRITICAL_FULL, etc.)
- Use more realistic (=longer) timeouts for SRM tests
- Have a central home page that links all test result pages that pertain to a site
Suggestions from sites on how to improve site monitoring in general
- Use more the experiment monitoring, often far more useful than the “standard” tools (x2)
- Make sure that the relevant monitoring metric returns OK within a short time after the problem has been fixed
- Use brower plug-ins (e.g. Nagios Checker for Firefox) for a real-time view
- Make sure that all relevant monitoring portals (including by the experiments) are advertised and made easily accessible
- Avoid spending time solving problems that affect SAM tests but not real operations
- Do not impose to sites show they should monitor their infrastructure (cf Perfsonar)
- Use data analytics
- Provide a minimal dashboard per showing the overall status of sites by VO
- Fully exploit the local Nagios probes and the log parser alerts, which can be more useful than the central monitoring
- Provide a set of diagnostic programs that an admin can download (e.g. via github or CVMFS) and run manually to determine status. This set should test all functionality required by the experiments. In the event of failures, the package should automatically create a diagnostic tarball, and send to a central service for central debugging (x2)
- For the WLCG Tier-1 reports, in addition to the SAM metrics, show also additional information such as: production/analysis jobs success rates, production/analysis CPU efficiency, transferred data volumes and data transfer quality (import/export), LHCOPN saturation (in hours per month) for each of the supported experiments. Most of this can be easily gathered from the experiments and it could be shown as a monthly executive summary
- Complete and automate all the resource accounting used in the WLCG reports
- Use tools, e.g. the Rob Fay's "testnodes" suite that tests WNs for black hole conditions and up-to-dateness, putting them offline if necessary; use the spacemon suite for alerting us to full disks; use the VomsSnooper suite for imposing and checking the validity of VOMS records
- Use SAM as a single entry point to see all relevant monitoring information for the VO, such that the site is green if and only if the VO can use it
Service administration
The following plots show how easy (or difficult) sites find several aspects in the operations of the most important WLCG-related services.
Comments from sites about WLCG operations in general
- Provide clear instructions for new institutes wanting to join WLCG as a new site, without assuming any background in Grid computing (x2)
- The forums that we get the most out of is HEPIX and our weekly regional Tier-2 ops meeting. This is because the information either comes from other sites, or from our Tier-1. Information is better flowed through the Tier-1 than having a broadcast, as there is someone we know who we can ask questions of, and they can either answer it, or send it through up the chain
- The number of services not really required for operations should be reduced (e.g. glexec, perfSonar): WLCG services should be simpler and fewer (x2)
- The WLCG approach is too bureaucratic and the survey is too long with too much pressure to sites to fill it
- The lack of proper support for ARGUS (x2) and other tools (Torque/Maui) is a concern
- Many issues about service administration were caused by Yaim and its less than optimal implementations for the various packages; moreover, its future is unclear (x2)
- The LSF integration into Grid services is sub-optimal
- The survey didn't really address time spent managing hardware, user documentation, or Xrootd and FTS explicitly
- WLCG should play a bigger role in networking, as this really crosses experimental boundaries
- After all these years, WLCG remains an EU-centric organization and often lags on technology innovation and support. E.g. experiments themselves drove the innovation in data federations, and poured enormous resources into supporting the WLCG monitoring group (rather than the other way around), and still are responsible for software deployment
- The future support of services post-EGI/EMI is unclear
- Sites are regularly requested by experiments to provide ‘special features’ not specified on the VO-cards. This should be avoided
- The XrootD deployment and upgrades need further attention. It is not clear for sites which plugins to install, where to find support and to find where the plugins/packages are placed. All of the XrootD deployment is in hands of the VOs and it should rather be addressed by WLCG Ops
- It is hard for sites which joined only recently to really understand who is WLCG, what it does, with whom it interfaces, where it is going and how everything works
- The WLCG goals should be consolidated into a published timeline of events, showing who is responsible for what and when, keeping track of events as update the plan accordingly, and make all this accessible by all sites
- Any material this is not an "event" or an "activity" should go in documentation, e.g. functional and non-requirements, expectations, goals etc.
Suggestions from sites on the adoption of new technologies
- WLCG should invest in containerisation of middleware services to reduce the workload on sites (x2)
- Have Puppet modules officially maintained by middleware/service operators (x2)
- Run Grid services and farm nodes on a cloud infrastructure (e.g. Openstack) (x3)
- WLCG should better define a roadmap for Cloud Computing adoption
- When considering new technologies, stick with industry-standard solutions, except in those cases in which there is a very good reason why this is not possible
- Support volunteer computing techniques also as a possible solution for small Tier-2 sites needing much less effort that a full-fledged Grid site
- Run many-core jobs
- Have HTTP-based storage federations (to move away from HEP specific solutions like xrootd)
- Consider SDN (software defined networks)
- Organise a yearly European HTcondor workshop with developers in Europe
- WLCG should provide tested and optimised disk images compatible with any cloud infrastructure. Ideally, running a WLCG site should mean running few (1-3) virtual machines with WLCG-provided images configured by a small and well documented configuration file made by the site admin, plus the virtual machines (also on WLCG-provided images) for WNs, storage servers, etc.
- WLCG should move to a model whereby site admins no longer are required to install local services beyond core cluster functionality; no more CE's, WN software, information services, unreliable and poorly supported software used by a small community. Site admins should use only widely adopted standard tools and software. Nothing more than providing an ssh account and a container edge service which could be used to host additional needed services (squid cache, etc.).
- WLCG should keep the barrier of entry for potential opportunistic resource providers at an extreme minimum: provide a grid model but leveraging container technologies such as Docker or Rocket. Site admins deploy a simple server, a trusted service agent, centrally managed, deploys the required services as run-time apps
- WLCG overviews all of the services/middleware deployed. The project should ‘recommend’ and ‘unrecommend’ some of them, based on the mid-long term requirements from the experimentsSites are losing the capacity to control services (remote FTS3 servers, XrootD re-directors, job submission through pilots which mixes analysis/production activities, …)
- ElasticSearch/Logstash/Kibana is getting popular in WLCG. WLCG might want to gather logstash templates for WLCG services, and made them available to sites through a central repository
- A big number of sites are looking into CEPH as a storage technology. Among many pros, it enables SSD caching out of the box. This could be a game changer for more efficient WAN transfers now that uplinks are being upgraded to the next order (100 Gbps). So supporting CEPH as an SE technology could be very beneficial
- The excellent work implementing ATLAS multicore on ARC/Condor should be completed with good documentation!
- The use of github and the encouragement of sharing solutions among the sites, for "official" middleware and other pieces of installed software is taking off and it should be further encouraged in WLCG
Maria D.'s bullet points for slides
These are notes from
the drupal results' view. If the plots and spreadsheet lead to another conclusion, please ignore.
- Domains we are doing well:
- The Middelware (MW) releases' frequency
- WLCG support (incl. documentation)
- The repositories' trend (all known ones are used. EPEL is the most popular)
- Site-WLCG_Ops communication
- Info share across sites
- The channels we use are good and sufficient (email, meetings, wikis etc ...)
- Our TFs and WGs are considered useful
- GGUS is very much appreciated as a support tool
- The 3pm Ops call is considered useful by the T0/T1s and the frequency is fine.
- The SAM test results are complete and reliable
- Batch systems' installation, upgrade, operation, troubleshooting is mostly considered Normal or Rather Easy.
- Domains we need to improve:
- Documentation and repository location discovery
- Upgrade procedures
- The role of WLCG Operations between the Experiments and their collaborating Sites
- The content of the fortnightly WLCG Ops Coord meeting (66/101 sites Never or Rarely attend). Maybe add a cross-sites communication item in the agenda?
- Invest some effort to improve the SAM documentation
- Batch systems' support is considered problematic
- Overall remarks
- Sites are mostly based on real jobs and transfers to conclude on their performance
- The majority of replies about all MW components are in the Normal or Rather Easy colujmns.
Pepe's slides presented at the Amsterdam GDB on 2015/03/11 HERE
Alessandra's slides presented at the Okinawa WLCG Workshop on 2015/04/11 HERE
Andrea's slides presented at the Okinawa CHEP 2015 on 2015/04/13 HERE
--
AndreaSciaba - 2015-02-24