WLCG Operations and Tools TEG - WG2

This page collects the input received for the areas covered by the WG2, namely:

Support tools

  • (TF) To improve: All these services are relying on EGI/NGI sustainability.

Ticketing tools

GGUS

  • (SP) Works well: GGUS works very well
  • (SP): "Change" tickets to sites, that is tickets requesting a change, are fundamentally different from "incident" tickets because the former do not impead correct functioning an may remain open for a long time without that being a problem. Unfortunately this causes the site to be red in the monitoring. A solution would be to implement an alternate ticket type called 'Change request', as done for alert and tram tickets.
  • (SP) It is not always obvious when a ticket has to be submitted to GGUS, to Savannah or to other tools
  • (AF) Sometimes tickets to other SU which are redirected to other tools are set to unsolved, making difficult to track progress and causing tickets to linger for months.
  • (MD) Works well: An advantage of GGUS is that it is possible to get the desired enhancements and bug fixes from the developers.
  • (MD) Works well: GGUS is the official reporting tool for WLCG
  • (MD) Works well: GGUS is integrated with other tools, like the operations portal (ex-CIC), GOCDB and OIM
  • (TF) Works well: GGUS is very stable.
  • (PES) Works well: GGUS works relatively well.
  • (ATLAS) Works well: the best features of GGUS are the team and alarm ticket categorisations (sites can easily distinguish "blocking" issues) and the possibility to directly send tickets to sites.
  • (MD) Top 3 problem: GGUS development is complicated by the different priorities for the collaborating projects (EGI, EMI, WLCG).
  • (MD) Top 3 problem: There is no User Support Working Group, like the old EGEE USAG, to discuss the development plans in agreement with all partners (VOs, sites, management).
  • (ATLAS) Top 3 problem: GGUS should better report what is going on at a site in terms of tickets, downtimes and actions. Programmatic access to the GGUS information should be stabilised.
  • (ATLAS) Top 3 problem: Alarm tickets are not yet very reliable for what concerns the assignment to the proper SU.
  • (PES) Top 3 problem: The emails sent as alarms are decorated with many '*', which when forwarded to a mobile phone as an SMS don't allow any of the useful information to be pass through.
  • (PES) Top 3 problem: Routing in GGUS is suboptimal for less frequent flows of alarms, for example Site -> VO tickets are often redirected back to the site.
  • (PES) Top 3 problem: The VO information in the EGI operations portal is often out of date (groups and roles).
  • (PES) Top 3 problem: The current interface between GGUS and SNOW is far from being bullet proof.
  • (MD) To improve: Sometimes the same things are reported again and again in different meetings. [why under GGUS?]
  • (MD) To improve: There were many changes imposed by external tools which caused a waste of time for no advantage (e.g. moving from Savannah to RT [?], changing the URL of the CIC portal, ...).
  • (MD) To improve: There were several bugs in the interfaces with external ticketing systems or other tools (VOMS, GOCDB, OIM, etc.).

Savannah

  • (CMS) Works well: Savannah has mapped all the site admins and responsibles and it allows to efficiently map tickets to them. It also allows to create squads related to services and tools.
  • (CMS) Top 3 problem: The Savannah-to-GGUS bridge should be improved: it should not be possible to write comments in Savannah for a bridged ticket and it is possible to convert a bridged ticket into a team or alarm ticket only from the GGUS portal.
  • (ATLAS) Works well: The best feature of Savannah is the ability to move tickets from a squad to another (for example from operations, to development, etc.).

Other ticketing systems

  • (CMS) Works well: TRAC, a web-based project management tool, has proven to be useful for the development teams.
  • (CMS) Top 3 problem: The TRAC deployment at CERN is very unreliable, apparently due to lack of manpower. For this reason CMS is considering to move to github both for reliability and feature reasons.

Accounting tools

  • (TF) Works well: APEL is very stable.
  • (TF) Top 3 problem: There is no open access to control/monitoring information that would allow to know which sites are badly reporting accounting data.
  • (TF) Top 3 problem: The accuracy of the accounting information should be assessed for all accounting systems in use.
  • (TF) Top 3 problem: There are no requirements defined by WLCG for storage accounting. (JG) The Installed Capacity document is the WLCG requirement for what should be collected. https://twiki.cern.ch/twiki/pub/LCG/WLCGCommonComputingReadinessChallenges/WLCG_GlueSchemaUsage-1.8.pdf

Administration tools

GOCDB4

  • (AF) GOCDB4's architecture is more complicated with very few benefits for the sysadmin.
  • (SP) The information can get stale.
  • (CMS) Works well: GOCDB is useful to keep track of all downtimes.
  • (CMS) Top 3 problem: GOCDB does not list which VOs are supported by a given service.
  • (ATLAS) Top 3 problem: Having to deal with two different systems, GOCDB and OIM, is painful for operations. Experiments need a place where sites can publish their "latest news" (like the CERN IT Service Status Board) and be able to programmatically query it.

Downtime alerts

  • (AF) some people receive multiple notifications (due to having more roles) and end up deleting them before reading
  • (CMS) To improve: It took an inordinate amount of time to have a reliable way to monitor downtimes from GOCDB and OIM and still today it takes a lot of effort to keep this information reliable. Better downtime exposure from GOCDB would reduce this manpower need.

Underlying services

Messaging services

ActiveMQ

  • (Purdie) No problems ever observed
  • (ATLAS) Works well: No issues whatsoever with the messaging system.

Information system

  • (CMS) Top 3 problem: The BDII service is very unreliable, as the information can vanish when a local BDII has problems reporting values.
  • (CMS) Top 3 problem: Retrieving information from the BDII is not straightforward due to the complexity of the LDAP schema and requires non-trivial code to be written.
  • (CMS) Top 3 problem: Information providers are unreliable: very often, and in particular for storage information, the space usage numbers are obviously not realistic. The published information should be validated much more thoroughly.
  • (TF) Top 3 problem: there is no satifactory way to use the BDII for ARC (workarounds have to be deployed who are not generally applicable).
  • (TF) Top 3 problem: The future of the BDII is not clear.
  • (TF) Top 3 problem: The policies for publishing resources in the BDII are not satisfactory: uncertified sites not registered in GOCDB or OIM are publishing services in the top BDII.
  • (ATLAS) Top 3 problem: The published information is often wrong or missing.
  • (PES) Top 3 problem: BDII validation tools continue to be nonexistent, the only way to validate anything is to put it in production and way for the tests to fail.
  • (PES) Top 3 problem: The BDII system is very unreliable. The information vanishes when a local BDII has problems reporting values. Sites info providers to BDII are unreliable.
  • (CMS) To improve: The BDII information should be certified and audited by WLCG and better tools to get the information should be provided.
  • (TF) To improve:R ely on NGI for provisioning of core services such as BDII. EGI is aming at 99% minimum availability for NGI-provided top-BDII services.

Batch systems

  • (MD) Top 3 problem: Since October 2010 GGUS has Support Units for LSF, Torque, Sun GridEngine and Condor. None of the projects (EGEE, then EGI or WLCG) funded this support activity in anyway so it has to happen on a best-effort basis.
  • (XE) Top 3 problem: With the increase of the size of some sites, scalability issues have emerged (Torque/Maui), while the future of some batch systems, like SGE, is uncertain. There are only few open source alternatives, for example SLURM and Condor.

WLCG operations and procedures

  • (ATLAS) Works well: Experiments and sites built up lots of procedures for operations. SIRs are very useful and should be extended to experiments (experiment specific services can also fail and affect sites).
  • (LHCb) Works well: In most cases sites are responding quickly to issues reported and provide solutions. This is essential for operations.
  • (PIC) Works well: Communication with the experiments is very good: WLCG daily meetings, eLOGs and GGUS being the key pieces. This is fundamental for the sites as it promotes the cooperation between experiment experts and sites.
  • (ATLAS) Top 3 problem : Some important sites are still missing a strong contact with ATLAS, even if this is improving.
  • (ATLAS) Top 3 problem : There is a good WLCG operations coordination but there are not WLCG operations per se (the experiments follow up issues within their own operations teams). We miss a channel for T2s (the GDB is only once a month, T1SCM targets T1s, daily ops is not followed by T2s).
  • (JT) Top 3 problem: Operations are still not routine in the sense that experiments feel the need to have one of their own tightly integrated into the operations team at each site. This scales only for largest sites (and only if only WLCG VOs have such people).
  • (CH T2) Top 3 problem: The ATLAS experiment is changing its requirements too often and changes are not reflected by its VO card in CIC. User abuses are difficult to trace due to the anonymity of their pilot jobs.
  • (CH T2) Top 3 problem: The ATLAS software is too heavy on the shared filesystem, as each job does the equivalent of a find on all the installed software files, resulting in denials of service, decrease of the overall performance and memory overconsumption (and swapping).
  • (LHCb) To improve: Training of people doing computing shifts is currently very time consuming for the "experts". The process of getting people ready for shifts should become much faster and efficient. For example, providing better and updated documentation, knowledge bases, a "grid training".

Contributors

AF Alessandra Forti
ATLAS  
CH T2 Swiss Tier-2 sites
CMS  
LHCb  
JG John Gordon
JT Jeff Templon
MD Maria Dimou
PES CERN IT-PES group
PIC  
SP Stuart Purdie
TF Tiziana Ferrari
XE Xavier Espinal

-- AndreaSciaba - 03-Nov-2011

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2012-01-17 - JohnGordon
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback