WLCGTEGOperationsUnderlyingServices < LCG

TWiki>

LCG Web>WLCGTEGOperations>WLCGTegOperationsWG2>WLCGTEGOperationsUnderlyingServices (2011-12-12, unknown)

EditAttachPDF

Underlying Services

Underlying Services

Messaging services

Overview

The main messaging service currently used by WLCG is operated by EGI and consists of four (two at CERN, one at AUTH and one at SRCE) tightly coupled brokers running ActiveMQ and designed to host the Grid operational tools such as SAM.

In addition, CERN runs two additional brokers for testing and validation purposes as well as two dedicated services (one for ATLAS/DDM, one for IT/ES), each made of two production brokers and one test broker. These dedicated services are being validated and are not yet in used in production.

The applications currently using messaging are diverse. They include for instance:

APEL
SAM
Ganga/DIANE Monitoring
LFC Catalogue Synchronization (EMI prototype)
ATLAS/DDM Tracer Service (prototype)

This list is by far not exhaustive; more and more users are interested in using messaging to simplify Grid middleware.

Works Well

The current service works well enough for the applications using it in production.

Top Three Issues

Although this service is successfully being used by today's applications, it is clear that it must evolve in three main directions:

security
scalability
availability and reliability

Improved Security

The use of messaging in the Grid started in 2008 and slowly evolved from a prototype to a reliable service. Although security was not a priority at the beginning, the situation has changed and it must now be taken into account.

Improved Scalability

More and more applications use or want to use messaging, yielding to a major increase in capacity requirements. The current "one size fits all" model that we currently use cannot easily scale.

Improved Availability and Reliability

Although the four brokers used today are part of a kind of cluster, the whole service needs to be stopped during some interventions (e.g. software upgrades or major configuration changes). In addition, some applications have single point of failures: the unavailability of one machine stops the application.

Information services

Overview

The Grid information system enables users, applications and services to discover which services exist in a Grid infrastructure and retrieve information about their structure and state. A recent survey identified 6 main use cases:

Service discovery
Installed Software
Storage Capacity (to identify dark data and for storage accounting)
Batch system queue status (also required for pilot job submission)
Configuration (to populate VO-specific configuration databases)
Installed Capacity

The information system a fully distributed hierarchical system based on OpenLDAP and it has been in continuous operation since the first LCG-2 release. It was designed and developed to overcome the limitations of previous implementations of the information system. Its main components are:

a hierarchy of BDII services, publishing information via LDAP, which include: resource-level BDIIs, site-level BDIIs and top-level BDIIs
a schema describing the information (the GLUE schema) and its LDAP implementation
information providers, which collect information at the resource level and send it to a resource-level BDII.

What Works Well

The information system infrastructure is reasonably stable, although the published information is sometimes unstable at the source. Overall, it meets the requirements of the WLCG production infrastructure, which has been achieved with a relatively small investment and support, especially when compared to other services.

A major achievement has been the agreement of an global information model (the GLUE schema) which took into account input from many diverse communities. Over time a substantial amount of operational experience has been accumulated.

Top Issues

The top issues with the information system can be classified in three categories:

Stability
Information validity
Information accuracy

Stability

The information system was originally designed to show the current state of a Grid infrastructure for workload scheduling purposes. Given the highly dynamic nature of the distributed computing infrastructure, any issues in the underlying Grid service (instabilities of the service or its service-level BDII) or distributed infrastructure itself (site- or resource- level BDIIs disappearing, etc.), may cause information not to be consistently published, resulting in information instability. Today, the primary use cases of the information have changed to ones that require information stability, eg, service discovery for operational tools. As such, this information instability is now seen as major issue. Typical symptoms are:
- information that vanishes or becomes stale
- "random" output from queries (for example when more BDII nodes are behind an alias and one of them is not working well).
Open issues include how to deal with disappearing services, for example to decide for how long service information has to be cached.
There is not yet a satisfactory way to use the BDII for ARC services (workarounds have to be deployed who are not generally applicable).

Information Validity

The published information is only as useful as it is reliable. A first level of validation comes from the compliance with the GLUE schema: however, this still leaves a lot of margin for mistakes, in particular considering that the structure of the information is complex from a configuration point of view and it is not trivial for system administrators to configure, leaving sometimes room for different interpretations. This is partially addressed by validation tools (gstat-validation and glue-validator) but they cannot prevent invalid information from being published. Such checks need to be improved and executed before the information is published.

The policy for publishing resources is also not clear, as there are uncertified sites not registered in GOCDB or OIM which are publishing services in the top BDII.

Information Accuracy

Even if information is validated it can still be incorrect or too old. As the information system is not optimized for dynamic information, it was observed that the latency for a change in a dynamic value to propagate to a top BDII ca be up to 10 minutes. This casts some doubts on the usefulness of dynamic information.
Information providers are unreliable, in particular for storage information: very often the space usage numbers are obviously not realistic or are missing altogether. The published information should be better validated at the WLCG and the site level.

To improve

The BDII information should be more thoroughly certified and audited by WLCG. A strong push in this direction would come from a decision of the experiments to use the BDII information.
Retrieving information from the BDII is not straightforward due to the complexity of the LDAP schema and requires non-trivial code to be written: therefore better client tools to get the information should be provided.
It was suggested that we should rely more on NGI for provisioning of core services such as the BDII. EGI is aming at 99% minimum availability for NGI-provided top-BDII services.

Suggestions

The information system mixes three types of information that have different properties/requirements. We will define this information to be structural data, meta data and state data. Structural data is information about the existence of sites and services etc. This contains information that is static through the lifetime of the service such as UniqueID and Type. As a result this information is never modified and changes very slowly. Meta data is all other information about a service/site that may be modified throughout it's lifetime excluding state data. Typically such changes coincide with a service update. State data is information about the current state of the service that is transient. Such information is highly dynamic, such as the number of running jobs. Each type of information should be provided by a system optimized for that purpose. This would result in three system that we will call the Service Catalogue, Meta Data Catalogue and a Service Monitoring System.

The Service Catalogue presents an interface for discovering services and should be restricted structural data. This should be a generic service for all infrastructure/VOs.

The Meta Data Catalogue should use the Service Catalogue to discover what services exist and contact them directly to obtain further information about them. This could be considered a generic service or a VO specific tool, such as the existing Configuration databases, when VOs, can annotate this information with there own naming and semantics.

The Service Monitoring System would carry the transient state information. A publish/subscribe model would be suitable for such a system.

Batch systems

Overview

Batch systems are the core of the computing activities at the sites, they have to provide a reliable framework to fulfill the project requirements in terms of Grid integration, accounting and information systems. Provisioning of the grid layer for local batch systems at the sites has been an activity that was driven by the Batch System Support team in EGEE. The support was funded in EGEE for all used batch systems: LSF, Torque/Maui, (S)GE and Condor. But explicit funding for batch systems disappeared in the transition to EMI/EGI. So far this has been working well with the involvement of sites that hold large expertise in running the systems. Also the Support team integrated new ones in the past couple of years: (S)GE and Condor.

Top issues

Sites are getting bigger while keeping up with the pledges and the size of the clusters are in constant evolution in number of nodes. Also sites can be affected by the "change of direction" of the system they have as happened recently with SGE. It is known that Torque/Maui faces serious scalability problems and (S)GE has an uncertain future with many forks appearing from the original system, that is not open source anymore (OGE). As a consequence of this some sites would be pushed either to move from their original batch system to a new one (hence loosing the accumulated know-how) or adopt a commercial solution (MOAB, PBS-pro, LSF, etc.) The decision requires a non-negligible investment in engineering time and/or money. Apart from the batch systems mentioned to have potential issues (PBS and GE), the most common open-source solutions left are Condor and SLURM:

CONDOR is known to scale very well; however its concepts of queue, QoS and resource management are quite different from standard batch systems. The Grid layer was provisioned about two years ago in EGEE.
- EGI did a survey in June 2011 (involving European, Latin America, Canada and Asia Pacific sites) to assess batch system deployment status and plans with 70 collected replies from sites. Only three sites reported to use CONDOR (1% of the sites that replied).
- Certification activities for the Condor integration will probably have to restart.
SLURM is used in supercomputing centres with success (+100k cores), but does not have the Grid layer ready (CREAM/blah, BDII and APEL). SLURM also misses some functionality needed by WLCG, as the cpu_factor which is used for homogenous accounting purposes and by the batch system itself to apply correct CPU/wallclock time limits. SLURM relies on a database and issues like the cpu_factor can be solved relatively easy, but more issues may come in the non-grid layer as well.
- SLURM is supported in ARC and UNICORE.

Suggestions

A good initiative could be to evaluate the possibility of a common effort (via a dedicated working group) for those sites that expect to have issues with their batch systems in the coming years.

EGI currently has an open survey to collect from the sites feedback on adding SLURM support in CREAM. To date, 53 sites provided feedback and the deadline for participation is December 14th 2011. If there's sufficient interest, a requirement will be sent to EMI (the discussion in the operations management board is foreseen for December 20th).

This possible common effort may be a big benefit in terms of investment and may result in faster and better evaluation of the system with a clear goal to have a scalable open source batch system that can last for long. Condor is also an worthy option but currently there is no real integration advantage for Condor over SLURM given the fact that the required effort is similar. A survey across sites in WLCG may well give an indicator whether a proposal for a common working group can be a good possibility or not. Or perhaps new ideas may come from this initiative.

Improvement Areas

Impact	Areas
Improvement Areas (10=Blocker, 5=Medium, 1=Low)
5	Improve the security of the messaging services
5	Improve the scalability of the messaging services
5	Improve the reliability and the availability of the messaging services
5	Improve the stability of the information services
5	Improve the validity of the information
5	Improve the accuracy of the information
5	Provide a long term solution for a scalable and open source batch system

Topic revision: r16 - 2011-12-12 - unknown

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback