5.3 Integration of Grid Tools

5.3.1 The ATLAS Virtual Organization

It is assumed that all members of the ATLAS Collaboration are authorized to become members of the ATLAS Virtual Organization (VO), and therefore are allowed to submit jobs and gain access to Grid resources. While currently it is not a trivial operation to submit jobs to the Grid, and it has effectively been done only by production managers, we have to foresee facilitating access to the Grid for analysis users.

In order to open access to Grid computing facilities to all members of the ATLAS Collaboration, we are setting up an internal system of priority allocation for the usage of CPU and data storage resources. The problem is multi-dimensional, as one can define at least three dimensions that can affect the priority of a job on a given site:

The role of each person in the computing structure:

Grid software administrator (who installs software and manages the resources);
production manager, who will have higher priority than normal users for "official" group productions and will be able to place files in commonly accessible areas;
any normal user.

The activity (physics, combined performance, detector groups and central computing operations);
The funding agency or institute affiliation of the person submitting the job.

The VOMS (Virtual Organization Management Service [5-2]) middleware package allows the definition of user groups and roles within the ATLAS Virtual Organization. We are initially setting up a VOMS group for each Physics Working Group, Combined Performance group and Detector System, as well as a generic ATLAS group and other groups for software testing, validation and central production activities.

At the time of registration with the ATLAS VO, a user will indicate the group (or groups) with which he or she is going to work, and in which capacity. The VO manager will check with the group leaders that the assignments and the proposed roles are correct.

In order to implement an efficient allocation sharing and priority system in the Grid access tools, it is necessary that the use of the resources dedicated to ATLAS be prioritized according to the Collaboration's policy. This could be achieved in principle via a central job distribution system to fully-dedicated clusters, but such a solution may suffer from bottlenecks and single points of failure. In a distributed submission system, local resource providers will have to be involved in defining and enforcing policies on priorities (in agreement with the Collaboration) through pseudo-dynamic updates to the local resource management systems fair-share policies. These options are under investigation as part of this year's upgrade plan (see Section 5.4.7 ).

In this context, one important issue to be addressed in the near future is the mapping of Grid users to local group accounts, through which tools it happens and how local priorities are taken into account. Identity mapping is necessary when a site's resources do not use Grid credentials natively, but instead use a different mechanism to identify users, such as UNIX accounts or Kerberos principals. In these cases, the Grid credential for each incoming job must be associated with an appropriate site credential.

The Open Science Grid (OSG) in the US has set up the GUMS (Grid User Management System [5-3]) project to develop a Grid Identity Mapping Service. The GUMS server performs this mapping and communicates it to the gatekeepers. The gatekeepers are charged with enforcing the site mapping established by GUMS. The GUMS client (gatekeeper) portion consists of an administration tool for querying and/or changing the state of the server. While this tool is not yet of general use outside OSG, it could in the future be adopted more widely to manage local heterogeneous environments with multiple gatekeepers; it allows the implementation of a single site-wide usage policy, thereby providing better control and security for access to the site's Grid resources. Using GUMS, individual resource administrators are able to assign different mapping policies to different groups of users and define groups of hosts on which the mappings will be used. GUMS was designed to integrate with the site's local information services (such as HR databases or LDAP).

5.3.2 Data Management on the Grid

The data management system fulfils a set of distinct functions: global cataloguing of files, global movement of files, and space management on ATLAS-owned storage resources. In ATLAS we opted initially to realize the global catalogue function by building on the existing catalogues of the three Grid deployments (the Globus RLS in the case of NorduGrid and Grid3, the EDG-RLS in the case of the LCG). The data management system acted as a thin layer channelling catalogue requests to the respective Grid catalogues and collecting/aggregating the answers. At the same time it presented the users with a uniform interface on top of the native Grid data-management tools, both for the catalogue functions and the data movement functions. The current implementation of the data management system used in ATLAS [5-4] and its foreseen evolution are described in Section 4.6 .

Following the experience of Data Challenge 2 in 2004, we have re-examined the basic assumptions of this system. A large fraction of job failures were due to lack of robustness of the native Grid tools used for data cataloguing and data transfer, as well as to a lack of appropriate Storage Element space-management tools. We had to implement many work-arounds to offset the middleware problems (such as retries and time-outs in file transfers, resubmission of jobs to different sites when input data became unavailable etc.), and at the same time we asked middleware providers to develop more robust data-management systems.

In order to build a robust Distributed Data Management system for ATLAS, we request that:

all sites deploy Storage Elements presenting the same interface (SRM), backed up by a reliable disk pool or mass storage management system, such as DPM, dCache or Castor;
Grid middleware implements an infrastructure for storage quotas, based on user groups and roles;
Grid middleware implements hooks for file placement policies depending on users' roles;
all mass-storage-based SE's provide information on file stage-in status (disk or tape) to be used for job scheduling and file prestaging.

The data management system based on native Grid catalogues has proven not to scale to the current needs of the ATLAS experiment. We are therefore in the process of redesigning the system, basing it on a multi-tiered cataloguing system. The basic concept is to have each dataset (a coherent set of events, recorded or selected according to the same procedure) subdivided into a number of "data blocks". Each data block can consist of many files, but is considered a single indivisible unit in terms of data storage and transfer. Central catalogues need to know only the location of the Storage Elements where each data block is present (in addition of course to the relevant metadata); the mapping to the physical filenames is known only to local catalogues. All local catalogues are required to present the same interface to the ATLAS data management system, but could in principle be implemented using different Grid-specific technologies. More information on the new ATLAS Distributed Data Management System (DDM) can be found in Section 4.6.6 .

5.3.3 Job Submission to the Grid

ATLAS has gained experience of job submission through Data Challenges 1 and 2, and through its preparations for the Rome Physics Workshop in June 2005. At present the job-submission system works with LCG2, Grid3, NorduGrid, and is starting to gain experience with gLite. It has also worked with the precursors to these Grid deployments. The system is now able to describe the function of a generic job submission, and to draw out some additional requirements.

5.3.3.1 Generic Job Submission to the Grid

An ATLAS Collaboration member is assumed to have already joined the ATLAS VO and to have been granted a certificate to be used for job submission. Tools on a user-interface machine then allow the user to generate, from this certificate, a delegated proxy. This is sent along with the job, enabling the job to perform Grid operations on behalf of the submitter.

In addition, in order to submit a job, the user must provide a job-description file (created either by hand, or via automatic tools like a Production System executor, or directly from the user interface tools), in which all the information about the job is specified. It includes the executable to be run and its parameters, the files to be staged to the execution host, the requirements and the preferences for the execution host (e.g. available memory, maximum CPU time granted to a job etc.) and the files to be staged out. Requirements and preferences concerning the system on which the job will run and the execution of the job itself are also specified (typically by expressions evaluating to a boolean or a numerical value).

When ready, the submission system sends the job-description file - together with all the specified accessory files - to a Grid service that verifies the user credentials, takes care of finding a suitable resource and dispatches the job to it. The submission command returns a job identifier, which is unique to the job; this is used to reference the job in all the subsequent commands.

Using other tools and the job identifier, the user can poll the status of his or her job, delete it if desired, or retrieve the output of the finished job by transferring it to the submitting machine. In case of errors, the user can inspect the job history by querying a logging and book-keeping service, where the various middleware components push a log of the operations they perform on the job.

5.3.3.2 Requirements and Desirable Features for Job Submission

The job-management techniques described above are rather low-level, and are seldom to be used directly by the user, other than for occasional submission or for debugging purposes. More often, the users implement their own tools, from simple scripts to large applications like the production system, to manage large quantities of jobs in a transparent way. For this reason, the user interface must provide also an API. Additionally, multiple-job submission with common required files and trivial variations in parameters should be possible through a single submission by the user.

In order to be able to optimize job submission and distribution on the Grid, a certain number of middleware services must be provided. First of all, one needs a reliable information system, so that jobs can be submitted to sites that support ATLAS, have the required system and software configuration, and have available capacity and network connectivity. Other workload-management services may be present, such as the Resource Broker on the LCG Grid. In all cases these services must be extremely reliable and scalable with a large number of jobs and/or bulk job submission.

Information for job monitoring, accounting, and error reporting must be provided consistently by Grid middleware. Jobs must never disappear without a trace. It is highly desirable that the reporting use a uniform schema for all Grids. In particular, Grid middleware must:

have the local job identifier returned at time of submission to the submitter, to allow easy access to job status and match of log files;
make access to the job's standard output possible while the job is running, to give single users as well as production managers a way to find out what happens in specific cases;
report errors as they occur and not hide them behind generic messages (such as "Maximum Retry Count exceeded").

It is highly desirable that the user can specify job groups and be able to tag jobs at submission time, in order to interact with and monitor those specific jobs amongst the many they may have submitted.

Job distribution on the Grid will be done by a combination of Grid services (such as the Resource Broker for the LCG Grid) and directives given by the submission system. In order to optimize job distribution and minimize data transfer and network activity, it is desirable to have ways to send jobs "reasonably close" to where their input data reside. An exact, and dynamically updated, implementation of a connectivity matrix between all Computing Elements and all Storage Elements is not needed, as network bandwidths can differ by orders of magnitude between local clusters and trans-oceanic links. Instead, a simple matrix of the "distance" between CEs and SEs, even in rough terms (local, close, same continent, faraway), could be used in match-making and improve, at least on average, network traffic and job throughput.