Computing Technical Design Report

4.5 Event Data

4.5.1 Introduction

The ATLAS event store is a multi-petabyte, globally distributed corpus of data describing real and simulated particle collisions and resulting interactions with the ATLAS detector. Event data are stored and managed at every stage of processing, ranging from readouts of the ATLAS detector written by the ATLAS Event Filter to highly distilled and derived data produced at advanced stages of analysis. The ATLAS Computing Model (Chapter 2) describes the principal processing stages (Raw, Event Summary Data, Analysis Object Data, and more) and event data flow. This information will not be repeated here.

This section describes the infrastructure used to write and read event data, including I/O infrastructure common to all data types, event-store technology choices, navigational infrastructure, principal event-store metadata components, and tools and services that support deployment and operation of a navigable, scalable, and maintainable event store.

4.5.2 Persistence Services

In the parlance of object-oriented programming, "persistence" refers to the capacity of an object to survive beyond the lifetime of the process that creates it. Persistence services provide the machinery to accomplish this, and encompass, for example, the means by which a data object's state may be saved in a file or database, and the corresponding means to restore the object to that state when it is needed as input to subsequent processing.

Athena inherits from Gaudi an approach to persistence that insulates physics code from specific persistence technologies. In this model, physics algorithms write and read data objects to and from an in-memory (transient) store that behaves as a blackboard for data sharing. In Athena, the blackboard infrastructure is known as StoreGate. Retrieval of an object from this transient store may in turn trigger data retrieval from files or databases, but this retrieval is intended to be transparent to physics users, as are any technology-specific operations involved in storage and retrieval of that data. Control framework I/O infrastructure allows specification of which objects in the transient store are to be written to persistent storage, and when: for example at the end of an iteration of an event loop, or at the end of a job.

Providing persistence services to the ATLAS control framework requires more than delivering the capability to write objects and read them back again. The machinery by which the transient store supports object identification, aliases, base-class and derived-class retrieval, and inter-object references must also be persistified: one must not only save and restore (the states of) physics data objects, one must also save and restore the state of StoreGate. Many subtleties in the design and implementation of ATLAS event- store software arise from this fact.

Writers of event data specify which objects are to be persistified by means of an "ItemList," a term inherited from Gaudi. Entries in an ItemList must contain sufficient information to identify uniquely the objects to be written. In Athena, the corresponding information is typically the {type, key} values used for StoreGate retrieval, where "type" may be a class name, or, equivalently, a unique class id (CLID), and "key" is an optional (character) string in the C++ sense. More compact notation allows specifications like "ESD" or "AOD" as shorthand for specific ItemLists, the contents of which may have been predefined at an ATLAS-wide or physics-working-group level.

"OutStreams" control the writing of selected events. An OutStream is configured at runtime, at which point it is associated with a specific output technology and any corresponding technology settings (the filename, for example, when POOL ROOT is chosen as the output technology). The OutStream is also associated with a corresponding ItemList, and optional event selection criteria (typically a choice made by applying one or more physics "filtering" algorithms). A job may have many OutStreams, each with its own selection criteria and ItemList (and output technology). Thus a single job may write events of interest to a given physics working group to one OutStream and events of interest to another working group to another stream; moreover, the physics working groups may implement different policies regarding which event data objects are retained in their streams.

The ATLAS Computing Model proposes writing disjoint streams of AOD, but this is a deployment policy, not a design restriction: event-store software supports writing any event, at any stage of processing, to one or to several streams. The ATLAS Computing Model proposes also that "primary AOD" definitions (ItemLists) be the same for all streams, but this is again a matter of deployment policy--it is not a restriction imposed by event-store software. (Recall that primary AOD refers to data written at the Tier-0 centre as one of the products of initial reconstruction. AOD potentially produced later at Tier-1 centres by physics working groups may vary from one group to another.) By making these specific choices, though, the Computing Model ensures that there is a master AOD "dataset" in which each event appears exactly once, one that is capable of supporting selections for physics analyses that cross stream boundaries because AOD content does not vary by stream. In this model, streams play the role of a physical clustering optimization, placing like events ("like" by at least one criterion--that used for stream selection) proximately in persistent storage.

Within an OutStream, finer placement control will be possible by mid-2005. When an event is selected for writing to a particular stream, one may specify that calorimetry be written to one file and tracker data to another. While the current Computing Model does not propose to take advantage of this feature, its presence gives ATLAS the flexibility to optimize data placement in the event that certain kinds of data that must be retained for specific clients prove too unwieldy to deliver to every client of the data at that processing stage.

4.5.3 Storage Technology and Technology Selection

The ATLAS control framework and database architecture support runtime selection of storage technologies; indeed, the architecture supports simultaneous use of multiple technologies in a single job, e.g., POOL ROOT files for event data, a relational database for event-level metadata (tags), and still another relational technology for conditions or interval-of-validity data.

The ATLAS computing effort has over the years demonstrated the flexibility of its architecture and its ability to support a variety of persistence technologies, including an object database (Objectivity/DB), direct use of ROOT, and use of ROOT via the LHC Computing Grid (LCG) common project persistence infrastructure known as POOL. While heterogeneity of storage technologies in the ATLAS event store is therefore possible in principle, there are significant advantages to making specific technology choices for deployment at large scales, and this is what ATLAS has done.

ATLAS deploys two principal technologies for event data. All derived and simulated data use LCG POOL with ROOT files as the storage backend. Only data written by the Event Filter (and data associated with Event Filter simulations) are different: raw event data are written into "bytestream" files in a format matched to ATLAS detector readout.

The LCG POOL project provides not only a persistence technology but also a capability to support runtime storage technology selection. POOL's newly introduced Relational Access Layer and object-relational mapping layers make it possible in principle to decide at runtime whether to write data to a POOL ROOT file or to a relational database. ATLAS retains its own technology direction layer, which predates the POOL project; thus, there are multiple loci in ATLAS software at which one might plug in new or alternative storage technologies should the need arise.

4.5.4 Persistence of the Event Data Model

A requirement of any event-store persistence-technology candidate is the ability to support the ATLAS event model. This may not necessarily require the ability to persistify every C++ language construct, but any limitations on what can and cannot be persistified must not unduly constrain the event data model. While the event model has not yet been finalized in every detail, an important event-store milestone was met in 2004: "POOL demonstrably capable of supporting the ATLAS event model," confirmed by a series of tests using ATLAS Data Challenge 2 data and the then-current event data model. Minor compromises were made and issues associated with persistence of certain CLHEP classes were uncovered, so work continues, but no major impediments to use of POOL as the primary event-store technology were found.

4.5.5 Automatic Persistence

While software is in development, it is extremely convenient to be able to say "store my object" without concerning oneself with its persistent representation. In the Gaudi/Athena architecture, runtime specification of an output technology triggers loading of a technology-specific "conversion service", which provides the "converters" to map from transient-data object structures to possibly technology-specific persistent layouts and back again. When these converters can be generated automatically, it is a boon to developers. The ROOT and LCG SEAL and POOL efforts have delivered the means to accomplish this for a wide range of data types (with perhaps the aid of a selection file to indicate which data members need not be persistified); indeed, automatic converter generation is standard practice for today's ATLAS event model. When such an automatic mapping is impossible or inappropriate (e.g. inefficient in terms of space usage), a capability to invoke hand-coded converters is also provided.

An artifact of a pure persistence model is that the reader's view is the writer's view: one takes a snapshot of data objects in the transient store after a processing stage, and later readers may retrieve some or all of these objects. Often this is precisely what is wanted; the first algorithm of the next job picks up where the last algorithm of the previous job left off. It is not a view typically taken in databases, though, wherein clients extract the view they desire in a manner that may be conceptually independent of any particular data-writer's view.

ATLAS event-store persistence infrastructure has been designed to accommodate delivery of a client view (transient objects) different than that recorded persistently by providing a means to distinguish transient from persistent type identification, and machinery in both ATLAS-specific and POOL layers to handle cross-type mapping. (In principle, the persistent view need not be the writer's view either: the same machinery can support transient-to-persistent type remapping on output.) When ATLAS data taking begins, it is expected that the pure persistence approach (writer's view and reader's view are essentially the same, and persistence captures the writer's view) will dominate, but the more general capability incorporated into the event-store technology design may grow in importance as schema and even language choices evolve.

4.5.6 Navigational Infrastructure

In addition to support for persistence of inter-object references in the transient event data model, the ATLAS event store also retains navigational information to allow access to upstream data.

The model is this: when an event is written via an OutStream, each event data object in the associated ItemList is written. In addition, a master object (a DataHeader) is written, which tracks where each individual physics object has been written, along with its StoreGate identity (type and key), and any other information needed to restore the state of StoreGate. This DataHeader serves implicitly as the entry point to any event: when one retains a reference to a persistent event, it is in fact a reference to this DataHeader.

A DataHeader also retains references to DataHeaders from upstream processing stages. This allows back navigation to upstream data, so it is unnecessary to rewrite, say, Monte Carlo truth data in one's AOD, though one may certainly do so as a performance optimization. Back navigation happens by recursive delegation: a reader of AOD who asks for calorimeter cells (by type and key, in the StoreGate fashion) will trigger a retrieval request from the input (AOD) event, which will not find anything with that type and key attached to its DataHeader. It will then delegate the request to its parent DataHeader, and so on. Back navigation may optionally be turned off to avoid undesired traversal of multiple files, which may or may not be locally available. More sophisticated back navigation control is under development.

4.5.7 Schema Evolution

The ATLAS event data model will almost certainly change over time, and event-store planning must anticipate this. ATLAS must retain the ability to read both old and new data after such a change, regulate the introduction of such changes, minimize the need to run massive data conversion jobs when such changes are introduced, and maintain the machinery to support such data conversions when they are unavoidable. In database literature, such changes to the layout of persistent data structures are known as schema evolution.

ATLAS has chosen a technology for its event store that already provides the capability to accommodate minor schema changes (addition or deletion of data members, for example), and these capabilities are improving. For more substantial changes to the event data model, the architecture provides at least two loci for introduction of code to detect and compensate for changes in persistent-data format. This work can be done at the "converter" level in Gaudi/Athena persistence layers, or, at a lower level, via "custom streamers" in POOL. Tools to detect schema changes from one release to another have been delivered; these provide an aid to support policy-level control of changes to the content definition of endorsed event-store data products at various processing stages.

4.5.8 Event-level Metadata

In a multi-petabyte event store, a means to decide which events are of interest to a specific analysis without first retrieving and opening the files that contain them, and to efficiently retrieve exactly the events of interest and no others, is an important capability. For standard analyses, standard samples will be built, according to criteria provided by physics working groups. Streams written as a product of primary reconstruction at the Tier-0 centre provide one means of building these samples. For analyses that require samples that cross stream boundaries, or for finer subsamples within a standard dataset, event-level metadata (tag databases) provide a means to support event selection.

When an event is written, metadata about that event (event-level metadata) may be exported, along with a reference to the event itself. Quantities chosen for export are those that are likely to be used later for event sample selection, and are defined by physics and physics-analysis-tools working-groups. These event tags may be written to a POOL ROOT file, or to a relational database.

RegistrationStreams provide the control framework machinery to accomplish this export of event-level metadata, and POOL explicit collections provide the implementation An event of interest to multiple analyses may be written exactly once, but the corresponding event tag and pointer to the event itself may be recorded in multiple collections; alternatively, the tag may be written to a single collection, with subsequent queries used to build multiple collections corresponding to multiple, specific analysis selections. Event-store back navigation machinery provide the means by which a collection built, for example, after AOD has been written, may nonetheless support selections of data from upstream processing stages.

4.5.9 Event Selection and Input Specification

ATLAS event-store infrastructure provides many means to specify input to an event-processing job. The means most frequently utilized today is the name of a single file, but the infrastructure supports also: lists of files (logical or physical), dataset names that map to lists of files, lists of specific events (via event collections/tag databases), specific event sets with filter criteria (query predicates applied to event collections), and heterogeneous combinations of all of these.

Only the primary input events need be specified to the Athena EventSelector. One specifies a file containing AOD, for example, as input, and back-navigation machinery (in the presence of an appropriate file catalogue) will find and open ESD if necessary.

4.5.10 Event Store Services

A variety of tools and services support identification, selection, navigation, and extraction of data samples of interest in the event store. Metadata associated with datasets (both file sets and event collections) support queries to identify which datasets are of interest. Middleware typically maps such selections into lists of logical files (or lists of file sets used as units of data distribution), which can then be located by other middleware services, including one or more replica catalogues. File management and cataloguing is discussed in Section 4.6 .

Typical output of an event-level metadata query is a list of references to events that satisfy the selection predicate. Such output suffices for jobs that run, for example, at a Tier-1 centre, where all AOD reside on disk, but may not suffice for laptop-level analysis. For such purposes an associated utility to extract replicas of exactly the selected events and no others into one's personal files is also provided. A related utility that builds the list of logical files that contain the selected events has also been delivered.

A variety of diagnostic utilities--to determine the contents of an event data file, to verify that all data objects in the file are retrievable, and so on--have been deployed, and the list is growing.

4.5.11 Summary

One guiding principle of event-store architecture has been to design for generality, and to support specific deployment choices and constraints by means of (enforceable) policy. Thus the design supports multiple persistence technologies, but ATLAS will deploy a specific one (ROOT-file-based LCG POOL); the design supports possibly overlapping streams, while the computing-model policy may keep streams disjoint; the design supports content definitions that may vary from stream to stream, while the Computing Model may call for keeping definitions identical; the design supports arbitrary transient/persistent type remapping but deployment at turn-on may not exploit this. A second principle is that the user view of event data storage and access is the same for laptop users as for distributed Grid-production jobs--only the specific data sources and scale change.



4 July 2005 - WebMaster

Copyright © CERN 2005