6.5 Experience with Data Challenges and other Mass Productions

To prepare for data taking in 2007, the LHC Computing Review [6-1] in 2001 recommended that the LHC experiments should carry out data challenges (DCs) of increasing size and complexity. A data challenge comprises, in essence, the simulation, done as realistically as possible, of data (events) from the detector, followed by the processing of those data using the software and computing infrastructure that will, with further development, be used for the real data when the LHC starts operating. The goals of the ATLAS Data Challenges are the validation of the ATLAS Computing Model, of the complete software suite, of the data model, and to ensure the correctness of the technical computing choices. In addition, the data produced allowed physicists to perform studies of the detector and of different physics channels.

The preparation of the simulated data consists in the generation of a large number (>107) of simulated "events". Each "event" is the result of a collision between two protons, and the full simulation requires the following steps:

Because of the very high intensity at which the LHC will operate, many proton-proton collisions will usually occur during the sensitive time of the particle detectors. Hence the digitized output from the ATLAS detector will contain the "piled-up" signals from the particles produced in several collisions. This leads to the next step:

These events can now be used to test the software suite that will be used on real LHC data:

The data fed to the reconstruction step are, as far as possible, indistinguishable from real data, with one important exception: since the original information on the particle trajectories and energies is known from the event generation step, this information ("truth") is also recorded so that the output from the reconstruction step can be compared with it.

Up to now two Data Challenges have been run, DC1 in 2002 and 2003, and DC2 in the second half of 2004, followed in the first half of 2005 by a large-scale production to provide data for physics studies in view of the ATLAS Physics Workshop in June 2005.

6.5.1 Data Challenge 1

ATLAS Data Challenge 1 ran from spring 2002 to spring 2003. The main goals of DC1 can be summarized as follows:

Put in place the full software chain;
Deploy a large-scale production system;
Provide data for the High-Level-Trigger community in view of their Technical Design Report;
Build the ATLAS DC community.

Ten million physics events and 40 million single-particle events for a total volume of about 70 TB were produced using 21 MSI2k-days. Forty institutes in 19 countries actively participated in the effort. A full report on the exercise can be found in See R. Sturrock et al., A Step Towards A Computing Grid For The LHC Experiments: ATLAS Data Challenge 1; CERN-PH-EP-2004-028.

It appeared clearly during DC1 how important it is to have efficient procedures to deploy and certify the various components of the software, to validate the many sites which participated in the enterprise, and to monitor the quality of the data which was produced. The Distribution and Validation kits which are now part of the standard ATLAS infrastructure come directly from the DC1 experience.

During DC1 we have seen the emergence of production on the Grid. The production was done exclusively on the Grid in NorduGrid, it was partially done on the Grid on the US test-bed, especially in the phase 2 of DC1, and serious testing was done with prototypes of EDG (European DataGrid) middleware in close collaboration with the EDG developers. The lessons learned were extremely important for the development of the Grid projects.

ATLAS DC1 has proved to be a very fruitful and useful enterprise, with much valuable experience gained, providing feedback and triggering interactions between various groups, for example groups involved in ATLAS computing (e.g., HLT, offline-software developers, Physics Group), Grid middleware developers, and the CERN IT Department.

One of the most important benefits of DC1 has also been to establish a very good collaborative spirit between all members of the DC team and to increase the momentum of ATLAS computing as a whole.

6.5.2 Data Challenge 2

One of the main lessons learned from DC1 was that the production system needed to be more automated. The system was redesigned in summer 2003 and developed in the following months. The system is described in Chapter 5 of this document.

In parallel the Grid middleware was evolving and it became clear that ATLAS had to use different Grid flavours, as they were becoming more mature.

It was then decided that Data Challenge 2 (DC2) should concentrate on the following aspects:

Deploy and use the new production system;
Use in a coherent way three of the existing Grid flavours: LCG, Grid3 in US and NorduGrid in Scandinavian and associated countries;
Perform a test of the operations that should be done in summer 2007 when the real data will be delivered by the detector. This test is called Tier-0 exercise.

The scale of the exercise was defined to produce at least the same amount of data as for DC1; that means fully simulate of the order of 10 million events.

The components of the software chain were the same as for DC1 with the following important differences:

All software components were the newly developed ones based on object-oriented technology. They were developed either by the ATLAS collaboration or by other projects like the Geant4 simulation or the OO persistency system POOL, both developed within the LCG Applications Area project.
In order to perform the Tier-0 exercise in conditions close those of the real situation in 2007, it was decided to add a new step in the production chain, called event mixing, in which different physics channels were mixed in such a way that the reconstruction program receives a realistic mixture of event types.
As part of the Tier-0 exercise it was also foreseen to build the TAG collections after reconstruction of the events, and to distribute in real time both the ESD and AOD data to the Tier-1s as defined in the Computing Model.

6.5.2.1 Phase 1 (Event Generation and Detector Simulation)

Phase 1 was run only on the Grid using the three available flavours: LCG, Grid3 and Nordu-Grid. Four types of jobs were run: first the event generation followed by the Geant4 simulation, then the digitization and/or the pile-up (coupled with the digitization) of the data. The output data of each step were persistified in POOL format and registered in the Grid catalogues.

6.5.2.2 Phase 2 (Event Mixing)

Phase 2 consisted of two steps. In the first one the digitized data, produced worldwide, was concentrated at CERN. In the second step the events of different physics channels, which were produced independently, were mixed in order to have runs with a realistic mixture of physics events (as would be produced at the output of the Event Filter). The input of the process was the POOL digitized data; the output was either RDOs (Reconstruction Data Objects) in POOL format or ByteStream (BS) data equivalent to raw data. Typically ten input files were used producing one output file, events were picked up randomly in an ad hoc proportion. Pile-up was run in parallel on a subsample of more than two million events.

6.5.2.3 Phase 3 (Tier-0 Exercise)

Phase 3 was supposed to be a simulation of what will be the production at Tier-0 in 2007. In a first step each RDO or BS file should have been reconstructed producing Event Summary Data (ESD) in POOL format. In the second step ESD files should have been read and Analysis Object Data (AOD), condensed useful information for analysis by the physicists, should have been produced again in POOL format. Finally in a third step we wanted to stream the AODs in ten different physics channels and produce event collections to be used by the different physics groups.

6.5.2.4 Running Experience

Phase 1: From July to mid September 2004 for the Geant4 simulation and in October and November 2004 for digitization, 10M physics events were produced, corresponding to 40 TB of data in 200 000 files. The three Grids were used roughly in the ratio LCG/Grid3/NorduGrid: 40/30/30% with a global efficiency of the order of 60%. In addition several million single-particle events and calibration events were produced in parallel. Since, as explained in the DC1 section, the pile-up step needed to use several input files, it was decided, for practical reasons, to run the process in a reduced number of sites where the "minimum-bias" and "cavern background" files were replicated in advance and to reduce the production to a subset of 3.5 million events.
Phase 2: Started in mid November 2004. Data transfer to CERN was done using the ATLAS data management system DonQuijote. In the initial phase the transfer rate was of the order of 2-3000 files per day, that means 500 GB/day. After some improvements the transfer rate increased to 10 000 files per day (1.5 TB/day).
Phase 3: For technical reasons the scope of the Tier-0 exercise was reduced, partly because some components of the software were not ready, partly because the resources at CERN were not fully available (due to the concurrent move to the new SLC3 operating system), and the need not to delay the start of the mass production for the physics workshop in June 2005 (Section 6.5.3 ). The exercise was reduced to a "proof of principle" and will be resumed in Summer 2005.

During the full period of DC2 several problems were encountered. During Phases 1 and 2 the following main problems were identified:

Stagein/out, file transfer and file registration problems;
Degradation of the response time of the central production database (Oracle);
Problems with site (mis)configuration, or temporary faults, such as losing connection between worker nodes and the shared filing system where ATLAS software is installed;
Problems with the LCG information system, leading to jobs going to the wrong queue;
Connections lost with the LCG Resource Broker, leading to "orphan" files that could not be registered;
Slowness of the LCG Resource Broker which limited the number of jobs we could run per day.

During Phase 3 we suffered also from problems with the POOL catalogue due to incompatibility of the LCG middleware and of the ATLAS software, as different versions were used and we had to set up the configuration by manual intervention. That showed the importance of a good synchronization in the development of software components used in different domains.

6.5.2.5 Conclusions

We started the DC2 production with a production system that was still under development and we discovered a lot of problems in the early days.

In addition, the development status of the services of the Grid caused troubles while the system was in operation. For example the Globus RLS (used by Grid3 and NorduGrid), the LCG Resource Broker and the information system were unstable during the initial phase.

Especially on the LCG Grid, we suffered from the lack of a uniform monitoring system and from the mis-configuration of sites and site-stability related problems.

Human errors, such as "expired proxy", and bad registration of files caused some additional troubles. We experienced network problems that caused the loss of connection between processes and Data Management System problems (e.g. loss of connection with the mass storage system).

Nevertheless we succeeded in running a large-scale production only on the Grid, using three Grid flavours, using an automatic production system that makes use of Grid infrastructure. Several "10 TB" of data were produced and moved among the different Grids using DonQuijote (ATLAS Data Management) servers. More than 200 000 jobs were submitted by the production system and we succeeded in running more than 2500 jobs per day.

6.5.3 Production for the ATLAS Physics Workshop

Immediately after DC2 we started the production of data for the ATLAS Physics community in view of the physics workshop in Rome in June 2005. That production was divided into one preparation phase (0) and 4 main phases:

Phase 0: Event generation;
Phase 1: Geant4 simulation, digitization and reconstruction;
Phase 2: Replication of ESD, AOD and CBNT to CERN;
Phase 3: Analysis;
Phase 4: Pile-up.

Except for Phase 0 the production was run using the ATLAS production system ProdSys. The production model is very close to what was done for DC2.

Phase 0 was run by physicists on non-Grid systems. The files were registered in the Grid catalogue and replicated on the Grid(s) where they should be used later on. The manual registration into the Production System led to human errors, which had to be subsequently corrected, and should be avoided in future.

Phase 1 was run on Grid only using the three flavours: LCG, Grid3 and NorduGrid. Three types of jobs were run: first the Geant4 simulation, then the digitization of the data, and finally the reconstruction of the data producing ESD, AOD and CBNT (Combined Ntuple) in the same job. For each of the three steps the output data was persistified in POOL format and registered in its corresponding Grid catalogue, except the CBNT output that is in ROOT format. From end-January to end-March 2005 more than 5 million events were produced.

Phase 2, the concentration of the data at CERN was done in April and May 2005, using the ATLAS Data Management System DonQuijote. File transfer rates were limited by the performance of the Castor mass storage system at CERN.

Phase 3: The analysis scenario is still being developed. The baseline model is to use both AODs and TAG collections. In a first step the number of AOD files is reduced (in a ratio of 20:1) by concatenation. At the same time we produce TAG collections that will then be used for analysis.

Phase 4: It is intended to prepare some pile-up samples at a luminosity of 1033. Three types of input files are needed: the signal, the minimum-bias, and the cavern background. These files have been produced and replicated to a limited number of sites that have enough resources to perform the exercise. It should be noted that the pile-up has been done at a luminosity ten times lower than the one used for DC2, in practice that means that the process is simpler and less stressing, in term of resources and I/O, than the same process run in DC2.

The four phases of the production were run in parallel, therefore we had at the same time in the system long jobs (simulation) and short jobs (reconstruction).

The ATLAS Production System (ProdSys) was run in the same way as for DC2 with a notable difference since we decided to use in addition a new version of the LCG executor using Condor-G to distribute jobs instead of the LCG Resource Broker. The net effect was an increase in the number of jobs running in parallel on LCG and a better use of the available resources. On the best day we ran more than 12 000 jobs.

More than 500 000 jobs were run, producing 6.1 million events fully simulated and reconstructed in 173 different physics samples, some of them with pile-up.

Despite the fact that we are trying to consolidate the production system, we have still many errors. The fraction of errors as measured for the LCG Grid is given in Table 6-1. It is computed as the ratio of the number of jobs failed for one specific error source over the total number of submitted jobs.

Table 6-1 Production for ATLAS Physics workshop failure rates on the LCG Grid.
System	Causes	%	Comments
Workload management		2.7
Data management
	stagein	25.5	~11.2% timeout ~7.3% connection to information system lost
	stageout	0.7
ATLAS or LCG Grid configuration
	Athena crash	4.3	Often "maxtime" not well defined
	User proxy expired	0.5
	Site configuration	0.7
Unclassified		5

Owing to the very high number of files produced we experienced problems with the CERN Castor mass storage system. The catalogue was so large that it became inefficient. With the help of the CERN Castor team we succeeded in keeping the system running. We also had problems with the EDG/LCG Replica Location Service which were solved only after five days of intensive debugging.

All these problems show how important it is to stress-test in due time all the components of the system.

6.5.4 Combined Test Beam

Another important component of the evaluation of the Computing Model was the analysis of the Combined Test Beam (CTB) data.

A full slice of the barrel detector of the ATLAS experiment at LHC was tested for six months in 2004 with beams of pions, muons, electrons and photons in the energy range 1-350 GeV in the H8 area of the CERN SPS. The set-up included elements of all ATLAS sub-detectors:

Inner Detector: Pixel, Semi-Conductor Tracker (SCT), Transition Radiation Tracker (TRT);
Electromagnetic Liquid Argon Calorimeter;
Hadronic Tile Calorimeter;
Muon System: Monitored Drift Tubes (MDT), Cathode Strip Chambers (CSC), Resistive Plate Chambers (RPC), Thin Gap Chambers (TGC).

Ninety million events, with an average size of 50 kB/event, were collected and stored on Castor, for a total of about 4.5 TB.

It has been a challenging pre-commissioning exercise: for the first time, the complete software suite developed for the full ATLAS experiment was used for real detector data. Important integration issues like combined simulation and reconstruction, connection with the online services and management of many different types of conditions data were addressed for the first time.

In order to manage simulation, digitization and reconstruction of such a large data volume, we adopted, for large-scale CTB operations, the production systems that had been developed for the Data Challenges:

for the applications that need access to the conditions database and are performed at the CERN LSF batch system (reconstruction, both of real and simulated data, and digitization of simulated data) we used the DC1 production infrastructure;
for the bulk production of simulated data on the Grid we used the production infrastructure developed for DC2 and for the Rome production.

6.5.4.1 Production Activities for Real Data

For book-keeping of the reconstruction of real data we used the AMI metadata catalogue and the AtCom [6-7] job submission system, with some new features to cope with CTB requirements.

Large-scale reconstruction for real data has already been performed twice in 2005 with two different software releases. In both cases a large sample of good runs (about 400, for a total of about 25 million events) was processed, half of them with the fully combined set-up and half of them with combined calorimetry only. See the CTB Offline Web Page [6-8] for more details.

The average CPU time per event is of the order of 1.5 kSI2k-seconds. All the runs were split into logical files of maximum 10 000 events each and for each job three types of output files were produced and stored on Castor:

ROOT files with the combined n-tuples (20 kB/event);
POOL files with ESDs (40 kB/event);
XML files storing the POOL file catalogue needed by the ESD files.

These productions were rather successful, with an overall failure rate of about 2%. New reprocessing of these data is already foreseen later in 2005 with the next major ATLAS software release.

6.5.4.2 Production Activities for Simulated Data

The Monte Carlo production for the 2004 Combined Test Beam started in September 2004. The production was divided in two parts: a preproduction phase and a production phase. The preproduction phase started in September 2004 and lasted until April 2005. During this period, Monte Carlo events were simulated, digitized and reconstructed using AtCom on the CERN LSF batch system and the AMI database was used for book-keeping. The goal of this production was twofold:

Test the different software packages that were part of the simulation, digitization and reconstruction in order to obtain a stable version;
Compare the first Monte Carlo reconstructed events with those produced by the reconstruction of real data.

The events were processed using different ATLAS software releases from 8.6.0 to 10.0.2. More than 500k electron, 900k pion, 350k muon and 10k proton events, for a total of >1.7 million events, were generated. The statistics of the simulated, digitized, and reconstructed samples is slightly different, mostly due to the fact that the software was being developed and not all detector components were available in a given release. The mean processing time per event ranged between 1.7 kSI2k-seconds for electron simulation and 1.0 kSI2k-seconds for pion reconstruction.

In May 2005 we started the production phase for a sample of 392 runs, corresponding to almost 4 million events, with the same conditions of the good "real" runs selected by the CTB groups. For that reason, we adopted the tools used for Rome production in order to cope with such a big production. For the simulation, job definition entries were filled in the ATLAS development database. We used the standard LCG Grid productions tools (the Windmill supervisor and the Lexor executor). ATLAS release 10.0.1 was used for the simulation together with a few additional software package tags. Once the jobs finished successfully, the output files were replicated to the CERN Storage Element.

The digitization and reconstruction of those events is being done at CERN using AtCom and ATLAS release 10.0.2. This part of the processing is not being submitted to the Grid since the CTB Conditions Database is not replicated worldwide.

More information about the produced samples can be found in the CTB Simulation web page listed in See ATLAS test beam simulation, http://atlas.web.cern.ch/Atlas/GROUPS/SOFTWARE/OO/testbeam/simulationCTB/productionMC.html.

6.5.4.3 Conclusions

The main problem that we had during these large-scale productions is that, even for stable and "validated" releases, a lot of "harmless errors" were issued during reconstruction and we had to wait for final answers from many developers before starting each production. This is due to the fact that usually the developers do not test the code on large statistics. So the "final" validation of reconstruction in a given release is quite painful.

The experience with AMI and AtCom has been quite positive: AMI is a useful book-keeping database and helps the end user to search for relevant information concerning the productions. It is of great help also for checking the production itself while it is runing. AtCom is a very practical tool for job submission but it cannot cope with the monitoring of a large number of jobs: job submission and monitoring should be done in groups of no more than 500 jobs at the same time. It was decided to use the Rome production Grid tools for CTB simulation on account of the large amount of runs that have to be simulated.

Concerning future activities, we know already that the simulation of data for the CTB will continue at least until the end of 2005 and that a full reprocessing of good real data will be done at least twice before the end of 2005, with two new major ATLAS releases, namely a new set of conditions data and improved combined reconstruction algorithms.