Data access and analysis
Page under construction
The current page collects the most important information concerning access of distributed data and their analysis. It provides an overview of the applications, which in turn are described on separate sub-pages. We restrict this documentation to the most commonly used tools. All tools are still evolving, especially towards better interoperability in future. For each application the most important refererences are listed together with a short summary of commands and options. Each command is clickable for easy access to the full description. Only part of the services are described. Please consult the detailed documentation to learn about the full functionality. These pages should be used as quick reminder and pocketbook, they complement the tutorials and workbooks and do not replace them.
Please give feedback about additional information, which should be included to make the pages useful for users.
You need a Grid certificate to submit jobs and download data, see Starting on the Grid.
Introduction
Tools are available (see the summary table)
- to find the datasets, their file content and the conditions (jobOptions) under which they were produced;
- to submit jobs and monitor the job status and output files;
- to associate datasets (provenance), which are related through a sequence of production steps, like generation, simulation, reconstruction.
The data are processed and stored under three different Grid flavours, EGEE/LCG, OSG, NorduGrid, each with its software for submitting jobs and its catalogues to store the data. The data management layer Don Quixote 2 (DQ2) hides these differences from the user.
User productions are performed with the ProdSys database, Panda/pathena and Ganga; data analysis is performed with Panda/pathena (on OSG) and Ganga (on EGEE/LCG).
The Tag navigator uses the Tag database to select events.
Local batch systems are LSF at CERN and Condor at BNL.
Summary of applications and their services
Application | sub-page | Services |
---|---|---|
DQ2 | DQ2 | Datasets and files: browse, download |
DQ2 - prod | DDM and DQ2: additional info for productions | |
AMI | AMI | Datasets: browse, search, provenance, related files (jobOptions, selection cuts) |
Panda / pathena |
Panda | Datasets and files: browse, search, status of replications Jobs: submit at BNL, check status |
Ganga | Ganga | Jobs: submit at LCG, check status. Future: Multiple backends (LCG, Panda, NorduGrid, LSF, etc) |
ProdSys | Datasets: browse, production status per Tier | |
Dataset types and their distribution
Data are organised in datasets, which in turn contain datafiles. Metadata describe a dataset. Datasets can be available at one or more locations. They can be complete or incomplete at a location. Be aware that at present only complete datasets can be used in a dataset-based brokering process.
Data types and their distribution at Tier centers (according to the computing model) are shown in the table below. Data sizes are for release 12.0.6 without pileup, MCTruth contributes additional 0.08 MByte/event.
Format | Size/evt MByte |
Description | Comment |
---|---|---|---|
RDO | Raw object data | complete copy at CERN, another copy shared by the Tier1s. Dataset granularity per run and per stream. | |
ESD | Event summary data | complete copy at CERN, replicated at BNL; ESD available at the Tier1, where it was produced; the access may be restricted | |
AOD | Analysis object data | replicated to all Tier1s; in addition replicated to Tier2 clouds | |
DPD | Derived physics data | One global DPD, several special DPDs for physics groups | |
TAG | Meta data | For quick selection of interesting events | |
The size/event listed in the table is the present size, the target size is in all cases smaller.
Various formats are presently in use at the analysis level: AOD, DPD, HPTV (High Pt View ntuple), AANT (Athena Aware NTuple), SAN (Structured Analysis NTuple), pAOD (persistent AOD), CBNT (Combined Ntuple), and group-defined ntuples. The working group "Analysis Model" tries to converge to fewer formats in future.
The ESD/AOD classes are merged in release 13; identical interfaces can be used for ESD and AOD analysis. Datasets have the same granularity at the RDO and ESD level, but will be merged at the AOD level into "super-datasets".
Location of datasets
Older datasets, especially RDOs, are archived, see Data Replication (part of ComputingOperations).
Distributed data access
Several tools, see summary table, with partly overlapping functionalities, are available to browse and select data:
- DQ2 is the tool for downloading datasets or files from local tier store or Grid-wide. It allows to list datasets with a one-line command, which accepts wildcard format.
- AMI was chosen for browsing the meta data. Dataset search is facilitated through drop-down menus with predefined choices. Also displayed is the "provenance" of a dataset, i.e. the production chain from MC dataset to AOD dataset(s).
- The Panda system, located at BNL, lists datasets and their file content, and the location of replicas at different tiers. It provides the tools for job submission at BNL and follow-up of the job status.
Distributed production and analysis
Ganga may become the standard application for job submission and follow-up of jobs. It takes care of differences of the Grid-flavours, partitions a submitted request into several jobs and merges the resulting files.
Panda/pathena is a lightweight framework for job submission with similar functionality and good user support. It provides access to a large set of data on the BNL server. Ganga will, starting from release 4.3, interface to Panda.
The ProdSys database allows to browse data; ProdSys has to be used for production jobs.
Use of Castor at CERN
Castor is the storage management system used at CERN. The most important Castor commands are listed on a separate page.
Information on CSC data
- Useful information on CSC data is collected in a separate page.
See also
- Athena DBAccess -- access to geometry and conditions data in Athena jobs. Use of SQLite files.
- Site downtime -- over next 7 days [may move this to page with operations status TBD]
- Grid glossary -- the Grid jargon
- Grid operations center-- showing down times
- Site at Lyon -- explaining event data access
- Usage of the Distributed Analysis tools -- in the ATLAS SUSY group, ATL-SOFT-INT-2007-005 (pdf, 8 pages, restricted access)
- Problems: PosixIO, disk pool manager (DPM) for Tier2 clouds
- Transient-persistent separation
Tutorials
- Distributed Analysis -- various Tutorials, covering different aspects of DDM and analysis tools
- Using Tags -- accessing tags and use of the Tag Navigator Tool (TNT)
- Pathena tutorial -- see BNL Jamboree March 2008
Acknowledgements
We appreciate input for this page and sub-pages from experts and guinee pigs : Solveig Albrand, Stephane Jezequel, Dietrich Liko, Monika Wielers, [ TBD - add more names ]