Home / Grid tools & services / Data access & analysis (dir)

Data access and analysis

Page under construction

The current page collects the most important information concerning access of distributed data and their analysis. It provides an overview of the applications, which in turn are described on separate sub-pages. We restrict this documentation to the most commonly used tools. All tools are still evolving, especially towards better interoperability in future. For each application the most important refererences are listed together with a short summary of commands and options. Each command is clickable for easy access to the full description. Only part of the services are described. Please consult the detailed documentation to learn about the full functionality. These pages should be used as quick reminder and pocketbook, they complement the tutorials and workbooks and do not replace them.

Please give feedback about additional information, which should be included to make the pages useful for users.

You need a Grid certificate to submit jobs and download data, see Starting on the Grid.

Introduction

Tools are available (see the summary table)

The data are processed and stored under three different Grid flavours, EGEE/LCG, OSG, NorduGrid, each with its software for submitting jobs and its catalogues to store the data. The data management layer Don Quixote 2 (DQ2) hides these differences from the user.

User productions are performed with the ProdSys database, Panda/pathena and Ganga; data analysis is performed with Panda/pathena (on OSG) and Ganga (on EGEE/LCG).

The Tag navigator uses the Tag database to select events.

Local batch systems are LSF at CERN and Condor at BNL.

Summary of applications and their services

Application sub-page Services
DQ2 DQ2 Datasets and files: browse, download
  DQ2 - prod DDM and DQ2: additional info for productions
AMI AMI Datasets: browse, search, provenance, related files (jobOptions, selection cuts)
Panda /
pathena
Panda Datasets and files: browse, search, status of replications
Jobs: submit at BNL, check status
Ganga Ganga Jobs: submit at LCG, check status. Future: Multiple backends (LCG, Panda, NorduGrid, LSF, etc)
ProdSys   Datasets: browse, production status per Tier

Dataset types and their distribution

Data are organised in datasets, which in turn contain datafiles. Metadata describe a dataset. Datasets can be available at one or more locations. They can be complete or incomplete at a location. Be aware that at present only complete datasets can be used in a dataset-based brokering process.

Data types and their distribution at Tier centers (according to the computing model) are shown in the table below. Data sizes are for release 12.0.6 without pileup, MCTruth contributes additional 0.08 MByte/event.

Format Size/evt
MByte
Description Comment
RDO
2
Raw object data complete copy at CERN, another copy shared by the Tier1s. Dataset granularity per run and per stream.
ESD
2.3 - 3.0
Event summary data complete copy at CERN, replicated at BNL; ESD available at the Tier1, where it was produced; the access may be restricted
AOD
0.5
Analysis object data replicated to all Tier1s; in addition replicated to Tier2 clouds
DPD Derived physics data One global DPD,
several special DPDs for physics groups
TAG
0.1
Meta data For quick selection of interesting events

The size/event listed in the table is the present size, the target size is in all cases smaller.

Various formats are presently in use at the analysis level: AOD, DPD, HPTV (High Pt View ntuple), AANT (Athena Aware NTuple), SAN (Structured Analysis NTuple), pAOD (persistent AOD), CBNT (Combined Ntuple), and group-defined ntuples. The working group "Analysis Model" tries to converge to fewer formats in future.

The ESD/AOD classes are merged in release 13; identical interfaces can be used for ESD and AOD analysis. Datasets have the same granularity at the RDO and ESD level, but will be merged at the AOD level into "super-datasets".

Location of datasets

Older datasets, especially RDOs, are archived, see Data Replication (part of ComputingOperations).

Distributed data access

Several tools, see summary table, with partly overlapping functionalities, are available to browse and select data:

Distributed production and analysis

Ganga may become the standard application for job submission and follow-up of jobs. It takes care of differences of the Grid-flavours, partitions a submitted request into several jobs and merges the resulting files.

Panda/pathena is a lightweight framework for job submission with similar functionality and good user support. It provides access to a large set of data on the BNL server. Ganga will, starting from release 4.3, interface to Panda.

The ProdSys database allows to browse data; ProdSys has to be used for production jobs.

Use of Castor at CERN

Castor is the storage management system used at CERN. The most important Castor commands are listed on a separate page.

Information on CSC data

See also

Tutorials

Acknowledgements

We appreciate input for this page and sub-pages from experts and guinee pigs : Solveig Albrand, Stephane Jezequel, Dietrich Liko, Monika Wielers, [ TBD - add more names ]

↑ Top