Data access and analysis

Page under construction

The current page collects the most important information concerning access of distributed data and their analysis. It provides an overview of the applications, which in turn are described on separate sub-pages. We restrict this documentation to the most commonly used tools. All tools are still evolving, especially towards better interoperability in future. For each application the most important refererences are listed together with a short summary of commands and options. Each command is clickable for easy access to the full description. Only part of the services are described. Please consult the detailed documentation to learn about the full functionality. These pages should be used as quick reminder and pocketbook, they complement the tutorials and workbooks and do not replace them.

Please give feedback about additional information, which should be included to make the pages useful for users.

You need a Grid certificate to submit jobs and download data, see Starting on the Grid.

Introduction

Tools are available (see the summary table)

to find the datasets, their file content and the conditions (jobOptions) under which they were produced;
to submit jobs and monitor the job status and output files;
to associate datasets (provenance), which are related through a sequence of production steps, like generation, simulation, reconstruction.

The data are processed and stored under three different Grid flavours, EGEE/LCG, OSG, NorduGrid, each with its software for submitting jobs and its catalogues to store the data. The data management layer Don Quixote 2 (DQ2) hides these differences from the user.

User productions are performed with the ProdSys database, Panda/pathena and Ganga; data analysis is performed with Panda/pathena (on OSG) and Ganga (on EGEE/LCG).

The Tag navigator uses the Tag database to select events.

Local batch systems are LSF at CERN and Condor at BNL.

Summary of applications and their services

Application	sub-page	Services
DQ2	DQ2	Datasets and files: browse, download
	DQ2 - prod	DDM and DQ2: additional info for productions
AMI	AMI	Datasets: browse, search, provenance, related files (jobOptions, selection cuts)
Panda / pathena	Panda	Datasets and files: browse, search, status of replications Jobs: submit at BNL, check status

Ganga	Ganga	Jobs: submit at LCG, check status. Future: Multiple backends (LCG, Panda, NorduGrid, LSF, etc)
ProdSys		Datasets: browse, production status per Tier

Dataset types and their distribution

Data are organised in datasets, which in turn contain datafiles. Metadata describe a dataset. Datasets can be available at one or more locations. They can be complete or incomplete at a location. Be aware that at present only complete datasets can be used in a dataset-based brokering process.

Data types and their distribution at Tier centers (according to the computing model) are shown in the table below. Data sizes are for release 12.0.6 without pileup, MCTruth contributes additional 0.08 MByte/event.

Format	Size/evt MByte	Description	Comment
RDO	2	Raw object data	complete copy at CERN, another copy shared by the Tier1s. Dataset granularity per run and per stream.
ESD	2.3 - 3.0	Event summary data	complete copy at CERN, replicated at BNL; ESD available at the Tier1, where it was produced; the access may be restricted
AOD	0.5	Analysis object data	replicated to all Tier1s; in addition replicated to Tier2 clouds
DPD		Derived physics data	One global DPD, several special DPDs for physics groups
TAG	0.1	Meta data	For quick selection of interesting events

The size/event listed in the table is the present size, the target size is in all cases smaller.

Various formats are presently in use at the analysis level: AOD, DPD, HPTV (High Pt View ntuple), AANT (Athena Aware NTuple), SAN (Structured Analysis NTuple), pAOD (persistent AOD), CBNT (Combined Ntuple), and group-defined ntuples. The working group "Analysis Model" tries to converge to fewer formats in future.

The ESD/AOD classes are merged in release 13; identical interfaces can be used for ESD and AOD analysis. Datasets have the same granularity at the RDO and ESD level, but will be merged at the AOD level into "super-datasets".

Location of datasets

Older datasets, especially RDOs, are archived, see Data Replication (part of ComputingOperations).

Distributed data access

Several tools, see summary table, with partly overlapping functionalities, are available to browse and select data:

DQ2 is the tool for downloading datasets or files from local tier store or Grid-wide. It allows to list datasets with a one-line command, which accepts wildcard format.
AMI was chosen for browsing the meta data. Dataset search is facilitated through drop-down menus with predefined choices. Also displayed is the "provenance" of a dataset, i.e. the production chain from MC dataset to AOD dataset(s).
The Panda system, located at BNL, lists datasets and their file content, and the location of replicas at different tiers. It provides the tools for job submission at BNL and follow-up of the job status.

Distributed production and analysis

Ganga may become the standard application for job submission and follow-up of jobs. It takes care of differences of the Grid-flavours, partitions a submitted request into several jobs and merges the resulting files.

Panda/pathena is a lightweight framework for job submission with similar functionality and good user support. It provides access to a large set of data on the BNL server. Ganga will, starting from release 4.3, interface to Panda.

The ProdSys database allows to browse data; ProdSys has to be used for production jobs.

Use of Castor at CERN

Castor is the storage management system used at CERN. The most important Castor commands are listed on a separate page.

Information on CSC data

Useful information on CSC data is collected in a separate page.

Tutorials

Distributed Analysis -- various Tutorials, covering different aspects of DDM and analysis tools
Using Tags -- accessing tags and use of the Tag Navigator Tool (TNT)
Pathena tutorial -- see BNL Jamboree March 2008

Acknowledgements

We appreciate input for this page and sub-pages from experts and guinee pigs : Solveig Albrand, Stephane Jezequel, Dietrich Liko, Monika Wielers, [ TBD - add more names ]