DataPreservation < Inspire

Inspire Web>WebTopicList>DataPreservation (2009-12-08, ZavenAkopovExCern)

-- ZavenAkopov - 08-Dec-2009

This topic consists a short compilation of general thoughts on how preservation of Secondary Data could be incorporated into INSPIRE and what structures should be used.

First part is just a general /public/ introduction to the issue and is here only for the sake of completeness:
Second part is related to implementation structure and definitions for the additional paper-related information - secondary data - followed by a specific example. (to be enriched by a theory paper example)
Also to be added, are options to store experiments' internal notes/knowledge bases and wiki's, as mentioned in Travis' talk at DPHEP

A typical HEP experimental paper usually contains results from an analysis that is based on data collected over long periods of time and presents these results in a concise manner according to the analysis capabilities of the collaboration at the moment of publication. However continuous improvement in scientific areas of theory, experiment, simulation and data analysis, as well as introduction of new ideas, interpretations or unexpected discoveries may reveal the need to re-analyze the data, which means that these data needs to be available for longer periods of time even after the corresponding experiment ceases to exist. On a large scale this is a challenge to be solved by administering ways to preserve primary data (so-called raw data) with a set of measures which includes strategy of storing the data, its classification and indexing, to persistency and long-term preservation of scientific data collected over the time period of an experiment. Needless to mention are of course the tools and programming environment necessary to re-use and re-process the stored data. This is a complex issue and requires combined efforts from major HEP institutions and IT centers, which have formed a *Data Preservation* task force with an aim to address these issues in a systematic way, clarify the objectives and the means of data persistency in HEP.

Simultaneously one more way to approach this problem from another side is considered, being in line with the abovementioned concerns. Namely, the level of the so-called "secondary data", which includes partly analyzed data in form of e.g. additional plots, supporting arguments, information on corrections and cuts, as well as data tables and Root files, which can be used to either reconstruct the presented results in different kinematic ranges (different binning), or apply different conditions ("cuts") according to current development of theory or experiment, and possibly obtain alternative results. This can be implemented within the framework of tools that exist already today or are being developed.

Currently articles submitted to e.g. arXiv or SPIRES, contain most of the time basically files with actual text of the article, usually in PDF format. Some also include links to corresponding plots stored in online electronic databases such as Durham, but other than that, one has usually not much more information on the supporting or additional data used to produce it. An implementation of additional features in the currently developed successor of Spires database, INSPIRE, that allow storage of secondary data seems to be a feasible and practical way to tackle the above- mentioned issue and provide another important building block in a homogenous HEP analysis environment for the physicists. A typical entry in the database would, therefore, contain not only the actual paper, its citation and relevant information, but also additional fields such as Tables, Scripts and Plots that support the data and can be used to re- analyze it or compare with theoretical models. These could be either stored as a whole, or incorporated as persistent links to relevant sources, whereas a very important condition is obviously sustainability of these links. Such expanded entries would contain:

Tables that correspond to the plots presented in the paper and beyond; supporting tables. These should contain numerical data as used in the analysis, as well the detailed binning and kinematic ranges.
Additional plots and distributions: e.g. phase space, kinematic dependencies, where applies.
Files containing codes or Root scripts (provided simplicity recommendations from the Root team are fulfilled) or Mathematica notebooks.

Obviously, along with storage of such additional information, a relevant indexing scheme should be implemented for empowering access to it via the search function, as well as proper meta-data assignment. This means that not only the paper itself can be assigned certain keywords, as it is currently implemented, but also tables and plots, which could be then returned as full entries. The proposed metadata structure is to be defined for each record and consists of the main part that is the same for all records, and a more specific part that varies with the type of entry. The main part should contain namely:

Specific type of entry/Experimental or theory paper;
Type of reaction;
Experiment/Institute/Facility, whichever applies;
Type of accelerator, if applicable;

In addition, entries of type "Plots" and "Tables" should also contain information such as:

Axis labels, definition of axis variables or tabled entries;
Kinematic range of each plot/table;
Name/type of the distribution function, if applicable;
Whether the data represented is experimental or theoretical, or a combination of both;
Date / time or data taking period of the obtained results;

Whereas entries of type "Scripts" should contain:

Version of software (Root, etc.) / production date;
Production environment / Platform / Libraries;
List of conditions (cuts), kinematic variables involved, and whether the cuts are hardcoded or not;

Here one should mention that it would be eventually a must to encourage submission of user-friendly scripts, which contain self- explanatory documentation or headers that clearly outline possible variation possibilities and variables which can be modified.

In order to demonstrate the way the entries would be then stored, let us take an example article (see Appendix A) and form an entry. The paper is an experimental article reporting on hadronization in nuclear environment. It has multiple tables as additional / secondary data, which can be stored in Inspire along with the text of the article. The record would then have the following information:

General:

Type of entry	Reaction	Experiment/Institute	Accelerator
Tables	Inclusive hadroproduction E+ A --> E+ H X	HERMES/DESY	HERA, E(P=1)=27.6GeV

Specific entries for each table - e.g. Table #1:

Variables	Kinematic range	Function	Exp/Th?	Date
x-axis: nu (GeV/c)	x-axis(nu): 4.0-20.0	pi+ multiplicity ratio (Helium/Deuterium) as a function of nu	Exp	1997-2005
y-axis:MULT(HE)/ MULT(DEU)	-	-	-	-

The basic principle is that a record as a whole is always available to the scientist - a paper, with all of its additional information and data. This is what makes such implementation favorable, compared to a typical situation where a user searches for a paper at one site, then has to look for the tables on another site, and maybe then ask the authors for additional information or numbers via e-mail or by searching the Experiments/Institute's website. (There are obviously a lot more advantages and/or motivations for INSPIRE to host additional data, so this should be mentioned and added)

APPENDIX A.

Hadronization in semi-inclusive deep-inelastic scattering on nuclei

Authors: The HERMES Collaboration (A. Airapetian et al.)
(Submitted on 24 Apr 2007)
Abstract: A series of semi-inclusive deep-inelastic scattering measurements
on deuterium, helium, neon, krypton, and xenon targets has been performed
in order to study hadronization. The data were collected with the HERMES
detector at the DESY laboratory using a 27.6 GeV positron or electron beam.
Hadron multiplicities on nucleus A relative to those on the deuteron,
R_A^h, are presented for various hadrons (\pi^+, \pi^-, \pi^0, K^+, K^-, p,
and \bar{p}) as a function of the virtual-photon energy \nu, the fraction z
of this energy transferred to the hadron, the photon virtuality Q^2, and
the hadron transverse momentum squared p_t^2. The data reveal a systematic
decrease of R_A^h with the mass number A for each hadron type h.
Furthermore, R_A^h increases (decreases) with increasing values of \nu (z),
increases slightly with increasing Q^2, and is almost independent of p_t^2,
except at large values of p_t^2. For pions two-dimensional distributions
also are presented. These indicate that the dependences of R_A^{\pi} on \nu
and z can largely be described as a dependence on a single variable L_c,
which is a combination of \nu and z. The dependence on L_c suggests in
which kinematic conditions partonic and hadronic mechanisms may be
dominant. The behaviour of R_A^{\pi} at large p_t^2 constitutes tentative
evidence for a partonic energy-loss mechanism. The A-dependence of R_A^h is
investigated as a function of \nu, z, and of L_c. It approximately follows
an A^{\alpha} form with \alpha \approx 0.5 - 0.6.

Full text at: http://arxiv.org/pdf/0704.3270v1 For demonstratory purpose, one of the tables (corresponding to a plot included in the paper) is shown here.

Table 1: PI+ multiplicty ratio (Helium/Deuterium) as a function of NU x-axis header: NU IN GEV

y-axis header: MULT(HE)/MULT(DEUT)

x	y	dy	dy
4.0 6.0	0.968	+0.021 -0.021	+0.029 -0.029
6.0 8.0	0.97	+0.012 -0.012	+0.031 -0.031
8.0 10.0	0.951	+0.0090 -0.0090	+0.03 -0.03
10.0 12.0	0.958	+0.0080 -0.0080	+0.031 -0.031
12.0 14.0	0.962	+0.0080 -0.0080	+0.031 -0.031
14.0 16.0	0.969	+0.0080 -0.0080	+0.03 -0.03
16.0 18.0	0.968	+0.0090 -0.0090	+0.029 -0.029
18.0 20.0	0.964	+0.011 -0.011	+0.028 -0.028
20.0 23.5	0.973	+0.012 -0.012	+0.028 -0.028

Topic revision: r1 - 2009-12-08 - ZavenAkopovExCern

Inspire

Related wikis:

Invenio

- Cern Search
- TWiki Search
- Google Search
Inspire All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback