--
ZavenAkopov - 08-Dec-2009
This topic consists a short compilation of general thoughts on how preservation of Secondary Data could be incorporated into INSPIRE and what structures should be used.
- First part is just a general /public/ introduction to the issue and is here only for the sake of completeness:
- Second part is related to implementation structure and definitions for the additional paper-related information - secondary data - followed by a specific example. (to be enriched by a theory paper example)
- Also to be added, are options to store experiments' internal notes/knowledge bases and wiki's, as mentioned in Travis' talk at DPHEP
A typical HEP experimental paper usually contains results from an
analysis that is based on data collected over long periods of time and
presents these results in a concise manner according to the analysis
capabilities of the collaboration at the moment of publication. However
continuous improvement in scientific areas of theory, experiment,
simulation and data analysis, as well as introduction of new ideas,
interpretations or unexpected discoveries may reveal the need to re-analyze
the data, which means that these data needs to be available for longer
periods of time even after the corresponding experiment ceases to exist. On
a large scale this is a challenge to be solved by administering ways to
preserve primary data (so-called raw data) with a set of measures which
includes strategy of storing the data, its classification and indexing, to
persistency and long-term preservation of scientific data collected over
the time period of an experiment. Needless to mention are of course the
tools and programming environment necessary to re-use and re-process the
stored data. This is a complex issue and requires combined efforts from
major HEP institutions and IT centers, which have formed a *Data
Preservation* task force with an aim to address these issues in a systematic
way, clarify the objectives and the means of data persistency in HEP.
Simultaneously one more way to approach this problem from another side
is considered, being in line with the abovementioned concerns. Namely, the
level of the so-called "secondary data", which includes partly analyzed
data in form of e.g. additional plots, supporting arguments, information on
corrections and cuts, as well as data tables and Root files, which can be
used to either reconstruct the presented results in different kinematic
ranges (different binning), or apply different conditions ("cuts")
according to current development of theory or experiment, and possibly
obtain alternative results. This can be implemented within the framework of
tools that exist already today or are being developed.
Currently articles submitted to e.g.
arXiv or
SPIRES, contain most of
the time basically files with actual text of the article, usually in PDF
format. Some also include links to corresponding plots stored in online
electronic databases such as
Durham, but other than that, one has usually
not much more information on the supporting or additional data used to
produce it. An implementation of additional features in the currently
developed successor of Spires database,
INSPIRE, that allow storage of
secondary data seems to be a feasible and practical way to tackle the above-
mentioned issue and provide another important building block in a
homogenous HEP analysis environment for the physicists. A typical entry in
the database would, therefore, contain not only the actual paper, its
citation and relevant information, but also additional fields such as
Tables, Scripts and Plots that support the data and can be used to re-
analyze it or compare with theoretical models. These could be either stored
as a whole, or incorporated as persistent links to relevant sources,
whereas a very important condition is obviously sustainability of these
links.
Such expanded entries would contain:
- Tables that correspond to the plots presented in the paper and beyond; supporting tables. These should contain numerical data as used in the analysis, as well the detailed binning and kinematic ranges.
- Additional plots and distributions: e.g. phase space, kinematic dependencies, where applies.
- Files containing codes or Root scripts (provided simplicity recommendations from the Root team are fulfilled) or Mathematica notebooks.
Obviously, along with storage of such additional information, a
relevant indexing scheme should be implemented for empowering access to it
via the search function, as well as proper meta-data assignment. This means
that not only the paper itself can be assigned certain keywords, as it is
currently implemented, but also tables and plots, which could be then
returned as full entries. The proposed metadata structure is to be defined
for each record and consists of the main part that is the same for all
records, and a more specific part that varies with the type of entry. The
main part should contain namely:
- Specific type of entry/Experimental or theory paper;
- Type of reaction;
- Experiment/Institute/Facility, whichever applies;
- Type of accelerator, if applicable;
In addition, entries of type
"Plots" and
"Tables" should also contain
information such as:
- Axis labels, definition of axis variables or tabled entries;
- Kinematic range of each plot/table;
- Name/type of the distribution function, if applicable;
- Whether the data represented is experimental or theoretical, or a combination of both;
- Date / time or data taking period of the obtained results;
Whereas entries of type "Scripts" should contain:
- Version of software (Root, etc.) / production date;
- Production environment / Platform / Libraries;
- List of conditions (cuts), kinematic variables involved, and whether the cuts are hardcoded or not;
Here one should mention that it would be eventually a must to
encourage submission of user-friendly scripts, which contain self-
explanatory documentation or headers that clearly outline possible
variation possibilities and variables which can be modified.
In order to demonstrate the way the entries would be then stored, let
us take an example article
(see Appendix A) and form an entry. The paper is
an experimental article reporting on hadronization in nuclear environment.
It has multiple tables as additional / secondary data, which can be stored
in Inspire along with the text of the article. The record would then have
the following information:
General:
Specific entries for each table - e.g. Table #1:
The basic principle is that a record as a whole is always available to
the scientist - a paper, with all of its additional information and data.
This is what makes such implementation favorable, compared to a typical
situation where a user searches for a paper at one site, then has to look
for the tables on another site, and maybe then ask the authors for
additional information or numbers via e-mail or by searching the
Experiments/Institute's website.
(There are obviously a lot more advantages and/or motivations for INSPIRE to host additional data, so this should be mentioned and added)
APPENDIX A.
Hadronization in semi-inclusive deep-inelastic scattering on nuclei
Authors: The HERMES Collaboration (A. Airapetian et al.)
(Submitted on 24 Apr 2007)
Abstract: A series of semi-inclusive deep-inelastic scattering measurements
on deuterium, helium, neon, krypton, and xenon targets has been performed
in order to study hadronization. The data were collected with the HERMES
detector at the DESY laboratory using a 27.6 GeV positron or electron beam.
Hadron multiplicities on nucleus A relative to those on the deuteron,
R_A^h, are presented for various hadrons (\pi^+, \pi^-, \pi^0, K^+, K^-, p,
and \bar{p}) as a function of the virtual-photon energy \nu, the fraction z
of this energy transferred to the hadron, the photon virtuality Q^2, and
the hadron transverse momentum squared p_t^2. The data reveal a systematic
decrease of R_A^h with the mass number A for each hadron type h.
Furthermore, R_A^h increases (decreases) with increasing values of \nu (z),
increases slightly with increasing Q^2, and is almost independent of p_t^2,
except at large values of p_t^2. For pions two-dimensional distributions
also are presented. These indicate that the dependences of R_A^{\pi} on \nu
and z can largely be described as a dependence on a single variable L_c,
which is a combination of \nu and z. The dependence on L_c suggests in
which kinematic conditions partonic and hadronic mechanisms may be
dominant. The behaviour of R_A^{\pi} at large p_t^2 constitutes tentative
evidence for a partonic energy-loss mechanism. The A-dependence of R_A^h is
investigated as a function of \nu, z, and of L_c. It approximately follows
an A^{\alpha} form with \alpha \approx 0.5 - 0.6.
Full text at:
http://arxiv.org/pdf/0704.3270v1
For demonstratory purpose, one of the tables (corresponding to a plot
included in the paper) is shown here.
Table 1: PI+ multiplicty ratio (Helium/Deuterium) as a function of NU
x-axis header: NU IN GEV
y-axis header: MULT(HE)/MULT(DEUT)
x |
y |
dy |
dy |
4.0 6.0 |
0.968 |
+0.021 -0.021 |
+0.029 -0.029 |
6.0 8.0 |
0.97 |
+0.012 -0.012 |
+0.031 -0.031 |
8.0 10.0 |
0.951 |
+0.0090 -0.0090 |
+0.03 -0.03 |
10.0 12.0 |
0.958 |
+0.0080 -0.0080 |
+0.031 -0.031 |
12.0 14.0 |
0.962 |
+0.0080 -0.0080 |
+0.031 -0.031 |
14.0 16.0 |
0.969 |
+0.0080 -0.0080 |
+0.03 -0.03 |
16.0 18.0 |
0.968 |
+0.0090 -0.0090 |
+0.029 -0.029 |
18.0 20.0 |
0.964 |
+0.011 -0.011 |
+0.028 -0.028 |
20.0 23.5 |
0.973 |
+0.012 -0.012 |
+0.028 -0.028 |