arXiv Harvesting
Description of how the metadata harvesting workflow from arXiv, with plot-extraction, is configured on INSPIRE. This harvesting will be scheduled daily before other metadata-updates (like SPIRES updates).
Setting up OAI Harvest
The harvesting of metadata from arXiv use the OAI-PMH protocol and therefore it is necessary to set up the harvesting using the Invenio OAI Harvest Admin Interface. When adding a new OAI source, some parameters must be set according to this list:
- Base URL:
http://export.arxiv.org/oai2
- Metadata prefix:
arXiv
- Postprocess:
harvest, convert, extract plots/references, attach full-text, filter, upload
- BibConvert configuration file:
/opt/invenio/etc/bibconvert/config/oaiarXiv2inspire.xsl
- BibFilter program:
/opt/invenio/bin/bibfilter_oaiarXiv2inspire.py
As you see in the above list, two files are needed for the harvesting to function properly. Firstly, it is necessary to provide a XSLT stylesheet to transform OAI style XML into proper Inspire MARCXML. Secondly, a python script must be provided to perform various filtering of incoming record updates. These will be properly installed when installing Inspire sources.
Selected categories
This sections describes which sets/categories that are currently harvested from arXiv.
Main sets where all records are accepted:
physics:astro-ph (Astrophysics)
physics:gr-qc (General Relativity and Quantum Cosmology)
physics:hep-ex (High Energy Physics - Experiment)
physics:hep-lat (High Energy Physics - Lattice)
physics:hep-ph (High Energy Physics - Phenomenology)
physics:hep-th (High Energy Physics - Theory)
physics:nucl-ex (Nuclear Experiment)
physics:nucl-th (Nuclear Theory)
Secondary sets, where only some records are accepted:
??
Update sets, where only updates are accepted (outside of the main ones):
physics:astro-ph
physics:cond-mat
physics:math-ph
physics:nlin
physics (Physics)
physics:quant-ph
math
cs
Sub-Categories in the future
Including cross-listings:
physics:astro-ph.HE (High Energy Astrophysical Phenomena)
physics:physics.acc-ph (Accelerator Physics)
physics:physics.ins-det (Instrumentation and Detectors)
If possible to select for main category only (otherwise DESY selects manually):
physics:physics.data-an (Data Analysis, Statistics and Probability)
Convert step: XSLT stylesheet
During the convertion step the following files are needed:
-
oaiarXiv2inspire.xsl
- main stylesheet to transform arXiv OAI to Inspire MARCXML. This is the path you put in the configuration.
-
oaiarXiv2inspire_categories.xml
- the stylesheet uses this file to map various category-names.
Installed by running
make install
from
bibconvert
folder on Inspire repo.
Filtering step: BibFilter script
The bibfilter step also consists of two files:
-
bibfilter_oaiarXiv2inspire.py
- main script run by BibHarvest to filter records. This is the path you put in the configuration.
-
oaiarXiv_bibfilter_actions.cfg
- configuration script for the filtering process
Installed by running
make install
from
bibharvest
folder on Inspire repo.
--
JanLavik - 21-Jan-2011