Collections in Inspire

This is a proposal for the various collections that will be used in Inspire. Some of them will come from workflow tools, others will come from DevelopmentRecordLoading from SPIRES.

There are 4 major types of collections in Inspire:

Status

There will be collections that are based on the status of a record within the workflow:

  • Invisible: Records that have a high probability of being duplicates, or are so rough that they should not be shown at all until some cleanup has been performed - or are likely to be discarded as irrelevant (AH)
  • Visible Temp: Records that have just been added, and have not yet been cleaned (i.e. we don't vouch for their accuracy) These are visible to users, but still in the workflow of enrichment
  • Complete: Records that have had everything done to them that we need to do. The meaning of complete (in terms of what tasks have been done) may depend on other factors (source, topic collection, etc)

For more information about the holding pen see:

SystemOperationsWorkflow

and

SystemDesignBibCatalogue

Core/Non-core

There will be collections based on the topic/content of the article. There could be sub-collections for these, but the major collections are:

  • Core: Defined by
    • FC: T, E, P, L....other things DESY selects
  • Non-Core: Defined by
    • FC: A... Things DESY does not select

Core will have more enrichment. Things may move from non Core to Core as we find out more about them, or vice versa, and that may affect the workflow. However, it will be possible that we decide to enrich non-core papers in some (many) ways (e.g. nucl, astro, etc) without moving into core. This would enable us to create "COREHEP" "CORENUCL" "COREASTRO"

Import from SPIRES

In import from SPIRES CORE tags are assigned by

  • if DRN exists ---> core
  • if XDRN is Z, X, e, h ---> core
  • if XDRN is w and (cnum > 0 or TC = C or cpbn > 0 or mn > 0):
    • if FC = T, E, P, L or primarch = hep-th/ex/ph/lat ---> CORE
(is the conf even necc. for this statement to be true?)
    • if XDRN is prefix d and not XSRN = Detector ---> noncore
  • else unknown (treated as noncore, but tagged for investigation)
    • probably the conf restriction is not needed so: FC = T, E, P, L or primarch = hep-th/ex/ph/lat ---> CORE (fits here)

Going forward

SystemDesignBibCategorize should configure most core/non-core with minimal manual assistance.

Subject

Subject categories are not exactly sub categories of Core/non-core, as there are many core papers (by above logic) that are not actuall in the core subjects. The right way to understand this distinction is that CORE/NONCORE is about the reader and their likelihood of finding the information valuable, while Subject is about the paper itself. (does this make sense???)

Subject collections are based on the FC value, which is in turn aligned with the arXiv category (for papers that happen to be arXived):

  • A astrophysics (astro-ph)
  • B accel phys (phys.accel-phys +)
  • C computing (cs +)
  • E (hep-ex) ---> automatic core
  • G (gr-qc)
  • I inst (NIM type) (phys.ins-det + )
  • L Lattice (hep-lat) ---> automatic core
  • M math and math phys (math, math-ph)
  • N (nucl-th) (and some atomic theory)
  • O other (Not PPA at all- often SSRL, or Library, etc.)
  • P (hep-ph) ---> automatic core
  • Q gen. phys (cond-mat, quant-ph, physics, as well as some crackpot stuff)
  • T hep-th ---> automatic core
  • X nucl-ex

(There is, as far as I can see, no need to have these letter codes. I think all names are straightforward to expand into english)

Like arXiv, these are not mutually excludive, though we should decide if, like arXiv we'd like a special "primary" topic. I think not.

Import from SPIRES

We need to define mappings from existing alternative codes into FC above. This includes mapping from SCL, FC, DOCTYPE, xdrn, primarch, archive. Currently SPIRES is importing FC, SCL, DOCTYPE to 694_a and noting the source in $$9.

TODO: XDRN should be added (the prefix only?where should the rest go?), but there should be a mapping of the non-FC codes.

Going forward

SystemDesignBibClassify will be able to, for the most part assign FC (694__a) automagically for those that don't come presassigned from arXiv. There will be manual additions as well.

Form

Collections are defined by Formal criteria, I.e. the "type" of the article.

  • theses
  • arXiv
  • peer-reviewed
  • conference
  • citeable
  • review
  • book
  • lecture
  • introductory

Again, I suggest that these be in english, not a letter code

Import From SPIRES

The import below is currently implemented

  • If CNUM or TC=C } --> ConferencePaper collection
  • If SCL = S or TC = P ---> published
    • i think this is better than explicitly identifying non-peer reviewed journals ( Couldn't we create a peer-review subcollection? Where talks and news etc don't get in? (AH) )
  • if lanl> 0 ---> arXiv collection
  • if pbn>0 or lanl> 0---> citeable collction
  • if Published collection -> citeable
    • want to extend this to conferences...
  • dk=thesis/note=thesis/TC=T --> theses collection
  • tc/scl = R/dk = review ---> review
  • dk=book/TC=B --> book collection
  • dk=lectures/TC=L --> lectures collection
  • dk=introductory/TC=I --> introductory (or popular or some other name) collection

These are stored in 694__a along with the subject classification, but have a different source (currently SPIRES TC)

mechanics of import from Spires to Inspire

the above collections rules are contained in a declared element (dynamic elem) at: /afs/slac/u/li/sul/decl.collection

on slaclib acct, two jobs run. /u/li/slaclib/bin/daily_sul.csh (stores a stack of defq in ..heptrans) 6:30 pm /u/li/slaclib/bin/daily_hepg.csh runs ..inspire.load 10:30 pm

    • converts stack.today to result.today and in result.MMDD
    • adds result.append to result.today to load records not in defq for that day
    • xeq decl.collection (see below Extract from Hep)
    • use result.today to generate file of collection data.
    • sel hep.collect --> addupd hep.collect from this file
    • use result.today to generate file of keys --> /u2/spires/slaclib/inspire/result.today
    • and add to /u2/spires/slaclib/inspire/result.compilation
    • removed IRNS from day added to /u2/spires/slaclib/inspire/removes.today

Note on result.append : this is a 'stored result' file that we can create in slaclib acct out of hep to load records from that day without updating them into hep. To create it, simply find the records or stack the records (then 'generate result'), then 'store result.append'

you search it by find @result.append

the file of keys are used by a job run from inspire to generate the updates to inspire, format xmlinspire uses data in hep.collect to create the collection xml.

Going foward

  • Conference: Can be assigned if CNUM or MN or CPBN exist in record (also dk=talk or dk=lectures, place date or dk=review, place date (AH) )
  • Published/Peer-Reviewed: Assigned if the journal of publication is marked as peer-reviewed in CODEN file.
    • There is an ambiguity in that some conference talks are formally published in special volumes of peer-reviewed journals. We currently count this as published, but admit that this is wrong. Would we get closer to the truth if we exclude all journal articles identified as talks? (AH)
  • arXiv: existence of bull element (i.e. paper submitted to arXiv)
  • other:not all documents fall into the above categories.
  • cite-eligible: Any document of any type is cite-eligible if we can track citations to it. Currently it must possess either a report number, a journal reference (not necc. peer-reviewed) or an arXiv number. This may eventually be extended to conferences
  • Thesis: (dk=thesis/note=thesis/TC=T)
  • Lectures: (dk=lectures/TC=L)
  • Books: (dk=book/TC=B)
  • introductory: (dk=introductory/TC=I for popular articles)

The logic above will often support the assignment of a formal classification. However, this needs to be placed in the workflow carefully, since some criterai above change as a document is enriched.

published/peer reviewed:

"published" in Inspire terminology designates in the future only peer-reviewed journal articles and possibly book chapters. Default is that journal articles with cnum do not count as published. Exceptions to these rules to be handled on a case-by-case basis.

To do:

  • list of journals that never publish conf proceedings - their articles count always as published
    • a conf paper in these journals passes through normal peer review
  • list of book series that count as published, e.g. Lect. Notes Phys.
  • list of journals without peer review, e.g. CERN Courier

Use in citesummary

These formal criteria collections will be useful for searching as some people will want to exclude include certain collections, however they are also very useful for citation summary:

We should present 3 columns:

  • Published
  • Published + arXiv
  • all cite eligible

In presenting average statistics, etc, each column should use only the total of papers in its set, of course. Thus usually one finds that published sets have higher averages than others.. The first 2 columns we are confident in our ability to track citation pretty accurately. the third column is less certain, but could be useful to show as it improves in accuracy.

Non-Literature

In addition to the literature database, we will eventually have authority-type files like HEPnames, Inst, Conf in Inspire. These are naturally collections.

For the initial time they will continue to be maintained in SPIRES, and dumped into Invenio collections if needed, otherwise linked to via http. However at some point they will be maintained and live as collections within inspire with very different data structures. The ones used as authority should move sooner rather than later, and Inst is already there:

  • institutions
  • conferences
  • names
  • experiments

INST and CONF are both nearly working, with imports/mapping via XSLT committed in Travis' git

needs only formatting for the confs and insts.

It should be noted that many of the non-literature collections might also share collections like subject codes (Heath has used this to good effect in Jobs and names)

Extract from HEP

Procedures to load data from HEP to Invenio

  • cp ~sul/decl.collection (myactive tmp file)
  • use (my active temp file)
  • sel hep// xeq
  • set elem irn collection
  • set recsep 'addupd;'
  • for subf // in act cle dis all
  • spibild//establish hepcollect//input batch
  • set for hepcollect.xml
  • for subf//in act cle dis all (did this to create separate occurrences of collection without doing filedef stuff in HEP)

-- TravisBrooks - 05 Mar 2008

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2011-12-08 - MikeSullivan
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback