Record ingestion from CERN Document Server (CDS)

Introduction

Description of the various methods of ingesting relevant content from CERN Document Server to INSPIRE, how this content is transformed to proper INSPIRE formats and a way of matching the records to avoid duplicate entries using BibMatch. This will be a step-by-step explanation of the workflow, with special emphasis in usage of the BibMatch-tool. It is specifically for migrating from CDS to Inspire, but can be used as a template for similar operations in other Invenio installations.

In the following I will try to explain what you can and cannot do with BibMatch. Some of these things might seem obvious, but might save you some time.

The basic idea is to migrate parts of the CERN-specific collections found on CDS to Inspire, which is complicated by the publication patterns in the discipline; as documents feed into Inspire, there is a possibility that the records you try to upload already exist on Inspire. Which is why we match the uploads against Inspire to make sure that we aren't creating duplicates. For some of our collections e.g. things we know for sure haven't been published elsewhere, it is fairly simple. We get the records, transform them to Inspire format and upload them. For other records we e.g. know both sets have identical unique identifiers (arxiv-numbers) we simply match on that one id. If the very same string appear in certain fields across CDS and Inspire, the records are matched.

The tricky part is when we aren't sure whether (or how many of) the records already exist and when unique IDs aren't consistent across the two systems. As cataloguing practice can differ, this is not wrong, but fairly inconvenient as the main author (100-field) might be considered two different people and the title, pagenumber and publication year might be slightly different for two documents that are actually identical. So we cannot use exact field-by-field matching, which is why we have BibMatch.

The basic workflow follows seven steps:

Step 1. Harvest the records from CDS. Step 2. Transform them with xslt Step 3. Match them against Inspire Step 4. Validation Step 5. Upload new records Step 6. Exchange recids between the systems - and maybe more metadata, like repnrs Step 7. Document the upload

Prerequisites for performing the task are:

The current files are attached to this twiki, but you might want to ask your local friendly system librarian for updated files. Due to security settings on the twiki, the .py-files have an extra .txt-extension, so you will have to delete that when they are downloaded.

Step 1. Harvest the records from CDS

The first task is to get the records we need to put on Inspire from CDS. The CDS server is fairly susceptible to timeout errors, so we are currently using a workaround to harvest the records we need for matching. First do a query in CDS for the records that need to be harvested e.g. For all the ATLAS Notes, which will produce this url:

https://cds.cern.ch/search?ln=en&cc=ATLAS+Notes

The url can be changed to do different output-formats, in this case you want all the recids, so you add '&of=id' at the end.

https://cds.cern.ch/search?ln=en&cc=ATLAS+Notes&of=id

The output will be the csv-values from each of the records in your query. Copy paste them into a .txt-file, search and replace all commas with linebreak and save it in the same folder as the script harvest.py

Then run the harvest script on the command-line.

=$ python harvest.py -r whatever.txt -s http://cdsweb.cern.ch=

!! In some cases you might encounter "Search term too generic, displaying only partial results..." in that case you have to be logged into CDS and append wl=0 at the end of the url. This changes the Wild Card limit to 0.

This harvests the records and puts them into your tmp-folder as a file called /tmp/datastampsomething.dat

Step 2. Transform them with xslt

Once the records are harvested, use the cds2inspire.xsl stylesheet based on the mapping between CDS and Inspire to change e.g. the field 'translated title' (242 in CDS & 245 in Inspire). As the stylesheet references external files for translating the CDS categories to Inspire categories, it is important to keep the files in the same folder.

Go to the folder where you keep the xsl-file and write:

$ xsltproc cds2inspire3.xs /tmp/datastampsomething.dat > inspireformat.xml

The generated file inspireformat.xml will contain the transformed records.

Always check the output of the transformation with someone else (like Annette) before uploading the final records to Inspire, in case we need to add some new information or something needs to be changed. If you need to add/remove/change some of the fields, you can either change the xslt or - if it is a one-time upload (like a specific conference) - use the text-editor Sublime to make the changes.

Step 3. Match them against Inspire BibMatch for dummies

BibMatch is a tool for matching bibliographic meta-data records against a local or remote Invenio repository and serves to identify and thereby avoid duplicate entries when migrating records from CDS to Inspire. Essentially it works in two distinct steps: first a query, then matching. As errors might occur, in the form of false positives, in which two similar, but not identical, records are matched, the output should always undergo a degree of human validation before creating links or uploading records between the two systems.

This isn't a comprehensive guide to all the possible settings in BibMatch, which can be found here, but rather an experience-based guide to the aspects needed for this particular task.

Before we get started, a brief primer on the reason we need BibMatch in the first place. Consider the two records below:

BibMatchX.pngBibMatchX2.png

The actual document is identical across both systems. The author name is different. The report-numbers and keywords differ as well. In CDS (on the left) the conference is stated, but in Inspire this isn't noted anywhere in the record. This is just an example. Most HEP records contain multiple authors and the main entry (100-field) can change from system to system ... just like cataloging practice can change over time. Before doing the matching (or after the first run-through) look at the records and try to see if there is some pattern to the errors. A common problem is that an author publishes a document and then presents his/her findings at a conference in a shorter format or as slides. The title and author is the same, but some changes have been made to the document, so it can be considered another edition of the same work. Number of pages for the same document can differ, depending on whether tables, appendices, figures and references are counted. That is the great thing about standards, everybody has one.

Step 3.1 The BibMatch Query

The first step takes the records harvested from CDS and generates one or more queries based on each record. This returns a set of records in Inspire to be matched in the second step. This is done to avoid matching every CDS record against the entire Inspire installation and means that BibMatch only attempts to match with records found by this initial query. Think of the process as the very same you would go through to do the same thing. You have a record with various information that you need to see if is already in a database. You wouldn't compare the record you need matched with every single record in the database, but rather save time by starting out with narrowing your options down e.g. by only going through records by the same author or with similar titles. Which is exactly what BibMatch does in the query.

The most basic command looks like this

$ python bibmatch -r "http://inspirehep.net" -b dummies > inspireformat.xml

This launches a query on Inspire, matching the records in the file 'inspireformat.xml' against Inspire. Doing it like this only takes the value from 245a and copy-pastes it into the search box in Inspire. This might not suffice if the title is too generic, if there are inconsistencies across the systems or if special characters in the title messes up the search engine in Inspire. In their nature, the titles and names are messy, so by using the BibConvert syntax BibMatch can extract 'cleaned' queries. Below is an example of this.

The following syntax in the BibMatch queries is taken from the BibConvert, but you do not need to pay too much attention to this. By typing in the command below:

$ python bibmatch -q "[245__a::REP(-, )::SUP(PUNCT, )::SHAPE::REP( , title:)::REP(EOL,)::ADD(title:,)]" -r "http://inspirehep.net" -b dummies > inspireformat.xml

We break up the query into field searches in the title field by adding 'title:' before each separate word and also ignoring special characters, such as the $. As some titles might not be specific enough, qualifying with an author-name is the go-to in most retrieval situations and bibMatch is no different. The issue here is that author-names are often treated differently. Sometimes they are indexed with their full names and sometimes with surname comma initial(s). By using truncation for first names, this is accounted for in the query below:

$ python bibmatch -q "[100__a::REP(COMMA,;;)::LIMW(;;,R)::REP(;;,)::REP( , author:)::SHAPE::ADD(author:,)]*" -r "http://inspirehep.net" -b dummies > inspireformat.xml

As we are doing known-item search and combination of author and title is usually the most feasible approach, but there are of course others we might consider. Recordnumbers can serve as uniqueIDs, but they far from always appear in both systems. It makes sense to include this in your query, because it doesn't do any harm. If there is no 037 field in Inspire or the value doesnt' match, it simply ignores the query and moves on to the next.

$ python bibmatch -q "[037__a]" -r "http://inspirehep.net" -b dummies > inspireformat.xml

This takes the value in 037 and puts it in the search field in Inspire. The combined go-to query when all this is put together looks like this:

$ python bibmatch -q "[037__a]" -q "[100__a::REP(COMMA,;;)::LIMW(;;,R)::REP(;;,)::REP( , author:)::SHAPE::ADD(author:,)]*" -q "[100__a::REP(COMMA,;;)::LIMW(;;,R)::REP(;;,)::REP( , author:)::SHAPE::ADD(author:,)]* [245__a::REP(-, )::SUP(PUNCT, )::SHAPE::REP( , title:)::REP(EOL,)::ADD(title:,)]" -n -v9 -r "http://inspirehep.net" -n -b dummies --ascii < inspireformat.xml

Which will typically do the trick.

The '-v9' in the string above means you will get a verbose output i.e. will be able to see what is going on while BibMatch is running, which can be very helpfull. In case you see some oddities in the queries, e.g. two author names 'melted' together, remember that the output you see in BibMatch is exactly what Inspire sees as well. So if the query seems off, a blankspace is probably missing somewhere (this has been known to happen due to erroneous copy-pasting). The '-n' supresses the

BibMatch doesn't work in a way, where you can specify if-this-value-appears-in-this-field-do-this, so in order to avoid false positive matches with these records, we can include e.g. -* after each query to omit any records with this value, if e.g.

http://inspirehep.net/record/1204275?ln=en

http://inspirehep.net/record/1119557?ln=en

While these two documents are very much alike, they have two different reportnumbers on CDS and they are in fact, not entirely the same article. If you want to match the ATLAS-CONF (which you won't need to do, since they are put on Inspire automatically... but this is just for the sake of argument) any time the experimental physics (EP) have published something, it will be a false positive. After having bumped into the same type of false positive in step 6 enough times, you can decide that, rather than sorting it out afterwards, you can ensure that these are never even found by the BibMatch-query by adding a NOT operator at the end of each query-string, like so:

$ python bibmatch -q "[037__a] -CERN-PH-EP*" -q "[100__a::REP(COMMA,;;)::LIMW(;;,R)::REP(;;,)::REP( , author:)::SHAPE::ADD(author:,)]* -CERN-PH-EP*" -q "[100__a::REP(COMMA,;;)::LIMW(;;,R)::REP(;;,)::REP( , author:)::SHAPE::ADD(author:,)]* [245__a::REP(-, )::SUP(PUNCT, )::SHAPE::REP( , title:)::REP(EOL,)::ADD(title:,)] -CERN-PH-EP*" -v9 -n -r "http://inspirehep.net" -b dummies --ascii < harvest.xml

Which fixes that problem.

The most important thing here is to know which fields to search for. If some of the records on CDS have DOIs, then it makes sense to include that in the query. If you try to match some records with special characteristics, you can include that in the query.

Step 3.2 BibMatching... ftw

Once a set of records has been returned, the matching commences, which is the real meat and potatoes of BibMatch; where each record is matched against the corresonding set of records found in the query-phase. How and how much the CDS-record have to resemble the retrieved Inspire-records, before BibMatch considers this a match, depends entirely on your local configuration - matching strategy - which is found in your local-conf file.

A standard configuration looks like this:

{ 'tags' : '245__%,242__%',
'threshold' : 0.9,
'compare_mode' : 'lazy',
'match_mode' : 'title',
'result_mode' : 'normal' },
{ 'tags' : '037__a,088__a',
'threshold' : 1.0,
'compare_mode' : 'lazy',
'match_mode' : 'identifier',
'result_mode' : 'final' },
{ 'tags' : '100__a,700__a',
'threshold' : 0.8,
'compare_mode' : 'lazy',
'match_mode' : 'author',
'result_mode' : 'normal' },
{ 'tags' : '773__a',
'threshold' : 1.0,
'compare_mode' : 'lazy',
'match_mode' : 'title',
'result_mode' : 'joker' }

Each statement between { } contains several things which can be tweaked for the specific purpose. Basically the matching strategy can be broken into five sub-rules. I find it easiest to think about them in the order seen below, albeit they appear in a slightly different order in the local-conf file.

*tags *match_mode *compare_mode *threshold *result_mode

First you choose to look at one or more marc-fields. Either at specific sub-fields or all of them by using '%'. In the example above, 245__% and 242__% are put in there together, in order not to miss translated titles (the 242 field).

Then you decide how to 'look' at the string in the field. BibMatch can do some handy modification of strings, like taking initials into account when we are looking at authornames or stripping punctuation from doi, issn or other identifiers. This is conceptually exactly the same as you did in the query with the BibConvert syntax.

So now BibMatch has a bunch of fields and subfields and you have to decide how to compare them. Lets say you wanna match authors in both 100 and 700 fields (lack of consistent main entry - i.e. that two different authors are listed in the 100-field - is a fairly common occurence. The same can be said for primary (037) and secondary (088) reportnumber. If you are certain that the main entry is indeed the main entry, if you e.g. are matching a collection of theses, you can set the 'compare_mode' to normal.

Now BibMatch has a list of fields on both sides of the CDS-Inspire fence and a rule for which field should be compared to which field. The next tweak you can do, is to decide 'how similar' the fields have to be to constitute a match. So, threshold is a value for string similarity and denotes how different the strings in the field can be from each other, before they are considered a match. In some cases like titles and author-names a value less than 1 (100%) makes sense, as spelling variations might occur. For unique IDs (like report-number, arxiv-number or DOIs) the value should be 1.0. In some special cases you might wanna set the threshold really low e.g. if you want to match conferences between CDS and Inspire and do so by matching on starting date (threshold 1.0), place (threshold 0.5) and title (threshold 0.5). As there is rarely more than a handfull of conferences with the exact same starting-date, you can safely lower your threshold to get as many matches as possible.Note that the threshold also works for numbers, which means that you can do 'almost-matches' for dates and pagenumbers, which is useful as these thing can vary for the same document (often one cataloger will count all the pages and the next one will stop when the bibliography or plots start).

And then... you have used parts of the CDS-record to construct a query in Inspire. All the records found are then compared string-by-string for a number of fields of your choice and by a certain degree of similarity. The final setting is how BibMatch interprets the result of these comparisons. Imagine yourself going through the records again.You have your original document in hand and you go through each piece of bibliographic information ... BibMatch does the same.

It goes through each of the statements between { } and the result_mode decides how it reacts to what it finds.

- 'normal' : a failed match will cause the validation to immediately exit as a failure. a successful match will cause the validation to continue on other rules (if any) - 'final' : a failed match will cause the validation to immediately exit as a failure. a successful match will cause validation to immediately exit as a success. - 'joker' : a failed match will cause the validation to continue on other rules (if any). a successful match will cause validation to immediately exit as a success.

If the result is normal, it will continue to the next field. The joker-setting is used for semi-unique identifiers (for lack of a better word) like report-numbers and ISBNs, while the final-setting applies to something like DOIs. It is worth noting that if the field isn't present in both records, BibMatch skips it. So don't worry about missing matches when setting DOIs to 'Final' as the rule only comes into play when both records have the 0247_field.

And there you have it.

Step 3.3 Local-conf and the log-file

Aside from the settings mentioned in the previous section, you might wanna give some of the other settings a glance as well. It is possible to adjust things like sleeptime between requests in a remote server. Very handy when you don't wanna overburden a system. If you set it too low you will eventually get a timeout, too high and matching will take forever (at least for larger collections).

Aside from saving time, there is an upper limit of allowed hits for each individual query built into the BibMatch. The default is 15, meaning that if a query is too generic (e.g. Ellis*) and returns more than 15 hits, the results are ignored entirely. This limit can be changed in your local.conf and if you discover some records in CDS that already exist in Inspire, but aren't matched, you might want to bump this up somewhere in the 25-35 range.

Step 4. Validation

You end up with four files called dummies.new.xml, dummies.matched.xml, dummies.fuzzy.xml, dummies.ambiguous.xml.You can name these files whatever you fancy with the '-b' in the query.

* dummies.matched.xml are the ones that did match. The file will contain each record and above it a url pointing to its location on Inspire and info on the matching criteria in comments above it.

Rather than going through the xml, you can generate a text-file with a field-by-field comparison of the matches by using generate_output.py like this:

$ python generate_output.py dummies.matched.xml > pretty_matches.txt

There are lines at the end of the script that let you easily change which fields are included in the pretty output - the fewer relevant fields you have - the faster it is to skim through.

* dummies.new.xml are all the CDS-records that didn't match anything - which are the ones you will want to upload.

In addition to this you will have dummies.fuzzy.xml and dummies.ambiguous.xml. The former will be those files that didnt quite match, but could still be relevant and the latter set of matches is where more than one match is found. This is typically for records with very generic titles, but can also be a case of duplicate entries on Inspire, which sometimes happens because we have duplicates on CDS that have been uploaded at an earlier date. In this case you should talk to someone (like Cath) and ask her to merge the record.

As far as the fuzzy and ambigious matches go; check them and copy-paste them into either test.matched.xml or test.new.xml. Remember to remove the incorrect Inspire-urls from the ambiguous matches if you move them to dummies.matched.xml and only leave the one correct url (this is needed for step 6).

Step 5. Upload new records.

Always use the xmllint command to confirm that your xml is well-formed before trying to upload it. You then use the batch uploader (ask Jan or Javier for an account with cataloguer-rights) and insert the records on Inspire.

If you have a large set, you will have to ;skip the upload simulation; in the batch uploader, because it tends to freeze if you overload it.

You will then need the newly created recids (001-fields) on Inspire for the next step, so make sure to make a note of them. You can either ask for them (again, Jan or Javier) or simply wait until they are up and 'find the newest one and count backwards' as they will be uploaded successively. There are rumors of a new feature, where you will simply get an email with the log-file of the upload, which will contain the recids as well.

Go through a couple to make sure they look correct.

One problem we have encountered many times relates to the harvesting of full-text on CDS. In the 856-field (or in this case FFT-field) you will typically find a url pointing to somewhere on CDS. Much like BibMatch itself, the uploader might cause a timeout if you try to upload too many records from CDS at once.... especially if there are multiple plots associated with the record. This can be solved by downloading the files to a local drive and then point to that location, rather than CDS.

Step 6. Crossreferencing

After the upload of the new records is complete, you should map the connection in both CDS and INSPIRE, so that in the future we know (for CDS) the recid of the INSPIRE record and vice versa. This is done by putting the value from 001 in 035$a (+035$9CDS/Inspire). To do this, first put your dummies.matched.xml in the folder where you keep recidmap.py and do the following:

Add matched record IDs to CDS: $ python recidmap.py -i Inspire dummies.matched.xml

Add matched record IDs to INSPIRE: $ python recidmap.py -i CDS --reverse dummies.matched.xml

Remember to do this right afterwards. I had some problems when someone (for whatever reason) deleted one in a couple of thousand uploaded records over a weekend. The script expects a one-to-one relation between your list of matched records and will give an error if this is not the case. In which case you have to identify the deleted record and remove it from your dummies.matched.xml... save yourself the trouble and do it right away.

CDS IDs will be automatically added to new INSPIRE records. But as soon as these new records are inserted into the system, the new record IDs generated have to flow back into the CDS records. This is where the recids you

Add Inspire uploaded record IDs to CDS with IDs in a file:

$ recidmap.py -i Inspire --bare --recids=recids.txt --cern bibmatch_records.insert.xml

or a range of IDs

$ recidmap.py -i Inspire --bare --recids='123->204' --cern bibmatch_records.insert.xml

Remember to append this time, not correct or insert!

Future wishlist. To append more than just CDS recids to the matched Inspire records e.g. reportnumbers and abstracts (in case they are missing). But this is for when Spires is shut down... which now has happened - so we should tackle this feature! [AH]

Step 7. Keeping track

Add the information about the update to:

https://twiki.cern.ch/twiki/bin/view/Library/CDSInspireUploads

Rinse and repeat next week.

As the content we are moving from CDS to Inspire isn't static, the collections will have to re-harvested from time to time. This is a list of these dynamic collections.

CERN-THESES based on the query 65017:'Particle' or 693:'LHC' or 693:'LEP' or 693:'clic' or 693:'SPS' -arxiv -035:inspire -035:spires -502:'internship' -980:hidden A lot of these are already uploaded to Inspire with reportnumbers, so always use joker in 037-field in the local-conf

LHCb Notes the entire collection (with the query -035:inspire)

ATLAS Notes - the entire collection (with the query -035:inspire) add -CERN-CH-PH* to the query-string to avoid false positives for the Atlas-Conf

CDS -> Inspire Mapping

This section covers input of record metadata from CDS to Inspire and details the various MARC-field and format differences between the systems. Here is a table detailing how the record transformation is done, field by field.

CDS INSPIRE INSPIRE Name Notes
001 035 external key Copy of CDS record ID as: 035 $$9CDS $$a RECID
035 035 external key Copy. Exclude: fields with only $$9; when $$a/$$9 contains cercer, inspire, cern annual report, cds, cmscms, wai01, xx
037 037 primary_report_number Copy. Except: if $a contains arXiv: add $$9arXiv, add $$c taken from 695 $$a
041 $$a 041 $$a language Only if not English, from abbr. to full language name. i.e fre -> French
088 $$a 037 $$a (595 $$b) primary_report_number Copy. Except: if $$9 starts with CM-P0 or P00 -> 595 $$b (barcode)
242 246 translated title Copy.
245 245 title Copy.
260 260 place of publication Copy, only if 980 $$aTHESIS is NOT found.
269 / 961$$x 269 date 269$c is taken from either 269 or 961 and reformatted. 961$$x : 20020614 -> 2002-06-14 or 269$$c : 15 Oct 2010 -> 2010-10-15
300 300 pages Only number, stripped from pre-/postfixes. mult p. is ignored.
100 100 first author Copy. Add punctuation to name initials.
700 700 (701) additional authors Copy. Add punctuation to name initials. If 980 $$a THESIS, add in 701 instead and ignore $$e.
500 500 note Copy.
502 502 degree_info Copy. Except $$a -> $$b and $$b -> $$c and $$c -> $$d
520 520 abstract Copy.
65017 65017 content classification Translated to INSPIRE categories.
6531_ 6531_ free-keyword Copy. Add $$9 author
690C 690C formal classification Only if $$a INTNOTE -> NOTE and $$a THESIS
693 693 experiment Copy. Except: $$a 'Not applicable' is ignored
695 695 THESAURUS TERMS Copy.
710 710 collaboration Copy. Except: $$5 and $$aCERN. Geneva* is ignored
773 773 publication_info Copy. $$p is translated to INSPIRE journal names
8564 8564 (FFT) attached_files / URL Copy. Except: if $$u contains http://cdsweb.cern.ch -> FFT upload. ignored if $$u contains any of: cmsdoc.cern.ch, documents.cern.ch, preprints.cern.ch. if 088 contains 'CMS' add 8564_ $$u http://weblib.cern.ch/abstract?REPORTNUMBER
  980 collections Add $$aHEP. If thesis add $$aThesis, if published add $$aPublished and $$aCiteable, if arXiv add $$aArxiv
anything else   * Ignored.

-- JanLavik - 28-Mar-2011

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng BibMatchX.png r1 manage 70.7 K 2012-08-24 - 09:07 RasmusThogersen Example image of records to illustrate difference between same object in CDS and Inspire
PNGpng BibMatchX2.png r1 manage 29.2 K 2012-08-24 - 09:08 RasmusThogersen Example image of records to illustrate difference between same object in CDS and Inspire
XSL (XML style sheet)xsl cds2inspire3.xsl r1 manage 27.7 K 2013-01-24 - 19:09 RasmusThogersen  
XMLxml cds_inspire_693.xml r1 manage 0.7 K 2013-01-24 - 19:09 RasmusThogersen  
XMLxml cds_inspire_categories_65017.xml r1 manage 2.4 K 2013-01-24 - 19:09 RasmusThogersen  
XMLxml cds_inspire_journal_abbreviations.xml r1 manage 364.3 K 2013-01-24 - 19:09 RasmusThogersen  
XMLxml cds_inspire_languages.xml r1 manage 3.1 K 2013-01-24 - 19:09 RasmusThogersen  
Texttxt generate_output.py.txt r1 manage 4.4 K 2013-01-24 - 19:09 RasmusThogersen  
Texttxt harvest.py.txt r1 manage 6.4 K 2013-01-24 - 19:09 RasmusThogersen  
Texttxt recidmap.py.txt r1 manage 13.6 K 2013-01-24 - 19:09 RasmusThogersen  
Edit | Attach | Watch | Print version | History: r31 < r30 < r29 < r28 < r27 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r31 - 2013-07-15 - AnnetteHoltkamp
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback