Comparison of SLAC and CERN reference extraction

Introduction

Note that SPIRES runs 4 levels of extraction:

  1. LaTeX parser - first cut, gets embedded SPIRES tags, and most cites
  2. PDF parser - used only if LaTeX fails (fails is determined by a human looking at the paper vs. the cites/ or 0 citations)
  3. Cut&paste parser - used if pdf fails, open pdf on user screen, cut and paste refs into parser
  4. Human checks - run through all refs checking extracted refs by hand (difficult job, may not be that accurate, depending on time of day, etc...) and entering missed refs. (Note that certain failures are unlikely to be caught by humans (i.e. missed one reference in a group of many that were caught...etc)

Initial Comparison

  • Use data from archives hep-ph and hep-th from 0603
  • Invenio (tony) provides
    • XML with arxiv id of record, followed by their extracted references
  • SPIRES (tcb) provides
    • extracted data from LateX -based extractor pre-human checks
    • citations post-human checks (should be the most accurate available, currently estimate 95% or so)
    • in some cases I can provide results of PDF extraction, but we don't do that unless LaTeX fails...
      • Could re-run this set through pdf to get a 3rd dataset

Reports to generate

  • Fraction that Invenio gets that SPIRES LaTeX misses
  • Vice-versa
  • Fraction that each miss compared to human checked results
  • Ones that both miss (look for patterns)

CDS Data for the comparison

Here is an example of the output of our reference extraction:

<record>
<controlfield tag="001">932695</controlfield>
<datafield tag="999" ind1="C" ind2="5">
<subfield code="o">[1]</subfield>
<subfield code="m">S. Godfray and J. Napolitano,</subfield>
<subfield code="s">Rev. Mod. Phys. 71 (1999) 1411</subfield>
</datafield>
<datafield tag="999" ind1="C" ind2="5">
<subfield code="o">[1]</subfield>
<subfield code="m">F. E. Close and N. A. Tornqvist,</subfield>
<subfield code="s">J. Phys. G 28 (2002) R249</subfield>
</datafield>
<datafield tag="999" ind1="C" ind2="5">
<subfield code="o">[2]</subfield>
<subfield code="m">BABAR Collaboration,</subfield>
<subfield code="s">Phys. Rev. Lett. 90 (2003) 242001</subfield>
</datafield>
<datafield tag="999" ind1="C" ind2="5">
<subfield code="o">[3]</subfield>
<subfield code="m">S. Godfrey and N. Isgur,</subfield>
<subfield code="s">Phys. Rev. D 32 (1985) 189</subfield>
</datafield>
<datafield tag="999" ind1="C" ind2="5">
<subfield code="o">[3]</subfield>
<subfield code="m">S. Godfrey and R. Kokoshi,</subfield>
<subfield code="s">Phys. Rev. D 43 (1991) 1679</subfield>
</datafield>
...
...
...
</record>

The controlfield tag contains the reference ID which is cross related to the standard name i.e: 932689 hep-ph/0603001.

The relevant tags are:

  • "m" names or miscellaneous
  • "s" recognized journal abbreviation
  • "r" recognized report abbreviation
  • "o" Ordinal number of reference line
  • "z" url
  • "u" url descriptor
The relevant files are attached..

Comparison Results

AS of 8/16 I have redone this, and replaced files and information here

OK I've finally built a comparison program for these cite extractors. I've posted some raw SPIRES data as well, if someone wants to double check. I have 2 types of SPRIES data, the raw latex based parser output (I am calling this Spi Raw or TeX in the comparisons) and the final SPIRES version (I call this SPIRES )which is the raw output, supplemented with pdf extraction if needed and human work if needed, as described above.

I noted that one of the papers in the CDS sample:

hep-th/0603148

has been withdrawn from arXiv, so I excluded that from the comparison.

I have excluded from the comparison any paper that completely failed extraction by any method. This is correct if we are considering essentially the order in which to try these extractors (i.e. their reliability if they report a successful extraction.) This eliminated many papers where CDS failed to parse due to pdf errors. It also excluded two papers hepth-0603068 and hepth-0603221 that SPIRES failed to extract via TeX.

Note that numbers for CDS are lower that it self-reports, because I am excluding all "m" cites, since these are not attributable to any paper by current means, and SPIRES extractors do not even attempt to extract these. Note that these m cites are useful though, especially in trying to get conf. proc. cites, so we want to use these in the future, and that is a significant feature of CDS extractor. Note that I am also counting double cites as only one. These are cases where CDS or SPI TeX extracted an eprint and a journal as separate cites, though they are actually the same, these cases are listed at the top of the output file. There are other cases here too, where CDS extracted the same cite several times, but I am not sure why.

I actually took the output of each extractor, checked each paper against spires for possible eprint/journal info, stored only the eprint number if there was one (stored the journal info only if no eprint) then compared the results like that. Thus the output may show that CDS beat spires to an eprint cite, while it actually extracted a journal cite (that happened to match an eprint...). This extra step avoids miscounting in cases where one extractor got the eprint and the other got the journal.

Now that we are taking only papers for which an extraction did not report an error, the CDS extractor outperforms both SPIRES Raw TeX extractor, and SPIRES final checked version:

------------------------- Overall ------------

Total Papers:228 Papers w/match:225 CDS/SPI Matches: 8334 CDS: 8829 SPIRES: 8801 SPI raw: 8641

Only in CDS: 607 Only in SPIraw: 427

Only in CDS: 450 Only in SPIfinal: 428

Only in SPI final: 568 Only in SPIraw: 407

This suggests that at the very least, the CDS extractor should be running in place of the LaTeX extractor as soon as possible, with the LaTeX extractor serving as a backup for when the pdf is unreadable.

We should compare other archives, since various archives may have different citing patterns, but these results are encouraging in that we can see some immediate improvement.

However, before we got too excited, we need to look by hand at the 450 cites that are different between CDS and SPIRES, as I'm not yet sure what these are.

Comments on the comparison (Tony O)

  • I forgot to tell you that there is information in the "a" tag, if you split this data on whitespace and then split the third word ([2] for python) on hyphen you can get statistics per input file - including an error code. i.e. (x,y,error,#rep,#jour,#url,#misc)=word[2].split('-') where:
    • error = 0 (ok), 2 or 3(pdftotext gave nothing), 4 (we could not find the start of the reference section). We use pdftotext to create a .txt file for the reference analysis.
    • Looking through the 256 cases for hep-th we can see:
      • Cases (error 2) where pdftotext returns nothing - it claims the pdf file is corrupt. I suggest that these cases should be removed from the comparison - there is no reason to think` they would introduce a bias in the comparison results. They are: hep-th/0603002 hep-th/0603003 hep-th/0603005 hep-th/0603006 hep-th/0603007 hep-th/0603009 hep-th/0603025 hep-th/0603048 hep-th/0603098 hep-th/0603107 hep-th/0603108 hep-th/0603109 hep-th/0603113 hep-th/0603114 hep-th/0603143 hep-th/0603146 hep-th/0603148 hep-th/0603150 hep-th/0603151 hep-th/0603154hep-th/0603155 hep-th/0603156.
      • Cases (error 4) where there was a start to a refencence section but we did not find it - hep-th/0603136 hep-th/0603101 - these should be kept in the comparison.
      • Cases (error 4) where we could not find a reference section because the txt file is incomprehensible - and e.g xpdf claims the file is corrupt although pdftotext does not actually complain. These case are hep-th/0603171 hep-th/0603201 hep-th/0603214 hep-th/0603220 hep-th/0603238 and should be removed from the comparison - a human could not understand the output.
  • Turning now to compare-hep-th-0603.txt I often do not understand the number of citations attributed to CDS. Here is a small list of the first few cases (where the error is 0)- comparing your numbers to what I think is contained both in the "a" tag info and in the xml I submitted.
  • hep-th/0603001 CDS/SPI Matches: 33 CDS: 33 SPIRES: 32 SPI raw: 61 ; CDS did find 57 citations
  • hep-th/0603004 CDS/SPI Matches: 66 CDS: 68 SPIRES: 73 SPI raw: 76 ; CDS did find 77 citations
  • hep-th/0603008 CDS/SPI Matches: 71 CDS: 71 SPIRES: 69 SPI raw: 125; CDS did find 127 citations
  • hep-th/0603010 CDS/SPI Matches: 52 CDS: 53 SPIRES: 54 SPI raw: 83 ; CDS did find 82 citations
  • hep-th/0603011 CDS/SPI Matches: 66 CDS: 71 SPIRES: 67 SPI raw: 111; CDS did find 114 citations
  • hep-th/0603012 CDS/SPI Matches: 45 CDS: 47 SPIRES: 45 SPI raw: 80 ; CDS did find 81 citations
  • hep-th/0603013 CDS/SPI Matches: 27 CDS: 27 SPIRES: 26 SPI raw: 46 ; CDS did find 47 citations
  • hep-th/0603014 CDS/SPI Matches: 28 CDS: 28 SPIRES: 29 SPI raw: 50 ; CDS did find 49 citations
  • hep-th/0603015 CDS/SPI Matches: 70 CDS: 70 SPIRES: 70 SPI raw: 163; CDS did find 107 citations
  • hep-th/0603016 CDS/SPI Matches: 35 CDS: 39 SPIRES: 37 SPI raw: 66 ; CDS did find 67 citations
So in general I think CDS found significantly more citations than you quote - and in fact a number usually close to SPI raw.
I guess this could be due to some combining of references..

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formathep-ph-0603 all_ids.hep-ph-0603 r1 manage 6.0 K 2007-07-25 - 10:10 TonyO Cross reference list of CDS ID's to arXiv filename
Unknown file formathep-th-0603 all_ids.hep-th-0603 r1 manage 5.8 K 2007-07-25 - 10:12 TonyO same as above for hep-th
Texttxt compare.hep-th.0603.txt r2 r1 manage 143.7 K 2007-08-17 - 21:34 TravisBrooks Full Comparison output
Unknown file formatgz hep-ph-0603.xml.gz r1 manage 375.7 K 2007-07-25 - 10:14 TonyO zipped file of xml output for hep-ph-0603
Unknown file formatgz hep-th-0603.xml.gz r1 manage 393.9 K 2007-07-25 - 10:16 TonyO zipped file of xml output for hep-th-0603
Texttxt hep-th.0603.latex.txt r2 r1 manage 207.6 K 2007-08-05 - 01:53 TravisBrooks SPIRES Raw LatEx extractor output
Unknown file formatgz hep-th.0603.proofed.txt.gz r1 manage 83.6 K 2007-08-05 - 01:52 TravisBrooks SPIRES Final output for 0603 hepth (raw+pdf+manual)
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2008-02-05 - TiborSimko
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Inspire All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback