Inspire Web>WebTopicList>ComparisonSlacCernReferenceExtraction (2008-02-05, TiborSimko)

EditAttachPDF

Comparison of SLAC and CERN reference extraction

Introduction
Initial Comparison
- Reports to generate
- CDS Data for the comparison
Comparison Results
Comments on the comparison (Tony O)

Introduction

Note that SPIRES runs 4 levels of extraction:

LaTeX parser - first cut, gets embedded SPIRES tags, and most cites
PDF parser - used only if LaTeX fails (fails is determined by a human looking at the paper vs. the cites/ or 0 citations)
Cut&paste parser - used if pdf fails, open pdf on user screen, cut and paste refs into parser
Human checks - run through all refs checking extracted refs by hand (difficult job, may not be that accurate, depending on time of day, etc...) and entering missed refs. (Note that certain failures are unlikely to be caught by humans (i.e. missed one reference in a group of many that were caught...etc)

Initial Comparison

Use data from archives hep-ph and hep-th from 0603
Invenio (tony) provides
- XML with arxiv id of record, followed by their extracted references
SPIRES (tcb) provides
- extracted data from LateX -based extractor pre-human checks
- citations post-human checks (should be the most accurate available, currently estimate 95% or so)
- in some cases I can provide results of PDF extraction, but we don't do that unless LaTeX fails...
  - Could re-run this set through pdf to get a 3rd dataset

Reports to generate

Fraction that Invenio gets that SPIRES LaTeX misses
Vice-versa
Fraction that each miss compared to human checked results
Ones that both miss (look for patterns)

CDS Data for the comparison

Here is an example of the output of our reference extraction:

<record>
<controlfield tag="001">932695</controlfield>
<datafield tag="999" ind1="C" ind2="5">
<subfield code="o">[1]</subfield>
<subfield code="m">S. Godfray and J. Napolitano,</subfield>
<subfield code="s">Rev. Mod. Phys. 71 (1999) 1411</subfield>
</datafield>
<datafield tag="999" ind1="C" ind2="5">
<subfield code="o">[1]</subfield>
<subfield code="m">F. E. Close and N. A. Tornqvist,</subfield>
<subfield code="s">J. Phys. G 28 (2002) R249</subfield>
</datafield>
<datafield tag="999" ind1="C" ind2="5">
<subfield code="o">[2]</subfield>
<subfield code="m">BABAR Collaboration,</subfield>
<subfield code="s">Phys. Rev. Lett. 90 (2003) 242001</subfield>
</datafield>
<datafield tag="999" ind1="C" ind2="5">
<subfield code="o">[3]</subfield>
<subfield code="m">S. Godfrey and N. Isgur,</subfield>
<subfield code="s">Phys. Rev. D 32 (1985) 189</subfield>
</datafield>
<datafield tag="999" ind1="C" ind2="5">
<subfield code="o">[3]</subfield>
<subfield code="m">S. Godfrey and R. Kokoshi,</subfield>
<subfield code="s">Phys. Rev. D 43 (1991) 1679</subfield>
</datafield>
...
...
...
</record>

The controlfield tag contains the reference ID which is cross related to the standard name i.e: 932689 hep-ph/0603001.

The relevant tags are:

"m" names or miscellaneous
"s" recognized journal abbreviation
"r" recognized report abbreviation
"o" Ordinal number of reference line
"z" url
"u" url descriptor

The relevant files are attached..

Comparison Results

AS of 8/16 I have redone this, and replaced files and information here

OK I've finally built a comparison program for these cite extractors. I've posted some raw SPIRES data as well, if someone wants to double check. I have 2 types of SPRIES data, the raw latex based parser output (I am calling this Spi Raw or TeX in the comparisons) and the final SPIRES version (I call this SPIRES )which is the raw output, supplemented with pdf extraction if needed and human work if needed, as described above.

I noted that one of the papers in the CDS sample:

hep-th/0603148

has been withdrawn from arXiv, so I excluded that from the comparison.

I have excluded from the comparison any paper that completely failed extraction by any method. This is correct if we are considering essentially the order in which to try these extractors (i.e. their reliability if they report a successful extraction.) This eliminated many papers where CDS failed to parse due to pdf errors. It also excluded two papers hepth-0603068 and hepth-0603221 that SPIRES failed to extract via TeX.

Note that numbers for CDS are lower that it self-reports, because I am excluding all "m" cites, since these are not attributable to any paper by current means, and SPIRES extractors do not even attempt to extract these. Note that these m cites are useful though, especially in trying to get conf. proc. cites, so we want to use these in the future, and that is a significant feature of CDS extractor. Note that I am also counting double cites as only one. These are cases where CDS or SPI TeX extracted an eprint and a journal as separate cites, though they are actually the same, these cases are listed at the top of the output file. There are other cases here too, where CDS extracted the same cite several times, but I am not sure why.

I actually took the output of each extractor, checked each paper against spires for possible eprint/journal info, stored only the eprint number if there was one (stored the journal info only if no eprint) then compared the results like that. Thus the output may show that CDS beat spires to an eprint cite, while it actually extracted a journal cite (that happened to match an eprint...). This extra step avoids miscounting in cases where one extractor got the eprint and the other got the journal.

Now that we are taking only papers for which an extraction did not report an error, the CDS extractor outperforms both SPIRES Raw TeX extractor, and SPIRES final checked version:

------------------------- Overall ------------

Total Papers:228 Papers w/match:225 CDS/SPI Matches: 8334 CDS: 8829 SPIRES: 8801 SPI raw: 8641

Only in CDS: 607 Only in SPIraw: 427

Only in CDS: 450 Only in SPIfinal: 428

Only in SPI final: 568 Only in SPIraw: 407

This suggests that at the very least, the CDS extractor should be running in place of the LaTeX extractor as soon as possible, with the LaTeX extractor serving as a backup for when the pdf is unreadable.

We should compare other archives, since various archives may have different citing patterns, but these results are encouraging in that we can see some immediate improvement.

However, before we got too excited, we need to look by hand at the 450 cites that are different between CDS and SPIRES, as I'm not yet sure what these are.

Comments on the comparison (Tony O)

I forgot to tell you that there is information in the "a" tag, if you split this data on whitespace and then split the third word ([2] for python) on hyphen you can get statistics per input file - including an error code. i.e. (x,y,error,#rep,#jour,#url,#misc)=word[2].split('-') where:
- error = 0 (ok), 2 or 3(pdftotext gave nothing), 4 (we could not find the start of the reference section). We use pdftotext to create a .txt file for the reference analysis.
- Looking through the 256 cases for hep-th we can see:
  - Cases (error 2) where pdftotext returns nothing - it claims the pdf file is corrupt. I suggest that these cases should be removed from the comparison - there is no reason to think` they would introduce a bias in the comparison results. They are: hep-th/0603002 hep-th/0603003 hep-th/0603005 hep-th/0603006 hep-th/0603007 hep-th/0603009 hep-th/0603025 hep-th/0603048 hep-th/0603098 hep-th/0603107 hep-th/0603108 hep-th/0603109 hep-th/0603113 hep-th/0603114 hep-th/0603143 hep-th/0603146 hep-th/0603148 hep-th/0603150 hep-th/0603151 hep-th/0603154hep-th/0603155 hep-th/0603156.
  - Cases (error 4) where there was a start to a refencence section but we did not find it - hep-th/0603136 hep-th/0603101 - these should be kept in the comparison.
  - Cases (error 4) where we could not find a reference section because the txt file is incomprehensible - and e.g xpdf claims the file is corrupt although pdftotext does not actually complain. These case are hep-th/0603171 hep-th/0603201 hep-th/0603214 hep-th/0603220 hep-th/0603238 and should be removed from the comparison - a human could not understand the output.
Turning now to compare-hep-th-0603.txt I often do not understand the number of citations attributed to CDS. Here is a small list of the first few cases (where the error is 0)- comparing your numbers to what I think is contained both in the "a" tag info and in the xml I submitted.

hep-th/0603001 CDS/SPI Matches: 33 CDS: 33 SPIRES: 32 SPI raw: 61 ; CDS did find 57 citations
hep-th/0603004 CDS/SPI Matches: 66 CDS: 68 SPIRES: 73 SPI raw: 76 ; CDS did find 77 citations
hep-th/0603008 CDS/SPI Matches: 71 CDS: 71 SPIRES: 69 SPI raw: 125; CDS did find 127 citations
hep-th/0603010 CDS/SPI Matches: 52 CDS: 53 SPIRES: 54 SPI raw: 83 ; CDS did find 82 citations
hep-th/0603011 CDS/SPI Matches: 66 CDS: 71 SPIRES: 67 SPI raw: 111; CDS did find 114 citations
hep-th/0603012 CDS/SPI Matches: 45 CDS: 47 SPIRES: 45 SPI raw: 80 ; CDS did find 81 citations
hep-th/0603013 CDS/SPI Matches: 27 CDS: 27 SPIRES: 26 SPI raw: 46 ; CDS did find 47 citations
hep-th/0603014 CDS/SPI Matches: 28 CDS: 28 SPIRES: 29 SPI raw: 50 ; CDS did find 49 citations
hep-th/0603015 CDS/SPI Matches: 70 CDS: 70 SPIRES: 70 SPI raw: 163; CDS did find 107 citations
hep-th/0603016 CDS/SPI Matches: 35 CDS: 39 SPIRES: 37 SPI raw: 66 ; CDS did find 67 citations

So in general I think CDS found significantly more citations than you quote - and in fact a number usually close to SPI raw.
I guess this could be due to some combining of references..

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
hep-ph-0603	all_ids.hep-ph-0603	r1	manage	6.0 K	2007-07-25 - 10:10	TonyO	Cross reference list of CDS ID's to arXiv filename
hep-th-0603	all_ids.hep-th-0603	r1	manage	5.8 K	2007-07-25 - 10:12	TonyO	same as above for hep-th
txt	compare.hep-th.0603.txt	r2 r1	manage	143.7 K	2007-08-17 - 21:34	TravisBrooks	Full Comparison output
gz	hep-ph-0603.xml.gz	r1	manage	375.7 K	2007-07-25 - 10:14	TonyO	zipped file of xml output for hep-ph-0603
gz	hep-th-0603.xml.gz	r1	manage	393.9 K	2007-07-25 - 10:16	TonyO	zipped file of xml output for hep-th-0603
txt	hep-th.0603.latex.txt	r2 r1	manage	207.6 K	2007-08-05 - 01:53	TravisBrooks	SPIRES Raw LatEx extractor output
gz	hep-th.0603.proofed.txt.gz	r1	manage	83.6 K	2007-08-05 - 01:52	TravisBrooks	SPIRES Final output for 0603 hepth (raw+pdf+manual)

Topic revision: r8 - 2008-02-05 - TiborSimko

Inspire

Related wikis:

Invenio

- Cern Search
- TWiki Search
- Google Search
Inspire All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback