DevelopmentIndexes < Inspire

Inspire Web>DevelopmentIndexes (2011-03-04, AnnetteHoltkamp)

Comments about the indexes that will be needed to be created explicitly in INSPIRE.

Logical field definitions

field name	marc tag
abstract	520__a
affiliation	100__u 700__u 902__a
anyfield	035__a 035__z 037__a 037__c 041__a 100__a 100__q 100__u 210__a 245__a 246__a 500__a 520__a 6531_a 693__e 695__a 700__a 700__q 700__u 710__a 710__g 773__a 773__c 773__p 773__t 773__v 773__w 773__x 773__y 902__a 242__a 242__y 269__c 65017a
author	100__a 100__q 700__a 700__q
caption	8564
citedby
collaboration	710__g
collection	980__a
datecreated
datemodified
doi	773__a
exactauthor	100__a 100__q 700__a 700__q
experiment	693__e
fulltext	8564_u
journal	773__%
keyword	6531_a 695__a
recid	001
reference	999C5r 999C5s
refersto
reportnumber	037__a
subject	65017a
title	210__a 245__a 246__a 242__a
year	773__y 269__c

Strategy

(From Tibor) We have basically four options:

1) We can introduce a new word index for 'collaboration' that will match for words in 710__g and do what is needed. But is such a new word index worth it for every existing SPIRES index? Let's try to find out.

2) We can search for exact phrase match, 710__g:"cms" behind the scene, but this will not give what we want, since CMS in only a part of the collaboration name value.

3) We can search for partial phrase match, 710__g:'cms' behind the scene, which will find what we want. But this has also a disadvantage of finding say 'acmso' collaboration, because we would be matching substrings here. So we can get false positives.

4) We can search silently for regexp and word boundaries, that is

 710__g:/[[:<:]]cms[[:>:]]/

which will find what we need as in #3, and which will also eliminate false positives. This is good, but it has also a drawback of being a "poor man's word index" emulation. Moreover, the search is done by doing full table scans, which is very inefficient and should be reasonable if only if there are not too many distinct values of 710__g there.

So let's browse for data stored in 710 to see what kind of information and values we have there:

http://hep-inspire.net/search?p=aaa&f=710__g&rg=200&action_browse=Browse

This indicates that the solution #1 is probably wanted.

BTW, the speed of #3 seems to be okay:

http://hep-inspire.net/search?p=710__g%3A%27cms%27

but the speed of #4 seems to be quite slow:

http://hep-inspire.net/search?as=1&m1=r&p1=%5B%5B%3A%3C%3A%5D%5Dcms%5B%5B%3A%3E%3A%5D%5D&f1=title

So I'd propose to simply go for #1 for the CN emulation, unless we are happy with the false positives obtained with #3...

The above musings should be applied to all existing SPIRES indexes so that we can decide when we have to introduce a new word index into Inspire and when we can simply use exact phrase matching ("foo") or partial phrase matching ('foo') or regexp matching (/foo/) or word-boundary regexp matching (

/[[:<:]]foo[[:>:]]/

) on a given MARC tag.

If a new word index is needed, then just tell me and I can add it as necessary into Makefiles and friends.

SPIRES stats

Taking a random sample of spires searches (Month of Sept 2007 1409383 searches from humans) These are the first indexes used in the search, any index with > 300 uses is shown. There are some caveats: Mike's list of searches using booleans, should be examined to see if some indexes are more commonly used in second place (this is true, for example, of 'date')

See also SlacSpiresSearch for similar stats for smaller time period with all indexes listed. Note that their appear to be some significant discrepancies in the cases of x, journal-year and hidden-note (FIXME)

   HEP - AUTHOR                       552354
   HEP - KEY                          176172
   HEP - CITATION                     134632
   HEP - JOURNAL-YEAR                  78846
   HEP - EPRINT                        75077
   HEP - EXACT-AUTHOR                  74874
   HEP - TITLE                         70453
   HEP - REPORT                        44207
   HEP - CONF-NUMBER                   34270
   HEP - KEYWORD                       26718
   HEP - X                             15181
   HEP - DK                             8892
   HEP - AFFILIATION                    6311
   HEP - COLLABORATION                  5489
   HEP - FC                             2771
   HEP - FIRST-AUTHOR                   2629
   HEP - J                              2570
   HEP - KEK                            2040
   HEP - ARX                            1975
   HEP - WHOIS                          1850
   HEP - TOPCIT                         1416
   HEP - EXP                            1353
   HEP - PPFA                           1343
   HEP - EE                             1335
   HEP - TEXKEY                         1016
   HEP -                                 979
   HEP - COUNTRY                         912
   HEP - UNKNOWNREMOTEQU                 896
   HEP - FIND                            788
   HEP - DA                              633
   HEP - DU                              447

Let us deal with these one at a time

Solutions

author : this has been (or will be) handled in the SearchEngineQueryParser. ONce the author searching is working well, do we want to consider applying it to all searches in "author" using regular invenio search box? Not sure which is more usable?
key : recID - should enable direct access to exact record ...many of these are links from formats
citation : handled already
journal-year : surprised that this is so high, possibly this is known as an index preferable to "date" which cannot be searched alone and/or a shortcut to identify published papers. This should be easy to produce by searching 773__y However, if people are using this as a shortcut to
eprint : clearly used a lot, no need to word index anything. Simply search 037 and 088 should be no problem with false positives unless people are searching substrings like '0804' which would match some non-arXiv report numbers, in which case we need to check 037__9 as well. To start with, ignore this possibility.
exact-author : handled in SearchEngineQueryParser
title : word index clearly needed, and presumably already exists
report : same as eprint...though we split report on "-" or "/" which allows things like "find report DESY and author Holtkamp" and might be significantly used...check this FIXME
conf-number : No word indexing needed 111__g
keyword : this should be discussed with Annette. Currently keyword is a word index, but it hasn't walways been. There are good arguments both ways. FIXME
- Pro-word : easier searching, matches with free keywords better
- Con-word : better discrimination and matching of exact phrases
- Note that this keyword index includes DESY keywords, author supplied keywords, as well as words from title and abstract (to enable keyword searching before anyone has keyworded the paper)
x : exact-citation I have no idea why this would ever be used by a user, should transition to citation above. This must be linking from somewhere? FIXME
dk : desy-keyword - can we add a condition on 653__9:DESY to do this? or should we just lump it with keyword above?
affiliation : Should really be some sort of indirection here (search inst collection find key, then search key in 100/700__i . However, that is more than we currently do. Affiliation is currently not a word index. AW is, but is rarely used I think a word index is not needed (in fact it creates odd hits in cases with similar inst names...)
collaboration : word index is needed
fc : codes should become collections
first-author : please note this, should transition as part of the SPIRES Search Syntax, but using only 100__a
j : journal info no word index needed
kek : no word index needed, however, this info has not been imported yet to INSPIRE, it is in a secondary database....FIXME
arx : archive category, needed ever since new arxive numbers, 694__a no word index needed
topcit : replace with new citation range searches
exp : these are controlled codes in 693__e, broken on '-' so a word index breaking the value on "-" would be useful to enable, say, "find exp BABAR"
ppfa : this needs further investigation to explain these searches - FIXME
EE : exact version of exp, directly search 693__e
texkey : no word index needed, search 035__a very distinctive string, no worry about false positives
country : this is an indirect index, through the institutions file. Unable to reproduce currently, without importing institutions file, save for later FIXME
da/du : datas added/updated No need for word indexing, however, should consider universal date seaching procedures?

Conclusions

Few word indexes are actually needed. Most FIXMEs occur in the low end of the stats (a few % or less) however, note that due to volume, a few percent is still several times a day.

Topic revision: r3 - 2011-03-04 - AnnetteHoltkamp

Inspire

Related wikis:

Invenio

- Cern Search
- TWiki Search
- Google Search
Inspire All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback