Comments about the indexes that will be needed to be created explicitly in INSPIRE.
Logical field definitions
field name |
marc tag |
abstract |
520__a |
affiliation |
100__u 700__u 902__a |
anyfield |
035__a 035__z 037__a 037__c 041__a 100__a 100__q 100__u 210__a 245__a 246__a 500__a 520__a 6531_a 693__e 695__a 700__a 700__q 700__u 710__a 710__g 773__a 773__c 773__p 773__t 773__v 773__w 773__x 773__y 902__a 242__a 242__y 269__c 65017a |
author |
100__a 100__q 700__a 700__q |
caption |
8564 |
citedby |
collaboration |
710__g |
collection |
980__a |
datecreated |
datemodified |
doi |
773__a |
exactauthor |
100__a 100__q 700__a 700__q |
experiment |
693__e |
fulltext |
8564_u |
journal |
773__% |
keyword |
6531_a 695__a |
recid |
001 |
reference |
999C5r 999C5s |
refersto |
reportnumber |
037__a |
subject |
65017a |
title |
210__a 245__a 246__a 242__a |
year |
773__y 269__c |
Strategy
(From Tibor)
We have basically four options:
1) We can introduce a new word index for 'collaboration' that will match
for words in 710__g and do what is needed. But is such a new word
index worth it for every existing SPIRES index? Let's try to find
out.
2) We can search for exact phrase match, 710__g:"cms" behind the scene,
but this will not give what we want, since CMS in only a part of the
collaboration name value.
3) We can search for partial phrase match, 710__g:'cms' behind the
scene, which will find what we want. But this has also a
disadvantage of finding say 'acmso' collaboration, because we would
be matching substrings here. So we can get false positives.
4) We can search silently for regexp and word boundaries, that is
710__g:/[[:<:]]cms[[:>:]]/
which will find what we need as in #3, and
which will also eliminate false positives. This is good, but it has
also a drawback of being a "poor man's word index" emulation.
Moreover, the search is done by doing full table scans, which is very
inefficient and should be reasonable if only if there are not too
many distinct values of 710__g there.
So let's browse for data stored in 710 to see what kind of information
and values we have there:
http://hep-inspire.net/search?p=aaa&f=710__g&rg=200&action_browse=Browse
This indicates that the solution #1 is probably wanted.
BTW, the speed of #3 seems to be okay:
http://hep-inspire.net/search?p=710__g%3A%27cms%27
but the speed of #4 seems to be quite slow:
http://hep-inspire.net/search?as=1&m1=r&p1=%5B%5B%3A%3C%3A%5D%5Dcms%5B%5B%3A%3E%3A%5D%5D&f1=title
So I'd propose to simply go for #1 for the CN emulation, unless we are
happy with the false positives obtained with #3...
The above musings should be applied to all existing SPIRES indexes so
that we can decide when we have to introduce a new word index into
Inspire and when we can simply use exact phrase matching ("foo") or
partial phrase matching ('foo') or regexp matching (/foo/) or
word-boundary regexp matching (
/[[:<:]]foo[[:>:]]/
) on a given MARC tag.
If a new word index is needed, then just tell me and I can add it as
necessary into Makefiles and friends.
SPIRES stats
Taking a random sample of spires searches (Month of Sept 2007 1409383 searches from humans)
These are the
first indexes used in the search, any index with > 300 uses is shown. There are some caveats: Mike's list of searches using booleans, should be examined to see if some indexes are more commonly used in second place (this is true, for example, of 'date')
See also
SlacSpiresSearch for similar stats for smaller time period with all indexes listed. Note that their appear to be some significant discrepancies in the cases of
x
,
journal-year
and
hidden-note
(FIXME)
HEP - AUTHOR 552354
HEP - KEY 176172
HEP - CITATION 134632
HEP - JOURNAL-YEAR 78846
HEP - EPRINT 75077
HEP - EXACT-AUTHOR 74874
HEP - TITLE 70453
HEP - REPORT 44207
HEP - CONF-NUMBER 34270
HEP - KEYWORD 26718
HEP - X 15181
HEP - DK 8892
HEP - AFFILIATION 6311
HEP - COLLABORATION 5489
HEP - FC 2771
HEP - FIRST-AUTHOR 2629
HEP - J 2570
HEP - KEK 2040
HEP - ARX 1975
HEP - WHOIS 1850
HEP - TOPCIT 1416
HEP - EXP 1353
HEP - PPFA 1343
HEP - EE 1335
HEP - TEXKEY 1016
HEP - 979
HEP - COUNTRY 912
HEP - UNKNOWNREMOTEQU 896
HEP - FIND 788
HEP - DA 633
HEP - DU 447
Let us deal with these one at a time
Solutions
-
author
: this has been (or will be) handled in the SearchEngineQueryParser. ONce the author searching is working well, do we want to consider applying it to all searches in "author" using regular invenio search box? Not sure which is more usable?
-
key
: recID - should enable direct access to exact record ...many of these are links from formats
-
citation
: handled already
-
journal-year
: surprised that this is so high, possibly this is known as an index preferable to "date" which cannot be searched alone and/or a shortcut to identify published papers. This should be easy to produce by searching 773__y However, if people are using this as a shortcut to
-
eprint
: clearly used a lot, no need to word index anything. Simply search 037 and 088 should be no problem with false positives unless people are searching substrings like '0804' which would match some non-arXiv report numbers, in which case we need to check 037__9 as well. To start with, ignore this possibility.
-
exact-author
: handled in SearchEngineQueryParser
-
title
: word index clearly needed, and presumably already exists
-
report
: same as eprint...though we split report on "-" or "/" which allows things like "find report DESY and author Holtkamp" and might be significantly used...check this FIXME
-
conf-number
: No word indexing needed 111__g
-
keyword
: this should be discussed with Annette. Currently keyword is a word index, but it hasn't walways been. There are good arguments both ways. FIXME
- Pro-word : easier searching, matches with free keywords better
- Con-word : better discrimination and matching of exact phrases
- Note that this keyword index includes DESY keywords, author supplied keywords, as well as words from title and abstract (to enable keyword searching before anyone has keyworded the paper)
-
x
: exact-citation I have no idea why this would ever be used by a user, should transition to citation above. This must be linking from somewhere? FIXME
-
dk
: desy-keyword - can we add a condition on 653__9:DESY to do this? or should we just lump it with keyword above?
-
affiliation
: Should really be some sort of indirection here (search inst collection find key, then search key in 100/700__i . However, that is more than we currently do. Affiliation is currently not a word index. AW is, but is rarely used I think a word index is not needed (in fact it creates odd hits in cases with similar inst names...)
-
collaboration
: word index is needed
-
fc
: codes should become collections
-
first-author
: please note this, should transition as part of the SPIRES Search Syntax, but using only 100__a
-
j
: journal info no word index needed
-
kek
: no word index needed, however, this info has not been imported yet to INSPIRE, it is in a secondary database....FIXME
-
arx
: archive category, needed ever since new arxive numbers, 694__a no word index needed
-
topcit
: replace with new citation range searches
-
exp
: these are controlled codes in 693__e, broken on '-' so a word index breaking the value on "-" would be useful to enable, say, "find exp BABAR"
-
ppfa
: this needs further investigation to explain these searches - FIXME
-
EE
: exact version of exp, directly search 693__e
-
texkey
: no word index needed, search 035__a very distinctive string, no worry about false positives
-
country
: this is an indirect index, through the institutions file. Unable to reproduce currently, without importing institutions file, save for later FIXME
-
da/du
: datas added/updated No need for word indexing, however, should consider universal date seaching procedures?
Conclusions
Few word indexes are actually needed. Most FIXMEs occur in the low end of the stats (a few % or less) however, note that due to volume, a few percent is still several times a day.