Inspire Web>WebTopicList>PythonHelp (2021-01-18, KirstenSachs)

HELP page for INSPIRE use with IPython

started by KS to be completed by someone who knows more.

ssh to
inspire2.desy.de
inspire-old-01.cern.ch (prod), inspire-old-02.cern.ch, inspire-old-03.cern.ch, inspire-old-solr.cern.ch with #!/opt/cds-invenio/bin/python
inspirevm16.cern.ch (inspiretest June 2015)
....

Search & retrive

Search_Engine

in the python program

from invenio.search_engine import FUNCTION

All functions are in https://github.com/inspirehep/invenio/blob/prod/modules/websearch/lib/search_engine.py

http://invenio-software.org/code-browser/invenio.search_engine-module.html OBSOLETE

to search

perform_request_search(req=None, cc='HEP', c=None, p='', f='', rg=25, sf='', so='d', sp='', rm='', of='id', ot='', aas=0, p1='', f1='', m1='', op1='', p2='', f2='', m2='', op2='', p3='', f3='', m3='', sc=0, jrec=0, recid=-1, recidb=-1, sysno='', id=-1, idb=-1, sysnb='', action='', d1='', d1y=0, d1m=0, d1d=0, d2='', d2y=0, d2m=0, d2d=0, dt='', verbose=0, ap=0, ln='en', ec=None, tab='', wl=50000)

This is the engine behind the web-interface.
Perform search or browse request. Return list of recIDs found, if of=id, otherwise create web page.

Lengthy list of arguments see attachment. Also useful for editing the search URL.

search_pattern(req=None, p=None, f=None, m=None, ap=0, of='id', verbose=0, ln='en', display_nearest_terms_box=True)

Search for complex pattern 'p' within field 'f' according to matching type 'm'.
'm' can have values: 'a'='all of the words', 'o'='any of the words', 'p'='phrase/substring', 'r'='regular expression', 'e'='exact value'.
-Return hitset of recIDs as intbitset (default for 'of').

pattern can include 'field:pattern' - ATTENTION: no brackets!!!

no spires syntax

It is possible to use single searches together with inersection / union

Add "not 980__c:DELETED" to exclude deleted records and 980__a:HEP to restrict to HEP
or intersect with corresponding collection

Indexes exclude deleted records, marctags include deleted records (e.g. title vs 245__a)

get_collection_reclist(coll)

name of collection .NE. tag in 980!

collections exclude DELETED records

There is a list for admins on the web incl. how they are defined - no access for ordinary people

in Python: collection_reclist_cache.cache
Alternative: from invenio.dbquery import run_sql; run_sql("select name from collection")

search_unit(p, f=None, m=None, wl=0)

Search for basic search unit defined by pattern 'p' and field 'f' and matching type 'm'. Return hitset of recIDs.

Experts only. This function is suitable as a low-level API.

to retrieve content

get_record(recid)

full record as dictionary with marc-tag as key.
E.g. record['700'][i][0] is a list for i-th author like [('a', 'Colferai, D.'), ('u', 'Florence U.'), ('u', 'INFN, Florence')]

record.get(k[,d])

returns a list record[k] if k is a key of record (a 3-digit marc field that has some value), otherwise d. d defaults to None.
E.g. references = record.get('999',[]) results in an empty list or a list of tuples where
([('r', 'arXiv:0812.2665'), ('s', 'AIP Conf.Proc.,1105,28')],'C','5','',58)
represents the 58th item:
999C5 $$rarXiv:0812.2665$$sAIP Conf.Proc.,1105,28

get_fieldvalues(recid, marcfield)

marcfield is something like '700__a'

get_all_field_values(marcfield)

marcfield is something like '700__a'

dates

get_creation_date(recid, fmt='%Y-%m-%d')

DADD , datecreated

get_modification_date(recid, fmt='%Y-%m-%d')

DUPB , datemodified

get_earliest_date(recid, fmt='%Y-%m-%d')

DEARL, earliest-date : earliest bibliographic date with 01 for unknown day or month

other

get_tag_name(MARC)

E.g. '700__a' -> 'additional author name'

get_field_tags(field)

E.g. 'author' -> ['100__a', '100__q', '700__a', '700__q']

https://inspirehep.net/admin/bibindex/bibindexadmin.py?ln=en
List of all indices, access restricted

collection_reclist_cache.cache

Dictionary of collections, incl. hidden ones (e.g. collaboration notes)

run_sql

allows to run mysql out of ipython. E.g.
run_sql("show tables")
run_sql("SELECT creation_date FROM bibrec WHERE id=1318300")

or go to mysql from shell: [sachs@inspire] ~ > /opt/invenio/bin/dbexec -i

run_sql("SELECT name FROM field")

run_sql("select name from idxINDEX"

a list of all indicees

some methods for intbitsets

|  add(...)
|      Add an element to a set.
|      This has no effect if the element is already present.
|  
|  clear(...)
|  
|  copy(...)
|      Return a shallow copy of a set.
|  
|  difference(...)
|      Return the difference of two intbitsets as a new set.
|      (i.e. all elements that are in this intbitset but not the other.)
|
|  discard(...)
|      Remove an element from a intbitset if it is a member.
|      If the element is not a member, do nothing.
|  
|  intersection(...)
|      Return the intersection of two intbitsets as a new set.
|      (i.e. all elements that are in both intbitsets.)
|  
|  union(...)
|      Return the union of two intbitsets as a new set.
|      (i.e. all elements that are in either intbitsets.)
|
|  union_update(...)
|      Update a intbitset with the union of itself and another.
|  
|  issubset(...)
|      Report whether another set contains this set.
|  
|  issuperset(...)
|      Report whether this set contains another set.
|

To create an empty intbitset

from invenio.intbitset import intbitset
myset = intbitset()

Change record

BibRecord

in the python program

from invenio.bibrecord import FUNCTION

All functions are in https://github.com/inveniosoftware/invenio/blob/master/modules/websearch/lib/search_engine.py

http://invenio-software.org/code-browser/invenio.bibrecord-module.html OBSOLETE

BibCheck

An alternative way to update and manipulate records in INSPIRE.
http://invenio-software.org/wiki/Development/Modules/BibCheck

http://invenio-software.org/wiki/Development/Modules/BibCheck/plugins

still correct??

bibckeck rules are defined in a little routine check_record(record) in file name_of_file.py
and a block in https://github.com/inspirehep/inspire/blob/master/bibcheck/rules.cfg
specifying the name of the rule, the name of the file containing the corresponding check_record routine, optional a list of arguments and a collection, as well as the holdingpen flag. If it is true the record is not changed but an update created in the HP and a RT ticket created:

[name_of_rule]
check = name_of_file
check.argument = '100'
filter_pattern =  710__g:/for the/ or 710__g:/on behalf of/
filter_collection = HEP
holdingpen = true

other tags:

filter_field
filter_limit

These other tags might be ignored!!!

The arguments are parsed via json:
only lists are accepted, not tuples: [] instead of ()
true, false, null instead of True, False, None

check_record(record)

the record passed to check_record is a class AmendableRecord(dict) from
https://github.com/inveniosoftware/invenio/blob/master/modules/bibcheck/lib/bibcheck_task.py

which allows the usage of functions that are simpler than BibRecord.
In addition the record can be flagged.

MORE ....

Write & test:

Since existing code is not modified it is possible to test bibcheck rules on any machine running INSPIRE. Test machines are prefered to avoid accidental modification of the database.

Put in one directory e.g.
debug_check_record.py: Stand-alone program for testing bibcheck rules
mvtexkey_z2a.py: Example file for check_record

In debug_check_record import check_record from the file you want to test.
Run debug_check_record query [collection] (e.g. debug_check_record.py "035__z:Yamamoto:201*" HEP
debug_check_record tries to display the change in the record.

While testing e.g. record.set_amended(message) does not work.

Run

regular job via bibckeck/rules.cfg
one-time: sudo -u apache /opt/cds-invenio/bin/bibcheck -c /tmp/my-rules.cfg
using a local rules.cfg and possibly additional flags:
--notimechange : avoid changing the timeflag and re-indexing
--no-tickets : no RT tickets
--dry-run : no xml for upload, no tickets, just log files
--ticket-creation-policy=per-rule : one ticket for all records
--id=ids: Run only in the specified record ids or ranges (comma separated), ignoring all other filters
--email-logs-to=EMAILS

Miscellaneous

Authors

AuthorID

from invenio.bibauthorid_dbinterface import get_authors_by_canonical_name_regexp
get_authors_by_canonical_name_regexp('S.Mele.1')
((273672L, 'S.Mele.1'),)

List of external publications by AuthorID
from invenio.webauthorprofile_corefunctions import get_external_publications
get_external_publications(273672)

SPIRES syntax conversion

 from invenio.search_engine_query_parser import SpiresToInvenioSyntaxConverter
 SpiresToInvenioSyntaxConverter().convert_query("fin a sachs, kirsten")

or from git clone /afs/cern.ch/project/inspire/repo/inspire-scripts.git

inspire-scripts/utility/spires-syntax-converter.py "fin a sachs, kirsten"

RT

from invenio.bibcatalog import BIBCATALOG_SYSTEM
BIBCATALOG_SYSTEM.ticket_search

Or faster, but requires login:

import rt
rt_inspirehep = rt.Rt('https://rt.inspirehep.net/REST/1.0/', user_name, password) 
rt_inspirehep.login()
rt_inspirehep.search(Queue=queue, Resolved__gt=from_date)

Knowledge bases

from invenio.bibknowledge import ...

get_kb_mappings(kb_name="", key="", value="", match_type="s")

returns a list of {'kbid': 8, 'value': 'Accelerators', 'kbname': 'Subjects', 'id': 29575, 'key': 'b'}

Managing files

Deleting / managing of attached files within python

on inspire05.cern.ch
from invenio.bibdocfile import BibRecDocs
bdoc = BibRecDocs(recid)
bdoc.bibdocs  is a sort-of dictionary
doc_name = bdoc.get_bibdoc_names()  returns the dictionary keys
doc_list = bdoc.list_bibdocs() returns some sort of document
doc_list[0].get_type() is the type: [INSPIRE-PUBLIC, arXiv, ...]
bdoc.delete_bibdoc(doc_name)   deletes corresponding document

To synchonize the mark of the record call

bibdocfile --fix-marc -r <recid>

which prepares an xml file and sends it to BibUpload.

Userinterface - see bibdocfilecli.py

Select records by

--recid=12345
--recids=12345,12346,12347
--pattern=
--collection=
--md_rec== (modification date)
--cd_rec== (creation date)

Actions

--get-info
--append=this.file
--delete --with-docname="*foo*" (delete all attached files matching docname)

Examples:

bibdocfile --get-info --recids=12345,12346,12347 | grep "total file attached"
bibdocfile --append=1212.6789.pdf --recid=12345 --set-doctype=INSPIRE-PUBLIC --with-format=pdf
bibdocfile --delete --recids=12345,12346,12347

Claimed articles

from invenio.bibauthorid_webapi import get_person_id_from_canonical_id
from invenio.bibauthorid_dbinterface import get_canonical_name_of_author
from invenio.bibauthorid_dbinterface import get_claimed_papers_of_author

bai = get_canonical_name_of_author(pid)
pid = get_person_id_from_canonical_id(bai)
claims = [c[2] for c in get_claimed_papers_of_author(pid)]

To get he full information of the table:

run_sql("select * from aidPERSONIDPAPERS where bibrec=recid")

flag = 1: recid is in cluster
flag = 2: recid is claimed

Mixed bag git repo

We have collected data cleanup and all kinds of other scripts in a git repo on AFS, which is accessible from inside CERN or via ssh

$ git remote -v add origin  ssh://hschwand@lxplus.cern.ch/afs/cern.ch/project/inspire/repo/inspire-scripts
$ git fetch origin
$ git checkout master
?? origin  ssh://hschwand@lxplus.cern.ch/afs/cern.ch/project/inspire/repo/inspire-scripts
(push)

[T@lap inspire-scripts (master)]$ /usr/bin/tree -d
.
- cron
- datacleaning
- examples
- git-hooks
- greasemonkey_scripts
- import
- maintenance
- sublime_plugins
   - Git
      - syntax
   - Invenio
   - Launcher
- utility

Since this is a somewhat unofficial repo/grab-bag of useful things to share, it's important to have good docstrings and to keep re-usability in mind when creating tools.

To fix broken xxx tables:

from fix_broken_bibxxx import check_record_consistency, fix_broken_record

how to programmatically submit tasks to bibsched? (by Thorsten)

There are some utility classes for common tasks which take care of the chunking of long lists of recids and submit the tasks to bibsched one chunk at a time, waiting for the task processing the previous chunk to finish before submitting the next.

Look at the base class

https://github.com/inspirehep/inspire/blob/master/miscutil/lib/bibtaskutils.py#L84-L114

In [1]: from invenio.bibtaskutils import ChunkedBibRank

# instantiate the ChunkedBibRank task with methods = 'citation' and user = schwande

In [2]: cbr = ChunkedBibRank('citation', 'schwande')

# take a list of 10k recids

In [3]: idlist = range(100000,110000) In [4]: len(idlist) 10000

# add the ids to cbr

In [5]: for i in idlist: cbr.add(i)

this will kick of a bibrank task once 500 recids (default chunk size, can be changed via args) are added and wait for it,

https://github.com/inspirehep/inspire/blob/master/miscutil/lib/bibtaskutils.py#L97-L98 and https://github.com/inspirehep/inspire/blob/master/miscutil/lib/bibtaskutils.py#L24-L28

then do the next 500, etc

There are several other args and options, look at the code. The code also shows you how to directly run task_low_level_submission() without chunking or waiting if you need that.

-- KirstenSachs - 13-Feb-2012

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
txt	debug_check_record.py.txt	r1	manage	3.0 K	2014-10-23 - 14:41	KirstenSachs	Stand-alone program for testing bibcheck rules
txt	mvtexkey_z2a.py.txt	r1	manage	1.6 K	2014-10-23 - 14:42	KirstenSachs	Example file for check_record
txt	prs.txt	r1	manage	7.7 K	2013-09-25 - 16:22	KirstenSachs	HELP perform_request_search

Topic revision: r41 - 2021-01-18 - KirstenSachs

Inspire

Related wikis:

Invenio

- Cern Search
- TWiki Search
- Google Search
Inspire All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback