Inspire Web>SystemDesign>SystemDesignBibHoldingPen (2015-02-18, JanLavik)

THIS PAGE IS OUTDATED, LEGACY WARNING: Holding Pen and BibWorkflow design overview

Holding Pen (HP): Pre-importation staging area. Refers to the place were catalogers will manually inspect new record updates/inserts that will (or will not) go into the database. This also includes a more elaborate record selection process (done by physicists). Uses BibWorkflow to archive this goal

BibWorkflow: Generic and automatic workflow module. This module allows a client to execute a given programmable set of tasks to a set of objects such as meta-data records, files etc. These tasks can be anything, such as taking the objects through the typical pre-ingestion steps like matching/filtering/classification etc.

Holding Pen

The holding pen will be a common interface for all records/files (read: objects) that needs to go though certain pre-ingestion steps before being uploaded into the system (or not*). Various decision metrics will be made available to curators through this interface, such as record approvals, meta-data/full-text analysis, record disambiguation.

* there will be ways of circumventing the HP as before.

Use cases

Some of the use-cases thought of for the holding pen as part of the ingestion workflow:

Match resolution (merge matched records, select outcome for ambiguous and fuzzy matched records)

Classification (should certain records or objects be accepted or rejected?)

Various automatic transformations/filters (conversion of formats, adding of fields etc.)

Interface

The interface for the holding-pen will be a layer built on top of BibWorkflow that will display all records/objects that need some human intervention.

For every action that is needed, a special interface is displayed to the curator. For example: special interface for match resolution.
The whole history of the records and the tasks/workflows done to it will be viewable and there will be special searchable/filterable fields; such as source, category, publisher etc.

Early mockups: https://docs.google.com/folder/d/0BzhOER3a3OHtTXNMM3ZwVU1ubWs/edit

Actions

After a record is put through the Holding Pen certain actions may be flagged for manual curation/oversight. Curators can then trigger these actions on the records and the records may then be inserted to the main database or discarded. When this happens the record in the holding pen is considered removed or inserted and may be deleted as the HP is just a temporary staging area - by definition. This means that in the ideal case, the holding pen would be empty.

Merging records

One action type would be to allow easy merging of records that already exist in the repository through the current existing Holding Pen integration in BibEdit (http://goo.gl/v9zfT).

Resolving matches

N/A

Implementation

Use new table for Holding Pen, linked to BibWorkflow object tables. Data there stored as blob. Maybe Solr indexing of record data? Also store status (what has been done to this HP record and what needs to be done with it) and workflow ID (this is a common identifier for records in the same update job) in now columns.

Status could be a different table storing various technical metadata with relation to HP records. To display HP records using Invenio BibFormats we need to subclass BibFormatObject overwriting self.get_record() to go to HP recstruct

BibWorkflow

A new module for creating various generic and configurable workflows for objects. This module is wrapped around the Python module "workflow engine". This offers high customizability for administrators and programmers to create workflows that will execute a list (or list of lists) of tasks (python functions) on a given range of objects one-by-one.

Workflows

The workflow definitions are just lists of python functions taking two arguments (the current workflow engine object and the current object to operate on). We store these definitions in python files - one definition per file.

from invenio.bibworkflow.tasks import convert_record, filter_record
workflow = [convert_record("/path/to/stylesheet"), filter_record("/path/to/filterfile")]

These python functions are called tasks and its goal is to manipulate the given objects, called BibWorkflowObject's, in the desired way. Since each task/function need have the current engine and the current object as function parameters, other parameters to functions can be given in the following way:

def filter_record(filepath):
     def _filter_record(engine, object):
          ......

     return _filter_record

Workflow engine

The engine in the above example is called BibWorkflowEngine and it subclasses the GenericWorkflowEngine in "workflow-engine" to add new functionality such as logging and saving of objects/workflow. We keep track of the running workflows by saving them into a database table (bwlWORKFLOW), storing the necessary data to recreate the engine if necessary. For example when restarting or continuing it.

The fields we store:

workflow id
workflow_name
etc.

Workflow objects

The object in the above example is called BibWorkflowObject and it is always wrapped around any objects sent through a BibWorkflow. This object gives us the functions needed to store the objects. Similiar to how we store the workflows, we store the objects in a table (bwlOBJECT) with the data needed to recreate it, in addition to the reference of the stored workflow object (workflow id).

We save the objects in different versions:

Version 0 (initial state): before the workflow starts, all objects are stored in the database in their original form.
Version 1 (finished state): when all tasks in the workflow completes successfully on an object, the "finished" object is stored in the database in their new form. References Version 0 (parent id).
Version 2 (unfinished state): if an error happens or user input is needed (raise HaltProcessing) the object is again stored in its "unfinished" state. References Version 0 (parent id).

Usually, version 2 objects are the interesting ones and is the one showing up in the Holding Pen (by default).

We decided against serialization of the objects for various reasons, such as issue with code updates, difficulty to query efficiently etc.

The fields we store:

object id
parent id
workflow id
object data
extra data

In addition, for now, we are storing the extra data needed for the holding pen in a separate table (bwlHOLDINGPEN) referencing object id as relation.

Workers

In BibWorkflow, workflows are executed by a certain "worker". A worker in this case is a implementation (more like a plugin) of a background process that is executing the workflow in the background. This is for example necessary when launching jobs from the web interface. It also decouples the execution of the task with the client, allowing the client to go on with other tasks for example.

Celery worker: currently we have a celery integration that add an executed workflow to the celery job queue and is executed there Read about it here
Python worker: standard python plugin that just runs the workflow directly in the current process
BibSched worker: will wrap the workflow execution in a BibWorkflow bibtask to be run through BibSched

Record ingestion sources

Here are some example of record ingestion sources that may use BibWorkflow as a decision factor in whether or not a record update is put into the holding pen.

OAI harvesting (BibHarvest): OAI harvested XML comes from BibHarvest jobs. This generates MARCXML to be run through the pre-ingestion processing step (BibWorkflow), then uploaded into the record database or HP.

BibUpload: When uploading new content, a simple (or advanced) matching workflow is performed. Ambiguous results should be fed into HP for analysis.

Batchuploader web interface: MARCXML coming from the web interface of batchuploader. Right now it will go directly into BibUpload - assumed to be well-formed. This could start a selected BibWorkflow process before ingestion as well.

Robot web upload or command line interface: Flat MARCXML to be uploaded into the database using BibUpload. Can use BibWorkflows before/after ingestion.

User submissions: Filter user submissions and have various approval/review workflows

BibEdit: Usually a record update coming from BibEdit will directly replace the existing record. An option would be to run changes through BibWorkflow in order to filter the changes - for example in case of non-fully trusted updates.

Interface

A workflow can be initiated by a call to the BibWorkflow interface, either by a web call, API call or CLI call. BibWorkflow will read configuration from the workflow definition (file).

Currently we have a basic web interface where administrators can launch workflows and monitor the running (and completed) workflows.

Summary

Module needs the following:

basic tasks (plugin based)
CLI, API, WEB to configure and run workflows
Workflow worker plugins (bibsched, python)
BibUpload needs to be able to upload new records, not only existing DB records, into the Holding Pen.

Project task list on Trello

The project will primarily be implemented by Wojciech (W), Daniel (D) and Jan Åge (J). We will aim to make heavy use of round robin code review and pair programming. Another goal is to have full test coverage.

For our internal task list we will be trying out trello.com

Here is a preliminary list of tasks (trello will be more updated over time):

~~Create a working basic interface - BibHarvest: fix holding pen display (W)~~
~~Create module bibworkflow to wrap around WorkflowEngine and Celery/BibSched~~
Redesign HP backend to use the chosen record structure (e.g. PinkBibRecord etc.) used by BibWorkflow components.
BibUpload should not require 001 mapping (recid) when inserting into holding pen
Improve basic HP interface with BibFormat hooks (with WTForms integration)
BibMatch API functions check
BibConvert API functions check
BibClassify API functions check
DocExtract API functions check
Improve BibMerge/BibEdit/Holding Pen integration when applying/merging changes
Change BibHarvest post-processes to use BibWorkflow
Allow Batchuploads to go through BibWorkflow

Celery

We can allow the use of the popular Celery task queue as our workflow engine, as the nicer alternative to BibSched, since we will (on INSPIRE) probably be using Celery in other parts of the system (BAI/DocExtract). Celery have the option to chain tasks in a way that makes it applicable for our requirements for a workflow.

Read about it here

PinkElephant

Since we will not be using the BibIndex or the record database for storage - some of the implementation in this branch are not applicable. We can use the PinkBibRecord object and its ideas as a basis for the BibWorkflowRecord object, modifying where necessary any reference to BibIndex or the record database, and have it be the common record structure between workflow tasks.

Related Trac Tickets

Some problems/issues to solve

Sync issue - matching should run again after x time in the HP? -> BibUpload needs a matching workflow -> results are fed to HP or DB
What runs in BibUpdate should be runnable by the HP (match, filter etc.)
Are we able to store snippets of record updates? Do we need to?
Technical metadata storage? Right now, BibMatch puts XML comments in front of records with such data.

-- JanLavik - 08-Aug-2012

Topic revision: r11 - 2015-02-18 - JanLavik

Inspire

Related wikis:

Invenio

- Cern Search
- TWiki Search
- Google Search
Inspire All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback