Generic SRM prestaging script

The script serves as an actual prestaging tool, as a reference implementation and as a testing tool. What follows describes the desired behaviour, not the current behaviour of the script.

Where can I find it?

Download the following files:
Do a source setup.sh
Run ./my_prestage.py [OPTIONS] -s surls | -i inputfile
- -s surls: to give a comma-separated list of surls, or
- -i inputfile: to give a file containing a list of surls (one per line)
- -t spacetokendesc: to set a space token descriptor where to stage the files (default: none)
- -l pinlifetime: to set the pin lifetime to set for the files (default: 7200 s)
- -r requestlifetime: to set the time after which an SRM request should time out (default: 86400 s)
- -L: if used, the script does not use srmLs to poll the file locality
- -p polltime: to set the time between a status poll and another (default: 60 s)
- -n runname: to set the name of the .txt files produced (see below)
- -b: to query the BDII to use "short" SURLs (default: use "full" SURLs)
- -O: to redo each time an srmLs also on files already known to be ONLINE (default: srmLs only on files not yet ONLINE at the previous poll)
- -R: to release all files before the script exits (default: files are not released)

How is it used?

Given a list of SURLs, the script is executed and runs until one of these conditions is fulfilled:

all files are staged (the GFAL status is 1 and the locality is ONLINE or ONLINE_AND_NEARLINE)
a timeout is reached, equal to requestlifetime + 7200
the script is interrupted by the user via CTRL-C

What does it produce?

Every time the script polls the status of the request, it generates three files:

fileStatusDump_<request token>.txt: a text file with one line per file, each line containing these fields:
```
time, surl, gfal_status, explanation, locality, pinlifetime
```
where gfal_status is the GFAL status (-1=error, 0=pending, 1=done).

request_status_<request token>.txt: a text file summarizing the status of the request with this format:

TOKEN=<request token>
TIME_TOT=<time passed since request submission>
TIME_STATUSOF=<time spent on the StatusOf call>
TIME_LS=<time spent on the Ls call>
ERRORS=<number of files with status = -1>
PENDING=<number of files with status = 0>
DONE=<number of files with status = 1>
NONE=<number of files NONE>
LOST=<number of files LOST>
UNAVAIL=<number of files UNAVAILABLE>
NEARLINE=<number of files NEARLINE>
ONLINE=<number of files ONLINE or ONLINE_AND_NEARLINE>
STAGED_GB=<amount of staged data according to srmStatusOfBringOnline
ONLINE_GB=<amount of online data according to srmLs

All the fields can be put in the same line and a new line appended to the same file at each polling.

How does it poll?

The request status is polled at regular intervals when gfal_prestagestatus returns 0. The backoff time increases exponentially when gfal_prestagestatus returns -1.

StatusOf or Ls?

The GFAL status of the files is defined after a gfal_prestagestatus, while the locality can be determined only via a gfal_ls. By default both functions are used, but it is possible to choose only to use StatusOf. Using both allows to spot any inconsistencies (like the locality being ONLINE and the GFAL status being 0).

Cleaning up

When the script finishes (including the case where it is killed by the user), it aborts the request and optionally releases all the files. The files should be released when the script is used for testing, and not released when it is used for real prestaging.

Optimizations

Ls is done only on the files which had not reached a final state at the previous polling.

Data analysis

The output data defined above is sufficient to perform a detailed data analysis. Some examples:

compare performances of Ls and StatusOf
study and compare the time evolution of the number of ONLINE files and of DONE files
measure the performances of the underlying tape system
look for stale files (e.g. files which never become ONLINE).

Operation as a pre-staging tool (notes by J. Hernandez)

By default the script determines the prestaging status of files via a bulk gfal call (very fast) and the file locality individually via srmLs (costly, ~1 sec/file). For dCache sites it should be fine to switch off the check via srmLs (with the -L option). For Castor sites, there is a bug in gfal in CASTOR 2.1.7/CASTOR SRM 2.7 which affects the prestaging status check and therefore srmLs should be used. Not sure what T1s are currently affected by that.
The default polling cycle for checking the pre-staging status is 60 seconds. A longer polling cycle of at least 10 minutes (-p option) should be sufficient.
The input to the prestaging script is a list (or a file with the list) of srm surls. It should be easy to write a little script using the PhEDEx data service to return a list of srm surls corresponding to a dataset/block at a given site that can be used as the input of the prestaging script.

To-do list

In addition to the things marked above, the following must be done:

add an optional call to gfal_pin to extend pin lifetimes
add a retry loop when the BringOnline request fails
print to request_status_<request token>.txt error messages with the format:

TIME_TOT ERROR <error message>

For example:

67.362 ERROR [SE][BringOnline] httpg://srmcms.pic.es:8443/srm/managerv2: CGSI-gSOAP: Error reading token data: Connection reset by peer

-- AndreaSciaba - 28 May 2009 -- JoseHernandez - 19-Oct-2009

Topic revision: r4 - 2009-10-19 - JoseHernandezExternal3

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback