Storage elements dumps and consistency checks versus file catalogues

This page aims to coordinate the request of SEs dumps to sites in a common way for all the experiments that might need them. It also summarizes the status of the activity of consistency checks Storage Elements (SE) versus file catalogues, to give a view of the context of consistency checks.

Many people contributed to the information displayed in this page: Pablo Saiz, Alessandro Di Girolamo, Nicolo Magini, Cedric Serfon, Natalia Ratnikova and others. Thanks! and feel free to give me feedback to correct any info or add other relevant info.

Overview of the situation

(Updated at Spring 2011, on the basis of a presentation given at T1 service coordination meeting 21.04.2011).

Grid storage elements (SEs) are decoupled from the file catalogue (FC) => inevitably some inconsistency can arise:

  • Data in the SEs, but not in the FC ('dark data') -> not an operational problem, but considerable waste of disk space
  • Data in the FC, but not in the SEs (lost/corrupted files) -> serious operational problems (jobs failing etc...)

ALICE

The coupling between SE and FC prevents most of the inconsistencies. No systematic check done on a periodic basis on the SEs/FC consistency.

Security envelopes used to access the SE, based on the AliEn FC. Any write/delete operation goes through the FC. This solves most of the inconsistencies i.e. a write operation follows the sequence:

  • 1. The client asks the FC
  • 2. The FC books the entry and returns the envelope
  • 3. The client writes the file to the SE using the envelope
  • 4. The client notifies the FC that the file has been added

If for some reason the connection breaks after step 3, the FC knows about the orphan file and should be able to remove it. Still there could be cases where inconsistency arise: delete operations not correctly propagated, SE failures, etc...

ATLAS

Inconsistencies, of both types, are checked in a systematic way.

Checks to spot and remove dark data:

Checks on demand (manual) when divergence arises between DQ2, LFC and SRM (on the basis of these kind of plots). Responsible is Cloud squad. Frequency of checks varies for every cloud. Tools and documentation are provided in a twiki about how to produce the SE dump and the output file format

  • 1-LFC vs SE: generate a dump of the SE, a dump of the LFC on the fly. From the comparison produce a list of files in the SE but not registered in the LFC => remove them from SE
  • 2-DQ2 vs LFC: running a script produce a list of files in LFC but not registered in DQ2 => remove from LFC + SE

Checks on lost/corrupted files:

Automatically reported: jobs report any failure for input file not available to Dashboard. Periodically Dashboard reports the list of problematic files to the Cloud Squad. Site admins will investigate the reason of the file not being available

CMS

Consistency checks between PhEDEx DB (provide replicas location) and local SEs. Both type of inconsistency can happen.

Lost/corrupted data (files registered in PhEDEx DB but not in the SEs):

Every site, as part of local PhEDEx installation, runs a local consistency agent: gets the list of files at the site from PhEDEx DB and checks size/checksum with local storage content. SE content obtained through a locally customized plug-in (either a local storage dump, or talking to the SE through local protocol) based on a common framework. Consistency checks can be triggered from remote, and the results are logged in the DB. Frequent at Tiers2, main reason of persistent transfer errors. Not critical: files can be re-transferred or re-generated. But time consuming to fix. Rarely happens at Tier1 (e.g. damaged tapes)

Checks to identify and remove dark data (files in the SE but not registered in PhEDEx):

Procedure to identify 'dark data':

  • The site generates a dump of the SEs and compares with central PhEDEx DB
  • The list of 'dark files' is sent to central operations asking if it is OK to remove them
  • Tiers1 regularly report about discrepancy between PhEDEx DB and local SE usage of order of ~10%

Main source of the inconsistency: leftover of removals

A summary of the activity for CMS here

LHCb

Systematic checks started recently (beginning of 2011).

Implemented a new agent to scan SEs dumps and cross check their consistency with LFC. Agent operative with SEs dumps from SARA, where huge inconsistency had been noticed (it was due to a dCache bug). Reported to the site and promptly fixed by them (Apr 2011). Currently, the agent is in production and ready to process SEs dumps for other sites as soon as they will be available.

The situation of consistency for LHCb space tokens is summarized in this LHCb page

Format of SE dumps

It has been agreed among the experiments that the required information is:

  • LFN (or PFN)
  • file creation time
  • file size
  • space token
  • checksum(adler32) (optional)
  • exact time stamp of the creation of the dump
the exact format of the files can be checked below in this section

Asking to sites to provide a common format for SEs dumps for all the interested experiments would streamline the operations on their side. We are coordinating with experiments and sites in order to agree on some common formats and procedures.

The approach is to ask sites which method/format they prefer, in order to make the task as easy as possible on their side. In general, one of the options is proposed: XML format or plain text format.

General (both for text and XML format)

Desired format for the storage dumps:
  • one file per space token
  • file name should be: SpaceToken.creationTimeInEpochSec.txt e.g. for LHCb-Disk space token: LHCb-Disk.1312201138.txt (or .xml)
  • the N files should be packed in a tar archive and published somewhere (it is up to the VO to decide the best way):
    • LHCb: the tarball should be published under some web server and the URL should be provided to the experiment
    • CMS: should be made available on the site's V0-box
    • ATLAS:

finally the tarball should be available for the VO with this format:

> bunzip2 LHCb.tar.bz2
> tar xvf LHCb.tar
LHCb/
LHCb/LHCb-Disk.1314742759.txt
LHCb/LHCb-Tape.1314742759.txt
LHCb/LHCb_USER.1314742759.txt

And in each file one row per file with the format either text or xml (see below the respective sections)

XML format

For dCache

XML format has been agreed with dCache sites, as they can produce the dump using the pnfs-dump or the chimera-dump tools, that support XML output format. More details here: http://www.desy.de/~paul/SynCat/syncat-1.0.tar.gz

Link to Chimera-dump tool

For DPM

ATLAS sites produce xml format storage dumps for DPM following procedures explained in this DDM operations twiki

Plain text format

CASTOR and Storm have preferred this option. Details:

Content of each file: one line pre file with:

LFN (or PFN) | file size (bytes) | file creation date (epoch s) | checksum
e.g.
[lxplus441] $ head -1 LHCb-Disk.1312201138.txt
/lhcb/MC/MC10/ALLSTREAMS.DST/00010680/0000/00010680_00000028_1.allstreams.dst|4046740455|1305988966000|N/A

if a field is not available, write 'na' (e.g. the checksum will not be available for some storage implementations, no problem)

Script to get plain text format storage dumps for dCache querying the SRM DB

Onno Zweers from SARA very kindly provided the script he developed to produce the storage dumps from dCache. (SARA_space_token_files.py, available at the bottom of this page).

Text format with chimera-dump

Chimera-dump will give plain text files with the "-a" (as ASCII) option. The file per space token can be created with several calls to the script with the "-s atlasdatadisk" (for ATLASDATADISK for example). The output name can be given with another option "-o SpaceToken.creationTimeInEpochSec.txt".
python chimera-dump.py -a -c fulldump -o /tmp/testfile -s atlashotdisk
will give a file /tmp/testfile like: pnfsid pfn size state pool time checksum
0000001A3A70CEF94F2EB5460539DB975C35 /pnfs/ifh.de/data/atlas/atlashotdisk/cond09_data/000001/lar/cond09_data.000001.lar.COND/cond09_data.000001.lar.COND._0637.pool.root 57964708 1 ssu05-0 1258571461.0 c1c863f1

Some notes on particular storage implementations

CASTOR

Update at March 2012: from CERN Castor operations team: from 24th March new CASTOR diskpooldumps will be available for test under /afs/cern.ch/project/castor/www/DiskPoolDump/tmp/

the format is still compatible with the old one and is the following: e.g.

/castor/cern.ch/../something.root (9859562) size 36577829 status STAGED available YES requester username,groupid (numeric_uid,numeric_gid)  datastaged 10 Jan 2012 16:32.01 lastaccess 10 Jan 2012 16:31.50 fileclass temp nbaccesses 0

all the files provided will be gzip, you will find the latest /afs/cern.ch/project/castor/www/DiskPoolDump/tmp/lhcb.NAME_OF_THE_POOL.last.gz e.g. lhcb.lhcbraw.last.gz

we will switch in production in 2 weeks under /afs/cern.ch/project/castor/www/DiskPoolDump/

August 2011: After speaking with CASTOR developers at CERN: first proposal from CASTOR team was to use the daily stager dumps produced for each experiment and each service class. This turned out to be a bad solution, as the stager dumps only contain files that are currently on disk cache. After ruling out this option, second proposal is that the storage dumps are produced directly on the experiment side, developing a tool based on nsls. In this way, the full content of the namespace will be obtained, but no information about service class (or space token) will be available.

A similar tool has been developed at RAL by Shaun de Wit and Raja Nandakumar. Got the first storage dump sample from RAL 16.09.2011. They produce it on weekly basis (every Monday night). Format: only one file, for all the nameserver dump, with file name, size and other meta-data:

-rw-r--r--      1       lhcb001 lhcb    5015920356      Jun     15      19:36   AD      ce918b28        /castor/ads.rl.ac.uk/prod/lhcb/LHCb/Collision10/BHADRON.DST/00010920/0000/00010920_00000001_1.bhadron.dst
-rw-r--r--      1       lhcb001 lhcb    5039633586      Jun     18      17:29   AD      ff9e543e        /castor/ads.rl.ac.uk/prod/lhcb/LHCb/Collision10/BHADRON.DST/00010920/0000/00010920_00000002_1.bhadron.dst
-rw-r--r--      1       lhcb001 lhcb    5270857100      Jun     18      17:20   AD      a2f27b34        /castor/ads.rl.ac.uk/prod/lhcb/LHCb/Collision10/BHADRON.DST/00010920/0000/00010920_00000004_1.bhadron.dst
-rw-r--r--      1       lhcb001 lhcb    4975439267      Jun     18      17:21   AD      db3aed2b        /castor/ads.rl.ac.uk/prod/lhcb/LHCb/Collision10/BHADRON.DST/00010920/0000/00010920_00000005_1.bhadron.dst
-

Raja says it is a script that can be easily run for CMS and ATLAS as well. The load on the nameserver is reasonable and it is no problem to execute it once a week. He is happy to contacted by other experiments and sites, in case they want to use the script. CMS and ATLAS confirmed that this format is fine for them too.

The usage of this script at CERN Castor is problematic, due to scalability issues (discussed at T1 service coordination meeting of 03.11.2011).

StoRM at CNAF

Contacted Vincenzo Vagnoni on (July 2011). His answer: both format (TEXT and XML) are possible, though slight preference for text format. About checksum: checksum might not be available. At CNAF they don't calculate the checksum for every file on the SE. They calculate the checksum only for files which land at CNAF via a gridftp transfer. For file written onto the SE from the worker nodes, for example, this is not possible for the moment. ( To have checksum calculated for every file would be very resource consuming).

How storage dumps are created: it is a dump of the GPFS. StoRM does not have a separate database for keeping files names. This avoids any possible lack of synchronization between file system and database.

dCache

A storage dump can be obtained in two ways:
  • 1- file system dump (Chimera of pnfs)
  • 2- query to SRM Postgres db (SRM Postgres DB which stores the names of all files known to the system and managed by SRM, with other metadata, like the space token)
Which method to choose? even if now LHCb (and other experiments) only write data into space tokens, there are still some older data (previous to space token implementation) at sites. These can be retrieved ONLY with a FS dump. Also, it has happened that after space token implementation, some files that were meant to be written inside space tokens, ended up outside (for still unknown reasons). On the other side, using a FS dump (if done with pnfs-dump ) will NOT provide the space token. The preferred method has to be decided by the experiment as it has implications on their data management models. tbd for LHCb.

CASTOR cannot return the 'used' space for a given space token through SRM

Some problem to get the used space from CASTOR: SRM returns 3 fields: e.g. uging lcgutils api:
>>> cernEp = 'httpg://srm-lhcb.cern.ch:8443/srm/managerv2'
>>> srm = lcg_util.lcg_stmd('LHCb_M-DST', cernEp, True, 0)
>>> srm
(0, [{'guaranteedsize': 352787820797952L, 'totalsize': 352787820797952L, 'lifetimeassigned': 0, 'spacetoken': 'lhcb:LHCb_M-DST', 'lifetimeleft': -1, 'unusedsize': 90425875414218L, 'retentionpolicy': 'replica', 'owner': None, 'accesslatency': 'unknown'}], '')

  • Guaranteed size: all allocated space, including on line and off line disks.
  • Total size = Guaranteed size
  • Unused: free space in on line disks.
Computing 'used space' = total size – unused => wrong value if some disks are temporarily off line! This happens quite often for big pools. CASTOR developers are aware of this problem (already reported by ATLAS at Sept 2010 GDB) but there is no easy solution: 4 values should be reported (free and total, for on line and for all disks) but only 3 fields are available. In next Castor release, the total space will refer to total online space. Still this does not solve the issue. Alternative solutions:
  • Proposal to publish the used space in the BDII
  • Ask for periodic dumps of the SEs and compute the used space summing the size of the files listed therein
The latter seems probably easier to put in place. ONGOING

Proposed abstract for CHEP2012 conference

Title: Data storage accounting and verification in LHC experiments. Abstract here

Logbook of the activity

The following request has been sent to some sites:

PFN, file creation time, file size, space token, checksum(adler32). and the exact time stamp of the creation of the dump.

  • 13.07.2011 Asked to Castor at CERN. Waiting for reply.
  • August 2011: after asking this to Castor, they suggested to use the stager dumps daily produced (for other purposes than consistency checks). Fine
  • 14.07.2011 Asked to StoRM at CNAF. Replied: yes, they are available to provide the dumps in plain text format. No checksum, but yes the other data. Will be done soon (remind them if no news before 1st Aug)
  • August 2011: agreed on a precise format for Storm for LHCb, CMS, ATLAS. Ok! provided on 30.08.2011 OK
  • 25.08.2011: noticed that many LHCb files are missing from storage dumps. Opened a GGUS ticket to CERN. On 30.08.2011 Understood! it doesn't make any sense to do consistency checks on the basis of stager dumps, as the stager only has the files that are currently in the disk cache, but NOT the files on tape! They proposed I produce myself storage dumps with developing a tool based on nsls ongoing
  • 30.08.2011: got storage dumps from CNAF. Right format. OK
  • 16.09.2011: RAL provided a nameserver storage dump. To be analysed. To be analysed
  • 24.10.2011: CMS and ATLAS confirmed that the RAL storage dumps format is fine for them too
  • 25.10.2011: sent out results of consistency checks for LHCb sites: CNAF, GRIDKA, SARA
  • 03.11.2011: reported the status of storage dumps provided by sites and status of LHCb consistency checks SE vs LFC at the T1 service coordination meeting

Presentations

New perspective to improve the SE - FC synchronization: SEMsg

SEMsg is a messaging system to make SEs and FCs talk to each other and keep them synchronized. More info here. Both ATLAS and LHCb have tried this new system with some prototype and intend to integrate it in their computing model to receive notification about files not available from the SEs in an automatic way. (add some more info about the validation tests).

-- ElisaLanciotti - 20-Jul-2011

Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt SARA_space_token_files.py.txt r1 manage 2.9 K 2011-08-31 - 11:18 UnknownUser example of a script to produce storage dumps for dCache.
Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2012-03-22 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback