A summary of the results of prestaging tests at different CMS sites

Tests run so far:

Run name DATE SURL list storage Ls optimization space token total data (GB) polling time (min)
CNAF-1 May 21 2009 CNAF/surl_1_1TB.txt CASTOR no - 1100  
CNAF-2 May 22 2009 CNAF/surl_1_1TB.txt CASTOR no - 1100  
CNAF-3 May 28 2009 surl_day00_CNAF.txt CASTOR yes - 7253 20
IN2P3-4 May 27 2009 surl_1_1TB-in2p3.txt dCache yes - 1129 5
FZK-1 May 27 2009 surl_1_1TB-fzk.txt dCache yes - 1100 5
IN2P3-5 May 29 2009 surl_1_1TB-in2p3.txt dCache yes - 1129 5

run CNAF-1

First official run at CNAF in the framework of STEP09 testing activity.

stagedGB vs time.CNAF-1.png rateOfStagedData vs time.CNAF-1.png

Everything fine. The amount of data brought online vs time is shown in the left picture. The staging rate (right figure) is good, and no error has been observed.

One comment: the bug affecting StatusOfBringOnline has not been observed (see picture below). CNAF has CASTOR 2.1.7-17 and CASTOR SRM 2.7-15, which is a combination supposed to be affected by the bug. To check with the CASTOR developers. Andrea has cheked with Giuseppe Lo Presti: he says that with the versions of CASTOR and SRM installed at CNAF the bug is present and it's only by chance that we have not observed it. In fact, due to the nature of this bug, if it works fine at the first polling, then it will work fine until the end of the request. By consequence at CNAF the only safe way to check if a file is actually staged is via srmLs, which causes a heavier load on the SRM server.

Done and online vs time.CNAF-1.png

run CNAF-2

Same test repeated the day after. Results are confirmed.

stagedGB vs time.CNAF-2.png rateOfStagedData vs time.CNAF-2.png

run IN2P3-4

Test run on 342 files. The staging rate is very high, up to 500 MB/s (see plot below).

Important comment about srmLs: a problem was observed with srmLs: the locality of files is reported in a wrong way. It reports 73 files NEARLINE and 269 ONLINE_AND_NEARLINE from the beginning of the request until the end (18 hours later). This is clearly wrong, as we know that the files were all NEARLINE at the beginning of the test, and they were all staged at the end, when the staging request finished successfully.

This is the information we got from the site:

The NEARLINE files are present on the same poolgroup (pgroup-cms-xferout) which is used for outside transfers. The ONLINE_AND_NEARLINE files are present on 2 poolgroups: pgroup-cms-xferout and pgroup-cms-analysis which is used for local file access.

I think that we hit the "LHCb file access" bug. They reported the same problem and we applied a configuration work-around for them.

Some other minor comments:

  • a minor problem observed due to refused connections at gSoap level. Sporadically the srmLs request is rejected with this error:
GSI-gSOAP: Error reading token data: Connection reset by peer

Cannot submit Ls request
[SE][Ls] httpg://ccsrm.in2p3.fr:8443/srm/managerv2: CGSI-gSOAP: Error reading token data: Connection reset by peer

  • The time taken by srmLs is more or less around 15 seconds, except once it took 4026 seconds (you can see the gap in the plot 'Locality of files vs time' from 22000 s to 26000 s).

stagedGB vs time.IN2P3-4.png rateOfStagedData vs time.IN2P3-4.png

Done and online vs time.IN2P3-4.png

run FZK-1

The staging rate was extremely slow. The first file was staged after 51 minutes from the beginning of the staging request. After 3.9 hours only 3 files had been staged and the other were reported as 'pending'. At this point maybe the request went in timeout and the pending requests (339 out of 342) changed their status from 'pending' to 'error'. We stopped the test then.

The behavior of srmLs is consistent with srmStatusOfBringOnline (the files reported as 'done' by srmStatusOfBringOnline appear to have a locality 'ONLINE'). No problem here.

run CNAF-3

Test run on 3051 files, total data about 7.3 TB. Very high staging rate, about 380 MB/s.

One error occurred: almost 3 hours after the beginning of the test (around 7 pm of May 28th), when half of the SURLs had been staged, one SURL was reported from pending state to error. Request token is 145803.

stagedGB vs time.CNAF-3.png rateOfStagedData vs time.CNAF-3.png

run IN2P3-5

Same conditions of run IN2P3-4, after applying the fix for the bug affecting srmLs. The fix didn't work, test stopped.

-- ElisaLanciotti - 28 May 2009

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2009-05-29 - ElisaLanciotti
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback