A summary of the results of prestaging tests at different CMS sites
Tests run so far:
Run name |
DATE |
SURL list |
storage |
Ls optimization |
space token |
total data (GB) |
polling time (min) |
CNAF-1 |
May 21 2009 |
CNAF/surl_1_1TB.txt |
CASTOR |
no |
- |
1100 |
|
CNAF-2 |
May 22 2009 |
CNAF/surl_1_1TB.txt |
CASTOR |
no |
- |
1100 |
|
CNAF-3 |
May 28 2009 |
surl_day00_CNAF.txt |
CASTOR |
yes |
- |
7253 |
20 |
IN2P3-4 |
May 27 2009 |
surl_1_1TB-in2p3.txt |
dCache |
yes |
- |
1129 |
5 |
FZK-1 |
May 27 2009 |
surl_1_1TB-fzk.txt |
dCache |
yes |
- |
1100 |
5 |
IN2P3-5 |
May 29 2009 |
surl_1_1TB-in2p3.txt |
dCache |
yes |
- |
1129 |
5 |
run CNAF-1
First official run at CNAF in the framework of STEP09 testing activity.
Everything fine. The amount of data brought online vs time is shown in the left picture. The staging rate (right figure) is good, and no error has been observed.
One comment: the bug affecting StatusOfBringOnline has not been observed (see picture below). CNAF has CASTOR 2.1.7-17 and CASTOR SRM 2.7-15, which is a combination supposed to be affected by the bug. To check with the CASTOR developers.
Andrea has cheked with Giuseppe Lo Presti: he says that with the versions of CASTOR and SRM installed at CNAF the bug is present and it's only by chance that we have not observed it. In fact, due to the nature of this bug, if it works fine at the first polling, then it will work fine until the end of the request. By consequence at CNAF the only safe way to check if a file is actually staged is via srmLs, which causes a heavier load on the SRM server.
run CNAF-2
Same test repeated the day after.
Results are confirmed.
run IN2P3-4
Test run on 342 files.
The staging rate is very high, up to 500 MB/s (see plot below).
Important comment about srmLs:
a problem was observed with srmLs: the locality of files is reported in a wrong way. It reports 73 files NEARLINE and 269 ONLINE_AND_NEARLINE from the beginning of the request until the end (18 hours later). This is clearly wrong, as we know that the files were all NEARLINE at the beginning of the test, and they were all staged at the end, when the staging request finished successfully.
This is the information we got from the site:
The NEARLINE files are present on the same poolgroup (pgroup-cms-xferout) which is used for outside transfers. The ONLINE_AND_NEARLINE files are present on 2 poolgroups: pgroup-cms-xferout and pgroup-cms-analysis which is used for local file access.
I think that we hit the "LHCb file access" bug. They reported the same problem and we applied a configuration work-around for them.
Some other minor comments:
- a minor problem observed due to refused connections at gSoap level. Sporadically the srmLs request is rejected with this error:
GSI-gSOAP: Error reading token data: Connection reset by peer
Cannot submit Ls request
[SE][Ls] httpg://ccsrm.in2p3.fr:8443/srm/managerv2: CGSI-gSOAP: Error reading token data: Connection reset by peer
- The time taken by srmLs is more or less around 15 seconds, except once it took 4026 seconds (you can see the gap in the plot 'Locality of files vs time' from 22000 s to 26000 s).
run FZK-1
The staging rate was extremely slow. The first file was staged after 51 minutes from the beginning of the staging request. After 3.9 hours only 3 files had been staged and the other were reported as 'pending'. At this point maybe the request went in timeout and the pending requests (339 out of 342) changed their status from 'pending' to 'error'.
We stopped the test then.
The behavior of srmLs is consistent with srmStatusOfBringOnline (the files reported as 'done' by srmStatusOfBringOnline appear to have a locality 'ONLINE'). No problem here.
run CNAF-3
Test run on 3051 files, total data about 7.3 TB. Very high staging rate, about 380 MB/s.
One error occurred: almost 3 hours after the beginning of the test (around 7 pm of May 28th), when half of the SURLs had been staged, one SURL was reported from pending state to error. Request token is 145803.
run IN2P3-5
Same conditions of run IN2P3-4, after applying the fix for the bug affecting srmLs. The fix didn't work, test stopped.
--
ElisaLanciotti - 28 May 2009