JC Daily Logs

Friday 22nd July

11:30

Transfers seemed to have stopped on some channels - turned out my myproxy cert had expired -oops! Now working again

Thursday 21st July

18:30

Olof found the problem:
James, the sudden increase of jobs was due to the cleanup of the old ftpd processes this afternoon.
Unfortunately in parallel there was another problem with updating the stager catalogue DB with 
monitoring information because of an oracle session holding a lock on the filesystem table. The job 
scheduling requires access to that table because each new job is assigned an anticipated load. It
took us some time but we finally found the session and killed it. Jobs have now started to flow
again and the number of queue is slowly being emptied.

17:00

castor2 seems to stop sending files - SRM negotiation is ok, but gridftp hangs ! not good !

14:00

Castor team understand the problem - some files "stuck" and have a manual cleanup procedure to make them work again.

11:19

Hmm, seems to have gotten worse - lots of FTPDs every where - many more than the queue length !!!
11:13:40[load-gen]lxplus004.$wassh -f ~/hosts/sc-wan "ps -ef | grep -c ftpd"
lxshare034d: 4
oplapro56: 351
oplapro57: 158
oplapro58: 8
oplapro59: 141
oplapro60: 295
oplapro61: 26
oplapro62: 13
oplapro63: 10
oplapro64: 10
oplapro71: 24
oplapro72: 104
oplapro73: 273
oplapro74: 14
oplapro75: 147
oplapro76: 10
oplapro77: 3
oplapro78: 17
oplapro79: 526

11:03

We always seem to have one or two machines (usually 61, 79) full of jobs. Invewstigating:
[root@castor2srv01 root]# bhosts -R wan
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
oplapro64          ok             30     21      7      7      0      0      0
oplapro72          ok             30     21      5      5      0      0      0
oplapro79          ok             30     21      1      1      0      0      0
oplapro59          ok             30     21      5      5      0      0      0
oplapro60          ok             30     21      7      7      0      0      0
oplapro58          ok             30     21      7      7      0      0      0
oplapro57          ok             30     21     11     11      0      0      0
lxshare034d        ok             30     10      3      3      0      0      0
oplapro74          ok             30     21     16     16      0      0      0
oplapro77          ok             30     21      4      4      0      0      0
oplapro76          ok             30     21     10     10      0      0      0
oplapro56          ok             30     21      8      8      0      0      0
oplapro78          ok             30     21     20     20      0      0      0
oplapro73          ok             30     21     17     17      0      0      0
oplapro62          ok             30     21      7      7      0      0      0
oplapro75          ok             30     21     10     10      0      0      0
oplapro63          ok             30     21      9      9      0      0      0
oplapro61          closed         30     21     21     21      0      0      0
oplapro71          closed         30     21     21     21      0      0      0

10:30

Started prep for tape test - cleared up all channels, and restarted FTS. Tested all channels (BNL, FNAL have problems). Started load-generators.

Current channel config in terms of numver of concurrent files:

CHANNEL_NAME   NOFILES
BNL3-BNL2   10
CERN-ASCC   30
CERN-BNL   75
CERN-CERN   0
CERN-DESY   0
CERN-FNAL   0
CERN-GRIDKA   30
CERN-IN2P3   20
CERN-INFN   20
CERN-NDGF   35
CERN-PIC   24
CERN-RAL   24
CERN-SARA   20
CERN-TRIUMF   50

Wednesday 20th July

16:30

Upped FTS per-channel limit to 75 for Reda to test.

13:30

Ron sees that it is a problem with running out of concurrent channels in the gridftp server. Reported to Michael.
07/20 13:33:31
Cell(GFTP-rembrandt5-Unknown-287@gridftpdoor-rembrandt5Domain) :
CellAdapter: maximum number of data channels is reached!!! closing server socket

and=20

07/20 13:44:10
Cell(GFTP-rembrandt5-Unknown-315@gridftpdoor-rembrandt5Domain) : EOF Exception in read line : java.net.SocketException: Socket closed

11:30

Working with Ron to debug problems with SARA. Seems to be a problem on the dCache gridftp.

10:00

GDB...

Tuesday 19th July

16:30

Problem at PIC was due to expiring CRL. Gonzalo fixed at his end, and transfers started flowing again

15:33

Problem on castor2 resolved
A misconfiguration of a monitor sensor led to (false) alarms for all the diskservers of the phedex queues 
at Cern. As a consequence, these nodes were automatically disabled, at ~ 11:40 this morning.

The configuration was fixed at 14:10, the alarms disappeared, the machines automatically became available
 again, and jobs started running on them  at ~ 14:20.

13:56

Still have very low rate out of castor2 - checking for bad channels.
  • CERN-BNL : Set to 20 files from 75 (?!) - v. high rate of failure/timeouts at far end
  • CERN-INFN: Set to 10 files from 21 - high rate of put timeout
  • CERN-RAL: Shut down - 100% failure all morning on big files (odd - gor a small file across no problem earlier)
  • CERN-PIC : high failure rate at dest end. no transfers in last 2 hours- shutting down
Failed on SRM put: Failed to put File. Error in srm__put: SOAP-ENV:Cli
ent - CGSI-gSOAP: Error reading token data: Connection reset by peer

13:30

Looks like all the cms001 jobs are stopping in PENDING in the LSF queue. notified castor-deployment

11:30

Ben tracks down 'Success' failures to being a problem with the random number generator for the request ID not being random enough and us getting clashes. And RFIO wasn't reporting the correct error.

10:00

Working with gavin to get the fts log analysis into cron. Some modifications to scripts needed

09:30

high RAL failure rate - they were doing an network fix - turned off channel. Debugged ASCC - seems there is a problem with ntpd on their end. Mailed Hung-Che

Monday 18th July

16:00

Tracking down some castor2 src failures. We see often failures like:
Failed to get File. Error in srm__get: SOAP-ENV:Client - Success
Debugged with Ben, passed on to castor-dev.

12:50

Cleared data from castor1 stager. Mapped all users to castor2 pools.

11:59

Added all level 2 support members to manager mapfile. Filled in all missing files in the 001 directory - should now be a complete set

11:52

GRIDVIEW has stopped - filled up a quota on a db account. Team are working to recover data from period missed.

10:42

Tracking down problems in the logs. NDGF have high rate (860/110) of errors like:
CERN-NDGFtransXX1DICEN:2005-07-18 08:35:17,472 [ERROR] - FINAL:TRANSPORT:Transfer failed. [globus_gass_copy_get
_url_mode]: globus_url_parse returned error code: -7 for url: gsiftp://dpmftp-nbi001.ndgf.org nbi.dk/dpmftp-nbi 
001:/data2/dteam001/2005-07-18/file-555-e7500563-271a-46de-b2ac-ff3d717f08ed.dat.44091.0
Source failures (file does not exist) :
file-043
file-155
file-249
file-316
file-367
file-374
file-377
file-382
file-413
file-416
file-439
file-486
file-546
file-592
file-713
file-717
file-874
file-888
file-892
file-902
file-914
file-918

RAL have errors (213/430) all on gftp0442:

FINAL:TRANSPORT:Transfer failed. the server sent an error response: 530 530 U
ser Name for GSI Identity/C=CH/O=CERN/OU=GRID/CN=James Casey 8576 not found.

SARA's SRM endpoint seems to be down (since last night) (2151/2220 failures)

[ERROR] - FINAL:SRM_DEST: Failed on SRM put: Failed to put File. Error in srm__put: SOAP-ENV:Client - CGSI-gSOAP:
 Could not open connection !; also failing to do 'advisoryDelete' on target.%

Saturday 16th July

17:40

Had a two hour problem with castor2 at ~1PM CET (due to reoccurance of LSF plugin bug - bug fixed, but version ot deployed yet).

Upped the per-channel file limit to 50, and changed the default num streams to 1 (from 2)

Friday 15 July

15:22

One node (gftp441) looks much slower (1-2MB/s) than the rest at RAL (~20-35MB/s). Reported to Derek.

14:10

Sonvane fixed the GRIDVIEW problem
Hi James,
    There was a problem with gridview, the data which arrives late
    was not getting summarized and hence there was a fall throughput
    figures.

    I have regenerated the summaries, and figures are corrected.
    Have a look at gridview now.
    I will fix this today itself, immidiately.

    The problem was known, we were fixing it in a systematic way,
    but the impact it was causing was underestimated.

Thanks,
-sonvane

13:57

Looks like problems this morning was due to GRIDVIEW. With some reprocessing, we actually ran at 500MB/s. If only we could "reprocess" up to 1G :- )

13:05

We had an outstanding alarm for a full /tmp on oplapro64. Turned out it was full of proxies for cms002 - I'm cleaned up all the ones older than 60 minutes and also all 0 length ones.

gridftp/srm processes were failing since it couldn't write a new proxy to disk. There has been no bytes transferred on this node since midnight.

12:30

Reda reports gridview is wrong ! this is true - the output from the network display don't correlate...

08:00

Rate went down last night - but odd - external traffic in cluster showed 400MB/s but the gridftp monitoring shows ~150MB/s ! Perhaps a very high amount of traffic timing out?

Thursday 14 July

17:30

We had a problem with one of the SRMs (oplapro79). Not detected in LEMON - turned out that Jan had missed reconfiguring the monitoring on the node when updating the metrics. Now fixed.

Rate seems to bve stabilized again at ~550-600MB from the WAN cluster

16:06

Started gridka again - down to 12 files, since they only have 2 pool nodes working.

15:50

Gonzalo reports that he had full pool, and only 2 of servers were still getting data. That probably explains why the rate was low! Setting back to down to what should be optimal - 12 streams.

BNL are looking into their dCache problem ...

14:51

GRIDKA is going out for fixing their network problem -requires a reboot to get at BIOS menu Shut down channel

14:30

Jan reset the sysctl settings - think we need o restart gridftp to pick them up.

Jean-Damien upped the number of slots per node in LSF to 18, since the eexec changes last night means we're a bit more efficient in job starting, so should have some more headroom

Played around with number of parallel streams. Moving down to 2 streams helps NDGF, RAL and INFN (now > 10MB/s ) - a bit better for IN2P3 too. Bad for SARA, TRIUMF and FZK - leave them at 10.

Looks like it's not good for PIC too - should leave them at 10.

Think the bug with high number of streams producing "data_write: handlein not good state() " errors might be a glite-u-c thing. Will bug it!

12:43

Seems to be a discrepency on some of our nodes for the RCP buffsize settings - although /etc/sysctl.conf says 2MB default, we're getting 4MB - FIO investigating
[root@oplapro57 root]# sysctl -a | grep net| grep tcp_[rw]mem
net.ipv4.tcp_rmem = 262144      4194304 8388608
net.ipv4.tcp_wmem = 262144      4194304 8388608
[root@oplapro57 root]# grep tcp_[rw]mem /etc/sysctl.conf
net.ipv4.tcp_rmem = 262144 2097152 4194304
net.ipv4.tcp_wmem = 262144 2097152 4194304

12:23

We have a large number of malloc errors - This is probably due to num streams * num allowed_transfer * buffsize exceeding memory. I'm going to set the default number of streams down now to 2. It can still be tuned up per channel in the load-generator.

11:45

Did some channel tuning according to some remarks from phonecon last night.
Here’s what's in the FTSDB, and in gridview

CHANNEL_NAME   NOFILES   NODES  BWIDTH           BY NODE   BY BWIDTH
CERN-BNL       80        5      622MB shared     40         8 (60MB/s)
CERN-INFN      10        9      1Gb ded          18        12 (100MB/s)
CERN-IN2P3     10        4      1Gb shared       24         8 (60MB/s)
CERN-GRIDKA    10        6      10Gb ded         24        24 (200MB/s)
CERN-NDGF      23        4      >3Gb ded (?)     24        18 (150MB/s)
CERN-PIC       30        4      1Gb shared       12        12 (100MB/s)
CERN-RAL       30        7      2Gb ded          28        24 (200MB/s)
CERN-SARA      32        5      10Gb ded         30        24 (200MB/s)
CERN-ASCC       5        1 (4)  2Gb ded           3 (12)   12 (100MB/s)
CERN-TRIUMF    45        4      1Gb ded          16        12 (100MB/s)

For Dcache sites, we could start with 6 * No Nodes with 10 streams, and for CASTOR sites, 3* No Nodes with 1 stream. 

Obviously this doesn’t concur with max bandwidth on the channels, so the other approach is to assume ~10MB/s
 flows on average (this is achievable, at least for sites in europe, and calculate like that.

10:30

Rate picked up a bit. Now at 500MB/s but not uniform across sites.

08:30

Rate went down during the night - from 700MB/s at 8pm to 350MB/s at 7am. Need to investigate

Wednesday 13 July

17:30

IN2P3 problems sorted. Now running again with 10 concurrent files. In total running at ~600MB/s

13:25

All seems to be ok - gridftp monitoring showing 700MB/s. And FTS showing good error rates for many sites
CERN-ASCCcompleted     80
CERN-ASCCfailed      0
CERN-BNLcompleted    852
CERN-BNLfailed     46
CERN-GRIDKAcompleted    403
CERN-GRIDKAfailed      2
CERN-INFNcompleted    358
CERN-INFNfailed    238
CERN-NDGFcompleted   1756
CERN-NDGFfailed    433
CERN-PICcompleted    657
CERN-PICfailed    146
CERN-RALcompleted   1184
CERN-RALfailed     98
CERN-SARAcompleted   1931
CERN-SARAfailed    273
CERN-TRIUMFcompleted    575
CERN-TRIUMFfailed     35

12:20

INFN still seem to have problems - phedex seems them too - lots of timeouts/failures on SRM ooperations - accounting for ~30% error rate. Sent mail to INFN SC Team. Also pinged Jon Bakken about the problem I have writing to FNAL.

11:00

GRIDKA solve their problem with the pool nodes not accessible - data starts flowing again

09:30

Did test of sites again - nearly all work.

Tuesday 12 July

18:00

Cleanup in progress of files from castor. This should allow traffic through again.

16:00

Problem understood of buildup of processes- due to 'bad' files in the system sending the castor to tape looking for them, but they're disk only !

13:30

Saw a large number of gridftp processes building up on oplapro nodes (>130 in some cases). Linked to timeouts/bad files perhaps? Forwarded to castor-dev

13:00

Gavin discovers the timeout problem - we need to manually add the timeouts to the FTS config file, since the gLite config won't let us specify them - he rebuilt the file to fix a different problem, and we lost the timeouts ;(

12:00

Turned off BNL since all transfers were timing out.

10:50

Tested all the channels with small files, and then started up the system for working channels

8:25

Transfers drained off during the night. oplapro* still in bad way after yesterday - many alarms still active, some machines not able to log in to. Vlado will reboot all.

Monday 11 July

19:50

CMS people have been told to restart their transfers ... Set all the FTS channels to Active and that's it for this evening !!!

19:20

Jean-Damien finds the problem:
I am repairing the trouble introduced (a change of castorscheduler:/etc/castor/castor.conf
that appeared to be fatal for the current code - I will make the scheduler plugin more
fault tolerant).

16:20

Looks like a problem with castor2. Leaving FTS shut down until it's resolved.

17:30

Transfer cluster going crazy ! Load all over 2, but no transfers. Shut down FTS completely on lxshare026d.

17:00

Shut down BNL - it's been failing now for an hour...

15:23

Moving oplapro76 out of the cluster in order to repair its disk. J-D doing a disk2disk copy of those files which will be lost(need to destructively rebuild the partition).

14:20

Gavin, while trying to fix the password file problem on lxshare026d killed all the transfers - finally worked out the directory permissions. should be ok now.

12:47

GRIDKA up and running again - Doris fixed their problem. Turned them on again in the FTS.

Moved Gonzalo over to castor2. Michael has problems mapped as cms002 doing srmCp's. Looking into it.

10:50

Added michael to grid-mapfile as cms002 to test he can now access both WAN and PHEDEX files. If this works, I'll switch all users.

Problem with FTS - seems like the sc3 user got deleted from the passwd file, the daemon dumped core.

I still can't transfer to IN2P3 - Lionel is able to do srmcp ok - he's looking into it.

09:00

Rate went back up ~10.30 last night to 600MB/s with no extra sites added. Problem at ~11PM with FTS stopping. down to a log file existing which stopped the watchdog starting it again. David/Roberto now tracking down why gridftp logs didn't work and get the procedures on how to restart it from the dev. team.

Sunday 10 July

17:30

ASCC Stopped accepting data at 14:30. turned the channel off.

12:33

Release new version of FtsLoadGenerator (v1.0.3). Fixed up some missing files that were causing lots of errors at SARA (file-055.dat - 75/130 errors). update Veto list in load-gen.

11:30

Site diagnostics:

RAL has a problem with accepting gsiftp protocol:

2005-07-10 09:16:33,230 [DEBUG] - Performing Call to method requestTurlFromSurl
2005-07-10 09:16:33,230 [DEBUG] - Performing Call to method srm__put
2005-07-10 09:17:07,174 [INFO ] - SRM put request ID: -2147013950
2005-07-10 09:17:07,174 [DEBUG] - Call completed to srm__put
...
2005-07-10 09:17:36,507 [DEBUG] - Performing Call to method finishRequest
2005-07-10 09:17:36,507 [ERROR] - Failed SRM put call. Error is
RequestFileStatus#-2147013949 failed with error:[ can not prepare to put : org.dcache.srm.SRMExcept
ion: can not find sutable put protocol]

GRIDKA looks like the one 'working' gridftp server now is no longer running:

11:35:06[plots]lxshare026$edg-gridftp-ls gsiftp://f01-015-103-e.gridka.de:2811//pnfs/gridka.de/sc3/dteam/2005-07-10
error a system call failed (Connection refused)
error a system call failed (Connection refused)
11:35:17[plots]lxshare026$edg-gridftp-ls gsiftp://f01-015-102-e.gridka.de:2811//pnfs/gridka.de/sc3/dteam/2005-07-10
error the server sent an error response: 451 451 //pnfs/gridka.de/sc3/dteam/2005-07-10  not found

error the server sent an error response: 451 451 //pnfs/gridka.de/sc3/dteam/2005-07-10  not found

11:35:21[plots]lxshare026$edg-gridftp-ls gsiftp://f01-015-101-e.gridka.de:2811//pnfs/gridka.de/sc3/dteam/2005-07-10
error the server sent an error response: 451 451 //pnfs/gridka.de/sc3/dteam/2005-07-10  not found

error the server sent an error response: 451 451 //pnfs/gridka.de/sc3/dteam/2005-07-10  not found

PICs SRM has gone away

2005-07-10 09:16:34,116 [DEBUG] - Performing Call to method requestTurlFromSurl
2005-07-10 09:16:34,117 [DEBUG] - Performing Call to method srm__put
2005-07-10 09:16:49,538 [ERROR] - Failed to put File srm://castorsrmsc.pic.es:8443/srm/managerv1?SF
N=/castor/pic.es/sc3/scratch/2005-07-09/file-051-211c3841-56b4-4f1a-a900-168ca79cef4e.dat
2005-07-10 09:16:49,538 [ERROR] - Failed to put File. Error in srm__put: SOAP-ENV:Client - CGSI-gSO
AP: Could not open connection !

11:15

Shut down RAL, PIC and GRIDKA - 100% failure rates for last 3 hours ;( Now running at ~ 550MB/s with ASCC, BNL, NDGF, SARA, TRIUMF.

11:28:04[bin]lxshare026$./show-channel-state
   CERN-ASCC   Active    106
    CERN-BNL   Active    214
   CERN-CERN   Active      0
   CERN-DESY Inactive      0
   CERN-FNAL Inactive      0
 CERN-GRIDKA Inactive      0
  CERN-IN2P3 Inactive      0
   CERN-INFN Inactive      0
   CERN-NDGF   Active      3
    CERN-PIC Inactive     68
    CERN-RAL Inactive     40
   CERN-SARA   Active     61
 CERN-TRIUMF   Active     35

CERN-ASCCcompleted    322
CERN-ASCCfailed      5
CERN-BNLcompleted     80
CERN-BNLfailed     25
CERN-CERNcompleted      0
CERN-CERNfailed      0
CERN-GRIDKAcompleted      0
CERN-GRIDKAfailed    505
CERN-IN2P3completed      0
CERN-IN2P3failed      0
CERN-INFNcompleted      0
CERN-INFNfailed      0
CERN-NDGFcompleted     69
CERN-NDGFfailed      3
CERN-PICcompleted      0
CERN-PICfailed   1702
CERN-RALcompleted      0
CERN-RALfailed    586
CERN-SARAcompleted   1192
CERN-SARAfailed    106
CERN-TRIUMFcompleted     99
CERN-TRIUMFfailed     39

10:00

RAL, TRIUMF, GRIDKA exhausted their queues (not running in crontab - need acrontab set up on lxshare26d... One disk has I/O error - oplapro76 on file-617.

Saturday 09 July

23:30

filled up queues for overnight run... should be enough to keep all working channels loaded 'til morning.

20:00

Rate started dropping to ~500MB/s. no idea why ! Took the opportunity to expand the load-gen to read from 1000 files instead of 70. Should balance stuff. Added a veto list to the load-gen to avoid bad files.

Ron found the problem with SARA - one pool node wans't doing so well. Restarted and upped the logging level. Turned the channel back on...

19:20

Still good rate - ~700MB/s. Getting gridftp's stacked on on machines, mostly for files that are causing trouble since they were badly written to e.g. oplapro62/file-043. Manually killed those processes - need a way to monitor/automate this cleanup.

Turned off CERN-NDGF due to high failure rate. Lots of SRM Put Errord and also errors in gridftp transfer:

FINAL:TRANSPORT:Transfer failed. the server sent an error response: 553 55
3 dpmftp-nsc001:/storage4/dteam001/2005-07-09/file-053-c71164c9-bdcb-4037-8b0b-ddef46288c31.dat.1204.0: Host
 not known.

High error rate on GRIDKA - looks like the PNFS namespace isn't coherent across the three pool nodes so we get 60% failures:

19:29:34[jamesc]lxplus064.$edg-gridftp-ls gsiftp://f01-015-102-e.gridka.de:2811/pnfs/gridka.de/
19:29:38[jamesc]lxplus064.$edg-gridftp-ls gsiftp://f01-015-101-e.gridka.de:2811/pnfs/gridka.de/
19:29:42[jamesc]lxplus064.$edg-gridftp-ls gsiftp://f01-015-103-e.gridka.de:2811/pnfs/gridka.de/
gridka.de
desy
sc3
atlas001
test1
alice
babar001
cms001
{LOG}
dteam001
cms
cdf001
dzero001
data
alice001
compass
lhcb001
cdf
usr

18:30

Some tuning of the channels - mostly upping the transatlantic ones to 20 concurrent files! Now we've been running at ~700MB/s out of the WAN cluster for 2 hours. Also put on the rest of the 1TB data set. Some errors - need to work through which ones are actually bad and try and recover/remove them.

Note, we've also got ~200MB/s out of the castor1 cluster, mostly to FNAL/PIC via phedex. And there's about 400MB/s on the display for incoming traffic - some will be the upload of the 1TB data set, and some is internode traffic (~50MB/s for PIC's direct gridftp downloads from castorgridsc) - the rest seems correlated with the outgoing rate.

16:04

Started testing all the sites again - INFN, SARA, FNAL & IN2P3 not working throught FTS, but FNAL is probably doing Phedex stuff (More details on site status in ServiceChallengeThreeProgress). Still a problem with auto-directory creation on GRIDKA via SRM, but it works via gridftp, so I used that as a workaround. Started transfers to the rest of the sites (ASCC, BNL, GRIDKA, NDGF, PIC, RAL, TRIUMF). 580MB/s out of the cluster - some is bounced castor1 traffic to PIC via phedex. There's an additional 100MB/s to FNAL from castor1 too.

12:00

J-D says that ther are some problems with the stager DB . Waiting on him to cleanup. Flushed all the jobs in the channels and set them all to Inactive. Will test each when it comes up again, and then set off a sample-load with the load generator.

Good news is the PIC and NDGF both managed some transfers with FTS last night! NDGF is running LCG DPM

Friday 08 Jul

22:20

Had to move DESY back to WAN serviceclass - SRM copy doesn't work -error in dest SURL formant, and direct gridftp is causing transfers to go through the WAN nodes, not the phedex nodes. Turned michael back onto the wan pools until we sort it. J-D also saw jobs getting stuck in the scheduler. He killed them - I killed the corresponding castor-gridftp processes.

castor1 pools not doing well - way too many transfers

21:33:02[jamesc]lxplus058.$wassh -f hosts/sc-all "ps -ef | grep -c ftpd"
lxshare025d: 3
lxshare027d: 6
lxshare028d: 16
lxshare030d: 18
lxshare031d: 19
lxshare032d: 11
lxshare033d: 11
lxshare034d: 17
lxshare214d: 3
lxshare215d: 3
lxshare216d: 3
lxshare217d: 3
lxshare218d: 3
lxshare219d: 3
lxshare220d: 3
lxfsrm464: 4
FAIL lxfsrm505: SSH daemon not responding
lxfsrm522: 13
oplapro56: 7
oplapro57: 5
oplapro58: 3
oplapro59: 7
oplapro60: 10
oplapro61: 10
oplapro62: 9
oplapro63: 6
oplapro64: 6
oplapro71: 6
oplapro72: 8
oplapro73: 4
oplapro74: 11
oplapro75: 4
oplapro76: 7
oplapro77: 8
oplapro78: 4
oplapro79: 5

19:29

Moved NGDF to castor2. Moved DESY to cms pools on castor2. v. high load on castor1 pools - 3 out of the 9 nodes don't even respond! the rest have > 15 ftps going on !
19:28:02[jamesc]lxplus059.$wassh -f hosts/sc-all "ps -ef |grep -c ftpd"
lxshare025d: 3
lxshare027d: 6
lxshare028d: 22
lxshare030d: 39
FAIL lxshare031d: SSH daemon not responding
lxshare032d: 20
FAIL lxshare033d: SSH daemon not responding
lxshare034d: 16
lxshare214d: 3
lxshare215d: 3
lxshare216d: 3
lxshare217d: 3
lxshare218d: 3
lxshare219d: 3
lxshare220d: 3
lxfsrm464: 19
lxfsrm505: 27
lxfsrm522: 20
oplapro56: 6
oplapro57: 6
oplapro58: 3
oplapro59: 4
oplapro60: 14
oplapro61: 8
oplapro62: 8
oplapro63: 5
oplapro64: 10
oplapro71: 6
oplapro72: 8
oplapro73: 7
oplapro74: 10
oplapro75: 5
oplapro76: 6
oplapro77: 10
oplapro78: 3
oplapro79: 8
The IA64's seem to have no problems with similar loads:

19:26:33[jamesc]lxplus059.$wassh -f hosts/sc-all uptime
lxshare025d:  19:27:02  up 2 days,  1:23,  1 user,  load average: 0.03, 0.01,  0.00
lxshare027d:  19:27:01  up 14 days,  7:22,  3 users,  load average: 0.04, 0.03, 0.00
lxshare028d:  19:27:04  up 17 days,  2:04,  0 users,  load average: 2.63, 1.24, 1.06
FAIL lxshare030d: timeout
FAIL lxshare031d: timeout
lxshare032d:  19:27:03  up 17 days,  2:04,  0 users,  load average: 5.42, 7.80, 8.39
FAIL lxshare033d: timeout
lxshare034d:  19:27:03  up 17 days,  2:04,  0 users,  load average: 1.55, 1.27, 1.72
lxshare214d:  19:27:03  up 7 days, 58 min,  1 user,  load average: 0.00, 0.00, 0.00
lxshare215d:  19:27:02  up 1 day,  2:38,  1 user,  load average: 0.00, 0.00, 0.00
lxshare216d:  19:27:03  up 7 days, 57 min,  1 user,  load average: 0.09, 0.05, 0.01
lxshare217d:  19:27:03  up 7 days, 57 min,  1 user,  load average: 0.00, 0.00, 0.00
lxshare218d:  19:27:02  up 7 days, 57 min,  1 user,  load average: 0.02, 0.01, 0.00
lxshare219d:  19:27:03  up 7 days, 57 min,  2 users,  load average: 0.03, 0.02, 0.00
lxshare220d:  19:27:03  up 7 days, 57 min,  2 users,  load average: 0.03, 0.01, 0.00
lxfsrm464:  19:27:04  up 17 days,  2:04,  0 users,  load average: 0.50, 0.72, 0.83
lxfsrm505:  19:27:05  up 17 days,  2:10,  0 users,  load average: 2.13, 3.38, 3.87
lxfsrm522:  19:27:04  up 17 days,  2:11,  0 users,  load average: 1.50, 2.95, 2.20
oplapro56:  19:27:03  up 4 days,  2:49,  1 user,  load average: 0.88, 0.95, 0.79
oplapro57:  19:27:03  up 4 days,  2:49,  1 user,  load average: 1.07, 0.75, 0.61
oplapro58:  19:27:03  up 4 days,  2:49,  1 user,  load average: 0.34, 0.45, 0.49
oplapro59:  19:27:03  up 4 days,  2:49,  1 user,  load average: 0.52, 0.48, 0.43
oplapro60:  19:27:02  up 4 days,  2:49,  1 user,  load average: 1.54, 1.23, 1.00
oplapro61:  19:27:03  up 4 days,  2:49,  2 users,  load average: 0.34, 0.41, 0.33
oplapro62:  19:27:03  up 4 days,  2:49,  1 user,  load average: 0.83, 0.53, 0.48
oplapro63:  19:27:03  up 4 days,  2:49,  1 user,  load average: 1.08, 0.60, 0.40
oplapro64:  19:27:02  up 3 days,  9:12,  1 user,  load average: 0.42, 0.72, 0.93
oplapro71:  19:27:02  up 2 days, 21:30,  1 user,  load average: 0.93, 0.74, 0.45
oplapro72:  19:27:03  up 2 days, 21:30,  1 user,  load average: 0.80, 0.79, 0.73
oplapro73:  19:27:03  up 2 days, 21:30,  1 user,  load average: 0.82, 0.56, 0.44
oplapro74:  19:27:03  up 2 days, 21:30,  1 user,  load average: 0.77, 0.60, 0.43
oplapro75:  19:27:04  up 2 days, 21:30,  1 user,  load average: 1.66, 1.29, 0.80
oplapro76:  19:27:02  up 2 days, 21:30,  1 user,  load average: 0.65, 0.61, 0.39
oplapro77:  19:27:03  up 2 days, 21:30,  1 user,  load average: 1.82, 1.23, 0.94
oplapro78:  19:27:03  up 2 days, 21:30,  2 users,  load average: 0.78, 0.70, 0.53
oplapro79:  19:27:03  up 2 days, 21:30,  3 users,  load average: 0.68, 0.86, 0.81

17:30

Ben found the problem with the request spool - was due to writing a data structure to disk for the request file, and the struct is different on IA32/IA64. Worked around it by moving all the SRMs to IA64 smile Will have a fix next week.

Fixed some broken files for Micheal. (file-014)

14:00

While investigating high levels of failures at INFN, noticed a problem with our SRM (probably the IA64 port). We sometimes see entries like.
2005-07-08 09:28:41,197 [DEBUG] - Performing Call to method srm__setFileStatus
2005-07-08 09:28:41,596 [ERROR] - Cannot Contact SRM Service. Error in srm__setFileStatus: SOAP-ENV:Clie
nt - RequestRegister.c:468 internal error when parsing request record from lxfsrm528.cern.ch:/tmp/0x1893
32a0 (errno=0, serrno=0)
2005-07-08 09:28:41,597 [ERROR] - Cannot Contact SRM Service. Error in srm__setFileStatus: SOAP-ENV:Clie
nt - RequestRegister.c:468 internal error when parsing request record from lxfsrm528.cern.ch:/tmp/0x1893
32a0 (errno=0, serrno=0)

This involves corrupted request files:

[root@lxshare030d root]#  ~jamesc/bin/castorSrmDumpRequest lxfsrm528.cern.ch:/tmp 0x189332a0
Check request: 0x189332a0
----> RequestStatus:
      Request ID: 0x189332a0
      Request type: (null)
      Request state: @
      Request submitted:
      Request started:
      Request finished:
      Request error message: (null)
      Request estimate to start: 87936
Passed on to castor-dev for investigation.

12:12

Tracked down the RAL problem - my bad in the load-generator -oops. Now running again some transfers.

Moved DESY to the castor2 setup. Jean-Damien turned on replication for the WAN serviceclass to 3.

11:20

Transfers running better now. oplapro Cluster serving data at ~700MB/s, but some is probably iperf ? Maybe not. Gridftp logs still broken, so hard to correlate.

Tested GRIDKA - still have a problem with no pool available

%FINAL:SRM_DONE_DEST: failing to do 'advisoryDelete' on target SRM
FINAL:TRANSPORT:Transfer failed. the server sent an error response: 425 425 Cannot open port: java.lang.Exception: Pool manager error: No write pools configured for <dc_sc3:SC3@osm>

Derek is looking into RAL TURL problem - he doesn't have a gridftp on the head node.

Rewriting the three broken files in the new data set.

08:50

Died last night with a castor internal error loading a .so. Jean-Damien fixed it this morning, and performace jumped up to > ~2Gb. Moved SARA over to castor2 (Added Ron's DN to the local mapfile)

00:30

Left running overnight - not particular optimal transfers, but a good range of transfers to a good range of sites are running. Will debug in the morning (my morning).

00:00

configured and turned on the gridftp monitors on oplapro71-79. Seeing some strange entries in R-GMA -
| oplapro74.cern.ch | Rembrandt8.tier1.sara.nl | 10       | 12.7773       | 2097152             | 22:01:58        |
| oplapro71.cern.ch | oplapro71.cern.ch        | 1        | 46.5378       | 6917529027643179008 | 22:01:58        |
| oplapro72.cern.ch | srm1.tier1.sara.nl       | 10       | 9.493         | 2097152             | 22:01:58        |
Need to investigate...

Added in more admin users for each site. Currently it's

# CERN
"/C=CH/O=CERN/OU=GRID/CN=Gavin Mccance 1838" CERN
"/C=CH/O=CERN/OU=GRID/CN=James Casey 8576" CERN
"/C=CH/O=CERN/OU=GRID/CN=host/lxshare026d.cern.ch" null

# ASCC
"/C=TW/O=AS/OU=PHYS/CN=Di Qing/Email=Di.Qing@cern.ch" ASCC
"/C=TW/O=AS/OU=CC/CN=Chih-Chiang Chang/Email=ccchang@beta.wsl.sinica.edu.tw" ASCC

# BNL
"/DC=org/DC=doegrids/OU=People/CN=Wensheng Deng 90806" BNL
"/DC=org/DC=doegrids/OU=People/CN=Dantong Yu 127718" BNL
"/DC=org/DC=doegrids/OU=People/CN=Dantong Yu 542086" BNL

# DESY
"/DC=org/DC=doegrids/OU=People/CN=Michael Ernst 487976" DESY

# FNAL
"/DC=org/DC=doegrids/OU=People/CN=Ian Fisk 839452" FNAL
"/DC=org/DC=doegrids/OU=People/CN=Jon A Bakken 644181" FNAL

# GRIDKA
"/O=GermanGrid/OU=FZK/CN=Doris Ressmann" GRIDKA
"/O=GermanGrid/OU=FZK/CN=Bruno Hoeft" GRIDKA

# IN2P3
"/C=FR/O=CNRS/OU=CC-LYON/CN=Lionel Schwarz/Email=lionel.schwarz@in2p3.fr" IN2P3

# INFN
"/C=IT/O=INFN/OU=Personal Certificate/L=CNAF/CN=Giuseppe Lo Re/Email=giuseppe.lore@cnaf.infn.it" INFN

# NDGF
"/O=Grid/O=NorduGrid/OU=nsc.liu.se/CN=Leif Nixon" NDGF
"/O=Grid/O=NorduGrid/OU=nbi.dk/CN=Anders Waananen" NDGF
"/O=Grid/O=NorduGrid/OU=pdc.kth.se/CN=lars malinowsky" NDGF

# PIC
"/C=ES/O=DATAGRID-ES/O=PIC/CN=Francisco Martinez" PIC
"/C=ES/O=DATAGRID-ES/O=PIC/CN=Gonzalo Merino" PIC
"/C=ES/O=DATAGRID-ES/O=PIC/CN=Gonzalo Merino DTEAM" PIC

# RAL
"/C=UK/O=eScience/OU=CLRC/L=RAL/CN=derek ross" RAL
"/C=UK/O=eScience/OU=CLRC/L=RAL/CN=martin bly" RAL
"/C=UK/O=eScience/OU=CLRC/L=RAL/CN=steve traylen" RAL

# SARA
"/O=dutchgrid/O=users/O=sara/CN=Alexander Theodorus Verkooijen" SARA
"/O=dutchgrid/O=users/O=sara/CN=Fokke Dijkstra" SARA
"/O=dutchgrid/O=users/O=sara/CN=Jules Wolfrat" SARA
"/O=dutchgrid/O=users/O=sara/CN=Mark van de Sanden" SARA
"/O=dutchgrid/O=users/O=sara/CN=Ron Trompert" SARA
"/O=dutchgrid/O=users/O=nikhef/CN=Kors Bos (dteam)" SARA
"/O=dutchgrid/O=users/O=nikhef/CN=Sander Klous" SARA

# TRIUMF
"/C=CA/O=Grid/OU=westgrid.ca/CN=Reda Tafirout_13/Email=tafirout@triumf.ca" TRIUMF

Thursday 07 Jul

23:20

BNL works too. Transfers generally seem to be flowing - if slowly. Have backed off the number of concurrent transfers to see if congestion is the problem. Transfers ongoing to 7 sites @ ~ 1.3Gb/s

A common problem popping up in the logs is :


2005-07-07 21:22:34,441 [ERROR] - FINAL:TRANSPORT:Transfer failed. globus_ftp_control_connect: globus_libc_gethostbyname_r failed%
FINAL:TRANSPORT:Transfer failed. globus_ftp_control_connect: globus_libc_gethostbyname_r failed%

22:00

PIC also works - started scheduling some transfers there. Now have SARA, RAL and PIC, TRIUMF using SRM. ASCC works using gridftp. INFN seems to refuse many SRM connections, but works with gridftp. GRIDKA still down for maintainence.

19:00

New bugfix version for SRM IA64 seems to work ok.

18:00

Asked for all sites to shut down transfers. Fleshed out the dataset (2 bad files - 31 & 33). Started transfers to RAL and SARA - 1Gb/s each straight off. Noticed a problem with castor-gridftp and an error that occur when using 20 streams. Solved by using 10 streams instead:
2005-07-07 21:24:18,856 [ERROR] - FINAL:TRANSPORT:Transfer failed. the server sent an error response: 426 426 Data connection. data_write() failed: Handle not in the proper state%
FINAL:TRANSPORT:Transfer failed. the server sent an error response: 426 426 Data connection. data_write() failed: Handle not in the proper state%

17:27

IA64 problem solved by Ben - it's building

Peter has fixed the FTS put problem too. Won't QF it, since it's a corner case. Still to look at why it kept retrying forever...

16:45

Copy of 290 2KB files (castor2/s00) Much better - 290 finished at the end - 4 intermittent errors:

All errors on the dest SRM Put...

 Performing Call to method srm__put
2005-07-07 14:44:27,948 [ERROR] - Failed to put File srm://castorgridsc.cern.ch:8443/srm/managerv1?SFN=/castor/cern.ch/grid/d
team/storage/transfer-test/castor2/s00/file-099.dat
2005-07-07 14:44:27,948 [ERROR] - Failed to put File. Error in srm__put: SOAP-ENV:Client - Success

SRM Req Id's of failed jobs for investigation: 405596606, 405606940, 405573117, 405579506

15:47

FTS has run amok ! It seems to be copying files a lot of times - ~10 - and some files it keeps retransmitting !
[root@lxshare026d CERN-CERNfailed]#  grep 001/file-046.dat *|cut -f 1 -d:|sort -u|xargs ls -lst
   8 -rw-------    1 sc3      localboy     5223 Jul  7 15:30 CERN-CERNtransXXjVzzJK
   8 -rw-------    1 sc3      localboy     5871 Jul  7 15:20 CERN-CERNtransXXnlZfzU
   8 -rw-------    1 sc3      localboy     6240 Jul  7 15:09 CERN-CERNtransXX1FcdnL
   8 -rw-------    1 sc3      localboy     5901 Jul  7 15:09 CERN-CERNtransXXz1G0WV
   8 -rw-------    1 sc3      localboy     6240 Jul  7 15:02 CERN-CERNtransXXvBDMFl
   8 -rw-------    1 sc3      localboy     6240 Jul  7 14:59 CERN-CERNtransXXTEa18E
   8 -rw-------    1 sc3      localboy     5902 Jul  7 14:58 CERN-CERNtransXXRV1C4B
   8 -rw-------    1 sc3      localboy     6240 Jul  7 14:51 CERN-CERNtransXXfRn9cV
   8 -rw-------    1 sc3      localboy     5901 Jul  7 14:47 CERN-CERNtransXXThFuQT
   8 -rw-------    1 sc3      localboy     6240 Jul  7 14:47 CERN-CERNtransXXPFmx0y
   8 -rw-------    1 sc3      localboy     6240 Jul  7 14:40 CERN-CERNtransXXf8tMJO
   8 -rw-------    1 sc3      localboy     6240 Jul  7 14:38 CERN-CERNtransXXhuzUTk
   4 -rw-------    1 sc3      localboy     1249 Jul  7 14:30 CERN-CERNtransXXdKnJHr

Now going to try and copy 300 very small files in - at least it will fail faster ;(

14:51

Transfers now not going so well - some problem with g-u-c and SRM whewre it gets a TURL to put, but then fails - the cleanup doesn't seem to work well, and I end up with lots of 0 byte files and EBUSYs !!!!

14:40

SRM transfers going ok - a bit of loss, but it seems to be when the SRM timeouts at a inopportune operation that g-u-c can not/does not recover from.

Ran into the lockf() bug on IA64 again ! Mailed castor-dev with the solution.

14:37:55[jamesc]lxshare030$srm_testget srm://oplapro56.cern.ch:8443/srm/managerv1?SFN=/castor/cern.ch/grid/dteam/storage/transfer-test/castor2/001/file-005.dat
SOAP FAULT: SOAP-ENV:Client
"RequestRegister.c:150 cannot lock request record, lockf() failed: Bad file descriptor
 (errno=9, serrno=0)"
HTTP/1.1 500 Internal Server Error
Server: gSOAP/2.3
Content-Type: text/xml; charset=utf-8
Content-Length: 747
Connection: close

<?xml version="1.0" encoding="UTF-8"?><SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:ns11="http://www.themindelectric.com/package/diskCacheV111.srm/" xmlns:tns="http://tempuri.org/diskCacheV111.srm.server.SRMServerV1"><SOAP-ENV:Body SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" id="_0"><SOAP-ENV:Fault><faultcode>SOAP-ENV:Client</faultcode><faultstring>RequestRegister.c:150 cannot lock request record, lockf() failed: Bad file descriptor&#xA; (errno=9, serrno=0)</faultstring></SOAP-ENV:Fault></SOAP-ENV:Body></SOAP-ENV:Envelope>
** HERE **

14:05

Pool entries understood - CMS files that need GC'ing. First 10 files in - and next 10 via FTS in batch worked too. Submitted final 280 files.

13:50

Started up again to transfer files into castor2 pools. Had to kill off all the stuck dteam002 SRM processes first. Old entries with 0 bytes still give EBUSY. Seems disk pools on oplapro nodes are already filled up:
oplapro56       0             /shift/oplapro56/data01/        Available           28.00k          68.33G          0.00% free
oplapro56       0             /shift/oplapro56/data02/        Available           40.00k          68.33G          0.00% free
oplapro57       0             /shift/oplapro57/data01/        Available           36.00k          68.33G          0.00% free
oplapro57       0             /shift/oplapro57/data02/        Available            6.64G          68.33G          9.72% free
oplapro58       0             /shift/oplapro58/data01/        Available            2.43G          68.33G          3.56% free
oplapro58       0             /shift/oplapro58/data02/        Available           21.26G          68.33G         31.11% free
oplapro59       0             /shift/oplapro59/data01/        Available          604.81M          68.33G          0.86% free
oplapro59       0             /shift/oplapro59/data02/        Available           12.37G          68.33G         18.10% free
oplapro60       0             /shift/oplapro60/data01/        Available           44.00k          68.33G          0.00% free
oplapro60       0             /shift/oplapro60/data02/        Available           24.00k          68.33G          0.00% free
oplapro61       0             /shift/oplapro61/data01/        Available           44.00k          68.33G          0.00% free
oplapro61       0             /shift/oplapro61/data02/        Available            3.41G          68.33G          4.99% free
oplapro62       0             /shift/oplapro62/data01/        Available          317.56M          68.33G          0.45% free
oplapro62       0             /shift/oplapro62/data02/        Available            9.40G          68.33G         13.76% free
oplapro63       0             /shift/oplapro63/data01/        Available            6.42G          68.33G          9.40% free
oplapro63       0             /shift/oplapro63/data02/        Available           19.20G          68.33G         28.11% free
oplapro64       0             /shift/oplapro64/data01/        Available           19.98G          68.33G         29.24% free
oplapro64       0             /shift/oplapro64/data02/        Available           21.34G          68.33G         31.23% free

-- JamesCasey - 07 Jul 2005

Edit | Attach | Watch | Print version | History: r44 < r43 < r42 < r41 < r40 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r44 - 2005-12-05 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback