ServiceChallengeTwoProgress < LCG

LCG Web>LCGServiceChallenges>ProgressLogs>ServiceChallengeTwoProgress (2005-06-16, unknown)

Saturday April 2, 12PM

Network outage to cluster on management interface. (came back ~1AM Sunday) Did not restart services.

Friday April 1, 2005

I've stopped all SC2 transfers from oplapro nodes to Fermilab mass storage. We are continuing with files from castorgrid, and distributing the files to the USCMS Tier2 universities. I've stopped for primarily 2 reasons: We've demonstrated the 40-60 MB/s to tape for most of the challenge, and we've just had requests for more than 20,000 user files to be restored from tape to the dCache and this requires the use of our tape drives.

I've prepared all of our pool nodes for the high rate, to FNAL disk, and scheduled it for Monday starting at 10AM. I expect to use somewhere between 150 and 200 nodes for this test.

Jon Bakken, FNAL

Fri 3PM Jobs stuck in the queue after the DB problems - purged all Active jobs in all channels JC

Fri 1PM DB load problems on pdb01 stopped new transfers being initiated. Rate went down to about 400MB/s. In total, problem lasted about 1.5 hours JC

Fri 12AM RAL had stopped transfers since old jobs had stuck in the queue. CLeaned up queue JC

Thur 6PM Service died for all transfers except IN2P3. Lots of DB connection problems - problem with connection to CERN LAN ? started at 8.30PM again JC

Thur 3PM 30 minute reduced throughput (2.5Gb) due to myproxy server crash JC

Thur 11AM

1 Hour outage due to network problems at CERN. JC

Wed 12PM Turned gridka up to 10 transfers (from 5), RAL to 16 (from 12) - they're suffering badly due to the increase by FNAL JC

Wed 10AM Problem at INFN resolved. Turned transfers back on. Giuseppe upped to 10 to try and get back to 950 MB/s JC

Wed 9AM Turned off transfers to INFN. FNAL have ramped up to ~350MB/s - asked them to tune down to 100MB/s. JC

Tue 12AM INFN lost connectivity to their central DNS servers -all transfers started to fail JC

Tue 3.30PM

Current configuration:

#Chan : State  :    Last Active   :Bwidth: Files:   From   :   To
IN2P35:Active  :05/03/29 15:16:32 :1024  :1     :cern.ch   :ccxfer05.in2p3.fr
IN2P33:Active  :05/03/29 15:20:30 :1024  :1     :cern.ch   :ccxfer03.in2p3.fr
INFN  :Active  :05/03/29 15:23:07 :1024  :8     :cern.ch   :cr.cnaf.infn.it
IN2P32:Active  :05/03/29 15:21:46 :1024  :1     :cern.ch   :ccxfer02.in2p3.fr
FNAL  :Inactive:Unknown           :10240 :0     :cern.ch   :fnal.gov
IN2P3 :Active  :05/03/29 15:23:40 :1024  :1     :cern.ch   :ccxfer01.in2p3.fr
IN2P34:Active  :05/03/29 15:18:00 :1024  :1     :cern.ch   :ccxfer04.in2p3.fr
BNL   :Inactive:Unknown           :622   :4     :cern.ch   :bnl.gov
NL    :Active  :05/03/29 15:22:54 :3072  :10    :cern.ch   :tier1.sara.nl
RAL   :Active  :05/03/29 15:23:09 :2048  :12    :cern.ch   :gridpp.rl.ac.uk
FZK   :Active  :05/03/29 15:23:47 :10240 :5     :cern.ch   :gridka.de

Tue 11AM

INFN set up to 8 transfers - only running @ 760Mb with 6 transfers JC

Mon 12PM

INFN now has host resolution working again - started them up. Set to 6 concurrent transfers JC

Sun 7.50PM

Dirk got the tablespaces created, and daemons were started up. JC

Sun 5PM

Database filled up at 3pm. Noticed by Andreas at 3, by James at 5. aFter debugging, called physics-db support to increase the tablespace size. JC

Sat 7PM

Had made a mistake on Thur 2.30 in setting IN2P35 to 1 - it was set at 4 - now set to 1. JC

thu 5PM

INFN put in DNS load-balancing to remove the most loaded host every 10 minutes.

Thur 3:45 PM

set concurrent files for RAL to 1 again ,since upping didn't make much difference. JC

Thur 2:30 PM

Set nofiles for IN2P35 back to 1, to see if it picks up more load again

Thur 13:47

myproxy stopped running. Restarted at 13:55. JC

Thur 11:52

RAL bandwidth test completed. Looks like we do have the full 2Gb/s available. Peaked the local links at 122MB/sec each. Martin.

Thur 11:40AM

Fixed problem on oplapro63/64 where the SRM was exposing the internal, not the external interface. This could be part of the problem affecting FNAL throughput performance.

Thur 11:25AM

Upped INFN to 10 concurrent transfers, since moving them to 7 just dropped their share.

RAL are doing some tests to see what the link will take - using the test cluster at the CERN end and hoping to avoid the radiant transfers. Tests will run for sets of 20 minutes, starting at 10:20UT. Martin.

Thur 9AM

Ran at ~550MB/s overnight. INFN seems to have dropped by 20MB with the increase. Turned them back down to 7 transfers. Turned RAL to number of streams per transfer is 5 (16 transfers).

IN2P35 (ccxfer05) didn't do we'll when we turned it up to 2 transfers - the overall rate went up to 85-90MB/s but individual host rate went down. ccxfer01 went up briefly went set to two streams, but then dropped again. Turned back down ccxfer01 to 1 stream, but kept ccxfer05 on two.

Current config:

#Chan : State  :    Last Active   :Bwidth: Files:   From   :   To
IN2P35:Active  :05/03/24 09:20:44 :1024  :2     :cern.ch   :ccxfer05.in2p3.fr
IN2P33:Active  :05/03/24 09:17:57 :1024  :1     :cern.ch   :ccxfer03.in2p3.fr
INFN  :Active  :05/03/24 09:23:36 :1024  :7     :cern.ch   :cr.cnaf.infn.it
IN2P32:Active  :05/03/24 09:20:18 :1024  :1     :cern.ch   :ccxfer02.in2p3.fr
CERN  :Inactive:05/03/22 13:55:17 :1024  :1     :cern.ch   :cern.ch
FNAL  :Inactive:Unknown           :10240 :16    :cern.ch   :fnal.gov
IN2P3 :Active  :05/03/24 09:14:00 :1024  :1     :cern.ch   :ccxfer01.in2p3.fr
IN2P34:Active  :05/03/24 09:15:52 :1024  :1     :cern.ch   :ccxfer04.in2p3.fr
BNL   :Inactive:Unknown           :622   :4     :cern.ch   :bnl.gov
NL    :Active  :05/03/24 09:23:21 :3072  :16    :cern.ch   :tier1.sara.nl
RAL   :Active  :05/03/24 09:21:40 :2048  :12    :cern.ch   :gridpp.rl.ac.uk
FZK   :Active  :05/03/24 09:23:46 :10240 :5     :cern.ch   :gridka.de

Wed 10PM

Running at ~520MB/s.

Turned IN2P3 ccxfer01 up to 2 transfers to see if it increases bandwidth to the node. ccxfer05 is dealing much better with the transfers than the other nodes.

Turned RAL up to 16 (from 12) transfers and INFN up to 8 (from 7). turned FZK up to 5 (from 4).

Wed 6PM

IN2P3 tuned their system, and reported better performance now with 5 parallel streams to each node. Tested briefly and got ~60MB/s to ccxfer05. Turned them back on with 1 file per system, 5 parallel streams.

Wed 5.45PM

Turned INFN back on after network tests. From Tiziana:

I have good news, it looks like that throughput on the path cnaf -> cern
 is now back to its usual threshold 950 Mb/s, even if it still suffers
 from sporadic packet loss. The same applies to throughput on the
 opposite direction cern ->cnaf.

Also, upped FZK to 4 streams, since on 3 they drop to ~100MB on a loaded radiant cluster (was ~150MB on a lightly loaded radiant cluster)

Wed 4.15PM

Turned RAL back on again (12 streams) JC

Wed 2.30PM

Debugging IN2P3 connection - single stream gridftp @~20MB/s throughput to ccxfer03. single stream iperf goes to 850Mb/s, but also drops down a lot to ~100Mb/s. 5 stream iperf gives stable 850-900Mb/s. gridftp gives poor performance. We also see processes building up in QUIT state (and load going up) when there is no gap between transfers to the same host. JC

Wed 2.15PM

Turned down number of transfers to FZK to 3 in order to reduce throughput to 100MB target. JC

Wed 10.45AM

Started to debug the IN2P3 slow performance. Went to one channel with one two files to see what the speed is. Trying to eliminate network crosstalk issues. Also changed FZK to 1 parallel stream instead of 2. JC

Wed 10.20AM

Turned off RAL to allow for network testing. With 12 concurrent transfers, we have ~70MB/s to RAL. JC

Wed 23rd March 8.30AM

We ran for the last 14 hours at 500MB/s. Turned INFN off again for network testing. Upped RAL to 12 concurrent files JC

Tue 22nd 7PM

Added INFN back in overnight, and tuned IN2P3 up to 2 files per stream. JC

Tue 22nd 5PM

Configured to talk to RAL via gridftp doors on their 11 nodes. dCache needs the -nodcau option set on gridftp! Set them running with 8 parallel streams. JC

Tue 22nd 4PM

Kernel panic on oplapro80. Was caught straight away, and the node rebooted. Downtime ~15minutes. JC

Tue 22nd March 1PM

Myproxy server locked again. This seems to be due to the known race condition when it is run in verbose mode ( writing to syslog inside a signal handler caused a deadlock in some conditions). myproxy restarted with no -verbose mode at ~1PM. We had ~30minutes downtime. JC

Tue 22nd March 2005

Ran overnight at ~250MB. Added in INFN again. Reconfigured IN2P3 to have a channel per host - this doubled their rate to 30MB/s JC

Monday 21st March 2005

Started all transfer daemons in the morning. Got up to 4GB at ~11AM including

SARA
INFN
IN2P3
FNAL
FZK

at 12PM, performance dropped for ~1hr to 2GB, and then sharply rose again for 30 minutes before dropping to 2GB again. JC

-- JamesCasey - 16 Jun 2005

Topic revision: r1 - 2005-06-16 - unknown

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback