--
GavinMcCance- 16 Jan 2006
- Present: Jan, Olof, Maarten, Harry, James, Gavin, Jamie (via phone)
- Sites called in: INFN, ASCC, FNAL, GRIDKA, IN2P3, BNL, NDGF, PIC, DESY, TRIUMF, SARA/NIKHEF, GSI.
- Expts called in: LHCb, CMS, Alice
- Absent: Atlas
Outlook for SC3 Disk-Disk re-run
Goal: At least 800MB/s minimum stable with current software, for 3-4 days.
If we can do that, then try SRM copy to see how that affects the rate.
Main work for next 24 hours is optimising the rates to obtain stable running.
Issues
- For deleting files (to avoid filling up space), the recommendation is to do this locally (via some script or other).
- Would be nice to see all the parameters for all the channels somewhere, e.g. on a web page (made dynamically by FTS).
- Would be good to document changes/problems in the wiki to keep track of all the changes.
Reports from Sites
CERN
Since afternoon, rate is 50% what it should be - scaled equally across sites. CERN is investigating.
Threading problems on new DLF server caused break in service. Problem is only seen at production loads.
Castor team downgraded version of DLF server. A fix is available and will be deployed soon.
Noted that only a small subset of test-files are being used (only 370 of 8000), causing poor load distribution.
--> problem in test-load generator.
ASCC
Running well.
BNL
Doing 50 files with 10 streams - still need to find optimal values for this.
BNL have two people on call for monitoring.
CNAF
Out of memory problems on Castor1 - probably too many files / streams.
We will reduce this numebr and start to tune it.
2 streams should be OK (previous tuning).
Currently running with asymmetric routing - CNAF->CERN using new 10 gig, CERN-CNAF previous network 2 x 1 gig.
If the tuning doesn't work, maybe we should change the routing to fully symmetric 2 x 1 gig.
DESY
Working well then hit by a firewall issue.
Fix underway - hopefully engineers will have it fixed tonight.
Michael: Changing to SRM copy -> we should get at least 50% more.
FNAL
Running at 80 MB/s. Increasing the number of transfers should increase rate linearly.
Unbalanced queuing issue on the pools - this is being looked at by experts.
Recommend 20 streams per transfer with 2MB TCP buffer.
#50 / #15 transfers issue. FTS asked to do 50 concurrent, but FNAL only sees 15 concurrent - CERN will check.
If we do SRM copy, FNAL can monitor what's going on.
GRIDKA
Pool nodes got totally filled up. Files deleted and pool nodes restarted had to be restarted.
Fix for this is in 1.6.6-4 dCache release (currently running 1.6.6-1 - will deploy after rerun to maintain stability of system).
There is a cron job running to clean files now.
80MB/s - good given 1-gig link.
NDGF
Firewall problems being looked at. CERN + NDGF network experts investigating.
PIC
Running well.
Reconfiguring: Increased the number of disk servers 4 -> 9.
gridFTP memory prpoblems: too many streams?
Best effort at weekend, 2 people keeping an eye on cluster.
SARA/NIKHEF
Reconfiguring. Almost done. Added 4 more pool nodes.
Ready to start with more transfers.
There was a DNS problem, now resolved (propagation delay).
TRIUMF
Working well. 80MB/s.
Can increase number of files.
Suggest buffer size increase, and reduce # streams.
Reports from Experiments
LHCb
Want to schedule phase-1 rerun. The SC3-rerun hardware setup isn't appropriate for this, so we will return to the the production castorgridsc cluster and give LHCb priority.
CMS
No issues.
Alice
No issues.
AOB
Experiment / tier-1 Software / deployment priorities document is underway to give realistic planning for SC4.