-- JamieShiers - 23 Oct 2006

Debugging of ALICE Transfers

Diary

From November 23, 2006

  • Scheduled downtime to update Alice Software to alien2.12
  • This operation affects the production and also the transfers
  • RAL transfers are a fundamental topic while restarting the transfers

November 21, 2006 (Tuesday)

  • Problems with FTS transfers to CASTOR@RAL
  • Any transfer is working because of the following problems:
  • Transfer failed. no local mapping for Globus ID
  • Failed on SRM put. Error in CastorStagerInterface.c:2457 Devicd or resource busy
  • Not able to retrieve the TURL from the destination
  • Address already in use

November 20, 2006 (Monday)

  • SARA VOBOX still down
  • CNAFand CCIN2P3 VOBOXES not accesible
  • RAL provided the SRM endpoint for accessing CASTOR at the site. Considered pre-production phase, the Alice server will be used for FTS exercises only. It has been configured as disk0tape0 SE.

November 18&19, 2006 (Weekend)

  • SARA vobox is still down
  • CNAF and CCIN2P3 are not accessible via gsissh, transfers fail to these sites since Fri night
  • FZK : 44MB/s
  • RAL : 22.5MB/s

November 17, 2006 (Friday)

  • A firewall intervention at CERN caused 2 hours of services unavailability in the morning
  • SARA vobox cannot access shared network storage, services are down
  • Day average speed : 120MB/s

November 16, 2006 (Thursday)

  • FZK SE Service down, restarted in the afternoon
  • Load generator failed for several minutes around 17;00, back in production and transferring after startup
  • 270MB/s after the load generator startup

November 15, 2006 (Wednesday)

  • Average transfer speed of 237MB/s at the end of the day
  • 5 Tier1 in operation
  • Worse results for FZK (probably SE service is down)

November 14, 2006 (Tuesday)

  • Castor2 crash at CERN affected the FTS production from 8:30 to 10:00
  • SARA transfers might be affected. Issues with the VOBOX at the site, the software area is not accesible. The services startup scripts stay in that region.
  • Overload of the RAL VOBOX has been solved this morning while rebotting the machine.
  • Neccesary update of the SE service in all VOBOXES

November 13, 2006 (Monday and weekend)

  • Central services down during the weekend (beginning last night) stopped the transfers until this morning
  • Load generator has also to be restarted
  • Several issues stopped the full FTD services restart for several hours
  • Once restarted, CNAF is back in production
  • SARA inestabilities. The SE in the destination showing a large number of connection problems. Reported via GGUS to the site
  • At 16:30 the central AliEn proxy got overloaded and crashed. This situation lasted for 1 hour
  • Source of overloading found and eliminated

November 9, 2006 (Thursday)

  • One of the central machines was hanged and needed a power cycle. Transfers were therefore stopped durting one hour.
  • Two sites didn't recover after this and FTD had to be restarted there
  • CNAF still not in production: no access to the srm endpoint. Waiting for the green light of the site to continue the production
  • Average speed during the day: 185.4 MB/s

November 8, 2006 (Wednesday)

  • Since about 12:00 it seems that the central FTS services are down, no query to them ever finishes. When Castor problems were announced there was still activity and the commands (glite-transfer-status for example) returned something, now they just hang.
  • The problem has been reported to the FTS experts
  • There have been two alarms (on fts103 and fts104) indicating that the web-services had hung - with the known problem of the DB connection pool getting stuck. The computer centre operators ran the procedure and restored the daemons.
  • The FTS agents continued to run the existing jobs unaffected during the period
  • The FTS patch to fix the DB connection pool problem is being prepared and should be released in a couple of weeks

November 7, 2006 (Tuesday)

  • Castor2 downtime at CERN
  • Castor2 DB restart at CERN
  • Castor2 problems at CNAF. Announced downtime
  • Efficiencies:
  • SARA: 96%
  • CCIN2P3: 97%
  • RAL: 99%
  • FZK: 97%
  • Total transfer speed (average): 94 MB/s

November 6, 2006 (Monday. Including weekend report)

  • Since Friday at 22:00 till Sat at 14:00 a very good rate was achieved (337MB/sec)
  • From Sat 14:00 to Sun 20:00 the rate dropped to 133MB/sec. no clear explanation for this. It has been asked during the daily meeting at 9:00, but not clear explanation right now.
  • Since yesterday at 20:00, the RFCP are failing with the status Device or resource busy. CASTOR2 team is looking inside the problem. Not only the MDC is affected but the whole alice activity on this stager including the WAN transfers.
  • Yesterday night at 00:30 the central Proxy AliEn service crashed. It was restarted this morning at 08:30. No transfers have been therefore submitted tonight
  • Inestabilities affecting SRM put has been observed on Friday at FZK and CCIN2P3. These issues decreased the efficiency at those sites to 43% and 56% respectively. These inestabilities dessapeared during Saturday the 4th
  • The 5th again inestabilities at CCIN2P3 were observed with the error: The FTS transfer transferid failed ( Transfer failed. ERROR an end-of-file was reached)

November 3, 2006 (Friday)

  • Transfers restarted in the morning
  • SARA is not in production for FTS transfers
  • Bad performance of CCIN2P3 and FZK with efficiences close to 10% and 14% respectively.
  • CCIN2P3 error message: ( Failed on SRM put: Empty request status gotten. No protocol support
  • FZK error message: ( Failed on SRM put: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Error reading token data: Success
  • RAL is not in production: THE VOBOX was not accesible (kernel problems). It should be back now (16:00)

November 2, 2006 (Thursday)

  • Strange behavior of ALICE transfers between 16:00 (1 November) and 9:30 (2 November). Regular picks and down simultaneously each 30 minutes for all sites. The problem has been reported and studied with FTS and also with castor experts
  • It seems not to be related to FTS since the same problem is not appearing for example for dteam
  • It seems to be a problem related to castor at CERN after the update of the system
  • The picks and downs during this period are perfectly understood by the experts which were able to find the problem
  • However ALICE keeps on seeing this variations from 9:30 until 17:00. The problem was also reported
  • The WAN traffic plots of CASTOr do not observed such problem, this new variaton has not been totally understood
  • During the night (from 2November to 3November) Alice transfers stopped for software updates
  • The results in terms of site efficiencies were the following:
  • SARA: 0%
  • RAL: 90% --> however very low rate
  • CNAF: 95%
  • CCIN2P3: 93%
  • FZK: 93%
  • Total average speed: 146MB/s

November 1, 2006 (Wednesday)

  • After update of CASTOR2 this morning at CERN (from 11:30 to 14;30) and the gridftp restart the transfer speed went up to a good speed
  • It went down again one hour with the usual symptoms, then after another 1/2 an hour the activity resumed.
  • Transfer speed at 150MB/s at 16:45
  • SARA scheduled downtime, transfers are not counted with this site today
Results at the end of the day: (in terms of efficiency)
  • FZK: 66%. Source of error: Transfer failed. ERROR a system call failed (Connection refused)
  • RAL: 80%
  • CCIN2P3: 91%
  • CNAF: 94%
  • Total transfer rate (in average): 129MB/s

October 31, 2006 (Tuesday)

General comments: Global FTS problems seem to have come back at about 19:00

Average transfer rates:

  • CCIN2P3 48MB/sec
  • SARA 16MB/sec, since 08:30 is down because of network changes
  • CNAF 16MB/sec, very few transfers, lots of gridftp errors
  • GridKa 50MB/sec, stable
  • RAL 11MB/sec, as usual

  • Average rate 140 MB/sec

October 30, 2006 (Monday)

General comments: Global instability in the system. Several major problems affected almost the entire day.

Average transfer rates:

  • CCIN2P3 62MB/sec, no site problems
  • SARA 19MB/sec, affected by SRM problems for 12 hours (06:00 - 17:50)
  • CNAF 8MB/sec, 45% efficiency, very few transfers, several different gridftp errors
  • GridKa 30MB/sec, 50% efficiency
  • RAL 7MB/sec

  • Average rate 126 MB/sec

Global problems for today:

  • yesterday's problems with the central FTS services persisted until 10:30
  • 12:30 - 17:00 : one of the ALICE services crashed due to overload from many failing transfers. The transfer submission system was changed to back-off in case there are too many faults with a cetain transfer channel or to all channels (general FTS failure).
  • 21:20 - 22:00 : networking problems, monitoring data missing for the period

October 29, 2006 (Sunday)

General comments: Stable transfer rates up to 17:30 hours. At 17:30 all rates drop to few MB/sec (CERN FTS failure?). Patrially recovered to CCIN2P3 at 19:30, all other channels below 20MB/sec, many transfer failures.

Average transfer rates:

  • CCIN2P3 100 MB/sec, stable
  • SARA 60 MB/sec, down to 15 MB/sec after 17:30
  • CNAF 17 MB/sec, down t0 7 MB/sec after 17:30
  • GridKa 27 MB/sec, unstable rate, down to 12 MB/sec after 17:30
  • RAL 9 MB/Sec, stable, but low.

  • Average rate 210 MB/sec

Predominant error conditions:

  • GridKA The FTS transfer transferid failed ( Transfer failed. ERROR the server sent an error response: 553 553 No such file or directory.)
  • RAL The FTS transfer transferid failed ( Transfer failed. ERROR the server sent an error response: 425 425 Cannot open port: java.lang.Exception: Pool request timed out
  • CNAF The FTS transfer transferid failed ( Operation was aborted (the gridFTP transfer timed out).)

October 28, 2006 (Saturday)

General comments: from 0:00 to 13:50 FTS load generator failure - no transfers submitted. Generator restarted at 13:50 - all channels OK and recover quickly

Average transfer rates:

  • CCIN2P3 115 MB/sec, stable
  • SARA 74 MB/sec, stable (drop to zero, no reason found between 19:30 and 20:45)
  • CNAF 43 MB/sec, stable
  • GridKa 30 MB/sec, unstable rate,
  • RAL 15 MB/Sec, stable, but low.

  • Average rate 280 MB/sec

Predominant error conditions:

  • GridKA The FTS transfer transferid failed ( Transfer failed. ERROR the server sent an error response: 553 553 No such file or directory.)
  • RAL The FTS transfer transferid failed ( Transfer failed. ERROR the server sent an error response: 425 425 Cannot open port: java.lang.Exception: Pool request timed out :

October 27, 2006 (Friday)

  • Transfer speeds dropping to zero to almost all destinations. Problem report sent to FTS experts.
  • FZK recovered at 8:30
  • Slowly ramping transfers up (at 12:00)
  • CNAF in scheduled downtime for Orable DB updates
  • Instabilities for CCIN2P3 and SARA
  • Coming back again to 450MB/s (14:00)
  • CCIN2P3 and SARA ramping up again

October 26, 2006 (Thursday)

  • At 10:00 the transfer speed has achieved 325 MB/s
  • Inestabilities at FZK. Efficiency of 23%. Error source: The FTS transfer transferid failed ( Transfer failed. ERROR a system call failed (Connection refused)).
  • FZK back in good production
  • Again inestabilities with FZK, transfers are failing because the connection is refused
  • From 00:00 until 17:00 the efficiencies with the rest of the sites are the following:
  • RAL: 66%
  • SARA: 97%
  • CNAF: 99%
  • CCIN2P3: 99%

October 25, 2006 (Wednesday)

  • From 14:30 to 16:30 inestabilities at SARA did not allow any transfer. The SE was not found in the BDII
  • In the last 24h ALICE is transferring to all T1 sites achieving a pick of 400Mb/s the 24th at 23:00
  • During the last hour (13:00) ALICE is transferring to a speed of 225 MB/s
  • From 25th at 00:00 (following dashboard statistics) thse are the ALICE efficiencies with all T1 sites:
  • RAL: 65.74%
  • FZK: 85.40%
  • CNAF: 86.24%
  • SARA: 92.78%
  • CCIN2P3: 98.65%

October 24, 2006 (Tuesday)

  • The problems observed at SARA are being reported via GGUS
  • The MonaLisa transfer reports are now similar to the results shown by GridView
  • FZK - still in scheduled downtime
  • CNAF - the services were not running, no activity there until 09:30
  • RAL - low number of transfers (as usual), 66% success rate (pool errors)
  • SARA - ok until 08:00, now all transfers are failing with errors like: Failed on SRM put: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Error reading token data: Success
  • CCIN2P3 - the only one working 100%
  • Good transfer rates overnight to Lyon (max 80MB/s) and Sara (max >100MB/s). Primary causes (top 2) of failures to Sara still in ALICE stack - see http://dboard-gr.cern.ch/dashboard/data/fts/24-Oct-06.html
  • CNAF not yet back in the game, no response from RAL re migration of ALICE to CASTOR2.
  • Dip in GridView plots overnight probably due to:
    • At 20:00 GMT I then switched off the hourly tomcat restart, but that was not a good idea, because the plots immediately crumbled again.
    • Since 23:00 GMT the restarts are back on.

October 23, 2006 (Monday)

  • lxgate38 is being reconfigured by FIO (and this is the machine running the SE). Transfers are therefore stopped from 13:00
  • Stable transfers from CERN-IN2P3 at 60-80MB/s with very low error rate.
  • Many failures to CNAF (557), out of which:
    • adding the file to the LVM at /opt/exp_software/alice/alien.v2-11-pm/lib/perl5/site_perl/5.8.7/AliEn/Service/SE.pm line 1119. (335)
      • Related to LFC & NULL comment field - will be disabled
    • contacting the Broker/Transfer (135)
      • VO box overload
  • Transfers at lower rate (~10-15MB/s - in any case above pp and HI requirement) also to RAL but high failure rate (>80%)
  • SARA upgrading to dCache 1.7 (see below) but transfers close to 80MB/s prior to this.

Participating Tier1 Sites

Tier1 Site SC4 Target pp Target HI Target
IN2P3 60 30 50
CNAF 60 35 50
FZK 60 30 80
RAL 30 10 5
SARA 30 15 30

Site Status

On Monday October 23th srm.grid.sara.nl and ant1.grid.sara.nl will be down for maintenance from 12:00-18:00 CET for a dcache upgrade to version 1.7.0.

FZK-LCG2 Scheduled Downtime 24. Oct. Published on : 2006-10-13 Published by Sven Gabriel (FZK-LCG2)

Due to an update of our d-Cache System and common maintenance FZK-LCG2 will not be available from 09:00 till 21:00.

Links

ARDA transfer breakdown

FTS report index

ALICE MonaLisa monitoring

December 12, 2006 (Tuesday) Multi-VO transfers to CCIN2P3

  • Source: CERN CASTOR2, Target: CCIN2P3 disk (/pnfs/in2p3.fr/data/alice/disk/multi_vo_tests/)
  • Parameters: 25 concurrent FTS transfers; filesize 1.9GB
  • Summary over past 24 hours:
    • Average rate : 101.6MB/s; total transferred 6.9TB
    • Total number of transfers: 5965; Failed 2149; Successfull: 3816 (64%)
  • Breakdown of errors (from ARDA dashboard - http://dboard-gr.cern.ch/dashboard/data/fts/)
    • 2050: The FTS transfer transferid failed ( No Channel found or VO not authorized for transferring between CERN-PROD and IN2P3-CC)
    • 17: The FTS transfer transferid failed ( Failed on SRM put: SRM getRequestStatus timed out on put
    • 20: The FTS transfer transferid failed ( Failed on SRM put: Failed To Put SURL. Error in srm__put: service timeout
    • 6: The FTS transfer transferid failed ( Failed on SRM put: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Could not open connection
    • 9: The FTS transfer transferid failed ( Failed on SRM put: Failed SRM put on httpg://ccsrm.in2p3.fr:8443/srm/managerv1
    • 4: The FTS transfer transferid failed ( Failed on SRM put: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Error reading token data: Success)
    • 12: The FTS transfer transferid failed ( Failed on SRM put: SRM getRequestStatus timed out on put)
    • 4: The file has size size and should have size
    • 27: Error contacting the SE_ALICE::CCIN2P3::LCG

December 13, 2006 (Wednesday) Multi-VO transfers to CCIN2P3, last 24 hours summary

  • Source: CERN CASTOR2
  • Parameters: 25 concurrent FTS transfers; filesize 1.9GB
  • Target: CCIN2P3 disk (/pnfs/in2p3.fr/data/alice/disk/multi_vo_tests/)
    • Average rate: 26.6MB/s
    • Total number of transfers: 2030, 1222 successful (60%), 808 failed (40%)
    • Predominant errors:
      • 768 (95%) : No Channel found or VO not authorized for transferring between CERN-PROD and IN2P3-CC
      • 37 (4.5%) : Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Could not open connection !
  • Target: CCIN2P3 tape (/pnfs/in2p3.fr/data/alice/tape/multi_vo_tests/)
    • Average rate: 21.8MB/s
    • Total number of transfers: 1620, 985 successful (61%), 625 failed (39%)
    • Predominant errors:
      • 560 (88%) : No Channel found or VO not authorized for transferring between CERN-PROD and IN2P3-CC
      • 61 (9.5%) : Failed on SRM get: Cannot Contact SRM Service. Error in srm__ping: SOAP-ENV:Client - CGSI-gSOAP: Could not open connection !
  • Observations
    • The "No Channel found..." errors appear when the transfers are submitted to the CCIN2P3 FTS endpoint
    • Big dip to almost (but not) zero between 12/12 @ 21:00 - 13/12 @ 01:00
    • Slow recovery until 13/12 @ 05:00
    • Dip to zero between 13/12 @ 09:00 - 13/12 @ 10:00

December 14, 2006 (Thursday) Multi-VO transfers to CCIN2P3, last 24 hours summary

  • Source: CERN CASTOR2
  • Parameters: 25 concurrent FTS transfers; filesize 1.9GB
  • Target: CCIN2P3 disk (/pnfs/in2p3.fr/data/alice/disk/multi_vo_tests/)
    • Average rate: 51MB/s
    • Total number of transfers: 3642, 2301 successful (63%), 1341 failed (37%)
    • Predominant errors:
      • 1323 (98.5%) : No Channel found or VO not authorized for transferring between CERN-PROD and IN2P3-CC
  • Target: CCIN2P3 tape (/pnfs/in2p3.fr/data/alice/tape/multi_vo_tests/)
    • Average rate: 49.5MB/s
    • Total number of transfers: 3517, 2249 successful (64%), 1268 failed (36%)
    • Predominant errors:
      • 1218 (96%) : No Channel found or VO not authorized for transferring between CERN-PROD and IN2P3-CC
  • Observations:
    • We are looking for a way to always use the CERN endpoint even when the information in the BDII is not correct.

December 15, 2006 (Friday) Multi-VO transfers to CCIN2P3 and FZK, last 24 hours summary

  • Source: CERN CASTOR2
  • Parameters: 25 concurrent FTS transfers; filesize 1.9GB
  • Target : CCIN2P3
    • Average rates : 60MB/s to tape + 60MB/s to disk = 120MB/s
    • Total number of transfers : 5579, 4042 successful (72.5%), 1537 failed
    • Observations:
      • A caching mechanism for the FTS endpoints was put in place, this reduced a lot the errors caused by BDII timeouts
      • The exercise to CCIN2P3 finished at 15.12 @ 04:30
  • Target : FZK (/pnfs/gridka.de/alice/AliEn/)
    • Load generator was started on 15.12 @ 07:30
    • Average rate until now : 52MB/s
    • Same problem appeared for FZK, BDII doesn't respond for a long time, the caching mechanism was applied to the FZK vobox too
    • Until now we observed some periodic dips in the transfer rates while the queue always has enough waiting transfers.

December 18, 2006 (Monday) Multi-VO transfers to FZK, weekend summary

  • Source: CERN CASTOR2
  • Target : FZK (/pnfs/gridka.de/alice/AliEn/)
  • Parameters: 25 concurrent FTS transfers; filesize 1.9GB
  • Period : 15.12 @ 11:00 - 18.12 @ 11:00
  • Average rate : 60.3MB/s
  • Total number of transfers : 9284, 7829 successful (84.33%), 1455 failed (15.67%)
  • Observations
    • Almost all of the failures are because the endpoint cannot be found in the BDII, even with the caching mechanism in place. We will try to increase the lifetime of the entries in the cache.

March 26, 2007 (Monday) Start of Dress Rehersal

  • CERN: the SE doesn't publish anymore the field 'GlueSEType=srm', FTD cannot locate the correct SRM endpoint from BDII LDAP, no transfer can start.
  • CCIN2P3: the SE ccsrm.in2p3.fr is not registered in the BDII anymore, AliEn SE and FTD services cannot start
  • RAL: vobox down, machine was rebooted
Edit | Attach | Watch | Print version | History: r43 < r42 < r41 < r40 < r39 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r43 - 2007-03-26 - ChristianKB
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback