LCGSCM Data Management Status - 2005/2006

Older reports are moved here. Please look at LcgScmStatusDms for the current reports.

December 13th 2006

Transfer service

  • Multi-VO tests ongoing to Lyon. About to start to GRIDKA.
  • Planning to deploy pilot version of FTS 2.0 (with delegation) this week.
  • Start FTS SRM 2.2 code testing today.

LFC

  • No issues reported.

Castor

  • One-hour interruptions on CASTORCMS and CASTORATLAS to fix Oracle corruptions. No data loss.
  • Castor intervention today.

Issues

  • Follow-up on issue of Castor SRM 2.2 only working with VOMS certificates

December 6th 2006

Transfer service

  • Patch 912 deployed on CERN-PROD. Tier-1's to upgrade as soon as patch comes out of pre-production.
  • Multi-VO test delayed (problems on Lyon SRM). Planning to run with GridKA first and Lyon next week.
  • ATLAS continue to give bad SURLs (100% failure rates on some channels): follow up with them.

LFC

  • LHCb ACL change complete and verified.

Castor

  • two service interruptions on CASTORPUBLIC caused by new problems with LSF scheduler
  • database deadlocks on CASTORALICE interrupted service yesterday
  • Castor delta review today and tomorrow

Issues

  • Serious issue noted for Castor SRM 2.2: it only seems to support use of VOMS certificates and does not work with plain proxy certificates. Do we want to couple SRM 2.2 to VOMS in the 2007 Q1 schedule.

November 29th 2006

Transfer service

  • Multi-VO test underway to Lyon
    • Problems on Lyon channel! (dest SRM)
  • ATLAS continue to give bad SURLs (100% failure rates on some channels): follow up with them.
  • dteam transfers restarted at lower rate (they were overloading Castor public since other expt. activity is very low)
  • Patch 912 in preproduction: planning rolling intervention to upgrade asap (< 30 minutes). Scheduled for Thursday.
  • Certificate upgrade to new CERN CA on Thursday.

LFC

  • No major issues.

Castor

  • Oracle patch to be applied to the namesever databaes machines on Dec 13. This implies a 30 minute service interruption, and we will have to stop batch as well. A good opportunity for maintenance work...
  • CASTORCMS is being upgraded today to 2.1.1-4 (Nov 29)

AOB

  • a new version of the host-certificate-manager script is in production on lxadm. This script can be used for bulk-requests of host ceritificates, which are retrieved real-time. The certificates are uploaded into SINDES (by default). Succesfully been used for FTS and Gridview boxes

November 15th 2006

Transfer service

  • dteam transfers stopped on 14th for a while due to Castor problem. No major issues.
  • FTS patch (for connection pool lock-up) given to integration on 13th. See: https://savannah.cern.ch/patch/index.php?912
  • Setting up new hardware and preparing intervention plan for reorganisation / cleanup of servers.
  • Planning multi-VO test

LFC

  • No major issues.

SRM 2.2

  • GFAL / lcg-util client available for testing by experiments against SRM 2.2 test systems.
  • FTS client integration in progress

Castor

  • new version of castor-lib-compat deployed on Computer Center nodes
  • CASTORALICE: still investigating LSF scheduler problem
  • CASTORLHCB, CASTORATLAS to be upgraded on Thursday
  • CASTORPUBLIC suffered from several problems on Tuesday, all of them understood and fixed smile

November 8th 2006

Transfer service

  • No major issues
  • Setting up new hardware
  • Patch for FTS lock-up in validation. Alarm in place.
  • DB query slowing down is understood as a missing constraint on our validation cluster confusing the optimiser (still not clear why it didn't affect us before)

LFC

  • No major issues
  • LHCb streams move complete
  • Another mass ACL update pending

Castor

  • problems with expt applications linked against libshift.so.2.0 forced us to downgrade Atlas and LHCb instances. The Castor development team have produced a new version of the castor-lib-compat compatibility RPM, that will allow us to upgrade again... Schedule is being agreed with the experiments
  • Atlas pools to be reshuffled this week, as agreed with experiment
  • SRM v2 endpoint problems being actively debugged
  • LSF-related problems on Alice stopping service (twice...). Actuator put in place.

November 1st 2006

LFC

  • LFC moved to new hardware.
  • Move srm-durable entries to srm.cern.ch for ATLAS
  • ACLs updated for LHCb and ATLAS
  • LHCb intervention for streams planned for Thursday.
  • LFC validation done for new Oracle patch.

Transfer service

  • Another problem found at RAL in FTS transfer component (between FTS and SRM). Patch under validation.
  • Connection pool problem under investigation (possible problem with OCI driver) - causing some of the web-service lockups. Alarm for this in preparation.
  • (separate) DB TAF problem being looked at by DB team.
  • New hardware planning underway.
  • FTS 2.0 delays continue.
  • ncm component in testing.
  • Oracle validation: the fetch query seems to have increased by a huge amount and we don't understand why.

Castor

  • upgrades of Castor instances well underway:
    • LHCb and Alice: today
    • CMS: later...
    • Atlas, public: done last week
  • cleaned up published information for SRM endpoints, and added support for Unosat, Gear, Geant VO's on srm.cern.ch. They can (should!) stop using castorgrid classic SE.
  • SRM v22 pre-production service is being set up today smile

October 25th 2006

LFC

  • Planning move to new hardware which is now available.
  • Intervention plan for LHCb streams replication in preparation.

Transfer service

  • Deployed new patches in production. T1 sites should upgrade.
  • Planning addition of new hardware and Tier-2 channel split.
  • Understanding recent service issues on web-service.

Castor

  • Oracle patches, Linux kernel upgrade and Castor S/W upgrades being rolled out. Castorpublic done today, Atlas, LHCb, Alice next week, CMS and Castor nameservers later.
  • Castor client S/W to be updated to 2.1.1-4 on Monday Oct 30
  • SRM v22 pre-production status: no update. November 1st begins to look tight...
  • Castor @ SLC4 validation:
    • client software: OK (i386 + x86_64)
    • diskservers: i386 OK, x86_64 under test
    • head nodes, tapeservers: later...
    • note: no support on ia64

October 11th 2006

LFC

  • Tracking release of 1.5.10 (certified for now)

Transfer service

  • Profiling and tuning to understand memory problems on FTS web-service.
  • Failover issues to understand with DB - FTS got 'stuck' this week.
  • New monitoring tools added - to be integrated with main FTS release and daily service procedures.
  • Plan to split tier-2 traffic to a distinct FTS service.

Castor

  • SRM: srm.cern.ch was upgraded to 2.2.9-4.slc3 on Oct 10. This version fixes a problem in stat() for files larger than 2 DB. It is also linked against libshift.so.2.1, which is needed to run against stagers 2.1.1. Endpoints in CNAF and ASGC have also been upgraded.
  • CASTORPUBLIC: upgrade postponed again. No new date known yet.
  • We have started to set up a SRM v22 pre-production service, to be available on Nov 1
  • We have also started to verify the Castor SW stack (Castor, Gridftp, RGMA, etc) on SLC4 diskservers (x86_64 architecture) with XFS filesystems. So far, so good...

AOB

As of Oct 10, the CERN firewall configuration is semi-automatic for certain Landb-sets. In particular, the Data Management production servers (Gridftp, SRM, LFC, FTS) are benefiting from this - when machines are added/removed to these sets, the firewall configurations will be updated without the need for security scans. It is currently still necessary to ask Computer Security to update the firewall configuration from the Landb sets, but this should become fully automatic in the future.

The ELFms Wiki provides more details about how Landb-sets are updated for Quattor-managed machines.

October 4th 2006

Oracle patch

  • FTS + LFC: Basic functional tests were OK for Oracle patch (security + cursor problem)

September 27th 2006

LFC

  • Nothing to report.

Transfer service

  • Firewall (tcp/8443) was closed on new FTS web-service node - this has been opened after security scan (still to confirm).
    • Update FtsWlcg internal hardware plan (which nodes are which) - it's outdated.
  • New info provider and FTS nodes seems to be handling the load well.
    • Will update to the 'official' RPM as soon as its released.
  • Need to split 'general' FTS service (CERN/Tier-2) from Tier-0 export service - will review hardware requirements for this.

Castor

  • new Castor client s/w version 2.1.1-1 (fixing 'request mixing' bug) under test on selected lxplus/lxbatch nodes (and on cmscsa06 WNs), to be deployed site-wide on Monday, Oct 2.
  • upgrades of servers:
    • CASTORCMS upgraded to v 2.1.0-6 on Mon Sep 25
    • CASTORATLAS upgraded to v 2.1.0-6 on Tue Sep 26.
    • CASTORLHCB scheduled for Mon Oct 2
    • CASTORALICE upgrade to be decided
    • new version 2.1.1-2 (server-side fix of 'request mixing' bug...) under test. Also needed for repack utility. Upgrade of CASTORPUBLIC to be planned, hopefully early next week...
  • Actively following up CASTOR issues with Atlas, details in our logbook

September 21st 2006

LFC

  • Actively following up LFC performance issues with Atlas.

Transfer service

  • Problems seen on web-service (tomcat lock-up). Putting alarm in place for this.
  • Add a new load-balanced alias to cope with load - but the configuration was bad for this. This led to serious service degradation for 24+ hours.
  • Working out best mechanism for getting FTS monitoring into the COD monitoring.

Castor

  • CASTORPUBLIC delivered, DTEAM and OPS transfers have started. Gridftp time out for DTEAM transfers possibly caused by too high number of concurrent transfer requests for the number of servers in the pool. Too much bandwidth available for DTEAM backgroudn transfers smile
  • CASTORITDC aka CASTORCSA06 delivered. Standby stager for CMS CSA06 with a t0export pool.
  • CASTORATLAS: discussed problems on durable diskpool with Atlas, agreed to try several workarounds (selective garbage collection, avoid writing from non-production users)
  • Will upgrade CASTORATLAS and CASTORCMS instances to 2.1.0-6 early next week.

September 13th 2006

LFC

  • DB corruption: Still a couple of bad file entries being chased with LHCb. Still to check the other catalogs.

Transfer service

  • Patch for lock-up problem progressing through release process - See DMFtsPatchStatus
  • T2 to CERN channels opened for Atlas
  • Problems seen on web-service (tomcat lock-up). Putting alarm in place for this.

Castor

  • CASTORPUBLIC is being set up, using latest Castor version 2.1.1-0. The first costumers will be dteam, ops, unosat VO's
  • Following read timeout problems, we removed iptables from the diskservers. This seems to remdy the situation, we are following this up.
  • CASTORGRID is being renumbered and downsized. No service interruptions are foreseen.
  • Reminder: castorgrid.cern.ch SRM endpoint to be removed on October 1st.

September 6th 2006

LFC

  • All corrupted catalog entries that were found have been put back into LHCb catalog.

Transfer service

  • Patch for lock-up problem in preparation.
  • ATLAS requesting to use tier-2 to CERN channels.

Castor

  • deployed named on all database serers, to minimize exposure in case on DNS trouble
  • deployed several gLite updates, including yesterday's security updates.
  • working with Atlas and CMS to prepare their upcoming challenges
  • next week, we will create castorpublic out ot the c2sc4 instance.
  • we have announced that the castorgrid.cern.ch SRM endpoint will vanish on Oct 1. Castorgrid will continue as a Classic SE, as it is still being used by non-LHC collaborations.

August 30th 2006

LFC

  • Intervention scheduled for DB account split on LFC (Atlas and CMS).

Transfer service

  • Working on exposing transfer statistics via GridView.
  • Service failure (50% failure) on 29th under investigation (contention in database).
  • High load on web-service noted from Alice (contacted to make their status polls more effiicient).

Castor

  • v 2.1.0-8 (fixing a bug in rfcp affecting castor1 users) was deployed last Thursday/Friday on the Computer Center machines. The Linux desktops were updated today.
  • The durable diskpool atldata filled up last weekend (no garbage collection on durable pools...), but Castor continued to schdule write requests to the pool. Requests piled up on the instance, as well as on the SRM endpoints that issued Castor requests. FIxed by adding more diskspace to the pool, and by removing 10K files (that anyway are on tape)
  • On Sunday, a problem with VOMS led to a change of grid-mappings. In particular, Maarten Litmaath's mapping changed from dteam001 to dteamsgm, and dteamsgm was not "castor mapped". We now added castor mappings for all SGM accounts, and have deployed a new version of edg-mkgrid that should handle the underlying problem in a better way.

Issues to follow

  • Debugging tier-2 to CERN channels still takes much effort from the FTS and Castor teams.

August 23rd 2006

LFC

  • Data corruption caused by Oracle problem. Oracle wrkaround now in place. Requests to clean up the namespace.

Transfer service

  • Upgraded to FTS patch 737 + 787 (various transfer logic fixes, service discovery fixes).
  • Working on exposing transfer statistics via GridView.

Castor

  • Castor S/W version 2.1.0-7 (containing the hotfix for the 'request mixing' bug) has been available for testing on 2 dedicated nodes since the end of last week. No problems have been reported, and we are deployin git on the Computer Center client nodes (lxplus, lxbatch) today. Linux desktops will follow shortly.

Issues to follow

  • Setting up dedicated tier-2 to CERN channels. This is a temporary solution for CMS' immediate problems and is absorbing some effort. If longer term, we need a better soluton (T1 to manage T2 to CERN channels?)

  • Maarten will provide a new lcg-mon-gridftp RPM to include an hourly restart of the RGMA producers. This should help to make the Gridview plots more correct.

August 16th 2006

LFC and DPM

  • ROWId problem under investigation ("Internal error" to clients) - James / Jean-Philippe.

Transfer service

  • Handover to FIO: Tutorials to follow. Still to test ncm component
  • New patches to deploy - rolling upgrade.

Castor

  • Load problems on the CMS stager last week slowed down significantly the t0export tests run by Tony W. The request load from the ongoing tier-0 - tier-1 data export together with the normal user activity resulted in slow response (0.5 - 1 minute) for a rfcp transfer that would normally only take 6-7 seconds to complete. The solution for this problem is to modify the current function of the CASTOR2 LSF plugin. The castor development team is working on a new LSF plugin with high priority.
  • In the context of the above problem a CASTOR2 client API bug was found. If the stager is under high load and the clients start to interrupt their requests with cntl-C, there is a high risk of request mixing where subsequent requests on the same client node may receive the callback from the interrupted request. This bug was introduced with the implementation of a callback port-range (ports 30,000 - 30,100) that was released and deployed in the beginning of July. A hot fix doing a random port selection was deployed last week in AFS and announced to the experiments. The fixed released will soon be widely rolled out after an initial deployment on a test cluster for certification.
  • Monitoring (lemon) problems Saturday morning 2am - 10am resulted in all CASTOR2 diskservers being automatically removed from production.

Issues to follow

  • CMS seeing a variety of problems transferring data into CERN
    • Slow / choppy speed transfer from T2 over GPN
    • Slow transfers from some T1 sites over OPN (e.g. ASCC)
    • Large variety of non-CERN problems: bad SRM, odd errors from gridFTP
  • Email requests from CMS to have a separate central LFC database account (i.e. separate from Atlas' namespace) to avoid seeing Atlas' entries.

August 2nd 2006

LFC and DPM

  • No major issues.
  • Handover to FIO complete

Transfer service

  • Handover to FIO: Tutorials to follow. Still to test ncm component
  • Outage from expired hostcert on myproxy-fts.cern.ch
  • Patches in validation

Castor

  • the castorsrm.cern.ch loadbalanced DNS alias changed to point to srm.cern.ch yesterday afternoon. The plan is to leave the srm running for a while on castorgrid.cern.ch (to which castorsrm was pointing previously) and remove the VOs one by one after agreement with each experiment. The final aim is to completely stop the SRM service on castorgrid.cern.ch but leave it as a gridftp gateway for non-LHC experiments until it is not needed anymore.

  • together with ATLAS we are testing a different configuration for their Tier-0 load tests with a separate export buffer to better understand the resource requirements for each step.

  • I followed some slow LHCb transfers live yesterday using strace. No conclusive results so far but there are no indications of problems on the castor side.

Issues

  • Still to understand slow/irregular transfer rate into CERN (LHCb).

July 26th 2006

LFC and DPM

  • No major issues.
  • Patch working its way through certification (for python API).
  • Tutorials this week for sysadmins to complete handover.

Transfer service

  • No major issues.
  • Patch going through certification to fix a variety of issues.
  • Still understanding the issue of slow / irregular transfer rate into CERN.
  • Tutorials to follow.

Castor

  • We want to change castorsrm.cern.ch DNS alias to point to srm.cern.ch, to workaroud the pattern matching problem in gfal. Currently waiting for LHCb's green light...
  • LHCB reports of slow transfer rates from WN's to Castor @ Cern: investigating at low priority
  • Configuring CMS and Alice isntances for data challenges
  • We configured UNOSAT access through castorgrid.cern.ch

July 19th 2006

Activities completed and work in progress.

  • updated EGEE.BDII to publish root as supported access protocols on 4 endpoints {srm,castorgridsc,srm-{lhcb,atlas}-durable}.cern.ch

LFC and DPM

  • inconsistencies in Atlas LFC catalogue (host field not containing correct host), being investigated
  • Atlas nodes overloaded, causing alarms. The CPU consumption is too high.

Transfer service

  • ...

Castor

  • main activity: debugging/running services, espeically for LHCb
  • new rootd release 5.12.00-1 is deployed on all diskservers

July 12th 2006

Activities completed and work in progress.

  • Very slow gridFTP traffic inbound to CERN understood, solution is being deployed.

LFC and DPM

  • ...

Transfer service

  • ...

Castor

  • client software upgraded to 2.1.0-3
  • several problems with Atlas, CMS, LHCB stagers

Issues

  • ...

July 6th 2006

Activities completed and work in progress.

LFC and DPM

  • Production ATLAS local catalog had IP address problem (different IP from LANDB). Rebooted, and machine was accidentally reinstalled from scratch. Service RPMs and most configuration was re-installed automatically but a few things still had to be done by hand; it was documented but not done fully automatically. Catalog was down until the next day due to not being published correctly in BDII.
  • This may be the cause of ATLAS's LFC problems recently.
  • Security problem: audit logs were lost when system was reinstalled. Suggest backing up logs into tape system.

Transfer service

  • SRM copy issues .. new error from FNAL SRM that we haven't seen before - has it changed again? Working on solution, awaiting reply from FNAL.
  • Solution for cleanly aborting gridFTP going to certifcation very soon.
  • Variety of other fixes coming along with this patch.

Castor

  • ...

Issues

  1. No effective validation process in LCG between dCache and LCG DM components; changes in dCache that affect FTS / GFAL are found on the production system first.
  2. Very slow gridFTP traffic inbound to CERN (from dCache sites) reported by CMS ~250 kilobytes per second). This issue is still not understood. CMS is about to start on these channels again. (FTS gridFTP abort bug being fixed for next patch).
  3. ROCs still have a tendency to re-assign too quickly to software support in GGUS. For operational problems, they should involve the software support rather than re-assign. That way, they will learn the solution for the next time.

June 22nd 2006

Activities completed and work in progress.

LFC and DPM

  • No major issues.

Transfer service

  • Service failures due to T1 sites publishing bad information are mostly resolved now. Chasing up remaining ones from SAM with GGUS.

  • Issue with SRM copy at FNAL understood: the new SRM copy version of dCache deployed there has a changed behaviour. If it has a non-fatal error message, it sends an error (as advisory) but continues with the operation. The FTS assumes the error was fatal, and the consequent cleanup attempt and retries cause a variety of problems. There is a fix on SRM at FNAL (that FNAL don't want to keep), and a 'fix' pending on FTS to accomodate the new behaviour. It is unclear what effect this 'fix' will have on other SRMs that do not follow the new dCache behaviour.

Castor

  • rollout of v 2.1.0-3:
    • alice,itdc,lhcb: done, atlas,cms,sc4: later
    • clients: later, still needs to be planned
  • work (still) ongoing to rationalize grid-mapfile management on srm, castorgrid, diskservers...
  • move of diskservers to LHC OPN completed
  • several hick-ups in recent weeks: oracle corruption affecting atlas, LSF licensing problem affecting cms, bugs in DB schema affecting Alice
  • Castor Readiness Review has reported
  • durable v1.1 endpoints are available, will be added to information systems.

Issues

  1. Subtle behavioural changes in dCache are causing real operational problems and service outages while the changes are understood for any clients not using the dCache srmcp commandline tool.
  2. Very slow gridFTP traffic inbound to CERN (from dCache sites) reported by CMS ~250 kilobytes per second). This issue is not understood. Not clear who is following it up.
  3. As a result of this, FTS gridFTP timeout triggers. This has exposed a bug in FTS - we do not send a clean gridFTP protocol abort on timeout, causing the Castor gridFTP server to hold onto some SRM resource. This causes the SRM cleanup to fail, which casues the subsequent retries to fail. Issue understood and fix pending in FTS.

June 8th 2006

Activities completed and work in progress.

LFC and DPM

  • Setup continues for LFC replication work - streams setup underway.
  • Problem with Python interface in gLite 3.0 (swig version) - being looked at.

FTS

  • ncm-yaim for FTS in testing
  • SURE alarms installed for FTA agent daemons
  • Working onto getting FTS transfer data into GridView

Castor

  • Busy with Castor review
  • Preparing castoralice for the Atlas TDAQ tests
  • LHCb have (just) announced that they will migrate their users to Castor-2 on June 8
  • We still manage many grid mappings through modifications in the local grid-mapfile, and need to see how to change that...
  • Following up Phedex+FTS (CMS) and srmcp (ATLAS) problems reported on GGUS for 'put' requests on srm.cern.ch
  • Reminder: a big intervention on Mon 12!

Issues

  • Serious service failures on many transfer channels. Various causes:
    • Information system issues
      • Sites not publishing any information about their SC4 SRMs
      • Sites publishing the incorrect site name (c.f. the site name in GOC DB)
      • Sites publishing their SRMs but not for all the relevant VOs
      • Transient nature of the site infrastructure information in BDII (information about sites will disappear for minutes at a time). FTS information system cache will need to be rewritten to cope with this. Workaround is to use static file, update by hand, and distribute this to all sites.
    • SRM problems
      • Various SRM problems at many sites
      • CMS report problems writing into CERN-PROD (FTS/Castor looking into it)

May 31st 2006

Activities completed and work in progress.

DB validation

  • LFC validated against 10gR2
  • FTS validated against 10gR2
  • Must use 10gR2 InstantClient.
  • Intervention planned for 12th at same time as Castor Intervention.

LFC and DPM

  • Setup continues for LFC replication work - populating DB going slow.
  • Crashing problem solved as version mismatch between Oracle server and client (use 10gR2).
  • LFC Dashboard filled in - ready to hand over to production service manager.

FTS

  • One alarm to install to monitor FTA agent daemons.
  • Work to do ncm-yaim for FTS underway
  • FTS Dashboard filled in - will be ready soon to hand over to production service manager.
  • Working onto getting FTS transfer data into GridView

Castor

  • Diskservers have been prepared and distributed across experiments instances.
  • Castor atlas DB had problems because of an Oracle procedure. It was a known bug (who's side effects that never got to these proportions). It was already fixed on more recent releases, fix was backported/deployed.
  • Preparation for Atlas service challenge: 10 diskservers moved into production, 2 new service classes (t0merge and analysis).
  • LHCB has been running for a month now. Thanks to their constant writing to disk/tape (40 terabytes) a problem was detected on the compression factor being reported by the tape drives they were using.

Issues

  • Names provided for 3rd level support - need to sort out details, training, etc.
  • Many pending requests in GGUS for CERN / Tier-2 channels from CMS.

May 23rd 2006

Activities completed and work in progress.

LFC and DPM

  • Upgrade to 3.0 done.
  • Setup continues for LFC replication work.
  • LFC setup for Biomed demo.. went well.
  • Issues being investigated with crashes: looks like a problem with RPMs built against 10gR1 client being used against 10gR2 server.
  • LFC Dashboard filled in - ready to hand over to production service manager.

FTS

  • Procedures updated. Working on improving alarms.
  • Working onto getting FTS transfer data into GridView.
  • Oracle 10gR2 validation underway - looking good.
  • Intervention made to update channel sources and destination to use correct GOCDB names for sites.
  • No names for 3rd level support

Castor

  • We are planning the migration of ~200 diskservers to the OPN on June 20
  • People are busy preparing the Castor review of June 6 - 9, rollout of 2.1.0-0 is stalled
  • We are adding diskservers to the LHC Castor instances, in accordance with the resource planning. We are preparing for the Atlas Tier-0 test as well.
  • We now use Yaim to manage the grid-mapfile on srm.cern.ch. Local grid-mapfile has been cleaned as well.
  • Last Friday's network intervention caused castorgrid.cern.ch to disappear, between 8:30 and 11:30

Issues

  • Need better simulation of tier-0 for transfer tests - work progressing here with Atlas.
  • 3 sites are not publishing SRM information correctly (BNL, NDGF, SARA) so their channels are down - work in progress.

May 17th 2006

Activities completed and work in progress.

LFC and DPM

  • Upgrade to 3.0 almost done.
  • LFC_SLOW_READDIR alarm threshold increased. Issue that caused it is being understood.
  • Setup continues for LFC replication work.

FTS

  • Procedures and SFT underway.
  • Upgrade to gLite 3.0 done.
  • Looking at ways of providing transfer monitoring information into GridView.

Castor

  • Castor s/w version 2.0.4-1 is not backwards compatible, and will not be further deployed at Cern. A new version is being prepared
  • SRM and gridftp access to CERN diskservers on the OPN is being configured for non-T1 networks. Control data will go through the firewall, bulk data will go via HTAR).
  • all Castor-2 diskservers have been gridftp-enabled.
  • we need to migrate the Castor nameserver database to different hardware. Tentative date: June 12. This means several hours of downtime, which we hope to use to move as many IP services as possible to the OPN.

Issues

  • There seem to be a need for an FTS service supporting Tier-0 to anywhere transfers. The request keeps popping up and it looks like the PPS is being (ab)used for this purpose. If done, it should be a distinct service from the Tier-0 to Tier-1 export (for FTS). To discuss.
  • Need better simulation of tier-0 for transfer tests.

May 10th 2006

Activities completed and work in progress.

LFC and DPM

  • Working on replication - setting up sites, understanding problems.
  • Soon to upgrade production with glite3.0 version (1.5.7).
  • LFC_SLOW_READDIR alarms on several machines with very little load. Some trivial calls seem to take a long time - maybe hardware issues?

FTS

  • Nothing new. Slowly improving procedures.
  • SFT underway (Piotr).
  • Upgrade to 3.0: half done (web-service).

Castor

  • new Castor version 2.0.4-1 (incl support for TURLs in RFIO) is being released. Roll out over the coming week(s).
  • "durable endpoints": GOCDB + EGEE.BDII configurations prepared by James.
  • stageatlas and stagecms Castor-1 stagers have been stopped
  • continuous operational improvements (monitoring, procedures, configurations, ...)

Issues

  • Need publication of all T1 SRMs in EGEE.BDII: 3 sites still missing.
  • Need better simulation of tier-0 for transfer tests.
  • Need a new VO for interop testing. Initially SRM.

May 3rd 2006

Activities completed and work in progress.

LFC and DPM

  • 1.5.7 added into 3.0 release.

FTS

  • DB cleanup procedure is OK; looking for cleaner solution.
  • fts001 to fts005 test machines now a subcluster of gridfts.
  • Improving service procedures for FTS. Yaim install guide, etc.
  • Various fixes of FTS went into 3.0rc5.

GFAl/lcg-util

  • Nothing new.

Castor

  • CMS switched to use CASTOR2 on May 2nd. The first day of running has been smooth with an average of more than 250 concurrent streams
  • started to move half (20) of the disk servers used for the SC4 throughput phase to experiment production CASTOR2 instances
  • lhcbprod was wrongly mapped to their CASTOR1 stager (stagelhcb) on castorgrid, which resulted in serious overload of stagelhcb over the Weekend. Also the gridmap mapping was wrong for some of the LHCb production managers. Both the stager mapping and gridmapping were fixed May 2nd in the morning and the LHCb production is now using CASTOR2. The reason why the mapping was wrong on castorgrid has not yet been understood.

Issues

  • Need publication of all T1 SRMs in EGEE.BDII.
  • Need better simulation of tier-0 for transfer tests.
  • Need a new VO for interop testing. Initially SRM.

April 26th 2006

Activities completed and work in progress.

LFC

  • Finishing service procedures for LFC.

FTS

  • Continued work to clean up DB; this caused some problems on the SC4 setup.
  • Our FTS test setup is not sufficient scale to catch all problems; working to improve it. Will use existing fts001 to fts005 for this.
  • Improving service procedures for FTS. Yaim install guide, etc.
  • Misc fixes for FTS in certification waiting to be deployed: file descriptor leak in SRM copy.

DPM

  • Nothing new.

GFAl/lcg-util

  • Nothing new.

Castor

  • new Castor release including support for TURLs in RFIO (for "durable" SRM endpoints) is being tested.
  • durable SRM endpoints are being prepared. To be added to GOCDB and BDII. Endpoint names will be srm-durable-{lhcb,atlas,cms,alice}.cern.ch
  • lcg-mon-gridftp now also reports non-SC4 gridftp transfers from experiment instance. We hope to add it to castorgrid as well. The data ends up in the Gridview plots!
  • We are ready to use yaim to manage gridmapfile creation on SRM + Castor-2 nodes, but wait for a newer yaim version with a silenced cron job.
  • Automated downloads of host certificates: certificates are being put into Sindes, an additional Lemon sensor is being prepared as well.
  • Experiment migrations:
    • Alice: done
    • Atlas: migrated last Thursday
    • CMS: user migration now foreseen for May 2nd
    • LHCb: getting ready for user migration, no date given, but should be Real Soon Now

Issues

  • Need publication of all T1 SRMs in EGEE.BDII.
  • Need better simulation of tier-0 for transfer tests.
  • Need a new VO for interop testing. Initially SRM.

April 12th 2006

Activities completed since last time.

LFC

  • RLS shutdown complete.

FTS

  • SRM copy tests running to FNAL. Still tuning.
  • Upgraded FTS to 3.0rc2 release for SC4 throughput test.

DPM

  • Test version 1.5.6 available with replication/pool-draining functionality.
  • Progress on SRM v2 status code convergence.

GFAl/lcg-util

  • New version of GFAL/lcg-util fixing all outstanding issues ready for testing.

Castor

  • RGMA and lcg-mon-gridftp configuration on Castor diskservers is now fully automated
  • srm.cern.ch is now published in the site-EGEE.BDII
  • We have switched off the oplapro boxes!
  • Migrations of CMS, Atlas, LHCb: next week most activities to be migrated to Castor-2!

Work in progress

  • LFC: Starting tests with streams for replication (CERN+CNAF)
  • Improving service procedures for LFC and FTS.
  • Couple of FTS problems found in throughout testing: these are being addressed.
  • DB cleanup in FTS needed: the cleanup is affecting the rate.

Issues

March 29th 2006

Activities completed since last time.

LFC

  • RLS shutdown proceeding for 31st. No issue with migration reported.

FTS

  • No new issues.
  • Running transfers now. SRM copy test transfers will run to FNAL and BNL.

DPM

  • No new issues.

Castor

  • SC4 instance now has 43 diskservers, spread over 4 IP services. This should be the configuration for the throughput phase
  • tape writing will be enabled for the SC4 instance
  • the castorgridsc.cern.ch endpoint is now using the same nodes as srm.cern.ch. We hope to switch off the oplapro boxes some time next week
  • Atlas now have their Castor-2 instance as well

Work in progress

  • Improving service procedures for LFC and FTS.
  • Upgrade of FTS to current 3.0 release.

Issues

March 20th 2006

Activities completed since last time.

LFC

  • RLS migration validation still not confirmed. End of month for complete shutdown of RLS.
  • Upgraded central LFC catalog for Atlas and CMS to version 1.5.4 to fix operational problem.
  • ncm-yaim component needs minor update to cope with gLite 3.0 yaim.

FTS

  • Packaged 3.0 release.
  • Background dteam transfers have started. More sites needed!
  • FTS channel plan in preparation for sites.

DPM

  • No major changes since last week. 1.5.4 being tested.

Castor

  • Production castor instance for Atlas is being setup
  • ~60 new diskservers on the LCG OPN have been received, 36 (split over 3 IP services) are being set up for SC4

Work in progress

  • Remaining LFCs will be upgraded to 1.5.4 tommorrow.

  • FTS: monitoring, SRM copy fixes. SRM v2 addition (and support for mixed SRM v1 / v2).
  • FTS: basic SFT done - ready for integration with SFT framework. End-to-end service functional tests being migrated from current EIS tests.

  • Castor: new stager_qry had some bugs. The fixes may have introduced another problem, which may explain the current SRM problems. Being investigated with high priority
  • Castor: the castorsrm DNS alias is currently pointing to the public castorgrid service, and cannot be moved until Atlas stop using it. We will discuss with them on Friday. Until then, srm.cern.ch can be used.

  • SRM v2 interoperability work proceeding.

Issues

  • For SC4, LHCb have asked for a replicated read-only copy of their global LFC catalog. This is being investigated now. Understanding the requirements with the experiments.
  • Agreement on SRM storage class for SRM v1. Should be aware that any changes in Castor SRM endpoint will result in a significant LFC SURL migration - this will involve catalog downtime of many hours.

March 15th 2006

Activities completed since last time.

LFC

  • RLS migration validation still not confirmed. Suggested end of month for complete sutdown of RLS. No comments on this yet.
  • Upgraded central LFC catalog for Atlas and CMS to version 1.5.4 to fix operational problem.
  • Other LFCs to be upgraded: this will be done by new LFC operations team.

FTS

  • YAIM scripts done for transfer agents.

DPM

  • 1.5.4 released and being tested.
  • Interoperability testing is ongoing.
  • Pool draining and internal replication code done - in testing.

Castor

  • Agreement between Castor and dCache on how to implement storage type for SRMv1 is now ~final.
  • Production castor instances for LHCb and CMS have been delivered.
  • SC4 instance with first 8 diskservers has been delivered
  • SRMv1 endpoint for SC4 throughput phase is ready as well

Work in progress

  • More diskservers for the SC4 Castor instance should be installed soon. We ran into SLC4 installation issues on some the designated servers, next week we should received and install a batch of ~60 new servers on 3 IP services.
  • Dismantling castorgridsc:
    • LHCB use the castorgridsc endpoint, but are mapped to their Castor-2 instance. We will re-assign their ~10 diskservers.
    • CMS have agreed to stop as well, we will reclaim their ~20 diskservers
    • Alice no longer use castorgridsc
    • Atlas: their Castor-2 instance is not ready yet...
  • Alice would like xrootd support, LHCb would like rootd support: these are not compatible. This is being investigated.

  • FTS: monitoring, SRM copy fixes. SRM v2 addition (and support for mixed SRM v1 / v2).
  • FTS: basic SFT done - ready for integration with SFT framework. End-to-end service functional tests being migrated from current EIS tests.

  • Starting background dteam transfers.
  • SRM v2 interoperability work proceeding.

Issues

  • For SC4, LHCb have asked for a replicated read-only copy of their global LFC catalog. This is being investigated now. Understanding the requirements with the experiments.

March 7th 2006

Activities completed since last time.

LFC

  • RLS migration validation still ongoing.

FTS

  • Script written to move 'dead/done' jobs in FTS to history table (this should reduce some of our database problems and substaintially reduce the query load on the database). This is a 'quick fix', pending doing it properly using Oracle partitioning.

DPM

  • 1.5.4 released and being tested. Interoperability testing underway.

Castor

  • SC4 castor instances for experiments being prepared.

Work in progress

  • FTS: monitoring, SRM copy fixes. SRM v2 addition (and support for mixed SRM v1 / v2).
  • FTS: basic SFT done - ready for integration with SFT framework. End-to-end service functional tests being migrated from current EIS tests.

  • Alice would like xrootd support, LHCb would like rootd support: these are not compatible. This is being investigated.

  • Starting background dteam transfers.

  • SRM v2 interoperability work proceeding. Castor/DPM interop almost done. Slow progress testing interop with dCache: Axis/gSoap problems.

Issues

  • Agreement between Castor and dCache has not yet been reached on the mechanism for choosing durable or permanent storage.

  • For SC4, LHCb have asked for a replicated read-only copy of their global LFC catalog. This is being investigated now. Understanding the requirements with the experiments.

  • Staging support needed for FTS.

  • Difficulty of debugging intermittent problems (degraded service) when using SRM copy: the T1 has all the relevant logs.

March 1st 2006

Activities completed since last time.

LFC

  • RLS migration done. Expt. validation underway.

FTS

  • Moved FTS entirely to gridfts cluster. Load-balancing on webservice almost complete. Yaiming underway to help auto-configuration.

DPM

  • 1.5.4 released.

Castor

Work in progress

  • FTS: monitoring, SRM copy fixes. SRM v2 addition (and support for mixed SRM v1 / v2).
  • FTS: basic SFT done - ready for integration with SFT framework. End-to-end service functional tests will be migrated from current EIS tests.

Issues

  • For SC4, LHCb have asked for a replicated read-only copy of their global LFC catalog. This is for purposes of scalability of queries and also for some level of reliability against LFC server failure. They would like it at a Tier-1 site (perhaps Lyon). We currently believe we can't run a service distributed across two sites, and suggested that perhaps having a replica database located on-site (perhaps at Prevessin) would provide a first solution. This could be re-evaluated once distributed database services were in production. Will look at this after the RLS migration is complete.

  • Staging support needed for FTS.

  • Difficulty of debugging intermittent problems (degraded service) when using SRM copy: the T1 has all the relevant logs.

February 22nd 2006

Activities completed since last time.

  • Mumbai was fun. Several issues resolved.

LFC

  • RLS migration is almost complete.
  • Upgrade of LFC to 1.4.5 (a few new functions requested by experiments) will happen as soon as migration is over.

FTS

  • Agreed to leave data scheduler. Transfers will be made point to point. Minimal changes needed to allow 'grouping' of sites for channel definitions.
  • Agreed to support mixed SRM v1 / v2.
  • Closer contact with SRM working group agreed.
  • Plan for SRM v2 interoperability testing agreed.

DPM

  • Some remaining issues with VOMS integration need to be addressed before it can be released.

Castor

  • New release 2.0.3-0:
    • stager_qry with regexp / directory quering
    • improvements for SLC4
  • installations of experiment instances ongoing
  • SRM endpoints for SC4: installations have started

Work in progress

  • FTS: monitoring, SRM copy fixes. SRM v2 addition (and support for mixed SRM v1 / v2).
  • FTS: basic site functional tests will be provided by end February. End-to-end service functional tests will be migrated from current EIS tests.
  • Load balanced FTS web-service being deployed on 2 nodes of the gridfts cluster. VO agents moving to gridfts cluster. Should be done by end of week.

Issues

  • For SC4, LHCb have asked for a replicated read-only copy of their global LFC catalog. This is for purposes of scalability of queries and also for some level of reliability against LFC server failure. They would like it at a Tier-1 site (perhaps Lyon). We currently believe we can't run a service distributed across two sites, and suggested that perhaps having a replica database located on-site (perhaps at Prevessin) would provide a first solution. This could be re-evaluated once distributed database services were in production. Will look at this after the RLS migration is complete.

  • Staging support needed for FTS.

  • Difficulty of debugging intermittent problems (degraded service) when using SRM copy: the T1 has all the relevant logs.

February 1st 2006

Activities completed since last time.

LFC

  • CMS RLS migration completed. CMS checking data consistency.

FTS

  • Moved FTS to new mid-range servers. Still not in final configuration.
  • Data scheduler prototyping work tailing off - general feeling that we can cover most of the production use-cases with the existing software (to discuss in Mumbai).
  • Operational requirements document available and prioritised by service team. Available in EDMS https://edms.cern.ch/document/700073/1.

DPM

  • DPM: Development and testing for VOMS/virtual ID enabled DPM done. Not deployed yet.

Castor

  • CASTOR2 cluster for ALICE being setup. Was waiting for a new CASTOR2 minor release solving some installation/configuration issues. The release was generated yesterday evening and the installation is now ongoing
  • The 'wan' pool in the old SC3 production instanance is being gradually dismantled. Yesterday we removed the ia32 nodes, leaving only the ia64 (oplapro) nodes, which should be sufficient for the LHCb tests.
  • The SC3 re-run instance (c2sc3) will be used for LSF stress-testing for a couple of days starting on Monday next week. After that the instance will be transformed for SC4 usage and SRM v21 interoperability testing before SC4.

Work in progress

  • FTS: monitoring, fixes.
  • LFC: RLS ATLAS migration requires RLS to stay up for two more weeks.

Issues

  • For SC4, LHCb have asked for a replicated read-only copy of their global LFC catalog. This is for purposes of scalability of queries and also for some level of reliability against LFC server failure. They would like it at a Tier-1 site (perhaps Lyon). We currently believe we can't run a service distributed across two sites, and suggested that perhaps having a replica database located on-site (perhaps at Prevessin) would provide a first solution. This could be re-evaluated once distributed database services were in production.

  • 1 machine of 9 is broken in new FTS cluster (fts104).

  • Staging support needed for FTS - this has been deprioritised at the FTS workshop.

  • SRM v2 support needed for FTS. Need deployed SRM v2 instances to do this plus some time.

  • Difficulty of debugging intermittent problems (degraded service) when using SRM copy: the T1 has all the relevant logs.

  • Need service functional tests.

January 17th 2006

Activities completed since last time.

LFC

  • New release (1.4.4 for LCG2.7.0) - VOMS and virtual IDs enabled - Automatic DB reconnection in case of problems.

FTS

  • Installed mid-range servers with FTS. Still to configure them. Plan to move production system to this asap.
  • Work underway on data placement component requested by expts.

DPM

  • New release (1.4.4 for LCG2_7_0) - Mutli-domain support - Automatic DB reconnection (DPNS) in case of problems.
  • Important : issue concerning the bad performances of the dCache -> DPM transfers resolved by David/Jean-Philippe.

Castor

  • SC3 re-run cluster setup in time for the ramp up last week
  • Crashes in the dlfserver caused several problems during the Weekend because the stager refuses to start when the dlfserver is down. Both problems are being worked on and in the meanwhile we have downgraded the dlfserver to the same version that has been running OK on the previous SC3 instance. We have also enabled an monitoring actuator for automatic restarts of the dlfserver
  • Progress on the stager_qry on directories (list all staged castorfiles in a directory) using Oracle regexps. This support is required by all four LHC experiments before they can agree on the general user migration.
  • SRM v2.1 full-feature release on track for end January.
  • New CASTOR name server with ACL support (based DPM/LFC name server) being heavily tested. Upgrade planned for Monday 30/1.

Work in progress

  • FTS: data scheduler, monitoring, fixes.
  • LFC: RLS ATLAS migration.
  • DPM: Development and testing for VOMS/virtual ID enabled DPM (almost done).

Issues

  • For SC4, LHCb have asked for a replicated read-only copy of their global LFC catalog. This is for purposes of scalability of queries and also for some level of reliability against LFC server failure. They would like it at a Tier-1 site (perhaps Lyon). We currently believe we can't run a service distributed across two sites, and suggested that perhaps having a replica database located on-site (perhaps at Prevessin) would provide a first solution. This could be re-evaluated once distributed database services were in production.

  • Staging support needed for FTS - this has been deprioritised at the FTS workshop.

  • SRM v2 support needed for FTS. Need deployed SRM v2 instances to do this plus some time.

  • Difficulty of debugging intermittent problems (degraded service) when using SRM copy: the T1 has all the relevant logs.

  • Need service functional tests.

November 30th 2005

Activities completed since last week

LFC

  • Set up a test LFC server using the Physics Database Validation Service, so that LHCb can test all the new features/improvement (this LFC runs version 1.4.1).

FTS

  • Star-star (anywhere to anywhere) channel added to pilot for Atlas to play with

DPM

  • DPM component added for tracking in WLCG meeting

Castor

  • SRM v2.1 release foreseen by end of this week: support for pinning and directory functions except 'ls'. RAL people come CERN week to upgrade the test system at srm://lxb2079.cern.ch:9000/ .
  • Working on a test-suite for new CASTOR name server with ACL support.
  • Fix double delete bug in castor2 client. This was reported by an atlas user loading the castor libraries dynamically (see http://savannah.cern.ch/bugs/?func=detailitem&item_id=11800)

Work in progress

  • FTS fix for memory leak to be deployed this week
  • Restarted tests on LFC 1.4.1 (with virtual ids/VOMS)

Issues

  • For SC4, LHCb have asked for a replicated read-only copy of their global LFC catalog. This is for purposes of scalability of queries and also for some level of reliability against LFC server failure. They would like it at a Tier-1 site (perhaps Lyon). We currently believe we can't run a service distributed across two sites, and suggested that perhaps having a replica database located at Prevessin would provide a first solution. This could be re-evaluated once distributed database services were in production.

  • FTS workshop (http://agenda.cern.ch/fullAgenda.php?ida=a056842) has identified a new software component as top priority for FTS developers - data planner/scheduler, to give one place to submit transfer jobs to, and handle any multiple hops required. Initial pilot scheduled for end-January (not a chance of it being production quality by then) but it will need hardware to run on. We also need to consider the wider implications of who manages the intermediate caches and whether something needs to be deployed at T1 sites to do this.

  • Staging support needed for FTS - this has been deprioritised at the FTS workshop.

  • SRM v2 support needed for FTS - this isn't even on the priority list from the workshop. There is an issue here that SRM copy support may not be provided in the first version of dCache SRM v2 [minutes of BSWG]. This does not sit well with the requirement that we need SRM copy to obtain high throughput when using dCache.

  • Difficulty of debugging intermittent problems (degraded service) when using SRM copy: the T1 has all the relevant logs.

  • Need service functional tests.

November 16th 2005

Activities completed since last week

LFC

FTS

  • Lyon moved to use SRM-copy

Castor

  • Logging data base (DLF) moved to new more powerful hardware.
  • SRM v2.1 test instance setup and announced to the baseline services WG. The endpoint is: srm://lxb2079.cern.ch:9000/
  • New CASTOR2 client has been deployed on all lxplus and lxbatch. The new clients comes with improved stager_qry functionality requested by CMS and other users.

Work in progress

  • LFC : on going cleaning of the splitted LFCs (= removal of the useless directories in lfc-lhcb, lfc-local and lfc-shared) - testing of the virtual uids/gids/VOMS LFC.
  • Still awaiting memory-leak patch for FTS
  • General HA definition for services.

Issues

  • Staging support needed for FTS.

  • SRM v2 support needed for FTS.

  • Difficulty of debugging intermittent problems (degraded service) when using SRM copy: the T1 has all the relevant logs.

  • Need service functional tests.

November 1st 2005

Activities completed since last week

LFC

DB moved to RAC. Single productioin LFC split into distinct global VO LFCs (for the VOS that requested them) and local VO LFCs (for the VOs that requested them).

FTS

Upgraded to new version on production and pilot. DB moved to RAC.

Castor

Upgraded to new version.

Work in progress

  • General HA definition for services.

Issues

  • Staging support needed for FTS.

  • SRM v2 support needed for FTS.

  • Difficulty of debugging intermittent problems (degraded service) when using SRM copy: the T1 has all the relevant logs.

  • Need service functional tests.

October 18th 2005

Activities completed since last week

LFC

New version released and deployed on production (1.3.8): configurable DB name, 'df' support, fixes.

FTS

New maintenance version released on 1.3 branch (gLite 1.3QF24): bug fixes, user-selectable MyProxy.

New version released on 1.4 branch (gLite 1.4.1): SRM copy, better BDII integration, better monitoring.

Castor

Downtime due to unoptimised queries on DB and high stager load.

New version of Castor stager deployed: SQL queries optimised.

New version of Castr SRM deployed: less polling to reduce load on stager DB.

Work in progress

  • Reboot of LFC and FTS nodes scheduled for Thursday morning (kernel upgrade).

  • To deploy FTS 1.3QF24 on production nodes, and at T1 sites.

  • To deploy FTS 1.4.1 on pilot nodes (upgrading from current 1.3 configuration).

  • Move of LFC and FTS databases to RAC. Schedule with next Oracle patch.

  • Further tuning of DB queries on Castor stager

  • General HA definition for services.

Issues

  • Staging support needed for FTS.

-- GavinMcCance - 07 Mar 2006

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2008-07-09 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback