LCG Web>LCGServiceChallenges>ProgressLogs>ServiceChallengeFourProgress>LcgScm>LcgScmStatus>LcgScmStatusDms (2009-10-21, GavinMcCance)

EditAttachPDF

LCGSCM Data Management Status

Oct 21, 2009

Castor

Currently running 2.1.8-12 on all VO instances
DB hardware moves ongoing.
Nameserver currently downgraded to 2.1.8. Waiting for a fixed version of 2.1.9, planning to deploy this before startup.
Startup: 2.1.9-2 will be deployed on public and c2cernt3. Possible upgrade of LHC experiment instances to be discussed per experiment.

xroot for Atlas Grid jobs (gssklog converting x509 cert to kerberos) - to be discussed.
Issue with checksums reported by CMS (CDR) - to discuss at Castor meeting.

SRM

All SRM instance at 2.8 - (LHCb still at 2.8.0 rather than 2.8.1 - transparent upgrade.)

FTS

Should aim for the fixed FTS 2.2 (using the old FTS/SRM interaction). New version in testing on PPS. ATLAS will already move. To discuss with other VOs.
Leave FTS 2.1 service around for a while.

LFC

Runing version 1.7.2-4.
Hardware and DB moves done.
Plan to move support for CMS to prod-lfc-shared-central, and to decommission =prod-lfc-cms-centra=l. Important: no data import/export, so all CMS data will be "lost". No date fixed yet.

Oct 7, 2009

Castor

Currently running 2.1.8-12 on all VO instances
Plans: DB hardware move this month on all instances - shutdown 1 hour - to be scheduled.
Startup: 2.1.9-2 will be deployed on public and c2cernt3. Possible upgrade of LHC experiment instances to be discussed per experiment.

SRM

CMS, ATLAS and ALICE currently running SRM 2.7-19. LHCb running 2.8.0 (better support for TURLs, checksums and better operational logging).
- 2.8.1 to be deployed on PPS, LHCb, ALICE asap. CMS and ATLAS to schedule (still testing on PPS). (2.8.1 fixes some checksum issues). Aim to have everyone at >=2.8 for startup.
- 2.9.0 is the pipeline (for deployment with Castor 2.1.9, adding syslog logging).

FTS

FTS 2.2 being stress-tested on large PPS service with ATLAS.
- First version to priovide checksum support.
- It changes trhe way it interacts with the SRMs for the first time in 4 years, and therefore requires a signficiant scale test.
- To scheule test with other VOs
- Status of T1 testing? (some sites had volunteered).
- Startup: dependant on test results. Plan B: backport checksums and use FTS 2.1.

LFC

Runing version 1.7.2-4.
Plan to move support for CMS to prod-lfc-shared-central, and to decommission =prod-lfc-cms-centra=l. Important: no data import/export, so all CMS data will be "lost". No date fixed yet.
All nodes are going out of warranty and will be replaced by new ones (except CMS). No date fixed yet.

Apr 29, 2009

Castor

CASTORCERNT3 running 2.1.8-7. To be patched on Wednesday 27th to 2.1.8-8 (latest version)
CASTORALICE running 2.1.8-7. To be patched on Wednesday 27th to 2.1.8-8 (latest version)
CASTORATLAS running 2.1.8-7. To be patched on Wednesday 27th to 2.1.8-8 (latest version)
CASTORCMS running 2.1.8-7. To be patched on Tuesday 2nd to 2.1.8-8 (latest version)
CASTORLHCB to be upgraded on Wednesday 27th to 2.1.8-7. To be patched on Tuesday 2nd to 2.1.8-8 (latest version)

CASTOR-xroot redirectors running 1.0.6-9 (latest version).

Castor central services VMGR, CUPV, VDQM running 2.1.8-3. To be patched on Thursday 28th to 2.1.8-8 (latest version).

SRM

All production instances still running version 2.7-15 which has some memory leak and stability problems.
- 2.7-18 (the fixed version of 2.7-17) will be released today and put on srm-pps to be tested. Time is short for upgrade before STEP'09.
- 2.8 will be put on stresstest SRM PPS instance once it is releases.

FTS

No changes before STEP

LFC

No changes before STEP
Waiting for LFC version 1.7.2-4 (with methods for ATLAS) to upgrade after STEP

Apr 29, 2009

Castor

Deployment of 2.1.8-7 on Tier-0 production instances will start next week with CASTORALICE. The 2.1.7 - 2.18 upgrade was successfully exercised on the CASTOR PPS setup last week
Database switch intervention scheduled for Wednesday morning: CERN-PRODsite will be flagged 'At risk'

SRM

Minor release of SRM v22 2.7 fixing problem with socket leak will be deployed in the next weeks

Jan 21, 2009

Castor

production services running stably
2.1.8 testing still ongoing on (internal) repack setup

FTS

proxy delegation bug hit Atlas last Saturday, S/w fix promised in the next few weeks

SRM

production services running stably
maintenance release 2.7-14 deployed this week

LFC

Reboots of all LFC nodes to get the latest kernel on Friday afternoon:
- Transparent on all nodes.
- Except on the standalone read-only LHCb LFC (prod-lfc-lhcb-ro.cern.ch):
  - To avoid having to schedule a reboot with LHCb, we configured another machine to serve as read-only LFC.
  - The configuration wasn't complete: the LHCb VOBOXES were not trusted during 7 minutes (between 18:27 and 18:34).
  - LHCb assured that it did NOT affect them, as their code was robust against this.

Dec 10, 2008

Castor

Stager database upgrades to 10.2.0.4 finishing today
Testing of 2.1.8 on (internal) repack instance.

SRM

production deployment of v 2.7 finishing today
- biggest problem: BoL behaviour. Degraded srm-atlas on Friday and Monday.
- expecting a maintenance release addressing most issues discovered since upgrades started Dec 1
plan to stop SLC3 nodes next week.

Nov 19, 2008

Castor / SRM

Disklevel checksumming is now finally working and put in action on both CASTORATLAS and CASTORCMS and soon also CASTORALICE, CASTORLHCB and CASTORPUBLIC.
2.1.8 is being production tested on the internal c2repack instance. Several bugs have been found and must be fixed before any deployment on the Tier-0 produciton instances can start.

FTS

CERN FTS SLC4 upgraded to latest released patch on Monday. Migration to SLC4 services good.
- Target for end-of-life of SLC3 FTS services is end of this year. Discuss.
Status of new release for T1s - today?
Deployment and test of FTS monitoring tool progressing on FTS PPS.

LFC

Nothing to report.

Nov 5, 2008

Castor / SRM

Production instances:
- Deploying bug-fix release 2.1.7-21
  - Allows for checksumming of RFIO v3 transfers
  - Atlas yesterday, others to be agreed (hopefully next week)
- Deploying Castor 2.1.8
  - Planning is starting...
- Deploying Castor SRM 2.7
  - PPS endpoints are being handed over
  - Atlas tests successful, and will be intensified
  - Production deployments: over the next weeks (barring problems)
- Upgrade databases to Oracle 10.2.0.4
  - Upgraded Atlas and Alice SRM databases
  - others to be done together with upgrade to Castor 2.1.8

DPM / dCache

No report

FTS

No report

LFC

Smooth running.

Oct 15, 2008

Castor / SRM

Production instances: bug-fix release 2.1.7-19-2 has been deployed (ATLAS,CMS already, LHCB/ALICE/Public today)
Still testing SRM 2.7 on PPS, setting up pre-prod endpoints for the experiments
All SRM v1 instances have now been stopped.
Data corruption on one filesystem of one CMS diskserver https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20081008
- 31 corruption files saved
- Improved monitoring being deployed
- Castor RFIO checksumming found not to work in case if file-updates, so we cannot configure checksumming currently

DPM / dCache

[from now on]

FTS

Testing continuies at CERN on FTS SLC4 services (20 TB CMS, 20 TB Atlas, 0.5 TB LHCb)
Two issues holding up FTS SLC4 release:
- log4cpp library bug (segfault for >1k messages) - library rereleased, 1 FTS RPM needs rerelease to fix dependency. Fix deployed at CERN.
- Error message problem (FTS' error categorisation omitted from front of error string) - this is a pain for operations since it loses the information about where the error actually is. https://savannah.cern.ch/bugs/?32942 (January 2008). Fix now deployed at CERN. Postmortem todo.

LFC

Intervention on LHCb database content at CERN and Tier-1s was successful. 24h delay at RAL (database update did not finish whereas it took 15 minutes maximum at other sites, and the database content had to be imported from CERN again). RAL says they've tuned the database parameters since then, so that the intervention would work next time.

Oct 1, 2008

Castor / SRM

Production instrances: maintenance release 2.1.7-19 to be dployed (transparantly)
"T3 instance": pends on first official 2.1.8 release,
SRM: version 2.7 (first official SLC4 release) is out, will test it over the next weeks. Deployment plan to be developed with experiments.
SRM v1: to be stopped once LHCB LFC entries have been updated.

FTS

FTS 2.1 (gLite 3.1 SLC4) installed at CERN for T0-export and T2-service (in parallel to existing service). Experiments are asked to test the new service with a view to switching production to it soon.
Problem found in FTS 2.1 release (patch 2048): hold for now. We'll patch the CERN FTS services here.

LFC

LHCb SRM update: intervention date not fixed yet. Probably next week
LFC on SLC5 before LHC restart under discussion.

Sep 17, 2008

Castor / SRM

Production Castor services running v 2.1.7-17, SRM service at v 1.3-28
SRM frontednode moves done, planning move of backend servers. This will involve a small service intteruption (~15 minutes)

FTS

FTS 2.1 (SLC4) installed for T0-EXPORT and T2-SERVICE, awaiting tests.

LFC

Staying with 1.6.8 for now (problem reported in newer version, under investigation).

Sep 3, 2008

Castor / SRM

22 August - network switch issue in computer centre caused various problems accessing Castor services over that weekend (including access to the first beam data from LHCb).
6 hour downtime on August 28 on SRM-CMS due to DB tablespace filling up
- DB monitoring problem to follow up
- SRM1.3 software should 'clean' its request tables (not cleaning up also exposes us to the risk of SRM core queries degrading over time).
Short interruption today on SRM service to upgrade firmware on NAS infrastructure behind the SRM DBs.
SRM node moves taking place this week - recovering space in the computer centre.

FTS

CERN FTS 2.1 installation proceeding (will run in parallel to existing SLC3 based production services).
Some node moves to ensure reasonable warranty coverage for existing SLC3 FTS services.

LFC

No issues.

Aug 20, 2008

Castor / SRM

All Castor instances now all upgraded to 2.1.7-14

FTS

Planning FTS 2.1 rollout. Still in discussion.

LFC

On request from Nick, testing the 1.6.11-3 LFC 64 bits against Oracle.
Waiting for a confirmation from Akos/David whether the difference of version for glite-security-voms RPMs is intentional.

Aug 6, 2008

Castor / SRM

All SRM endpoints were upgraded to 1.3-28 last week
CASTORATLAS upgraded to 2.1.7-14 last Thursday
CASTORCMS upgrade scheduled for next Tuesday, August 12th, 09h00-11h30 CERN time

Jul 7, 2008

Castor / SRM

We will propose to upgrade Castor ATLAS next week to 2.1.7-10 (allows different GC policy, various fixes)

LFC

All LFCs are now running with 60 threads, not only the LHCb ones.
Support for vo.sixt.cern.ch added to prod-lfc-shared-central.

Transfer service

Testing ongoing for SL4 FTS version. Couple of issues reported from CMS and Atlas - being understood (1 is Castor, the other looks like a config mistake on one of the channels). Software certified and now in Preproduction. New machines at CERN being prepared for installation. Want to wait a bit more to see the results of the pilot before scheduling upgrade. Again, we'd like to run for a couple of weeks at CERN before pushing to T1 sites (though time is short).

Following issue of 'missing' FTM data - we'll contact a likely site and see if we can debug there.

Jun 24, 2008

Castor / SRM

SRM running stably
- SRM endpoints now split over dedicated hardware (RACs, frontend, backend servers)
- bug-fix release 1.3-27 deployed for CMS today, other VOs tomorrow
Castor
- no issues

LFC

nothing much

Transfer service

nothing much

June 3, 2008

Castor / SRM

Q1: we want to configure the use of the Castor "internal gridftp" implementation. This requires an FTS upgrade first. As the tURL format changes, we will contact the VO's to coordinate this change.

Transfer

nothing much
extracting data from the service for the CCRC post-mortem

21 May 2008

Transfer

Various degardations to T0-export as a rsult of Castor SRM issues.
Found out that the version of FTS deployed for CCRC'08 to permit the preferred Castorint gridFTP setup only fixes half of the problem (mis-tag).

CASTOR

Various service degardations to SRM service, summariised https://prod-grid-logger.cern.ch/elog/Data+Operations/2
- LHCb tablespace problem on SRM database (ran out of space)
- Issue of long stager operations - putDones taking a long time (20 mins). Understood: solved in latest Castor version. Upgrading ATLAS today (21 May), CMS tommorrow (22 May).
- Issue of SRM server getting stuck ("CGSI: no header returned", i.e. server accepted the TCP connection but didn't respond) casuing serious degration to ATLAS: not understood. Putting actuator in place to restart daemon in cause of no response. This in itself is a really bad workaround - the SRM doesn't gracefully deal service its threads on restart - but it's better than leaving it stuck.
- Misc. application deadlocks in the SRM database - some resolved by new SRM version. Upgrading ATLAS and CMS tomorrow (22 May).

LFC

No issue.

30 April 2008

Transfer

FTS upgrade to patch 1740 (supporting Castorint gridFTP configuration) on all CERN-PROD services.
DESY channels moved to FTS-T2-SERVICE at CMS' request.

CASTOR

Deployed versions:

VO	Castor	SRM2	SRM1	Comments
Atlas	2.1.7	srm-atlas, v 1.3-21	-
CMS	2.1.7	srm-cms, v 1.3-21	srm.cern.ch
Alice	2.1.6	srm-alice, v 1.3-21	-
LHCb	2.1.6	srm-lhcb, v 1.3-21	srm.cern.ch, srm-durable-lhcb.cern.ch
Ops, Dteam	2.1.7	srm-{ops,dteam}	srm.cern.ch

Disk caches:
- required diskspace for CCRC provided for Atlas, CMS, Alice, will add 50 TB to LHCb today

16 April 2008

CASTOR

Preparing for CCRC: Software upgrades, space token creations, access control lists update, ...
Castor version 2.17 has been deployed on Castoratlas on April 8, and is stabilising. We are planning an upgrade of Castorpublic next week. Other instances will start CCRC on latest release of 2.1.6
Decommissioning SRM v1 for Atlas and Castorgrid Classic SE for all LHC VO's

LFC

All LFC servers, except the LHCb ones: 3 hours downtime on Thursday afternoon - 14:00 to 17:00 to rename SURLs).
Unplanned 2 hours downtime of the LHCb LFCs today from 15:00 to 17:00. The LHCb DB was marked down for scheduled intervention, but the associated service wasn't (it wasn't clear that this database intervention was going to affect the service). Communication problems?

Transfer

Plan to reconfigure shares on T0 export service as outlined in previous mail and as discussed at CCRC workshop. ASAP.
Discuss status of https://savannah.cern.ch/patch/?func=detailitem&item_id=1740 and upgrade slots.

27 February 2008

CASTOR

Current deployed versions:
- Castoratlas, Castorcms, Castorpublic (as of today): v 2.1.6-10
- Castoralice, Castorlhcb: v 2.1.4-10 (upgrades early in March?)
- SRM2 endpoints: v 1.3-14
CCRC report
- Stable running, apart from Feb 14 -- 16

13 February 2008

Transfer Service

A fair number of issues have been discovered in FTS
- Issue with proxy delegation: sometimes the delegated proxy is bad [the private key is corrupt]. (Not understood, workaround in progress). [CMS]
- Issue with proxy delegation: on the FTS the proxy is indexed by the hash of its DN (it should be by the hash of DN AND VOMS roles). The means you can end up not getting a VOMS role when you expect to, which affects the mapping on some SRMs (reported at NIKHEF). [ATLAS]
- Failed to process transfers in 5 seconds (not understood, busy servers?). This reduces the efficiency of the jobs since they fail and have to be retried. [ATLAS]
- CERN-ASGC agent stuck for a day on CERN-PROD FTS-T0-EXPORT: monitoring and lemon actuator didn't restart it (Not yet understood). [ATLAS]

Other issues:
- Storm: 'someone' doing SRM.Ls is creating excessive load on CNAF Storm - not clear where Ls requests come from. We have confirmed FTS sends a compliant SRM.Ls(numLevels=0).
- SRM.mkdir issue on dCache understood: in case of directory already exists= the dcache should return the correct code. Reported to dCache.

LFC

All quiet on the western front. No issues.

CASTOR

Current deployed versions:
- Castoratlas, Castorcms: v 2.1.6-10
- Castoralice, Castorlhcb, Castorpublic: v 2.1.4-10
- SRM2 endpoints: v 1.3-13
CCRC report
- Stable running, mainly
- Several bugs and configuration problems found and fixed in the first week

16 January 2008

Transfer Service

CCRC'08 preparations:
- Awaiting patch 1589
- Deploying Lyon's transfer service monitor on fts102.

CASTOR

Preparations for CCRC ongoing:
- Creation of SRM v2 space tokens well underway (only LHCB configuration to be finalized)
- Upgrade of CASTOR central services next Wednesday (downtime during morning)
- CASTOR s/w v 2.1.6 still under test.

Previous reports

Older reports are moved to:

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
ppt	CERN_site_report_-_Jan_24.ppt	r1	manage	222.0 K	2007-01-23 - 23:33	JanVanEldik	CERN site report for WLCG Collaboration Workshop
ppt	Physics_services_at_Christmas.ppt	r2 r1	manage	398.0 K	2007-02-02 - 00:32	JanVanEldik	post-C5: Physics service at X-mas

Topic revision: r223 - 2009-10-21 - GavinMcCance

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback