LCG Web>WLCGTier1ServiceCoordinationMinutes110901 (2011-09-01, MariaDimou)

EditAttachPDF

WLCG Tier1 Service Coordination Minutes - 1 September 2011

Attendance

Local:

Remote:

Action list review

Release update

Data Management & Other Tier1 Service Issues

Site	Status	Recent changes	Planned changes
CERN	CASTOR 2.1.11-2 (SL5); SRM 2.10-x (SL4); xrootd: 2.1.11-1 FTS SL4 3.2.1 i.e old EOS -0.1.0/xrootd-3.0.4	All instances are now on the latest CASTOR version. All of them use the Transfer Manager for internal sceduling (LSF replacement) . The end of August technical stop was used to perform several DB maintenance intervention (security patched on all, including the Name Server, CMS stager DB moved to new more performing hardware, ATLAS stager DB defragmented	Complete the migation to new hardware (all instances -but CMS, and the Name Server) Resource move (disk space) from CASTOR to EOS is being planned (time scale: October 2011)
ASGC	CASTOR 2.1.11-2 SRM 2.11-0 DPM 1.8.0-1	29-30/8: downtime due to CASTOR upgrade	None
BNL	dCache 1.9.5-23 (PNFS, Postgres 9)	None	Transition to Chimera during next TS (Nov)
CNAF	StoRM 1.7.0
FNAL	dCache 1.9.5-23 (PNFS) httpd=1.9.5.-25 Scalla xrootd 2.9.1/1.4.2-4 Oracle Lustre 1.8.3
IN2P3	dCache 1.9.5-26 (Chimera) on core servers. Mix of 1.9.5-24 and 1.9.5-26 on pool nodes
KIT	dCache (admin nodes): 1.9.5-25 (ATLAS, Chimera), 1.9.5-26 (CMS, Chimera) 1.9.5-26 (LHCb, PNFS) dCache (pool nodes): 1.9.5-9 through 1.9.5-26
NDGF	dCache 1.9.12
NL-T1	dCache 1.9.5-23 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)
PIC	dCache 1.9.12-8; PNFS, Postgres 9.0		13/9: upgrade to dCache 1.9.12-10
RAL	CASTOR 2.1.10-2 2.1.10-0 (tape servers) SRM 2.10-0		7/9: will apply DB patches to 3D ATLAS and LHCb, FTS and LFC. Services will be "at risk"
TRIUMF	dCache 1.9.5-21 with Chimera namespace	None	None

Other site news

CERN: the CERN-T2 FTS had its agents upgraded to SLC5 (IT SSB post).

CASTOR news

CERN operations and development

The Transfer Manager is now in production on all CERN instances. It shows all the expected advantages compared to the LSF-based scheduling (lightweight and efficient scheduling, less overall complexity). Residual issues are reasonably under control and are worked around satisfactorily.

Resource movement (disk space) from CASTOR to EOS is being planned (time scale: October 2011)

EOS news

xrootd news

dCache news

StoRM news

EMI-1_Update-6 contains an updated version of StoRM (see EMI1-Update6 link ), and it should be available tomorrow. After that, StoRM v1.7.1 can be installed by EA (EGI).

FTS news

FTS 2.2.5 in gLite Staged Rollout: http://glite.cern.ch/staged_rollout
FTS 2.2.6 in development: FTS 2.2.6 patch (see list of bugs at the end)

DPM news

LFC news

LFC deployment

Site	Version	OS, n-bit	Backend	Upgrade plans
ASGC	1.8.0-1	SLC5 64-bit	Oracle	None
BNL	1.8.0-1	SL5, 64-bit	Oracle	None
CERN	1.7.3 64-bit	SLC4	Oracle	Upgrade to SLC5 64-bit pending
CNAF	1.7.4-7 (ATLAS, to be dismissed> 1.8.0-1 (LHCb, recently updated)	SL5 64-bit	Oracle
FNAL	N/A			Not deployed at Fermilab
IN2P3	1.8.0-1	SL5 64-bit	Oracle 11g	Oracle DB migrated to 11g on Feb. 8th
KIT	1.7.4-7	SL5 64-bit	Oracle	Oracle backend migration pending
NDGF	1.7.4.7-1	Ubuntu 9.10 64-bit	MySQL	None
NL-T1	1.7.4-7	CentOS5 64-bit	Oracle
PIC	1.7.4-7	SL5 64-bit	Oracle
RAL	1.7.4-7	SL5 64-bit	Oracle
TRIUMF	1.7.3-1	SL5 64-bit	MySQL

Experiment issues

WLCG Baseline Versions

Release report: deployment status wiki page
WLCG Baseline versions: table

GGUS issues

Two tickets concerning voms bugs revealed after a recent, supposingly transparent, upgrade, were posing serious problems with the oracle middleman process and/or to obtain voms-proxies. After 2 weeks of idle time at the DMSU premises, the tickets are now closeful followed-up by the developers, starting today.

Review of recent / open SIRs and other open service issues

Conditions data access and related services

Database services

Generic reports:
- Many different small problems related to storage and streams over last 2 months. All resolved by now.
- All 10.2.0.5 CERN DBs upgraded with Oracle July CPU including downstream capture boxes. No issues found except for the ATLAS downstream capture issue (see below) - not patch related.
- After an intervention (applying security patches) on ATLAS downstream database (ATLDSC) on 31st of August replication for ATLAS conditions to T1 sites has aborted. The root-cause of this problem was most probably streams processes misbehavior during replication restart which caused some transaction lost and consequently streams inconsistency. All streams setups have been manually resynchronized by capture process reconfiguration and apply process recreation. Replication to Atlas conditions has been fully restored by 16:30.

Site reports:

Site	Status, recent changes, incidents, ...	Planned interventions
BNL	-- Oracle database services at BNL were turned off and related hardware power off past Saturday as a part of the preventive shutdown of the BNL Tier 1 due to critical weather conditions. Atlas Conditions database - streams propagation and apply process and database service were disabled Saturday ~11:47 AM EDT and re-enabled in Monday 29 at ~10:34AM EDT. BNL LFC and FTS, US ATLAS Tier 3 LFC databases - disabled at ~11:15AM EDT Saturday and re-enabled at 10:00AM EDT Monday Standby database for BNL LFC and FTS, US ATLAS Tier 3 LFC databases - disabled at ~11:30 EDT Saturday and re-enabled at 11:00AM EDT Monday. VOMS and Priority Stager - disabled at ~11:15AM EDT Saturday and re-enabled at 10:05AM EDT Monday So far no problems (hardware and database and replication) observed after re-enbling the database services. Activities: --Migrated VOMS and Priority Stager to a newer disk storage. -- ORACLE diagnosed in Service Request 3-3535183751 and SR 3-3396390491 a possible cause of the apply process and the gather_stats job contention reported in 04/13/11 and 06/15/11. Oracle claims that this problem is due to a BUG 6011045 which affects DBMS_STATS causing a deadlock between 'cursor: pin S wait on X' and 'library cache lock'. In addition Oracle provided a patch number (P6011045) and recommend to apply it to fix this BUG. The patch appears to be rolling.	--To apply CPU patches initially to Conditions Database and proposed patch from Oracle (P6011045) --To relocate the LFC and FTS cluster to a new datacenter room, this intervention has been positioned until next Technical Stop (November) and will be scheduled in conjunction with other services (dCache) maintenance to minimize overall site services disruption.
CNAF
KIT	Nothing to report	None
IN2P3	On 18 July 2011, hard disks have broken down involving a corruption at different level (control file, undo, redo logs, archived logs and datafiles). Unique solution to restore databases, was to apply an incomplete restore for all databases. Streaming was restored together with CERN DBAs. Storage team is in contact with the Pillar support to understand the root of problem. Note that, all Luns presented to ASM are in RAID5 and hard disk failures should not have impacted Oracle.	CPU patching schedule: - DBLHCB on 6th sept 2011 - DBATL on 7th sept 2011 - DBAMI on 8th sept 2011 We may foresee a downtime on 20th sept for a micro-code upgrade on FC switches and disk array. This intervention involves to shutdown all 3d databases for around 12 hours. We wait for the confirmation of Atlas France before fixing a date. To be scheduled.
NDGF	Nothing to report	None
PIC		- Complete stop of ATLAS' database. - apply CPU July on remaining instances (maybe next week)
RAL
SARA	Nothing to report (not attending)	No interventions
TRIUMF	Nothing to report	No interventions

AOB

-- AndreaSciaba - 30-Aug-2011

Topic revision: r9 - 2011-09-01 - MariaDimou

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback