LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>Tier1ServiceCoordination>WLCGTier1ServiceCoordinationMinutes101007 (2010-11-11, AndreaSciaba)

EditAttachPDF

WLCG Tier1 Service Coordination Minutes - 07/10/2010

Attendance

Gonzalo Merino (PIC), Elizabeth Chism (OSG), Luca dell'Agnello (CNAF), Patrick (dCache.org), Elena Planas (PIC), Felix Lee (ASGC), Jon Bakken (FNAL), Pierre Girard (IN2P3), Alessandro Cavalli (INFN), John DeStefano, Carlos (BNL), Jhen-Wei Huang (ASGC), Ron Trompert (NL-T1), Andrew (TRIUMF), Andreas (GridKA), Pierre Girarc (IN2P3), Elizabeth Gallas

Lola, Andrea Valassi, Alessandro, Manuel, Jean-Philippe, Andrea Sciaba, Jamie, Maria, Roberto, Maria Allandes, Maria DZ, Massimo, Alexei, Luca

Release Update

WLCG Baseline Versions

Release report: deployment status wiki page
WLCG Baseline versions: table

Data Management & Other Tier1 Service Issues

Site	Status	Recent changes	Planned changes
CERN	CASTOR 2.1.9-8 (All) SRM 2.9-4 (all) xrootd 2.1.9-7	end of August:CASTOR upgrade 5-6/10: xrootd upgrade for ATLAS, CMS and LHCb	CASTOR 2.1.9-9 is being tested and will be gradually rolled out (in accordance with the experiments) in order to have a combined test prior to the HI data taking (hence before the 1st of November)
ASGC	CASTOR 2.1.7-19 (stager, nameserver) CASTOR 2.1.8-14 (tapeserver) SRM 2.8-2	24/11: unscheduled downtime for 5 hours due to a power outage	Setting up a 2.1.9 testbed for a future upgrade
BNL	dCache 1.9.4-3 (PNFS)	NONE	NONE
CNAF	StoRM 1.5.4-5 (ATLAS, CMS, LHCb,ALICE)
FNAL	dCache 1.9.5-10 (admin nodes) (PNFS) dCache 1.9.5-12 (pool nodes)	None	Investigating upgrade to 1.9.5-22 during Nov 2 downtime
IN2P3	dCache 1.9.5-22 (Chimera)	Upgrade to 1.9.5.22 (core servers + pool nodes) Alice removed from dCache configuration Tape system protection enabled	none
KIT	dCache 1.9.5-15 (admin nodes) (Chimera) dCache 1.9.5-5 - 1.9.5-15 (pool nodes)
NDGF	dCache 1.9.7 (head nodes) (Chimera) dCache 1.9.5, 1.9.6 (pool nodes)
NL-T1	dCache 1.9.5-21 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)
PIC	dCache 1.9.5-21 (PNFS)	None	None
RAL	CASTOR 2.1.7-27 and 2.1.9-7 (stagers) CASTOR 2.1.8-3 (nameserver central node) CASTOR 2.1.8-17 (nameserver local node on SRM machines) CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers) SRM 2.8-2	Upgraded LHCb to CASTOR 2.1.9 during September	Will be updating other VOs to 2.1.9 during Oct/Nov
TRIUMF	dCache 1.9.5-21 with Chimera namespace

CASTOR news

CASTOR 2.1.9-9 is being tested (preproduction and other internal instances) especially improving our control on tape drive usage.
2.1.9-9 contains changes necessary to cope with the increased requirements of the heavy-ion run (especially ALICE and CMS).
2.1.9-9 will be gradually rolled out (in accordance with the experiments) in order to have a combined test prior to the HI data taking.
An improved version of the xrootd-CASTOR plug-in (2.1.9-7) is available. It was installed on the CERN instance on October 5 and 6 (ATLAS, CMS and LHCb). This version should cure some problems observed in the last couple of months.
A campaign to install more capacity for the heavy-ion run for CASTOR at CERN is well underway. Due to the peak high rates expected by ALICE and CMS (adding up to more than 4 GB/s) their data export will take place mostly after the run.

xrootd news

See CASTOR news

dCache news

Nothing to report.

StoRM news

FTS news

FTS 2.2.5 tagged this morning, certification will start tomorrow; its main new feature is support for sites without SRM.

Started to work on 2.2.6 which will add the possibility to do resume failed transfers (will work with all SEs except CASTOR when using "internal" GridFTP daemon).

DPM news

Certification of 1.8 should be completed early next week: the migration to the new database schema is still to be certified. It introduces the possibility to ban users and VOs.

LFC news

Certification of 1.8 should be completed early next week: the migration to the new database schema is still to be certified. It introduces the possibility to ban users and VOs. Another change affects a bulk method, as requested by LHCb.

There is a memory leak in the VOMS libraries used in LFC 1.7.4. This is not a sufficient reason to downgrade, but sites should keep an eye on the memory used by the LFC daemon. The size increases in proportion to the usage, but the largest value observed so far was about 1.3 GB. New VOMS libraries for SLC5 64-bit have been just released and they will be used with 1.8.

LFC deployment

Site	Version	OS, n-bit	Backend	Upgrade plans
ASGC	1.7.2-4	SLC4 64-bit	Oracle
BNL	1.7.2-4	SL4	Oracle	1.7.4 on SL5 in October
CERN	1.7.3 64-bit	SLC4	Oracle	Will upgrade to SLC5 64-bit by the end of the year
CNAF	1.7.2-4	SLC4 32-bit	Oracle	1.7.4 on SL5 64-bit in October
FNAL	N/A			Not deployed at Fermilab
IN2P3	1.7.4-7	SL5 - 64 bits	Oracle
KIT	1.7.4	SL5 64-bit	Oracle
NDGF
NL-T1	1.7.4-7	CentOS5 64-bit	Oracle
PIC	1.7.4-7	SL5 64-bit	Oracle
RAL	1.7.4-7	SL5 64-bit	Oracle
TRIUMF	1.7.2-5	SL5 64 bit	MySQL

GGUS Issues

Ticket\Experiment	Creation date	Assigned To	Last update	Status	Comment
GGUS:61440 ATLAS CNAF-BNL Transfers	2010/08/23	ROC_Italy	2010/10/06	In progress	Alternative network routes (now GARR) being tested (successfully!?)This ticket appears for the 3rd time in the T1SCM report!
GGUS:62696 CMS xrootd errors	2010/10/02	ROC_CERN	2010/10/05	In progress	Investigation now moved to the CASTOR Support savannah tracker
GGUS:62401 CMS wrong settings for FNAL in the CERN FTS	2010/09/24	ROC_CERN	2010/09/29	In progress	Forgotten maybe
GGUS:61706 CMS Maradona error in the CERN CEs	2010/08/31	ROC_CERN	2010/10/06	On hold	a LSF patch planned but no date decided yet. This ticket appears for the 2nd time in the T1SCM report!

Outstanding SIRs

From the last MB report, outstanding were:

LHCb LFC replication (from Sep 14 – GGUS:62078) - now done, SIR
New ASGC DB problem (from Sep 25) - not done
NDGF-RAL networking issues (from Aug 19 – GGUS:61306) - see LHC OPN discussion
BNL-CNAF networking issues (from Aug 23 – GGUS:61440) - ditto

LHC OPN meeting

This issues of involvement and ownership are being discussed at the on-going LHC OPN - agenda.

Conditions data access and related services

COOL, CORAL and POOL

CMS has prepared a new release using the CORAL_2_3_12 tag. This is confirmed to successfully solve the crash previously observed with optimized gcc 4.3 builds.
Work is ongoing to reproduce and fix the problems observed by several users (in all of ATLAS, CMS, LHCb) in the CORAL handling of network or database server glitches.

Frontier

ATLAS weekly FroNTier meetings

Database services

Topics of general discussion (and CERN report)
- We have started validation of 10.2.0.5 with experiments and main applications. A summary of testing will be done at the Distributed DB workshop in November. Final decision to upgrade and chedule will be published in December. The current general plan is to upgrade in January.
- Extended maintenance contract will be necessary for Oracle support for 10.2.0.x from August 2011. CERN plans to negotiate this with Oracle.
- PSU Apr and Jul non rollingness testing, an update:
  - Tests have reproduced on several occasions the issue. Oracle support is working on analyzing our logs.
- The distributed DB workshop has been organized for Novemebre 16th and 17th. The draft agenda is open for comments and feedback:

https://twiki.cern.ch/twiki/bin/view/PDBService/3DWorkshopPlanning

Experiment reports:
- ALICE:
  - Alice plan to replace current DB HW going out of warranty during extended technical stop. Discussions with IT-DB on which HW to buy are undergoing.
- ATLAS:
  - Atlas has requested the split of offline DB into conditions and ADC. Tests are in progress, this will be discussed more in details at Atlas DB meeting on Monday.
- CMS:
  - CMS_LUMI_PROD will be moved from CMS PVSS replication to CMS Conditions replication (scheduled for 7/10)
- LHCb:
  - Nothing to report

Site reports:

Site	Status, recent changes, incidents, ...	Planned interventions
ASGC	DB status 3D Stream: Due to problem of power generator maintenance, all ASGC services were shutdown by unexpected power cut. 3D Database and Backup are corrupted. 3D Database is still under recovery, thanks Marcin and Eter.Transporttable gets very poor performance due to network issue, ASGC network team is investing this issue now. RAC Testbeds were patched to April 2010	None
BNL	Patch 6196748, OS and ASM kernel bug fixes were deployed on Conditions DB.	none
CNAF	X	X
IN2P3	X	X
KIT	Nothing to report	None
NDGF	Nothing to report	None
PIC	applied last security kernel updates on all the systems	None
RAL	X	X
SARA	SARA is still running on the alternate server. The data corruption is not present anymore after a re-format of some LUN's (surprisingly enough not the ones which were in use at the time of the incident). At this moment we are performing "storage" stress tests and until now 3 disks (there are 80 disks) had to be replaced due to disk errors. It's unlikely we will find the real cause of this incident because many things have changed until then (re-formats and storage firmware upgrade).	No interventions
TRIUMF	Applied Linux kernel security updates to our Oracle Servers	None

AOB

WLCG Service Coordination BOF at CHEP - agenda

Topic revision: r31 - 2010-11-11 - AndreaSciaba

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback