WLCG Tier1 Service Coordination Minutes - 11/11/2010
Attendance
Local: Ricardo, Oliver, Dirk, Kors, Gavin, Alexei, Flavia, Roberto, Massimo, Andrea V, Maarten, Miguel, Manuel, Alessandro, Nicolo, Simone, Jamie, Maria A, Maria D, Maria G, Huang.
Connected: Michael, Carlos, Jon B, Jon NDGF, John DS, John K, Felix, Elena, Xavier, Gonzalo, Andrew, Ron, Rolf, Carmine, Andrew S, Andreas, Alexander, Jeremy, Foued, Joel, Ian, Paul, Jhen-Wei, Dave.
Release Update
WLCG Baseline Versions
Data Management & Other Tier1 Service Issues
Site |
Status |
Recent changes |
Planned changes |
CERN |
CASTOR 2.1.9-8 (ATLAS) CASTOR 2.1.9-9 (ALICE, CMS and LHcb) SRM 2.9-4 (all) xrootd 2.1.9-7 |
LHCb upgraded on 8-NOV-2010 |
|
ASGC |
CASTOR 2.1.7-19 (stager, nameserver) CASTOR 2.1.8-14 (tapeserver) SRM 2.8-2 |
11/11 00:00-06:00 UTC: scheduled downtime for network maintenance |
none |
BNL |
dCache 1.9.4-3 (PNFS) |
|
|
CNAF |
StoRM 1.5.4-5 (ATLAS, CMS, LHCb,ALICE) |
|
|
FNAL |
dCache 1.9.5-23 (PNFS) Scalla xrootd 2.9.1/1.4.2-4 |
Upgraded dCache on Nov 8 xrootd read accessed opened to CMS VO WARNING: new (gsi)dcap client stricter on correct use of slashes in URLs! |
|
IN2P3 |
dCache 1.9.5-22 (Chimera) |
The recent performance and stability problems with the storage seem to be due to problems with Solaris on the disk servers, i.e. not due to dCache or its configuration; the issues can be reproduced with "iperf". Paul: PIC have seen similar issues with Solaris, getting better performance using Linux instead. |
|
KIT |
dCache 1.9.5-15 (admin nodes) (Chimera) dCache 1.9.5-5 - 1.9.5-15 (pool nodes) |
|
|
NDGF |
dCache 1.9.7 (head nodes) (Chimera) dCache 1.9.5, 1.9.6 (pool nodes) |
|
|
NL-T1 |
dCache 1.9.5-19 (Chimera) (SARA), DPM 1.7.3 (NIKHEF) |
|
|
PIC |
dCache 1.9.5-21 (PNFS) |
|
|
RAL |
CASTOR 2.1.7-27 and 2.1.9-6 (stagers) 2.1.9-1 (tape servers) SRM 2.8-2 and SRM 2.8-6 |
|
CMS upgrade to 2.1.9-6 on 16-18/11/10 and ATLAS to 2.1.9-6 on 6-8/12/10. Upgrades to 2.1.9-10 early 2011 |
TRIUMF |
dCache 1.9.5-21 with Chimera namespace |
none |
none |
CASTOR news
CERN operations
LHCb has been upgraded. We are following on a alarm ticket (not necessarily linked to the upgrade). A post-mortem (upgrade + ticket) is planned for the end of the week.
We are preparing for the DB upgrades on the stagers (after the HI run but before Christmas).
The DB upgrade related to the NS is definitely for January 2011 (it affects all instances (hence VO) and will require some downtime)
Development
Release 2.1.9-10 has been produced, which mainly targets issues found at RAL following their upgrade to 2.1.9. Full release notes and upgrade instructions are
available.
xrootd news
dCache news
- Installed an experimental RSS feed for the different dCache downloads. It is linked from the dCache download pages and the feeds are organized per target (e.g. clients, server versions, etc.)
- New Golden Release, 1.9.5-23 (see the release notes). Highlights:
- Space Manager Database access was optimized to make better use of indexes.
- Fixed read from pool that was disabled with the -rdonly flag of the pool disable command.
- Fixed xrootd mover TCP port allocation to avoid reusing a port until previous transfers on the port have finished.
- Fixed implementation of
rep ls -l=c
command.
- Recommended feature release: 1.9.10-2 (release notes). Highlights:
- Mostly speedup of the SRM, xrootd and NFS4.1 protocol.
StoRM news
- GGUS:64107 was opened about a possible incompatibility of gLite 3.2 VOMS proxies with StoRM, but the problem turned out to be due to an incorrect configuration of the StoRM server.
FTS news
DPM news
- DPM 1.8.0-1 has been certified
- fixing memory leaks and thread safety issues in VOMS library (used e.g. in SRM v2.2 daemon)
- allowing admin to ban users (DN or VOMS FQAN)
- dpm-xrootd 2.2.0-1 has been certified
- Working on DPM 1.8.1:
- faster dpm-drain and replication
- refactoring of Name Server code for better performance of SRM, xroot and NFS 4.1
LFC news
- LFC 1.8.0-1 has been certified
- fixing memory leaks and thread safety issues in VOMS library (used in LFC daemon)
- allowing admin to ban users (DN or VOMS FQAN)
LFC deployment
Site |
Version |
OS, n-bit |
Backend |
Upgrade plans |
ASGC |
1.7.2-4 |
SLC4 64-bit |
Oracle |
Testing on a database dump, upgrade will be scheduled after tests, no date for now |
BNL |
1.7.2-4 |
SL4 |
Oracle |
1.7.4 on SL5 in November |
CERN |
1.7.3 64-bit |
SLC4 |
Oracle |
Will upgrade to SLC5 64-bit by the end of the year |
CNAF |
1.7.2-4 |
SLC4 32-bit |
Oracle |
1.7.4 on SL5 64-bit in November |
FNAL |
N/A |
|
|
Not deployed at Fermilab |
IN2P3 |
1.7.4-7 |
SL5 - 64 bits |
Oracle |
|
KIT |
1.7.4 |
SL5 64-bit |
Oracle |
|
NDGF |
|
|
|
|
NL-T1 |
|
|
|
|
PIC |
1.7.4-7 |
SL5 64-bit |
Oracle |
|
RAL |
1.7.4-7 |
SL5 64-bit |
Oracle |
|
TRIUMF |
1.7.2-5 |
SL5 64 bit |
MySQL |
|
Experiment issues
GGUS Issues
Maria D. will update the CMS ticket
IN2P3 to MIT transfers to clarify the investigation is expected from
IN2P3. She will also re-assign the finnish T2 ticket from Gstat to NDGF to prompt them to change config. parameters in order to attribute the T2 FIN its real resources in the WLCG monitoring.
KIT was reminded to finish configuring CE and batch nodes with the new CMS
VOMS Roles on hiproduction.
Outstanding SIRs
Three reports were discussed (see
agenda
- The RAL - storage degradation for LHCb Full Report.
- An interim report from IN2P3 about shared area problems (see attachments below).
- The need for a report addressing the problems seen by LHCb "since the recent CASTOR upgrade" (although it has since been confirmed that the specific problem reported pre-dated this upgrade).
Conditions data access and related services
COOL, CORAL and POOL
Frontier/Squid
- Squid service deployment discussion
- Kors Bos (ATLAS) announced that ATLAS had taken the decision to not share Squids with other experiments. Discussion closed.
Database services
- Topics of general discussion
- Distributed Database Operations Workshop - please register if attending social dinner:
http://indico.cern.ch/conferenceDisplay.py?confId=111194
- Experiment reports:
- ALICE:
- ATLAS:
- Conditions replication to SARA fixed.
- Atlas PANDA applications have suffered from transaction locking issues on Wednesday (3rd Nov) afternoon. DBAs had to intervene to kill user sessions to unblock the application. We are following up with Atlas on the issue.
- CMS:
- On Tuesday morning (9th Nov) CMS PVSS replication from online to offline database was unexpectedly disabled for 2 hours due failure of one of weekly automatic maintenance procedures (shrinking of LogMiner table). The procedure has been modified on to inform DBAs via email whenever its execution is unsuccessful.
- On Tuesday (9th Nov) CMS PVSS replication was affected once again for 30 minutes because of user error (using table without primary key) which caused abort of the apply process.
- LHCb:
- Conditions replication to SARA fixed.
Site |
Status, recent changes, incidents, ... |
Planned interventions |
ASGC |
Nothing to report |
- Install and set up TSM - Data Guard studies, testbed creation and implementation plans. |
BNL |
Nothing to report |
Deployment of PSU OCT 2010 in TAGS test cluster. Performance tests of data replication using Transportable Tablespaces between Triumf and BNL for TAGS database. |
CNAF |
Nothing to report |
None |
KIT |
Nothing to report |
None |
IN2P3 |
Nothing to report |
None |
NDGF |
Nothing to report |
None |
PIC |
Nothing to report |
None |
RAL |
Nothing to report |
Planning to apply October PSU on 3D DBs at the next CERN technical stop. |
SARA |
Nothing to report |
November 16th starting at 7:00 UTC until 17:00 UTC- intervention is caused by maintenance on network infrastructure. |
TRIUMF |
Nothing to report |
None |
AOB
Action List
Action number |
Description |
Announced |
Due |
Last Update |
Status |
20101028_01 |
RAL out of production due to Atlas upgrade Being an announcement and not an action it will not be further followed up |
20101028 |
20101206-08 |
20101119 |
Open |
20101028_02 |
Configure new ASGC T2 channels Done at: ASGC, CERN, IN2P3. KIT: tomorrow |
20101028 |
20101104 |
20101111 |
Open |
20101028_03 |
CMS to decide on redirector fix of GGUS:62696 |
20101028 |
a.s.a.p. |
20101028 |
Open |
20101028_04 |
IN2P3 (P.Girard)-CERN(H.Renshall) WG to address Afs issues of LHCb shared area GGUS:59880,GGUS:62800 The WG is active (thanks to Harry) and an intermediate report is attached |
20101028 |
a.s.a.p. |
20101111 |
Closed |
20101028_05 |
Invite Dave Dijkstra to discuss FroNTier/squid sharing by Atlas and CMS sites |
20101028 |
20101111 |
20101028 |
Open |
--
AndreaSciaba - 10-Nov-2010