WLCG Tier1 Service Coordination Minutes - 07/10/2010
Attendance
Gonzalo Merino (PIC), Elizabeth Chism (OSG), Luca dell'Agnello (CNAF), Patrick (dCache.org), Elena Planas (PIC), Felix Lee (ASGC), Jon Bakken (FNAL), Pierre Girard (
IN2P3), Alessandro Cavalli (INFN), John
DeStefano, Carlos (BNL), Jhen-Wei Huang (ASGC), Ron Trompert (NL-T1), Andrew (TRIUMF), Andreas (
GridKA), Pierre Girarc (
IN2P3), Elizabeth Gallas
Lola, Andrea Valassi, Alessandro, Manuel, Jean-Philippe, Andrea Sciaba, Jamie, Maria, Roberto, Maria Allandes, Maria DZ, Massimo, Alexei, Luca
Release Update
WLCG Baseline Versions
Data Management & Other Tier1 Service Issues
Site |
Status |
Recent changes |
Planned changes |
CERN |
CASTOR 2.1.9-8 (All) SRM 2.9-4 (all) xrootd 2.1.9-7 |
end of August:CASTOR upgrade 5-6/10: xrootd upgrade for ATLAS, CMS and LHCb |
CASTOR 2.1.9-9 is being tested and will be gradually rolled out (in accordance with the experiments) in order to have a combined test prior to the HI data taking (hence before the 1st of November) |
ASGC |
CASTOR 2.1.7-19 (stager, nameserver) CASTOR 2.1.8-14 (tapeserver) SRM 2.8-2 |
24/11: unscheduled downtime for 5 hours due to a power outage |
Setting up a 2.1.9 testbed for a future upgrade |
BNL |
dCache 1.9.4-3 (PNFS) |
NONE |
NONE |
CNAF |
StoRM 1.5.4-5 (ATLAS, CMS, LHCb,ALICE) |
|
|
FNAL |
dCache 1.9.5-10 (admin nodes) (PNFS) dCache 1.9.5-12 (pool nodes) |
None |
Investigating upgrade to 1.9.5-22 during Nov 2 downtime |
IN2P3 |
dCache 1.9.5-22 (Chimera) |
Upgrade to 1.9.5.22 (core servers + pool nodes) Alice removed from dCache configuration Tape system protection enabled |
none |
KIT |
dCache 1.9.5-15 (admin nodes) (Chimera) dCache 1.9.5-5 - 1.9.5-15 (pool nodes) |
|
|
NDGF |
dCache 1.9.7 (head nodes) (Chimera) dCache 1.9.5, 1.9.6 (pool nodes) |
|
|
NL-T1 |
dCache 1.9.5-21 (Chimera) (SARA), DPM 1.7.3 (NIKHEF) |
|
|
PIC |
dCache 1.9.5-21 (PNFS) |
None |
None |
RAL |
CASTOR 2.1.7-27 and 2.1.9-7 (stagers) CASTOR 2.1.8-3 (nameserver central node) CASTOR 2.1.8-17 (nameserver local node on SRM machines) CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers) SRM 2.8-2 |
Upgraded LHCb to CASTOR 2.1.9 during September |
Will be updating other VOs to 2.1.9 during Oct/Nov |
TRIUMF |
dCache 1.9.5-21 with Chimera namespace |
|
|
CASTOR news
- CASTOR 2.1.9-9 is being tested (preproduction and other internal instances) especially improving our control on tape drive usage.
2.1.9-9 contains changes necessary to cope with the increased requirements of the heavy-ion run (especially ALICE and CMS).
2.1.9-9 will be gradually rolled out (in accordance with the experiments) in order to have a combined test prior to the HI data taking.
- An improved version of the xrootd-CASTOR plug-in (2.1.9-7) is available. It was installed on the CERN instance on October 5 and 6 (ATLAS, CMS and LHCb). This version should cure some problems observed in the last couple of months.
- A campaign to install more capacity for the heavy-ion run for CASTOR at CERN is well underway. Due to the peak high rates expected by ALICE and CMS (adding up to more than 4 GB/s) their data export will take place mostly after the run.
xrootd news
dCache news
Nothing to report.
StoRM news
FTS news
FTS 2.2.5 tagged this morning, certification will start tomorrow; its main new feature is support for sites without SRM.
Started to work on 2.2.6 which will add the possibility to do resume failed transfers (will work with all SEs except CASTOR when using "internal" GridFTP daemon).
DPM news
Certification of 1.8 should be completed early next week: the migration to the new database schema is still to be certified.
It introduces the possibility to ban users and VOs.
LFC news
Certification of 1.8 should be completed early next week: the migration to the new database schema is still to be certified.
It introduces the possibility to ban users and VOs. Another change affects a bulk method, as requested by LHCb.
There is a memory leak in the
VOMS libraries used in LFC 1.7.4. This is not a sufficient reason to downgrade, but sites should keep an eye on the memory used by the LFC daemon. The size increases in proportion to the usage, but the largest value observed so far was about 1.3 GB. New
VOMS libraries for SLC5 64-bit have been just released and they will be used with 1.8.
LFC deployment
Site |
Version |
OS, n-bit |
Backend |
Upgrade plans |
ASGC |
1.7.2-4 |
SLC4 64-bit |
Oracle |
|
BNL |
1.7.2-4 |
SL4 |
Oracle |
1.7.4 on SL5 in October |
CERN |
1.7.3 64-bit |
SLC4 |
Oracle |
Will upgrade to SLC5 64-bit by the end of the year |
CNAF |
1.7.2-4 |
SLC4 32-bit |
Oracle |
1.7.4 on SL5 64-bit in October |
FNAL |
N/A |
|
|
Not deployed at Fermilab |
IN2P3 |
1.7.4-7 |
SL5 - 64 bits |
Oracle |
|
KIT |
1.7.4 |
SL5 64-bit |
Oracle |
|
NDGF |
|
|
|
|
NL-T1 |
1.7.4-7 |
CentOS5 64-bit |
Oracle |
|
PIC |
1.7.4-7 |
SL5 64-bit |
Oracle |
|
RAL |
1.7.4-7 |
SL5 64-bit |
Oracle |
|
TRIUMF |
1.7.2-5 |
SL5 64 bit |
MySQL |
|
GGUS Issues
Ticket\Experiment |
Creation date |
Assigned To |
Last update |
Status |
Comment |
GGUS:61440 ATLAS CNAF-BNL Transfers |
2010/08/23 |
ROC_Italy |
2010/10/06 |
In progress |
Alternative network routes (now GARR) being tested (successfully!?)This ticket appears for the 3rd time in the T1SCM report! |
GGUS:62696 CMS xrootd errors |
2010/10/02 |
ROC_CERN |
2010/10/05 |
In progress |
Investigation now moved to the CASTOR Support savannah tracker |
GGUS:62401 CMS wrong settings for FNAL in the CERN FTS |
2010/09/24 |
ROC_CERN |
2010/09/29 |
In progress |
Forgotten maybe |
GGUS:61706 CMS Maradona error in the CERN CEs |
2010/08/31 |
ROC_CERN |
2010/10/06 |
On hold |
a LSF patch planned but no date decided yet. This ticket appears for the 2nd time in the T1SCM report! |
Outstanding SIRs
From the last MB report, outstanding were:
- LHCb LFC replication (from Sep 14 – GGUS:62078) - now done, SIR
- New ASGC DB problem (from Sep 25) - not done
- NDGF-RAL networking issues (from Aug 19 – GGUS:61306) - see LHC OPN discussion
- BNL-CNAF networking issues (from Aug 23 – GGUS:61440) - ditto
LHC OPN meeting
This issues of involvement and ownership are being discussed at the on-going LHC OPN -
agenda.
Conditions data access and related services
COOL, CORAL and POOL
- CMS has prepared a new release using the CORAL_2_3_12 tag. This is confirmed to successfully solve the crash previously observed with optimized gcc 4.3 builds.
- Work is ongoing to reproduce and fix the problems observed by several users (in all of ATLAS, CMS, LHCb) in the CORAL handling of network or database server glitches.
Frontier
Database services
- Topics of general discussion (and CERN report)
- We have started validation of 10.2.0.5 with experiments and main applications. A summary of testing will be done at the Distributed DB workshop in November. Final decision to upgrade and chedule will be published in December. The current general plan is to upgrade in January.
- Extended maintenance contract will be necessary for Oracle support for 10.2.0.x from August 2011. CERN plans to negotiate this with Oracle.
- PSU Apr and Jul non rollingness testing, an update:
- Tests have reproduced on several occasions the issue. Oracle support is working on analyzing our logs.
- The distributed DB workshop has been organized for Novemebre 16th and 17th. The draft agenda is open for comments and feedback:
https://twiki.cern.ch/twiki/bin/view/PDBService/3DWorkshopPlanning
- Experiment reports:
- ALICE:
- Alice plan to replace current DB HW going out of warranty during extended technical stop. Discussions with IT-DB on which HW to buy are undergoing.
- ATLAS:
- Atlas has requested the split of offline DB into conditions and ADC. Tests are in progress, this will be discussed more in details at Atlas DB meeting on Monday.
- CMS:
- CMS_LUMI_PROD will be moved from CMS PVSS replication to CMS Conditions replication (scheduled for 7/10)
- LHCb:
Site |
Status, recent changes, incidents, ... |
Planned interventions |
ASGC |
DB status 3D Stream: Due to problem of power generator maintenance, all ASGC services were shutdown by unexpected power cut. 3D Database and Backup are corrupted. 3D Database is still under recovery, thanks Marcin and Eter.Transporttable gets very poor performance due to network issue, ASGC network team is investing this issue now. RAC Testbeds were patched to April 2010 |
None |
BNL |
Patch 6196748, OS and ASM kernel bug fixes were deployed on Conditions DB. |
none |
CNAF |
X |
X |
IN2P3 |
X |
X |
KIT |
Nothing to report |
None |
NDGF |
Nothing to report |
None |
PIC |
applied last security kernel updates on all the systems |
None |
RAL |
X |
X |
SARA |
SARA is still running on the alternate server. The data corruption is not present anymore after a re-format of some LUN's (surprisingly enough not the ones which were in use at the time of the incident). At this moment we are performing "storage" stress tests and until now 3 disks (there are 80 disks) had to be replaced due to disk errors. It's unlikely we will find the real cause of this incident because many things have changed until then (re-formats and storage firmware upgrade). |
No interventions |
TRIUMF |
Applied Linux kernel security updates to our Oracle Servers |
None |
AOB
- WLCG Service Coordination BOF at CHEP - agenda