WLCG Tier1 Service Coordination Minutes - 25 March 2010
Attendance
Local(Patricia, Andrea Valassi+Sciaba, Jamie, Flavia, Miguel, Julia, Maria G + DZ, Rosa, Eva, Dirk, Simone, Stephane, Alessandro, Jean-Philippe, Gavin, Roberto, Sasha, Luca, Jacek, Zbyszek, Ian F)
Site |
Name(s) |
ASGC |
Jhen-Wei, Shu-Ting |
BNL |
Carlos, JD |
CNAF |
Alessandro |
FNAL |
Jon |
KIT |
Angela, Andreas |
IN2P3 |
Osman |
NDGF |
Jon |
NL-T1 |
Alexander |
PIC |
Gonzalo, Josep Flix |
RAL |
Carmine, Gareth, Andrew, Derek |
TRIUMF |
Andrew |
Minutes of the last meeting and matters arising
- Follow-up on alarm tickets / critical services and related:
- Clarification of GGUS ticketing to the Tier0: When site CERN_PROD is selected the GGUS TPM IS bypassed - this is true for ALARM, TEAM and 'normal' tickets.
- A test of alarms for the Tier0 critical services will be setup and announced. (The purpose is to test the response to the alarms within the Tier0 and not the delivery of the tickets to the site).(GGUS alarm testing rules)
- ALICE critical services: Support for services at Tier1s are now lower thanks to changes in the deployment model. (IAliEnv2.18 establishes a failover mechanism. In addition most of the sites are providing at least 2 voboxes).
- Further details in presentations on the agenda page.
Data Management & Other Tier1 Service Issues
- Summary: stability - few recent or planned changes.
Storage systems: status, recent and planned changes
Site |
Status |
Recent changes |
Planned changes |
CERN |
CASTOR 2.1.9-4 (all) SRM 2.8-6 (ALICE, CMS, LHCb) SRM 2.9-2 (ATLAS) |
xroot plugin update |
None |
ASGC |
CASTOR 2.1.7-19 (stager, nameserver) CASTOR 2.1.8-14 (tapeserver) SRM 2.8-2 |
Created tape pool for ATLAS RAW data (can be extended on request) |
|
BNL |
dCache 1.9.4-3 |
none |
none |
CNAF |
CASTOR 2.1.7-27 (ALICE) SRM 2.8-5 (ALICE) StoRM 1.5.1-2 (ATLAS, CMS, LHCb) |
none |
none |
FNAL |
dCache 1.9.5-10 (admin nodes) dCache 1.9.5-12 (pool nodes) |
none |
none |
IN2P3 |
dCache 1.9.5-11 with Chimera |
|
13/4: configuration update and tests on tape robot, integrate new drives Q3-Q4: one week downtime to upgrade the MSS |
KIT |
dCache 1.9.5-15 (admin nodes) dCache 1.9.5-5 - 1.9.5-15 (pool nodes) |
none |
none |
NDGF |
dCache 1.9.7 |
none |
none |
NL-T1 |
dCache 1.9.5-16 (SARA), DPM 1.7.3 (NIKHEF) |
New disk servers at NIKHEF run in read-only mode due to hardware issues. New data can be written to older disk servers where there is sufficient space left |
next LHC stop: migrate three dCache admin nodes to new hardware |
PIC |
dCache 1.9.5-15 |
|
30/3: full day scheduled downtime for periodic electrical system intervention |
RAL |
CASTOR 2.1.7-27 (stagers) CASTOR 2.1.8-3 (nameserver central node) CASTOR 2.1.8-17 (nameserver local node on SRM machines) CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers) SRM 2.8-2 |
None |
None |
TRIUMF |
1.9.5-11 |
|
|
Other Tier-0/1 issues
CASTOR news
CASTOR 2.1.9-5 was released on 23/3 and is available in the savannah
release area. Full
release notes and upgrade instructions are available.
dCache news
Nothing to report.
StoRM news
LFC news
Nothing to report.
FTS
FTS 2.2.3 is finally installed at all sites.
Issue closed
Experiment issues
WLCG Baseline Versions
Conditions data access and related services
- Oracle client for Persistency and COOL: tests are being performed to validate a patch that has been received from Oracle Support, as a possible fix for the 11g client bug affecting ATLAS sites running AMD Opteron quadcore nodes. Preliminary tests indicate that the patch does not solve the issue. The priority to fix this issue has in any case decreased since ATLAS moved back to the 10g client as a workaround for this bug.
- Problems in ATLAS reprocessing at SARA (see attached slides by Sasha). The cause of the incident was Oracle access from TAG production jobs. After investigation by the SARA and ATLAS teamms it was found that the problem was caused by the firewall configuration, which has since been fixed. No problems were found in the behaviour of the Persistency software in this incident.
Database services
- Presentation by Sasha regarding the connection to Oracle problems observed at SARA during the last ATLAS reprocessing campaign.
- Nodes added to the cluster where not added to the firewall
- Firewall usage at the Tier1 sites
- Experiments reports:
- ALICE: ntr
- ATLAS: Atlas archive database running out of resources. Db will be migrated to new hw end of April aprox.
- CMS: Apply handler configured for the CMS PVSS replication in order to avoid the frequent apply failures due to grants to/from users which do not exist at the replica db. Discrepancies in few schemas, caused by a wrong rule definition, in the CMS PVSS replication setup fixed.
- LHCB: ntr
- Tier0 Streams: ntr
- Tier0 Castor: ntr
- Sites status:
- RAL (Carmine): Jan PSU patch applied on LHCb and ATLAS.
- Castor: plan to buy new hw, email with the plan and architecture sent around, waiting feedback.
- New licenses required by April or May. Gordon has sent an email with the details.
- Gridka (Andreas): ntr
- SARA (Alexander): Jan PSU patch applied.
- CNAF (Alessandro): LHCb-LFC db migrated to new hw. LHCb and ATLAS dbs pending. Experiments will contact CNAF with the best timescale.
- TRIUMF(Andrew): Jan PSU patch applied. Andrew will send email with the licenses requirements.
- ASGC: ntr
- NDGF (Jon): ntr
- Question about licenses for partitioning: partitioning is not currently used by COOL, it is being investigated but there are not immediate plans.
- IN2P3 (Osman): ntr
- PIC (input from Elena): TAGs database is ready. Yearly power cut scheduled for beginning next week: all services will be down (see intervention announcement for more details).
- BNL (Carlos): Jan PSU patch applied all db + os patches
- Database weekly reports presentation by Zbigniew (attached to the agenda):
- Question on distribution of the reports: will be discussed internally by ATLAS.
Action List
Action |
Status |
FTS 2.2.3 deployment |
Completed |
Oracle 11g client bug |
patch delivered - tests on-going |
Tier0 alarm tests |
To be scheduled |
Tier1 Oracle licenses |
Requests submitted to IT-DB (affects only RAL) |
Tier1 DB monitoring |
Experiment (ATLAS) feedback on access to monitoring data |
Prolonged site downtimes |
Followup on status & input from experiments other than CMS |
LHCb jamboree |
Review actions to be followed up |
AOB