LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>Tier1ServiceCoordination>WLCGTier1ServiceCoordinationMinutes100325 (2010-06-11, PeterJones)

EditAttachPDF

WLCG Tier1 Service Coordination Minutes - 25 March 2010

WLCG Tier1 Service Coordination Minutes - 25 March 2010

Attendance

Local(Patricia, Andrea Valassi+Sciaba, Jamie, Flavia, Miguel, Julia, Maria G + DZ, Rosa, Eva, Dirk, Simone, Stephane, Alessandro, Jean-Philippe, Gavin, Roberto, Sasha, Luca, Jacek, Zbyszek, Ian F)

Site	Name(s)
ASGC	Jhen-Wei, Shu-Ting
BNL	Carlos, JD
CNAF	Alessandro
FNAL	Jon
KIT	Angela, Andreas
IN2P3	Osman
NDGF	Jon
NL-T1	Alexander
PIC	Gonzalo, Josep Flix
RAL	Carmine, Gareth, Andrew, Derek
TRIUMF	Andrew

Minutes of the last meeting and matters arising

Follow-up on alarm tickets / critical services and related:
- Clarification of GGUS ticketing to the Tier0: When site CERN_PROD is selected the GGUS TPM IS bypassed - this is true for ALARM, TEAM and 'normal' tickets.
- A test of alarms for the Tier0 critical services will be setup and announced. (The purpose is to test the response to the alarms within the Tier0 and not the delivery of the tickets to the site).(GGUS alarm testing rules)
- ALICE critical services: Support for services at Tier1s are now lower thanks to changes in the deployment model. (IAliEnv2.18 establishes a failover mechanism. In addition most of the sites are providing at least 2 voboxes).
- Further details in presentations on the agenda page.

Data Management & Other Tier1 Service Issues

Summary: stability - few recent or planned changes.

Storage systems: status, recent and planned changes

Site	Status	Recent changes	Planned changes
CERN	CASTOR 2.1.9-4 (all) SRM 2.8-6 (ALICE, CMS, LHCb) SRM 2.9-2 (ATLAS)	xroot plugin update	None
ASGC	CASTOR 2.1.7-19 (stager, nameserver) CASTOR 2.1.8-14 (tapeserver) SRM 2.8-2	Created tape pool for ATLAS RAW data (can be extended on request)
BNL	dCache 1.9.4-3	none	none
CNAF	CASTOR 2.1.7-27 (ALICE) SRM 2.8-5 (ALICE) StoRM 1.5.1-2 (ATLAS, CMS, LHCb)	none	none
FNAL	dCache 1.9.5-10 (admin nodes) dCache 1.9.5-12 (pool nodes)	none	none
IN2P3	dCache 1.9.5-11 with Chimera		13/4: configuration update and tests on tape robot, integrate new drives Q3-Q4: one week downtime to upgrade the MSS
KIT	dCache 1.9.5-15 (admin nodes) dCache 1.9.5-5 - 1.9.5-15 (pool nodes)	none	none
NDGF	dCache 1.9.7	none	none
NL-T1	dCache 1.9.5-16 (SARA), DPM 1.7.3 (NIKHEF)	New disk servers at NIKHEF run in read-only mode due to hardware issues. New data can be written to older disk servers where there is sufficient space left	next LHC stop: migrate three dCache admin nodes to new hardware
PIC	dCache 1.9.5-15		30/3: full day scheduled downtime for periodic electrical system intervention
RAL	CASTOR 2.1.7-27 (stagers) CASTOR 2.1.8-3 (nameserver central node) CASTOR 2.1.8-17 (nameserver local node on SRM machines) CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers) SRM 2.8-2	None	None
TRIUMF	1.9.5-11

Other Tier-0/1 issues

KIT: ntr

CASTOR news

CASTOR 2.1.9-5 was released on 23/3 and is available in the savannah release area

. Full release notes

and upgrade instructions are available.

dCache news

Nothing to report.

StoRM news

LFC news

Nothing to report.

FTS

FTS 2.2.3 is finally installed at all sites. Issue closed

Experiment issues

WLCG Baseline Versions

Release report: deployment status wiki page
WLCG Baseline versions: table

Conditions data access and related services

Oracle client for Persistency and COOL: tests are being performed to validate a patch that has been received from Oracle Support, as a possible fix for the 11g client bug affecting ATLAS sites running AMD Opteron quadcore nodes. Preliminary tests indicate that the patch does not solve the issue. The priority to fix this issue has in any case decreased since ATLAS moved back to the 10g client as a workaround for this bug.

Frontier-Squid: ATLAS weekly FroNTier meetings. The new release 2.7.STABLE7-11 of the frontier-squid rpms has been announced yesterday March 24, 2010. The details can be found in the twiki: https://twiki.cern.ch/twiki/bin/view/PDBService/SquidRPMsTier1andTier2. The most significant change introduced in this release is the new filesystem structure agreed with CMS and ATLAS.

Problems in ATLAS reprocessing at SARA (see attached slides by Sasha). The cause of the incident was Oracle access from TAG production jobs. After investigation by the SARA and ATLAS teamms it was found that the problem was caused by the firewall configuration, which has since been fixed. No problems were found in the behaviour of the Persistency software in this incident.

Database services

Presentation by Sasha regarding the connection to Oracle problems observed at SARA during the last ATLAS reprocessing campaign.
- Nodes added to the cluster where not added to the firewall
- Firewall usage at the Tier1 sites

Experiments reports:
- ALICE: ntr
- ATLAS: Atlas archive database running out of resources. Db will be migrated to new hw end of April aprox.
- CMS: Apply handler configured for the CMS PVSS replication in order to avoid the frequent apply failures due to grants to/from users which do not exist at the replica db. Discrepancies in few schemas, caused by a wrong rule definition, in the CMS PVSS replication setup fixed.
- LHCB: ntr

Tier0 Streams: ntr
Tier0 Castor: ntr

Sites status:
- RAL (Carmine): Jan PSU patch applied on LHCb and ATLAS.
  - Castor: plan to buy new hw, email with the plan and architecture sent around, waiting feedback.
  - New licenses required by April or May. Gordon has sent an email with the details.
- Gridka (Andreas): ntr
- SARA (Alexander): Jan PSU patch applied.
- CNAF (Alessandro): LHCb-LFC db migrated to new hw. LHCb and ATLAS dbs pending. Experiments will contact CNAF with the best timescale.
- TRIUMF(Andrew): Jan PSU patch applied. Andrew will send email with the licenses requirements.
- ASGC: ntr
- NDGF (Jon): ntr
  - Question about licenses for partitioning: partitioning is not currently used by COOL, it is being investigated but there are not immediate plans.
- IN2P3 (Osman): ntr
- PIC (input from Elena): TAGs database is ready. Yearly power cut scheduled for beginning next week: all services will be down (see intervention announcement for more details).
- BNL (Carlos): Jan PSU patch applied all db + os patches

Database weekly reports presentation by Zbigniew (attached to the agenda):
- Question on distribution of the reports: will be discussed internally by ATLAS.

Action List

Action	Status
FTS 2.2.3 deployment	Completed
Oracle 11g client bug	patch delivered - tests on-going
Tier0 alarm tests	To be scheduled
Tier1 Oracle licenses	Requests submitted to IT-DB (affects only RAL)
Tier1 DB monitoring	Experiment (ATLAS) feedback on access to monitoring data
Prolonged site downtimes	Followup on status & input from experiments other than CMS
LHCb jamboree	Review actions to be followed up

AOB

Topic revision: r23 - 2010-06-11 - PeterJones

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback