WLCG Tier1 Service Coordination Minutes - 25 March 2010

Attendance

Local(Patricia, Andrea Valassi+Sciaba, Jamie, Flavia, Miguel, Julia, Maria G + DZ, Rosa, Eva, Dirk, Simone, Stephane, Alessandro, Jean-Philippe, Gavin, Roberto, Sasha, Luca, Jacek, Zbyszek, Ian F)

Site Name(s)
ASGC Jhen-Wei, Shu-Ting
BNL Carlos, JD
CNAF Alessandro
FNAL Jon
KIT Angela, Andreas
IN2P3 Osman
NDGF Jon
NL-T1 Alexander
PIC Gonzalo, Josep Flix
RAL Carmine, Gareth, Andrew, Derek
TRIUMF Andrew

Minutes of the last meeting and matters arising

  • Follow-up on alarm tickets / critical services and related:
    • Clarification of GGUS ticketing to the Tier0: When site CERN_PROD is selected the GGUS TPM IS bypassed - this is true for ALARM, TEAM and 'normal' tickets.
    • A test of alarms for the Tier0 critical services will be setup and announced. (The purpose is to test the response to the alarms within the Tier0 and not the delivery of the tickets to the site).(GGUS alarm testing rules)
    • ALICE critical services: Support for services at Tier1s are now lower thanks to changes in the deployment model. (IAliEnv2.18 establishes a failover mechanism. In addition most of the sites are providing at least 2 voboxes).
    • Further details in presentations on the agenda page.

Data Management & Other Tier1 Service Issues

  • Summary: stability - few recent or planned changes.

Storage systems: status, recent and planned changes

Site Status Recent changes Planned changes
CERN CASTOR 2.1.9-4 (all)
SRM 2.8-6 (ALICE, CMS, LHCb)
SRM 2.9-2 (ATLAS)
xroot plugin update None
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
Created tape pool for ATLAS RAW data (can be extended on request)  
BNL dCache 1.9.4-3 none none
CNAF CASTOR 2.1.7-27 (ALICE)
SRM 2.8-5 (ALICE)
StoRM 1.5.1-2 (ATLAS, CMS, LHCb)
none none
FNAL dCache 1.9.5-10 (admin nodes)
dCache 1.9.5-12 (pool nodes)
none none
IN2P3 dCache 1.9.5-11 with Chimera   13/4: configuration update and tests on tape robot, integrate new drives
Q3-Q4: one week downtime to upgrade the MSS
KIT dCache 1.9.5-15 (admin nodes)
dCache 1.9.5-5 - 1.9.5-15 (pool nodes)
none none
NDGF dCache 1.9.7 none none
NL-T1 dCache 1.9.5-16 (SARA), DPM 1.7.3 (NIKHEF) New disk servers at NIKHEF run in read-only mode due to hardware issues. New data can be written to older disk servers where there is sufficient space left next LHC stop: migrate three dCache admin nodes to new hardware
PIC dCache 1.9.5-15   30/3: full day scheduled downtime for periodic electrical system intervention
RAL CASTOR 2.1.7-27 (stagers)
CASTOR 2.1.8-3 (nameserver central node)
CASTOR 2.1.8-17 (nameserver local node on SRM machines)
CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers)
SRM 2.8-2
None None
TRIUMF 1.9.5-11    

Other Tier-0/1 issues

  • KIT: ntr

CASTOR news

CASTOR 2.1.9-5 was released on 23/3 and is available in the savannah release area. Full release notes and upgrade instructions are available.

dCache news

Nothing to report.

StoRM news

LFC news

Nothing to report.

FTS

FTS 2.2.3 is finally installed at all sites. Issue closed

Experiment issues

WLCG Baseline Versions

Conditions data access and related services

  • Oracle client for Persistency and COOL: tests are being performed to validate a patch that has been received from Oracle Support, as a possible fix for the 11g client bug affecting ATLAS sites running AMD Opteron quadcore nodes. Preliminary tests indicate that the patch does not solve the issue. The priority to fix this issue has in any case decreased since ATLAS moved back to the 10g client as a workaround for this bug.

  • Problems in ATLAS reprocessing at SARA (see attached slides by Sasha). The cause of the incident was Oracle access from TAG production jobs. After investigation by the SARA and ATLAS teamms it was found that the problem was caused by the firewall configuration, which has since been fixed. No problems were found in the behaviour of the Persistency software in this incident.

Database services

  • Presentation by Sasha regarding the connection to Oracle problems observed at SARA during the last ATLAS reprocessing campaign.
    • Nodes added to the cluster where not added to the firewall
    • Firewall usage at the Tier1 sites

  • Experiments reports:
    • ALICE: ntr
    • ATLAS: Atlas archive database running out of resources. Db will be migrated to new hw end of April aprox.
    • CMS: Apply handler configured for the CMS PVSS replication in order to avoid the frequent apply failures due to grants to/from users which do not exist at the replica db. Discrepancies in few schemas, caused by a wrong rule definition, in the CMS PVSS replication setup fixed.
    • LHCB: ntr

  • Tier0 Streams: ntr
  • Tier0 Castor: ntr

  • Sites status:
    • RAL (Carmine): Jan PSU patch applied on LHCb and ATLAS.
      • Castor: plan to buy new hw, email with the plan and architecture sent around, waiting feedback.
      • New licenses required by April or May. Gordon has sent an email with the details.
    • Gridka (Andreas): ntr
    • SARA (Alexander): Jan PSU patch applied.
    • CNAF (Alessandro): LHCb-LFC db migrated to new hw. LHCb and ATLAS dbs pending. Experiments will contact CNAF with the best timescale.
    • TRIUMF(Andrew): Jan PSU patch applied. Andrew will send email with the licenses requirements.
    • ASGC: ntr
    • NDGF (Jon): ntr
      • Question about licenses for partitioning: partitioning is not currently used by COOL, it is being investigated but there are not immediate plans.
    • IN2P3 (Osman): ntr
    • PIC (input from Elena): TAGs database is ready. Yearly power cut scheduled for beginning next week: all services will be down (see intervention announcement for more details).
    • BNL (Carlos): Jan PSU patch applied all db + os patches

  • Database weekly reports presentation by Zbigniew (attached to the agenda):
    • Question on distribution of the reports: will be discussed internally by ATLAS.

Action List

Action Status
FTS 2.2.3 deployment Completed
Oracle 11g client bug patch delivered - tests on-going
Tier0 alarm tests To be scheduled
Tier1 Oracle licenses Requests submitted to IT-DB (affects only RAL)
Tier1 DB monitoring Experiment (ATLAS) feedback on access to monitoring data
Prolonged site downtimes Followup on status & input from experiments other than CMS
LHCb jamboree Review actions to be followed up

AOB

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2010-06-11 - PeterJones
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback