WLCG Tier1 Service Coordination Minutes - 07/10/2010

Attendance

Gonzalo Merino (PIC), Elizabeth Chism (OSG), Luca dell'Agnello (CNAF), Patrick (dCache.org), Elena Planas (PIC), Felix Lee (ASGC), Jon Bakken (FNAL), Pierre Girard (IN2P3), Alessandro Cavalli (INFN), John DeStefano, Carlos (BNL), Jhen-Wei Huang (ASGC), Ron Trompert (NL-T1), Andrew (TRIUMF), Andreas (GridKA), Pierre Girarc (IN2P3), Elizabeth Gallas

Lola, Andrea Valassi, Alessandro, Manuel, Jean-Philippe, Andrea Sciaba, Jamie, Maria, Roberto, Maria Allandes, Maria DZ, Massimo, Alexei, Luca

Release Update

WLCG Baseline Versions

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.9-8 (All)
SRM 2.9-4 (all)
xrootd 2.1.9-7
end of August:CASTOR upgrade
5-6/10: xrootd upgrade for ATLAS, CMS and LHCb

CASTOR 2.1.9-9 is being tested and will be gradually rolled out (in accordance with the experiments) in order to have a combined test prior to the HI data taking (hence before the 1st of November)

ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
24/11: unscheduled downtime for 5 hours due to a power outage Setting up a 2.1.9 testbed for a future upgrade
BNL dCache 1.9.4-3 (PNFS) NONE NONE
CNAF StoRM 1.5.4-5 (ATLAS, CMS, LHCb,ALICE)    
FNAL dCache 1.9.5-10 (admin nodes) (PNFS)
dCache 1.9.5-12 (pool nodes)
None Investigating upgrade to 1.9.5-22 during Nov 2 downtime
IN2P3 dCache 1.9.5-22 (Chimera) Upgrade to 1.9.5.22 (core servers + pool nodes)
Alice removed from dCache configuration
Tape system protection enabled
none
KIT dCache 1.9.5-15 (admin nodes) (Chimera)
dCache 1.9.5-5 - 1.9.5-15 (pool nodes)
   
NDGF dCache 1.9.7 (head nodes) (Chimera)
dCache 1.9.5, 1.9.6 (pool nodes)
   
NL-T1 dCache 1.9.5-21 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.5-21 (PNFS) None None
RAL CASTOR 2.1.7-27 and 2.1.9-7 (stagers)
CASTOR 2.1.8-3 (nameserver central node)
CASTOR 2.1.8-17 (nameserver local node on SRM machines)
CASTOR 2.1.8-8, 2.1.8-14 and 2.1.9-1 (tape servers)
SRM 2.8-2
Upgraded LHCb to CASTOR 2.1.9 during September Will be updating other VOs to 2.1.9 during Oct/Nov
TRIUMF dCache 1.9.5-21 with Chimera namespace    

CASTOR news

  • CASTOR 2.1.9-9 is being tested (preproduction and other internal instances) especially improving our control on tape drive usage.
    2.1.9-9 contains changes necessary to cope with the increased requirements of the heavy-ion run (especially ALICE and CMS).
    2.1.9-9 will be gradually rolled out (in accordance with the experiments) in order to have a combined test prior to the HI data taking.
  • An improved version of the xrootd-CASTOR plug-in (2.1.9-7) is available. It was installed on the CERN instance on October 5 and 6 (ATLAS, CMS and LHCb). This version should cure some problems observed in the last couple of months.
  • A campaign to install more capacity for the heavy-ion run for CASTOR at CERN is well underway. Due to the peak high rates expected by ALICE and CMS (adding up to more than 4 GB/s) their data export will take place mostly after the run.

xrootd news

  • See CASTOR news

dCache news

Nothing to report.

StoRM news

FTS news

FTS 2.2.5 tagged this morning, certification will start tomorrow; its main new feature is support for sites without SRM.

Started to work on 2.2.6 which will add the possibility to do resume failed transfers (will work with all SEs except CASTOR when using "internal" GridFTP daemon).

DPM news

Certification of 1.8 should be completed early next week: the migration to the new database schema is still to be certified. It introduces the possibility to ban users and VOs.

LFC news

Certification of 1.8 should be completed early next week: the migration to the new database schema is still to be certified. It introduces the possibility to ban users and VOs. Another change affects a bulk method, as requested by LHCb.

There is a memory leak in the VOMS libraries used in LFC 1.7.4. This is not a sufficient reason to downgrade, but sites should keep an eye on the memory used by the LFC daemon. The size increases in proportion to the usage, but the largest value observed so far was about 1.3 GB. New VOMS libraries for SLC5 64-bit have been just released and they will be used with 1.8.

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.7.2-4 SLC4 64-bit Oracle  
BNL 1.7.2-4 SL4 Oracle 1.7.4 on SL5 in October
CERN 1.7.3 64-bit SLC4 Oracle Will upgrade to SLC5 64-bit by the end of the year
CNAF 1.7.2-4 SLC4 32-bit Oracle 1.7.4 on SL5 64-bit in October
FNAL N/A     Not deployed at Fermilab
IN2P3 1.7.4-7 SL5 - 64 bits Oracle  
KIT 1.7.4 SL5 64-bit Oracle  
NDGF        
NL-T1 1.7.4-7 CentOS5 64-bit Oracle  
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.2-5 SL5 64 bit MySQL  

GGUS Issues

Ticket\Experiment Creation date Assigned To Last update Status Comment
GGUS:61440 ATLAS CNAF-BNL Transfers 2010/08/23 ROC_Italy 2010/10/06 In progress Alternative network routes (now GARR) being tested (successfully!?)This ticket appears for the 3rd time in the T1SCM report!
GGUS:62696 CMS xrootd errors 2010/10/02 ROC_CERN 2010/10/05 In progress Investigation now moved to the CASTOR Support savannah tracker
GGUS:62401 CMS wrong settings for FNAL in the CERN FTS 2010/09/24 ROC_CERN 2010/09/29 In progress Forgotten maybe
GGUS:61706 CMS Maradona error in the CERN CEs 2010/08/31 ROC_CERN 2010/10/06 On hold a LSF patch planned but no date decided yet. This ticket appears for the 2nd time in the T1SCM report!

Outstanding SIRs

From the last MB report, outstanding were:

  • LHCb LFC replication (from Sep 14 – GGUS:62078) - now done, SIR
  • New ASGC DB problem (from Sep 25) - not done
  • NDGF-RAL networking issues (from Aug 19 – GGUS:61306) - see LHC OPN discussion
  • BNL-CNAF networking issues (from Aug 23 – GGUS:61440) - ditto

LHC OPN meeting

This issues of involvement and ownership are being discussed at the on-going LHC OPN - agenda.

Conditions data access and related services

COOL, CORAL and POOL

  • CMS has prepared a new release using the CORAL_2_3_12 tag. This is confirmed to successfully solve the crash previously observed with optimized gcc 4.3 builds.
  • Work is ongoing to reproduce and fix the problems observed by several users (in all of ATLAS, CMS, LHCb) in the CORAL handling of network or database server glitches.

Frontier

Database services

  • Topics of general discussion (and CERN report)
    • We have started validation of 10.2.0.5 with experiments and main applications. A summary of testing will be done at the Distributed DB workshop in November. Final decision to upgrade and chedule will be published in December. The current general plan is to upgrade in January.
    • Extended maintenance contract will be necessary for Oracle support for 10.2.0.x from August 2011. CERN plans to negotiate this with Oracle.
    • PSU Apr and Jul non rollingness testing, an update:
      • Tests have reproduced on several occasions the issue. Oracle support is working on analyzing our logs.
    • The distributed DB workshop has been organized for Novemebre 16th and 17th. The draft agenda is open for comments and feedback:
https://twiki.cern.ch/twiki/bin/view/PDBService/3DWorkshopPlanning

  • Experiment reports:
    • ALICE:
      • Alice plan to replace current DB HW going out of warranty during extended technical stop. Discussions with IT-DB on which HW to buy are undergoing.
    • ATLAS:
      • Atlas has requested the split of offline DB into conditions and ADC. Tests are in progress, this will be discussed more in details at Atlas DB meeting on Monday.
    • CMS:
      • CMS_LUMI_PROD will be moved from CMS PVSS replication to CMS Conditions replication (scheduled for 7/10)
    • LHCb:
      • Nothing to report

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
ASGC DB status 3D Stream: Due to problem of power generator maintenance, all ASGC services were shutdown by unexpected power cut. 3D Database and Backup are corrupted. 3D Database is still under recovery, thanks Marcin and Eter.Transporttable gets very poor performance due to network issue, ASGC network team is investing this issue now. RAC Testbeds were patched to April 2010 None
BNL Patch 6196748, OS and ASM kernel bug fixes were deployed on Conditions DB. none
CNAF X X
IN2P3 X X
KIT Nothing to report None
NDGF Nothing to report None
PIC applied last security kernel updates on all the systems None
RAL X X
SARA SARA is still running on the alternate server. The data corruption is not present anymore after a re-format of some LUN's (surprisingly enough not the ones which were in use at the time of the incident). At this moment we are performing "storage" stress tests and until now 3 disks (there are 80 disks) had to be replaced due to disk errors. It's unlikely we will find the real cause of this incident because many things have changed until then (re-formats and storage firmware upgrade). No interventions
TRIUMF Applied Linux kernel security updates to our Oracle Servers None

AOB

  • WLCG Service Coordination BOF at CHEP - agenda
Edit | Attach | Watch | Print version | History: r31 < r30 < r29 < r28 < r27 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r31 - 2010-11-11 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback