WLCG Tier1 Service Coordination Minutes - 13 October 2011

Attendance

(CERN: Masimo, Dario, Alexei, Eliszbeth, Simone, Alessandro, Stefan, MariaD, MariaA, MariaG, Maite, Zsolt, Oli, Dirk, Maarten, Eva, AndreaS, Jamie, Wojcek, Nicolo) (Remote: AndreaV, Carlos, Burt, Carmine, Ron, AlessandroC, )

Action list review

Release update

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.11-6 for ATLAS and CMS, 2.1.11-5 for PUBLIC, others are on 2.1.11-2; SRM 2.10-x (SL4); xrootd: 2.1.11-1
FTS: 5 nodes in SLC5 3.7.0-3; 7 nodes in SLC4 3.2.1
EOS -0.1.0/xrootd-3.0.4
CASTOR 2.1.11-* to 2.1.11.6 for ATLAS and CMS

CASTOR: Bring all instances at the same level (presumably 2.1.11.8); as soon as possible the CASTOR CERNT3 will be retired (capacity goes to EOS)

ASGC CASTOR 2.1.11-5
SRM 2.11-0
DPM 1.8.0-1
None Upgrade CASTOR to 2.1.11-6; date TBD depending of CMS reprocessing schedule
BNL dCache 1.9.5-23 (PNFS, Postgres 9) None Transition from PNFS to Chimera at next LHC TS
CNAF StoRM 1.7.0 (Atlas)
Storm 1.5.0 (other endpoints)
   
FNAL dCache 1.9.5-23 (PNFS) httpd=1.9.5.-25
Scalla xrootd 2.9.1/1.4.2-4
Oracle Lustre 1.8.3
none Scheduled downtime 15-Oct 0300-1800 CDT due to power cut
IN2P3 dCache 1.9.5-29 (Chimera) on core servers and pool nodes upgrade to 1.9.5.29 last week  
KIT dCache (admin nodes): 1.9.5-27 (ATLAS, Chimera), 1.9.5-26 (CMS, Chimera) 1.9.5-26 (LHCb, PNFS)
dCache (pool nodes): 1.9.5-6 through 1.9.5-27
   
NDGF dCache 1.9.14 (Chimera) on core servers. Mix of 1.9.13 and 2.0.0 on pool nodes.    
NL-T1 dCache 1.9.12-10 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.12-10 (last upgrade to patch release on 13-Sep); PNFS on Postgres 9.0    
RAL CASTOR 2.1.10-1
2.1.10-0 (tape servers)
SRM 2.10-2
Started using T10KC tape media for ATLAS on 28/9/11 none
TRIUMF dCache 1.9.5-28 with Chimera namespace 5/10: updated dCache  

Other site news

CASTOR news

CERN operations and development

EOS news

xrootd news

dCache news

StoRM news

FTS news

  • FTS 2.2.5 in gLite Staged Rollout: http://glite.cern.ch/staged_rollout
  • FTS 2.2.6 released in EMI-1 Update 6 on Sep 1
    • restart/partial resume of failed transfers
  • FTS 2.2.7 cancelled, next version will be FTS 2.2.8

DPM news

  • DPM 1.8.2-2 - a problem was found in certification, fixed, and the release rebuilt
  • DPM 1.8.2-3 in staged rollout
  • Monthly releases of new unstable components can be followed on the blog: https://svnweb.cern.ch/trac/lcgdm/blog
    • This covers NFSv4.1, WebDAV, Nagios, Catalogue synchronisation & 'perfsuite'.

LFC news

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.8.0-1 SLC5 64-bit Oracle None
BNL 1.8.0-1 SL5, 64-bit Oracle None
CERN 1.8.2-0 64-bit SLC5 Oracle Upgrade to SLC5 64-bit only pending for lfcshared1/2
CNAF 1.7.4-7 (ATLAS, to be dismissed>
1.8.0-1 (LHCb, recently updated)
SL5 64-bit Oracle  
FNAL N/A     Not deployed at Fermilab
IN2P3 1.8.0-1 SL5 64-bit Oracle 11g Plan to migrate to 1.8.0-2 asap
KIT 1.7.4-7 SL5 64-bit Oracle Oracle backend migration pending
NDGF 1.7.4.7-1 Ubuntu 10.04 64-bit MySQL None
NL-T1 1.7.4-7 CentOS5 64-bit Oracle  
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.3-1 SL5 64-bit MySQL None

Experiment issues

WLCG Baseline Versions

  • WLCG Baseline versions: table

Status of open GGUS tickets

(MariaDZ):

  • The operators' instructions should be further narrowed down to avoid calling the wrong service (MariaDZ).
  • The DB group should have members in the e-group notified by GGUS and SNOW in case of ALARMs (MariaDZ working with Maite and Eva on appropriate names).
  • Experiments should not submit the most recent tickets for these reports but only those which don't get adequate support (subset of tickets extracted from today's CMS email).
  • The LHCb GGUS:74775 on LFC is open for 2 weeks indeed but supporters actively worked on narrowing down the right Support Unit. CERN LFC supporters received it yesterday and they will act a.s.a.p.

Review of recent / open SIRs and other open service issues

  • MariaDZ will rephrase the recent GGUS SIR to make obvious the impact on the users.
  • 3 SIRs related to recent DB tickets (CMS, Atlas and ALARMs' notifications in general) should come out of the DB group.

Conditions data access and related services

Database services

  • Experiment reports:
    • ATLAS:
      • Atlas offline DB for Atlas ADC was impacted by network issues to Safehost last Monday.
      • Atlas offline DB, ATLR suffered high load and instance reboots during the nights of 11th and 12th. High load comes as spikes from Tier0 jobs and has been observed since a couple of months. We are aware Atlas is actively testing the use of Frontier to reduce this load. We are also discussing with Atlas on the action plan to mitigate the impact of this high load and provide an action plan. Meeting with Atlas on 13-10 morning to discuss this, a couple more nodes will be added to the cluster as a temporary fix, hopefully more application-level solutions will be proposed by Atlas too.
    • CMS:
      • Node reboots affecting CMS offline production database. Root cause being investigated. Additional logging proposed by Oracle has been enabled.
    • WLCG: ntr

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
BNL Conditions database: Patch (6011045) applied recommended by MOS, fixes BUG 6011045 which affects DBMS_STATS causing a deadlock between 'cursor: pin S wait on X' and 'library cache lock' relocate the LFC and FTS cluster to a new datacenter room, this intervention is planed for next November 7th
CNAF ntr ntr
KIT    
IN2P3 LHCb-LFC corruption??  
PIC ntr ntr
RAL Castor: hit again the problem of duplicate sub-request. The problem has been solved by deleting all sub-requests. Root cause is still being investigated ntr
SARA ntr ntr
TRIUMF On Oct 5th, applied JULY 2011 CPU Patch ntr

AOB

-- AndreaSciaba - 12-Oct-2011

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2011-10-14 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback