WLCG Tier1 Service Coordination Minutes - 10 February 2011

Attendance

LHC run in 2011 / 2012

Action list review

Release update

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.10 (all)
SRM 2.9-4 (all)
xrootd 2.1.9-7
   
ASGC CASTOR 2.1.7-19 (stager, nameserver)
CASTOR 2.1.8-14 (tapeserver)
SRM 2.8-2
28/1: 30' of unscheduled downtime for CASTOR due to blade firmware upgrade, core servers had to be rebooted Upgrade tests ongoing
BNL dCache 1.9.5-23 (PNFS, Postgres 9) none in the process of adding disk space
CNAF StoRM 1.5.6-3 (ATLAS, CMS, LHCb,ALICE)   certification terminated on Feb 09, upgrade OS to SL5 is delayed a little bit
FNAL dCache 1.9.5-23 (PNFS)
Scalla xrootd 2.9.1/1.4.2-4
Oracle Lustre 1.8.3
none Moving unmerged pools from dCache to Lustre
Deploying scalable SRM servers with DNS load balancing
IN2P3 dCache 1.9.5-24 (Chimera) Upgraded to version 1.9.5-24 on 2011-02-08 4-day intervention starting Feb 20
KIT dCache (admin nodes): 1.9.5-15 (Chimera), 1.9.5-24 (PNFS)
dCache (pool nodes): 1.9.5-9 through 1.9.5-24
Updated part of the dCache setup to 1.9.5-24 during GridKa downtime on 26.01.2011  
NDGF dCache 1.9.11 None None
NL-T1 dCache 1.9.5-23 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)   On march 8th a core router will be replaced at NIKHEF. Services at NIKHEF will not be available and be at risk two days afterwards.
PIC dCache 1.9.5-23 (PNFS)    
RAL CASTOR 2.1.9-6 (stagers)
2.1.9-1 (tape servers)
SRM 2.8-6
CMS disk server upgraded to SL5 64 bit on 31/1/11 ALICE disk server upgrades to SL5 64bit on 15/2/11
Plans for CASTOR upgrade to 2.1.10 in March
TRIUMF dCache 1.9.5-21 with Chimera namespace None None

CASTOR news

CERN operations

Development.

xrootd news

dCache news

  • Jon: apparently the new Golden Release will be 1.9.12, but FNAL are happy with 1.9.5 and have serious concerns about such a significant upgrade during the 2011-2012 LHC run!
  • Maarten: we will discuss this offline with the dCache developers and report in the next meeting

StoRM news

FTS news

  • FTS compatibility with Oracle 11g servers will be tested at CERN in a few weeks from now
    • problems may show up in the DB usage optimization
    • Matt Hodges: this was tested at RAL and the performance was bad indeed
      • e-mail after the meeting: Richard Sinclair is the DBA at RAL-LCG2 who was investigating the problem, and it was a deprecated parameter (commit_write) that was suspected of causing the performance problems that we were seeing.

DPM news

  • DPM 1.8.0-1 for gLite 3.2 has been released to Production on Feb 9.
  • For gLite 3.1 it remains in the Staged Rollout and the memory leak appears not to be fixed.

LFC news

  • LFC 1.8.0-1 for gLite 3.2 has been released to Production on Feb 9.
  • For gLite 3.1 it remains in the Staged Rollout and the memory leak appears not to be fixed.

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.7.4-7 SLC5 64-bit Oracle None
BNL 1.8.0-1 SL5, 64-bit Oracle None
CERN 1.7.3 64-bit SLC4 Oracle Will upgrade to SLC5 64-bit by the end of Jan or begin of Feb.
CNAF 1.7.4-7 SL5 64-bit Oracle  
FNAL N/A     Not deployed at Fermilab
IN2P3 1.8.0-1 SL4 64-bit Oracle 11g Oracle DB migrated to 11g on Feb. 8th
KIT 1.7.4 SL5 64-bit Oracle  
NDGF 1.7.4.7-1 Ubuntu 9.10 64-bit MySQL None
NL-T1 1.7.4-7 CentOS5 64-bit Oracle  
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.3-1 SL5 64-bit MySQL  

Experiment issues

WLCG Baseline Versions

Status of open GGUS tickets

GGUS - Service Now interface: update

Review of recent / open SIRs and other open service issues

Conditions data access and related services

Database services

---++ Database services

  • 10.2.0.5 patching status - ALL databases are now running 10.2.0.5 - no major issues found, some minor issues included:
    • OEM agents not reading the 10.2.0.5 DB alert logs properly - bug 10170020
    • Oracle Bug 9184754 found with very specific to ATLAS PANDA workload

  • Experiment reports:
    • ALICE:
      • Nothing to report
    • ATLAS:
      • Applied patch 9184754 on ADCR production DB - bug was only affecting PANDA application and was causing single instance crashes every few days
      • On Friday (4th of Feb) morning ATLAS PVSS replication aborted due to foreign key violation on target database (ATLAS offline) by five transactions. Harming transactions were applied on source database (ATLAS online) before without problems - violating the constraint – which should never happen. The problem of constraint inconsistency in PVSS schema is being investigated but the root cause is not known yet. In order to start replication and maintain the replica consistent problematic transactions have been applied without constraint validation.
    • CMS:
      • Nothing to report
    • LHCb:
      • Nothing to report

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
ASGC    
BNL * Conditions database successfully upgraded to 10.2.0.5. No issues occurred during this upgrade.
* Former LFC_FTS cluster was reconfigured to be used as a physical standby database:
- Upgrades included OS RHEL5, Cluster/database server 10.2.0.5, Storage firmware.
-Only initially enabled on one of the production clusters, as a part of the integration of this Data Guard technology within the oracle database operations.
* Enable IPMI on all oracle production clusters.
* To enable Data guard for LFC database.
* Decommission TAGS database service.
CNAF   * 16 Feb - LHCb cluster upgrade to 10.2.0.5
* 2 Mar (to be confirmed): FTS DB upgrade to 10.2.0.5 + FTS DB purge old data and set up of the periodical cleaning job, that was missing before.
KIT * Jan 26: Upgrade of 3D RACs (ATLAS, LHCb) to 10.2.0.5. None
IN2P3    
NDGF Nothing to report None
PIC 8th Feb - we've upgraded FTS database. We're planning to upgrade all other databases before the end of February but we don't have a exact date.
RAL * We have upgraded the 3D, LFC,FTS and castor databases to 10.2.0.5
* In few days we should receive the new hardware ready for us to install Oracle and start our testing (this is the HW that will be used for data guard CASTOR and FTS/LFC).
 
SARA Nothing to report No interventions
TRIUMF * Upgraded Oracle 3D RAC to Oracle 10.2.0.5 None

AOB

-- JamieShiers - 03-Feb-2011

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2011-02-11 - XavierMol
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback