WLCG Tier1 Service Coordination Minutes - 5 May 2011

Attendance

Action list review

Release update

CondorG & CREAM - issues

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.10 (all)
SRM 2.10-x
xrootd: ALICE 2.1.10 update 1, others: 2.1.9-7
  Planned move to SLC5 nodes for the Name Server (May 10). Some Oracle security pathches will be also deployed in the next days
ASGC CASTOR 2.1.10-0
SRM 2.10-2
DPM 1.8.0-1
CASTOR upgraded for ATLAS and CMS 7/5: 8-hours downtime due to DPM DB optimisation; move DPM from T2 to T1: after that all transfers from/to ASGC DPM will go via the FTS T1 channel (shared with CASTOR)
BNL dCache 1.9.5-23 (PNFS, Postgres 9) None Transition to Chimera planned in summer 2011
CNAF StoRM 1.5.6-3 SL4 (CMS, LHCb,ALICE)
StoRM 1.6 SL5 (ATLAS)
   
FNAL dCache 1.9.5-23 (PNFS) httpd=1.9.5.-25
Scalla xrootd 2.9.1/1.4.2-4
Oracle Lustre 1.8.3
none none
IN2P3 dCache 1.9.5-24 (Chimera) on all core servers and pool nodes    
KIT dCache (admin nodes): 1.9.5-15 (Chimera), 1.9.5-24 (PNFS)
dCache (pool nodes): 1.9.5-9 through 1.9.5-24
   
NDGF dCache 1.9.12    
NL-T1 dCache 1.9.5-23 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.5-25 (PNFS, Postgres 9)    
RAL CASTOR 2.1.10-0
2.1.9-1 (tape servers)
SRM 2.10-2,2.8-6
none ALICE SRM upgrade to 2.10-2 on 10/5/11. Update to support T10KC tape media during May
TRIUMF dCache 1.9.5-21 with Chimera namespace    

CASTOR news

CERN operations

Development

xrootd news

dCache news

  • SRM overwrite for dCache at SARA: dCache provides the feature, but Ron has to decide whether or not to enable it. dCache people are currently documenting on how overwrite works in dCache with respect to the tape backend. We expect this to be available within days.

StoRM news

  • Checksum and recommended versions: for SL5 the only version is 1.6.2. For SL4 checksum support is available since 1.5.* The checksum is calculated in two ways that can coexist:
    • via GridFTP on the fly
    • via che Checksummer service (on demand)
The checksum is returned whenever it is defined as a file attribute and it will be calculated by either of the above options.

FTS news

DPM news

LFC news

  • LFC 1.8.0-1 for gLite 3.1: waiting for rebuild of the meta package with the correct VOMS libraries (1.9.10-14)

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.8.0-1 SLC5 64-bit Oracle None
BNL 1.8.0-1 SL5, 64-bit Oracle None
CERN 1.7.3 64-bit SLC4 Oracle Upgrade to SLC5 64-bit pending
CNAF 1.7.4-7 SL5 64-bit Oracle  
FNAL N/A     Not deployed at Fermilab
IN2P3 1.8.0-1 SL5 64-bit Oracle 11g Oracle DB migrated to 11g on Feb. 8th
KIT 1.7.4-7 SL5 64-bit Oracle Oracle backend migration pending
NDGF 1.7.4.7-1 Ubuntu 9.10 64-bit MySQL None
NL-T1 1.7.4-7 CentOS5 64-bit Oracle  
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.3-1 SL5 64-bit MySQL  

Experiment issues

WLCG Baseline Versions

FTS & overwrite mode 10'

Consistency of storage elements & LFC 15'

Status of open GGUS tickets

GGUS - Service Now interface: update

Review of recent / open SIRs and other open service issues

VOMS SL4->5 migration

Conditions data access and related services

Database services

  • Patching recommendations - list patches that were/will be additionally applied on top of vanilla 10.2.0.5 at CERN and are recommended to be applied also on T1s databases:
    • 9232517 PROPAGATION MISSING MESSAGES AFTER DESTINATION QUEUE OWNERSHIP SHIFT ON RAC - highly recommended for T1s DBs
    • 9184754 SGA corruption / ORA-600 [ktcccenxt] / dump using Lobs
    • 9577583 FALSE ORA-942 OR OTHER ERRORS WITH MULTIPLE SCHEMAS HAVING IDENTICAL OBJECTS
    • 7612454 DSS PERF REGRESSIONS IN SERIAL DIRECT READS
    • 8684595 GETTING ORA-01115, 27069, WHILE RUNNING PQ, WHEN AUTOEXTEND IS ON
    • 8970313 STALE FILE CACHE IN RAC ENV AFTER TABLESPACE DROP AND RECREATE
    • 9586877 THE FIX FOR BUG 7526851 (& BUG 8494071) NEEDS REWORKING TO AVOID ORA-904 ERRORS
  • Additionally here are some important fixes for OEM agents that monitor 10.2.0.5 dbs:
    • 9282414 - 10.2.0.5 PSU 2 for OEM agent
    • 10170020 - alert_log fix for 10.2.0.5 DBs
  • Patching status:
    • All test and integration DBs have been patched
    • Production DBs coming next week during technical stop.
    • One test and one integration DB (int11r) has been already upgraded to Oracle 11.2.0.2

  • Experiment reports:
    • ALICE:
    • ATLAS:
      • ATLASDD which keeps ATLAS Geometry data was added to Atlas Conditions replication to Tier1s on Thursday 14.04 10AM. Operation required 2 hours of replication downtime.
      • On Monday (18.04) morning ATLAS and LHCB replications of conditions data to T1s were affected by the streams process deadlock which occurred after weekly short maintenance stop of replication service. To get rid of lock restart of downstream instances was required.
    • CMS:
      • CMS offline production database got stuck on Thursday April 28th at areound 23:30. Investigation showed that the hang was related to a library cache issue on node number 4 of the cluster. Restarting the node (at 0:50) healed the situation. The root cause of the problem is still not understood. The issue affected all CMS offline applications which could not process the load during 1 hour and 20 minutes.
    • LHCb:
    • Other:
      • A problem with LGC integration database (int6r) started after a disk failure on 26th of April at 20:38. One hour later this normally transparent failure caused int6r integration DB to hang for unknown reason following SCSI errors reported by the OS. Monitoring systems informed us about a problem, but no reaction was required as this is not a production DB. Further analysis and problem recovery was performed in the morning following day. A database restart was required to clean the locking conditions that caused this freeze. DB was not available from the user point of view between April 26 21:45 and April 27 9 a.m.

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
ASGC Castor 2.1.10 upgrade done. None
BNL * Successful migration of RAC hosting the Conditions Database hardware, BNL Oracle Enterprise Manager hardware and LFC-FTS Standby database cluster, to new datacenter room (04/27/11)
* Follow up apply process and gather_stats_jobs contention observed 04/13/11 - after analyzing log files, trace files collected and looking at the timeline of the events involving the two process contention it appears that Bug 3642294 is affected the gather_stats_jobs. This is being followed up with ORACLE via SR.
Apply quarterly oracle security patches on database services.
CNAF Updated a local DB (Lemon) to latest PSUs for 10.2.0.5 without any problem. Waiting with WLCG DBs. None
KIT   Looking for new date for migration of FTS/LFC RAC to new hardware.
IN2P3    
NDGF    
PIC Nothing to report None
RAL Nothing to report 10th May 2011 - 09:00 and 11:00 BST (UTC+1) - Patching ATLAS conditions, LHCb conditions and LHCb LFC
SARA Nothing to report (Liberation Day in the Netherlands) On June 24 expected to upgrade to 10.2.0.5
TRIUMF Nothing to report (not participating this Thursday) None

AOB

-- JamieShiers - 29-Apr-2011

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2011-05-05 - DawidWojcik
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback