WLCG Tier1 Service Coordination Minutes - 1 September 2011

Attendance

  • Local:

  • Remote:

Action list review

Release update

Data Management & Other Tier1 Service Issues

Site Status Recent changes Planned changes
CERN CASTOR 2.1.11-2 (SL5); SRM 2.10-x (SL4); xrootd: 2.1.11-1
FTS SL4 3.2.1 i.e old
EOS -0.1.0/xrootd-3.0.4
All instances are now on the latest CASTOR version. All of them use the Transfer Manager for internal sceduling (LSF replacement) . The end of August technical stop was used to perform several DB maintenance intervention (security patched on all, including the Name Server, CMS stager DB moved to new more performing hardware, ATLAS stager DB defragmented

Complete the migation to new hardware (all instances -but CMS, and the Name Server)

Resource move (disk space) from CASTOR to EOS is being planned (time scale: October 2011)

ASGC CASTOR 2.1.11-2
SRM 2.11-0
DPM 1.8.0-1
29-30/8: downtime due to CASTOR upgrade None
BNL dCache 1.9.5-23 (PNFS, Postgres 9) None Transition to Chimera during next TS (Nov)
CNAF StoRM 1.7.0    
FNAL dCache 1.9.5-23 (PNFS) httpd=1.9.5.-25
Scalla xrootd 2.9.1/1.4.2-4
Oracle Lustre 1.8.3
   
IN2P3 dCache 1.9.5-26 (Chimera) on core servers. Mix of 1.9.5-24 and 1.9.5-26 on pool nodes    
KIT dCache (admin nodes): 1.9.5-25 (ATLAS, Chimera), 1.9.5-26 (CMS, Chimera) 1.9.5-26 (LHCb, PNFS)
dCache (pool nodes): 1.9.5-9 through 1.9.5-26
   
NDGF dCache 1.9.12    
NL-T1 dCache 1.9.5-23 (Chimera) (SARA), DPM 1.7.3 (NIKHEF)    
PIC dCache 1.9.12-8; PNFS, Postgres 9.0   13/9: upgrade to dCache 1.9.12-10
RAL CASTOR 2.1.10-2
2.1.10-0 (tape servers)
SRM 2.10-0
  7/9: will apply DB patches to 3D ATLAS and LHCb, FTS and LFC. Services will be "at risk"
TRIUMF dCache 1.9.5-21 with Chimera namespace None None

Other site news

  • CERN: the CERN-T2 FTS had its agents upgraded to SLC5 (IT SSB post).

CASTOR news

CERN operations and development

The Transfer Manager is now in production on all CERN instances. It shows all the expected advantages compared to the LSF-based scheduling (lightweight and efficient scheduling, less overall complexity). Residual issues are reasonably under control and are worked around satisfactorily.

Resource movement (disk space) from CASTOR to EOS is being planned (time scale: October 2011)

EOS news

xrootd news

dCache news

StoRM news

EMI-1_Update-6 contains an updated version of StoRM (see EMI1-Update6 link ), and it should be available tomorrow. After that, StoRM v1.7.1 can be installed by EA (EGI).

FTS news

DPM news

LFC news

LFC deployment

Site Version OS, n-bit Backend Upgrade plans
ASGC 1.8.0-1 SLC5 64-bit Oracle None
BNL 1.8.0-1 SL5, 64-bit Oracle None
CERN 1.7.3 64-bit SLC4 Oracle Upgrade to SLC5 64-bit pending
CNAF 1.7.4-7 (ATLAS, to be dismissed>
1.8.0-1 (LHCb, recently updated)
SL5 64-bit Oracle  
FNAL N/A     Not deployed at Fermilab
IN2P3 1.8.0-1 SL5 64-bit Oracle 11g Oracle DB migrated to 11g on Feb. 8th
KIT 1.7.4-7 SL5 64-bit Oracle Oracle backend migration pending
NDGF 1.7.4.7-1 Ubuntu 9.10 64-bit MySQL None
NL-T1 1.7.4-7 CentOS5 64-bit Oracle  
PIC 1.7.4-7 SL5 64-bit Oracle  
RAL 1.7.4-7 SL5 64-bit Oracle  
TRIUMF 1.7.3-1 SL5 64-bit MySQL  

Experiment issues

WLCG Baseline Versions

GGUS issues

Two tickets concerning voms bugs revealed after a recent, supposingly transparent, upgrade, were posing serious problems with the oracle middleman process and/or to obtain voms-proxies. After 2 weeks of idle time at the DMSU premises, the tickets are now closeful followed-up by the developers, starting today.

Review of recent / open SIRs and other open service issues

Conditions data access and related services

Database services

  • Generic reports:
    • Many different small problems related to storage and streams over last 2 months. All resolved by now.
    • All 10.2.0.5 CERN DBs upgraded with Oracle July CPU including downstream capture boxes. No issues found except for the ATLAS downstream capture issue (see below) - not patch related.
    • After an intervention (applying security patches) on ATLAS downstream database (ATLDSC) on 31st of August replication for ATLAS conditions to T1 sites has aborted. The root-cause of this problem was most probably streams processes misbehavior during replication restart which caused some transaction lost and consequently streams inconsistency. All streams setups have been manually resynchronized by capture process reconfiguration and apply process recreation. Replication to Atlas conditions has been fully restored by 16:30.

  • Site reports:
Site Status, recent changes, incidents, ... Planned interventions
BNL -- Oracle database services at BNL were turned off and related hardware power off past Saturday as a part of the preventive shutdown of the BNL Tier 1 due to critical weather conditions.
Atlas Conditions database - streams propagation and apply process and database service were disabled Saturday ~11:47 AM EDT and re-enabled in Monday 29 at ~10:34AM EDT.
BNL LFC and FTS, US ATLAS Tier 3 LFC databases - disabled at ~11:15AM EDT Saturday and re-enabled at 10:00AM EDT Monday
Standby database for BNL LFC and FTS, US ATLAS Tier 3 LFC databases - disabled at ~11:30 EDT Saturday and re-enabled at 11:00AM EDT Monday.
VOMS and Priority Stager - disabled at ~11:15AM EDT Saturday and re-enabled at 10:05AM EDT Monday
So far no problems (hardware and database and replication) observed after re-enbling the database services.
Activities:
--Migrated VOMS and Priority Stager to a newer disk storage.
-- ORACLE diagnosed in Service Request 3-3535183751 and SR 3-3396390491 a possible cause of the apply process and the gather_stats job contention reported in 04/13/11 and 06/15/11. Oracle claims that this problem is due to a BUG 6011045 which affects DBMS_STATS causing a deadlock between 'cursor: pin S wait on X' and 'library cache lock'. In addition Oracle provided a patch number (P6011045) and recommend to apply it to fix this BUG. The patch appears to be rolling.
--To apply CPU patches initially to Conditions Database and proposed patch from Oracle (P6011045)
--To relocate the LFC and FTS cluster to a new datacenter room, this intervention has been positioned until next Technical Stop (November) and will be scheduled in conjunction with other services (dCache) maintenance to minimize overall site services disruption.
CNAF    
KIT Nothing to report None
IN2P3 On 18 July 2011, hard disks have broken down involving a corruption at different level (control file, undo, redo logs, archived logs and datafiles). Unique solution to restore databases, was to apply an incomplete restore for all databases. Streaming was restored together with CERN DBAs. Storage team is in contact with the Pillar support to understand the root of problem. Note that, all Luns presented to ASM are in RAID5 and hard disk failures should not have impacted Oracle.
CPU patching schedule:
- DBLHCB on 6th sept 2011
- DBATL on 7th sept 2011
- DBAMI on 8th sept 2011

We may foresee a downtime on 20th sept for a micro-code upgrade on FC switches and disk array. This intervention involves to shutdown all 3d databases for around 12 hours. We wait for the confirmation of Atlas France before fixing a date. To be scheduled.
NDGF Nothing to report None
PIC   - Complete stop of ATLAS' database.
- apply CPU July on remaining instances (maybe next week)
RAL    
SARA Nothing to report (not attending) No interventions
TRIUMF Nothing to report No interventions

AOB

-- AndreaSciaba - 30-Aug-2011

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2011-09-01 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback