Worker Node testing for WLCG

  • Note: write access for external collaborators can be obtained here.

Introduction

As of mid September 2012 most of the WLCG sites in EGI are still running the old gLite 3.2 WN version on their worker nodes, despite various issues:

  • The old GFAL/lcg_util code has known bugs that are only fixed in EMI/UMD releases of the WN.
  • New products like GFAL2 and features like Xrootd support and federation are not getting real exposure in the production environment.
  • Developers who implemented new features (often on our request) may become unavailable when the EMI project has ended.
  • It becomes hard to maintain the old build infrastructure and expertise for security patches, should they be needed.
  • Even though the old code may be "good enough" for current usage by ATLAS and CMS, it certainly is not for the many other VOs that most EGI sites need to support.

ALICE and LHCb are much less affected, at least for SL5, because their jobs bring themselves essentially all they need. For SL6 porting also LHCb will benefit from corresponding test queues.

In the spring of 2012 an initiative was launched to get the EMI-1/UMD-1 WN validated by ATLAS and CMS on a set of sites that together cover all of the relevant SE types:

  • BeStMan (as part of EOS)
  • CASTOR
  • dCache
  • DPM
  • EOS
  • StoRM

Due to other activities with higher priorities at that time, the validation was only completed partially, allowing e.g. CNAF and a few CMS T2 to move their WN to the EMI-1/UMD-1 release.

We now need to restart this activity and keep testing further WN updates regularly, such that we may discover early if a particular update breaks some experiment work flow.

The testing would be done through HammerCloud and participating sites would set up small, essentially permanent test queues for the experiments they support and apply WN updates (automatically?) as they appear in the EMI-2 testing repository:

Meanwhile the EMI-2/UMD-2 WN has been released and it has a much longer lifetime than what was tested earlier, so we should concentrate on that now.

The OS will be mainly SL5 for the time being.

Sites are welcome to join this effort!

Participating sites and queues

SE type VOs Site CE + queue name WN version ATLAS
status
CMS
status
LHCb
status
CASTOR atlas, cms RAL lcgce03.gridpp.rl.ac.uk:8443/cream-pbs-gridTest
lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-gridTest
lcgce07.gridpp.rl.ac.uk:8443/cream-pbs-gridTest
lcgce08.gridpp.rl.ac.uk:8443/cream-pbs-gridTest
lcgce09.gridpp.rl.ac.uk:8443/cream-pbs-gridTest
EMI-WN 2.0.0      
dCache atlas, cms DESY grid-cr2.desy.de:8443/cream-pbs-emi2-sl6 EMI-WN 2.0.0
SL6
     
dCache atlas TRIUMF ce1.triumf.ca:8443/cream-pbs-test EMI-WN 2.0.0      
DPM atlas, cms, lhcb Brunel dc2-grid-65.brunel.ac.uk:8443/cream-pbs-atlas
dc2-grid-65.brunel.ac.uk:8443/cream-pbs-cms
dc2-grid-65.brunel.ac.uk:8443/cream-pbs-lhcb
EMI-WN 2.0.0
SL6
     
DPM atlas, lhcb Liverpool hepgrid5.ph.liv.ac.uk:8443/cream-pbs-long EMI-WN-2.0.0      
DPM atlas, lhcb Manchester vm3.tier2.hep.manchester.ac.uk:8443/cream-pbs-long EMI-WN-2.2.0      
DPM atlas, cms Oxford t2ce02.physics.ox.ac.uk:8443/cream-pbs-shortfive
t2ce02.physics.ox.ac.uk:8443/cream-pbs-mediumfive
t2ce02.physics.ox.ac.uk:8443/cream-pbs-longfive
EMI-WN 2.0.0      
StoRM atlas, cms CNAF ce03-lcg.cr.cnaf.infn.it:8443/cream-lsf-emitest EMI-WN 2.0.0      

ATLAS test details

CMS test details

Summary of fixes to data management components

The latest EMI-2 update contains fixes for all known issues related to gfal/lcg_utils and DPM/LFC clients.

Result tables (match your site here!)

EMI-2 SL5 ATLAS CMS LHCb ALICE
CASTOR OK OK OK OK
dCache NOTE OK OK OK
DPM OK OK OK OK
EOS OK OK OK OK
StoRM OK OK OK OK

  • NOTE: ATLAS found gsidcap access failing for limited (WN) proxies and opened GGUS:87065 for the dCache developers.
    • Fixed in EMI-2 Update 6 released Nov 26.
    • Also CMS have seen this issue, but currently no CMS site is using that protocol.
    • For ATLAS sites where only plain dcap is used the Oct release was already OK.
  • CMS workaround for DPM sites documented here.

EMI-2 SL6 ATLAS CMS LHCb ALICE
CASTOR     OK OK
dCache   OK OK OK
DPM   OK OK OK
EOS     OK OK
StoRM   OK OK OK

  • See aforementioned CMS workaround for DPM sites.

EMI-1 SL5 ATLAS CMS LHCb ALICE
CASTOR   OK OK OK
dCache   OK OK OK
DPM OK OK OK OK
StoRM OK OK OK OK

  • Note: with EMI-1 an upgrade to lcg_util 1.13.9 may still be needed.
  • See aforementioned CMS workaround for DPM sites.
Edit | Attach | Watch | Print version | History: r34 < r33 < r32 < r31 < r30 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r34 - 2013-01-22 - ChristophWissing
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback