WLCG Operations Coordination Minutes - VIRTUAL - January 16, 2014

Agenda

Attendance

  • Local:
  • Remote:

News

Middleware news and baseline versions

https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions

  • No change in baselines since last meeting.
  • OpenSSL issue update

Tier-1 Grid services

Storage deployment

Site Status Recent changes Planned changes
CERN CASTOR:
v2.1.14-5 and SRM-2.11-2 on all instances
EOS:
ALICE (EOS 0.3.4 / xrootd 3.3.4)
ATLAS (EOS 0.3.4 / xrootd 3.3.4 / BeStMan2-2.2.2)
CMS (EOS 0.3.2 / xrootd 3.3.4 / BeStMan2-2.2.2)
LHCb (EOS 0.3.3 / xrootd 3.3.4 / BeStMan2-2.2.2)
   
ASGC CASTOR 2.1.13-9
CASTOR SRM 2.11-2
DPM 1.8.7-3
xrootd
3.3.4-1
none none
BNL dCache 2.6.18 (Chimera, Postgres 10 w/ hot backup)
http (aria2c) and xrootd/Scalla on each pool
None None
CNAF StoRM 1.11.2 emi3 (ATLAS, CMS, LHCb)   Upgrade StoRM 1.11.3 (ATLAS) today Jan 16
FNAL dCache 1.9.5-23 (PNFS, postgres 8 with backup, distributed SRM) httpd=2.2.3
Scalla xrootd 2.9.7/3.2.7.slc
EOS 0.3.2-4/xrootd 3.3.3-1.slc5 with Bestman 2.2.2.0.10
   
IN2P3 dCache 2.6.18-1 (Chimera) on SL6 core servers and pool nodes
Postgres 9.2
xrootd 3.0.4
   
KISTI xrootd v3.2.6 on SL5 for disk pools
xrootd 20100510-1509_dbg on SL6 for tape pool
dpm 1.8.7-3
   
KIT dCache
  • atlassrm-fzk.gridka.de: 2.6.17-1
  • cmssrm-kit.gridka.de: 2.6.17-1
  • lhcbsrm-kit.gridka.de: 2.6.17-1
xrootd
  • alice-tape-se.gridka.de 20100510-1509_dbg
  • alice-disk-se.gridka.de 3.2.6
  • ATLAS FAX xrootd proxy 3.3.3-1
   
NDGF dCache 2.3 (Chimera) on core servers. Mix of 2.3 and 2.2 versions on pool nodes.    
NL-T1 dCache 2.2.17 (Chimera) (SURFsara), DPM 1.8.7-3 (NIKHEF)    
PIC dCache head nodes (Chimera) and doors at 2.2.23-1
xrootd door to VO severs (3.3.4)
   
RAL CASTOR 2.1.13-9
2.1.13-9 (tape servers)
SRM 2.11-1
   
TRIUMF dCache 2.2.18    
JINR-T1 dCache
  • srm-cms.jinr-t1.ru: 2.6.19
  • srm-cms-mss.jinr-t1.ru: 2.2.23 with Enstore
xrootd federation host for CMS: 3.3.3
   

FTS deployment

Site Version Recent changes Planned changes
CERN 2.2.8 - transfer-fts-3.7.12-1    
ASGC 2.2.8 - transfer-fts-3.7.12-1    
BNL 2.2.8 - transfer-fts-3.7.10-1 None None
CNAF 2.2.8 - transfer-fts-3.7.12-1    
FNAL 2.2.8 - transfer-fts-3.7.12-1    
IN2P3 2.2.8 - transfer-fts-3.7.12-1    
KIT 2.2.8 - transfer-fts-3.7.12-1    
NDGF 2.2.8 - transfer-fts-3.7.12-1    
NL-T1 2.2.8 - transfer-fts-3.7.12-1    
PIC 2.2.8 - transfer-fts-3.7.12-1    
RAL 2.2.8 - transfer-fts-3.7.12-1    
TRIUMF 2.2.8 - transfer-fts-3.7.12-1    
JINR-T1 2.2.8 - transfer-fts-3.7.12-1    

LFC deployment

Site Version OS, distribution Backend WLCG VOs Upgrade plans
BNL 1.8.3.1-1 for T1 and US T2s SL6, gLite ORACLE 11gR2 ATLAS None
CERN 1.8.7-3 SLC6, EPEL Oracle 11 ATLAS, LHCb, OPS, ATLAS Xroot federations  

Other site news

Experiments operations review and Plans

ALICE

  • besides a reprocessing campaign on Dec 27-30 not a lot of activity during the break
  • central services moved to new HW on Jan 14-15
    • also more efficient power/UPS distribution and cooling
    • Torrent no longer available
  • CERN
    • high error rates after central services restart, to be debugged
    • SLC6 vs. SLC5 job failure rates and CPU/wall-time efficiencies
      • Wigner jobs were running with Torrent since Dec 20 to test if SL6 builds behave better than the SL5 builds used via CVMFS: no improvement
      • to be continued...
  • CVMFS
    • almost all sites using it in production
    • a few in various stages of preparation
    • sites please ensure the WN have version 2.1.15 (or higher)
  • SAM (reminder)
    • ALICE sites that care about the SAM Availability and Reliability reports should ensure they look OK in the ALICE SAM tests, which will be used instead of the Ops tests as of Jan 2014:

ATLAS

  • Re-starting pushing FTS3: adding more sites into RAL FTS3 instance
  • SAM CVMFS probe is being integrated: now sorting out issues in pre-production
  • MultiCore production started on limited number of sites, which are now working on their internal scheduling configuration (part of the MultiCore task force)

CMS

  • Processing / Production over Xmas break
    • Went reasonably fine
    • Best effort from sites and experiment teams

  • Filling Tier-1s with real and backfill work
    • Try to run processing machinery under load
    • Identify bottlenecks
    • Want to extend backfill at Tier-2 level to generate full load
    • Needed to start exercising for next year

  • CMS file catalog DBS2 to migrated to DBS3
    • Preliminary Scheduled for week of Feb 10th
    • Will take several days
    • Need to drain production during the week before

  • Remainder: CMS intends to make the GLEXEC SAM test critical ~ end of January 2014

LHCb

  • Low production activity at the moment with simulation productions only
  • pA/Ap reprocessing campaign started before Xmas and finished in the last days of the Xmas break
  • Working on WMS decommissioning of the last sites (only small sites)

Ongoing Task Forces and Working Groups

FTS3 Deployment TF

  • Resumed performance tests after holiday break.

gLExec deployment TF

  • 69 tickets closed and verified, 26 still open
  • EMI gLExec probe (in use since SAM Update 22) crashes on sites that use the tar ball WN and do not have the Perl module Time/HiRes.pm installed (GGUS:98767)
    • installation of that dependency now looks sufficient to cure the problem
    • a proper fix is still to be decided
  • Deployment tracking page

IPv6 validation and deployment TF

  • Job submission to a CE via gLite WMS both using IPv6 to communicate was successfully tested to work
  • A similar test via Condor-G failed, due to a bug in HTCondor, which uses a rather old version of CREAM (condor-8.1.3/externals/bundles/cream/1.12.1_14). Waiting for feedback from the developers
  • CMS tests
    • The next steps are to get a BeStMan SE and an OSG CE to run data transfer and job submission tests respectively. If successful, will proceed to test glideinWMS in dual stack to submit jobs to IPv6-only CEs and WNs. Most central services do not need to be tested on IPv6 for the time being (typically those services that do not directly interact with site resources).

Machine/Job features TF

  • the mjf.py client has been installed on slc5 (bare metal) and slc6 (virtualised) CERN workernodes, the machine/job features are being fed to those nodes
  • a rpm for the client is available at the wlcg repository
  • Asking VOs to try these out and report any problems/issues discovered within a month if possible
    • In a later stage rollout the service and client to more sites

Middleware readiness WG

  • the next meeting is planned for Feb 6 at 16:00 CET
    • in particular about how to involve experiments and sites

Multicore deployment

  • Call for the first meeting already sent to TF participants. We will start next week, date yet to be confirmed, see doodle
    • we would like to dedicate the first meeting to a summary of the current situation, which is dominated so far by ATLAS and CMS experiences.
    • we will also organize the following meetings, in particular the frequency of the meetings, which may initially be of one meeting every week, given the fast pace of the ongoing activities, then go for a meeting every two weeks.

perfSONAR deployment TF

  • all remaining sites (indicated by experiments) not deploying perfSONAR have been ticketed. Slow progress and many negative answers (unavailability of hardware)
  • a perfSONAR-lite installation is being discussed with Internet2/ESnet
  • work ongoing to decide how to replace the perfSONAR modular dashboard
  • almost all sites opened access to the perfSONAR main page

SHA-2 Migration TF

  • very few services monitored by EGI not ready for SHA-2
  • FNAL intend to be ready by mid or late Feb
    • OSG CA switch to SHA-2 postponed to Feb 11 on request of CMS
  • EOS SRM instances
    • OK for ATLAS and CMS
    • not yet OK for LHCb: a patch will need to be ported to let the new SRM support the root protocol expected by LHCb jobs
      • OSG BeStMan support ticket opened
  • the CERN CA has delayed making SHA-2 the default to Feb 11 as well
    • SHA-1 remains available until June
  • the UK CA intends to switch mid March
  • the French CA has switched on Jan 14
  • VOMRS
    • the VOMS-Admin instability has been fixed (Java 7 --> 6)
      • many thanks to Valerio Venturi (RIP) for debugging this!
    • the VOMS-Admin test cluster is being reinstalled and should become available in the coming days
  • timelines
    • by mid Feb the WLCG infrastructure is expected to be essentially ready
    • the migration from VOMRS to VOMS-Admin may take longer

Tracking tools evolution TF

WMS decommissioning TF

  • usage of the CMS WMS at CERN has remained low in recent weeks
  • VOs using the "shared" WMS instances will be contacted for migration plans

xrootd deployment TF

Action list

  1. Investigate how to separate Disk and Tape services in GOCDB
    • proposal submitted via GGUS:93966
    • in progress - ticket updated
  2. Agree with IT-PES on a clear timeline to migrate OPS and the LHC VOs from VOMRS to VOMS-Admin
    • in progress
  3. Collect feedback from VOs about need for grid-cert-info and setting EMI-UI 2.0.3 as baseline.
    • closed

It was agreed that the last action item can just be closed.

-- AndreaSciaba - 13 Jan 2014

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2014-01-27 - JosepFlix
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback