LCG Web>WebPreferences>WLCGOpsMinutes131107 (2018-02-28, MaartenLitmaath)

EditAttachPDF

WLCG Operations Planning - November 7, 2013 - minutes

Agenda

https://indico.cern.ch/conferenceDisplay.py?confId=280057

Attendance

Local: Simone Campana (chair), Andrea Sciabà, Stefan Roiser, Nicolò Magini, Michail Salichos, Oliver Keeble, Pablo Saiz, Xavier Espinal, Ikuo Ueda, Maite Barroso, Alessandro Di Girolamo, Andrea Valassi, Maria Dimou, Helge Meinhard, Alberto Aimar, Julia Andreeva, Alexandre Lossent, Domenico Giordano, Markus Schulz, Borut Kersevan, Massimo Lamanna, Ulrich Schwickerath, Maarten Litmaath
Remote: Massimo Sgaravatto, Alessandra Forti, Rob Quick, Ron Trompert, Christoph Wissing, Hung-Te Lee, Pepe Flix, Jeremy Coles, Di Qing, Frederique Chollet, Shawn Mc Kee, Alessandro Cavalli, Francesco Sciacca, Vanessa Hamar, Gareth Smith, Michel Jouvin, Andreas Petzold, Burt Holzman

Agenda items

News

For the details see the slides

Simone proposes to remove the data access task force (mostly related to putting the Tier-1 WNs in the OPN) from our future plans, but to have a small survey to understand where we stand. Similarly, the (future) task force on dynamic data placement should be removed from the list as the functionality is either in production (ATLAS) or very early development (CMS and LHCb). It might be rediscussed in future if needed. Both proposals are accepted.

Experiments Plans

ALICE

CVMFS
- 50 sites using it already, ~30 in various stages of preparation
- revised deadline: end of this year!
CERN
- SLC6 job efficiencies lower and failure rates higher than for SLC5
  - to be debugged very soon...
SAM
- rationalization of the tests
  - discussed with SAM team
  - proof of concept tested - let MonALISA send selected test results to the message bus
    - XRootD
    - VOBOX
  - will also allow NDGF to appear as a T1 in SUM etc.
- to be continued
gLExec
- adaptations of AliEn to support gLExec are planned
  - no estimates yet

Helge says PES will put effort in understanding the job efficiency issues but will need help from the people affected. ATLAS will also contribute to it.

ATLAS

For details see the slides

Alessandro presents the production activities foreseen for the next months: the main one will be a full reprocessing campaign starting in April 2014.
All ATLAS sites are encouraged to use the CVMFS 2.1; he also points out a recurrent problem with WNs having stale CVMFS data. It would be important to have automatic checks at sites to spot such problems and he argues that it should be relevant for all the experiments.
ATLAS is taking the final steps to get rid of the shared NFS area: it should take a few more weeks.
The Rucio renaming requires WebDAV enabled but only 50% of sites have it; the rest are being tracked via tickets by ATLAS
Sites are welcome to volunteer for testing Rucio
A proposal to use HammerCloud to "benchmark" sites as a means to estimate the HS06 for public clouds: possibly of general interest in WLCG
A proposal to work together among experiments to converge on a multicore deployment strategy

Simone proposes to have dedicated discussions of the last two points at a future operations coordination meeting.

CMS

CVMFS Migration - close to completion
- Stopped sending installation job with a very few exceptions
- Sites 'allowed' to run own local deployment (cron based)
  - In Q2/2014 CVMFS mandatory - software might require/expect something in CVMFS
- Remaining Sites:
  - CERN: Client ready, but not used yet by CMS
  - Tier-1: FNAL, cron-based installs - plans to migrate still this year
  - Tier-2s: T2_IT_Pisa (half done), T2_IT_Bari (do be done with SL6 migration, ~now), T2_UK_London_IC, T2_IN_India, T2_PK_NCP, 2 out of 6 Russian T2s
  - Tier-3s: ~80% done, getting in contact

Christoph comments that CMS sees the same problems as ATLAS with CVMFS stale data on some WNs.

SL6 Migration
- WLCG task force (almost) completed
- CMS production still based on SL5 builds
  - SL6 native builds existing for recent releases
  - Switch of production architecture to be defined (likely after ~full CERN/FNAL migration)
- CERN: fraction of resources on SL6
- Tier-1s: Done except FNAL, planning to switch in November (tbc)
- Tier-2s: 10 still on SL5, 6 with both (varying fraction)
- Tier-3s: About 50/50

Multi-core
- "Dynamic partitioning" ready and (small scale) tested
  - Pilots allocate multiple cores and execute several single-threaded jobs
- Want to enlarge the usage
  - Sites are invited to offer multi-core queues
  - ~50 slots (would correspond to several 100 slots) should be good for a start
  - No dedicated request yet

Christoph agrees with Alessandro's proposal to coordinate on multicore deployment.

Disk-Tape Separation
- Three sites completed (CERN, CNAF, RAL)
- Remaining are in progress (KIT) or concrete planning

Xrootd-Federation
- Scale testing and hardening of redirectors in progress
- SAM tests not critical yet
- Fallback (Get file via federation, if not accessible locally)
  - Good deployment status
  - Used routinely for production (sending to sites not hosting the data)
  - Two T1s missing, 6 T2s missing
  - Firewall capacity usually preventing sites from enabling
- Xrootd-access
  - Enabled at disk-tape separated T1s and ~80% of the T2

HLT
- Equipped with OpenStack Cloud
- Still some limitations to use full capacity
- Successfully used for parts of 2011 reprocessing
- Want to enlarge the usage
  - Better integration with existing production tools
  - Gain understanding of present limitations

Savannah-GGUS
- Concerns only Computing infrastructure
- (Slowly) progressing
  - Some input still to be provided by CMS
  - Functional prototype for Q1/2014 still realistic
  - Complete switch in Q2/2014 assuming no further issues

gLExec
- Glidein infrastructure is ready to use it
- Considering to make SAM test critical early next year (~ January)
  - Asking for approval in next WLCG MB

Michel comments that, given that the WLCG MB had put October as tentative deadline for gLExec deployment, CMS plans to enforce it in January are compatible and hence would not need to be further approved.

gLiteWMS Decommissioning
- Still used for small fraction (~10%) of user analysis jobs
- Proposal to decommission the CMS WMS cluster at CERN
  - O.k. for CMS
  - Can use general WMS service
  - Further decrease of gLite WMS usage expected

Maite asks if it is possible to remove the CERN WMSes from the CRAB configuration. Christoph says that CMS is oriented to pushing users to glideins rather than imposing a strict cutoff. Maarten would rather prefer a clear message about whether CERN IT can decommission the CMS WMS cluster in the near future. Simone invites CMS to give an official statement at the next meeting.

Site News
- T1_RU_JINR
  - Did some scale testing last week - looks good
  - Will be included in MC production as first step
  - At least 10% of planned total capacity in place, growing in 2014
  - Stable 10Gb WAN connection expected early 2014
- KISTI
  - Interested to support CMS as Tier-3

LHCb

For details see the slides

LHCb plans another incremental stripping campaign for spring 2014 (6-8 weeks at all Tier-1 sites)
Five Tier2D sites are now in production
FTS-3 is used (in FTS-2 mode) for all WAN transfers
LHCb is interested in consuming perfSonar data
Only about 20 sites still use WMS submission and are being taken care of individually
Will soon switch to SL6 as default architecture for user analysis
Will soon consume WLCG monitoring information in DIRAC

WLCG monitoring consolidation: the site perspective

For details see the slides

Pepe summarises the goals and the status of the WLCG monitoring consolidation project. The first phase, consisting of assessing the current situation and defining a plan, is over and from now till July effort will be devoted to implementation of the plan. The area of site and service monitoring was identified as the one with the most potential for improvement.

In particular for SAM, Nagios as test submission tool will become optional and job submission will be via either Condor-G, CREAM CE or ARC client to get rid of the gLite WMS. Experiments will be able to feed their own information to SAM. PIC will implement a Nagios plugin to feed into a local site Nagios installation information gathered from the SSB (which is planned to be used to retrieve SAM test results).

Some applications will soon be stopped (because they are not considered useful any more by the experiments).

As a possible next step, Pepe adds that implementing complex aggregation algorithms would be useful. Pablo answers that this will easily be possible, like CMS does since years for the site readiness metric.

Ongoing Task Forces Review

Middleware readiness

E-group wlcg-ops-coord-wg-middleware@cernNOSPAMPLEASE.ch now exists with 'self-subscription' policy with owner's approval.
Twiki MiddlewareReadinessArchive being built with content. Please comment on the MANDATE.
What do we start with? By evaluating the status of the Baseline versions or by selecting packages, e.g. DPM and evaluating their readiness verification procedures? Is this EGI information on Early Adopters up-to-date?

SL6 migration

Total number of Tier1s Done: 14/16 (Alice 9/10, Atlas 12/13, CMS 6/8, LHCb 8/9)
Total number of Tier1s not Done: 2/16 (1 with a plan, 1 in progress)
- CERN: at last update moved 30% of the resources plan to move at least 80% before the end of the year. They are not only moving to SL6 but also virtualising the WNs.
  - More detailed answer from Helge: the CERN batch resources are running up to about 30% on SL6; according to our monitoring this is currently the right balance in view of user requests. However, we are in an active migration campaign (and stick to our objective of being at 80% by Christmas) even if it means that jobs submitted to SL5 will have to wait longer. If any LHC VO plans to move work at CERN to SL6 massively, we should be able to cope with this with 1...2 weeks advance warning.
- FNAL: have 10 WNs in test and should finish the upgrade by the end of November.

Total number of Tier2s Done: 113/131 (Alice 32/40, Atlas 75/88, CMS 54/65, LHCb 40/46)
Total number of Tier2s not Done: 18/131 (12 with a plan, 6 in progress)

In term of HS06 the numbers are roughly
- T1+T2: ~11% of total resources (numbers taken from rebus are variable)
- Should be less by the end of November

21 T2 and 2 T1 got a GGUS ticket on the 1/11/2013
- 3 tickets have been closed
- 2 are on hold
- 13 in progress
- 5 assigned with no answer
- Tickets should be followed by the experiments.

Positive outcome of the TF
- ~90% of resources moved to SL6 without much disruption for the experiments.
- Creation of WLCG repository
- Cleaner HEP_OSlibs rpm
- EMI-3 WNs tested and usable (even though it required atlas to use arcproxy as primary proxy verification tool).

Helge comments that the 30% quoted for CERN is a bit misleading, as it refers to the entire batch capacity (which includes dedicated resources that will stay on SLC5 for yet some time, and non-LHC resources). The actual number is much higher if referred to the WLCG resources only.

FTS3

For details see the slides

FTS-3 status: LHCb: all production transfers; ATLAS: 30% of all production transfers, plus functional tests at all sites; CMS: 30% of debug transfers
Stable service in the last month, after several bug fixes
Next tests: compare FTS-2 and FTS-3 using appropriate metrics and test new functionality (e.g. multihop, staging and VO shares)
The FTS-3 deployment scenario will be defined only after determining if a single instance can sustain all the experiment loads

A discussion on the deployment scenario follows. Oliver points out that in the past ATLAS argued that in any case a multisite deployment was desirable, regardless of scalability. Borut answers that if scalability with a single instance is demonstrated, it can be a good incentive to talk to the sites and reconsider the scenario.

perfSONAR

For details see the slides

All sites must have deployed perfSonar 3.3.1 by April 1 at the latest, with the services registered in OIM/GOCDB and the mesh configuration applied. Documentation and support are provided by the task force
67 pS instances (over a total of 198 declared in OIM/GOCDB) show issues in querying them
Progress in the modular dashboard

SHA-2

Dec 1: some CAs may make SHA-2 the default for new certs
- not yet clear how many WLCG site services will not be ready for SHA-2 by that time
  - storage elements are the main concern
  - EGI have opened tickets for sites and will follow up
- CERN CA will switch when WLCG is ready, probably early next year
VOMS
- VOMS-Admin test setup being worked on
  - will have EMI-3 Update 10 released Mon Nov 4
  - timeline?
experiments
- ALICE looks ready
  - successful jobs at CERN using a SHA-2 cert!
- ATLAS/CMS/LHCb have tested a lot and no real issues were found

Michel adds that OSG will start to issue SHA-2 certificates in mid January. Burt clarifies that this is likely to happen but it's not yet official.

About the migration to VOMS Admin, Helge adds that fresh manpower has been assigned to the task. As soon as a couple of technical problems are resolved, progress should be swift.

WMS decommissioning

timelines

gLExec

54 tickets closed and verified, 40 tickets still open
- many sites tied gLExec deployment to their SL6 migration
deployment tracking page
special cases
- USATLAS sites
- ATLAS sites with ARC CE
  - rely on future version of ARC Control Tower
- ALICE sites without CREAM
  - cannot be tested today
deadlines per experiment?
- CMS are the most ready
- LHCb
- ATLAS and ALICE still need development (ongoing resp. foreseen)

Stefan adds that LHCb did some testing of a new DIRAC pilot with gLExec and it was working, but site issues were found.

Tracking tools evolution

New release of GGUS announced for the 27th of November, with several improvements.
possibility to notify multiple sites at once is not yet scheduled for this release. For this, the 'Type Of Problem' will be limited to TEAM and ALARM ticket. There will not be any special grouping for all the notifications created in one bunch.
From the 83 IT+Grid savannah projects, the important ones have been identified. Projects that are not migrated will not be accessible any more once savannah is decommissioned.

Simone points out that even in absence of a special grouping for a set of multiple tickets, it should be very easy to find them all via the GGUS query engine.

xrootd deployment

Continue supporting deployment of XRootD at sites, playing as liaison among FAX and AAA projects

Focus on Monitoring and Information System

Monitoring:

Consolidate GLED Collector system and infrastructure
- include filtering of VOs (crucial for multi-VOs sites)
- include connection to load balanced messaging brokers
Detailed monitoring of dCache sites
- third party plugins made available in WLCG repo for dCache version 1.9, 2.2, 2.6
- Incompatibility of this plugin for dCache versions <2.6 with SHA-2 migration, implies that the detailed monitoring of dCache sites will be missed until full migration to dCache 2.6 takes place.
- we will continue tracing the deployment of these plugins as soon as sites moves to dCache 2.6
Deploy FAX EU Collector at CERN
Finalize merging of Dashboard federation monitoring and data popularity monitoring

Information System

XRootD SE entry point and redirector should be declared in GOCDB/OIM for all sites.
- goal to achieve in the next months.

IPv6

Status
- New experiment contacts: Alastair Dewhurst (ATLAS), Tony Wildish (CMS), Ivan Glushkov (CMS). Pete Clark and Costin Grigoras represent LHCb and ALICE "ex officio" (as they do in the HEPix IPv6 WG)
- Defined, tested and documented a recipe to enable dual stack on any CERN machine (apart from Open Stack nodes)
Plans
- Continue collecting compliance information (https://hepix-ipv6.web.cern.ch/wlcg-applications)
- Participate in testing basic middleware use cases
- Set up testing plans and procedures in the experiments for their software components

-- AndreaSciaba - 05 Nov 2013

Topic revision: r23 - 2018-02-28 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback