WLCG Operations Coordination Minutes, Dec 3, 2020

Highlights

Agenda

https://indico.cern.ch/event/980181/

Attendance

  • local:
  • remote: Alberto (monitoring), Chris (LHCb), Christoph (CMS), Concezio (LHCb), Dave M (FNAL), David Cameron (ATLAS), David Cohen (Technion), Eric F (IN2P3), Giuseppe (CMS), Julia (WLCG), Maarten (ALICE + WLCG), Marian (networks + monitoring), Matt (Lancaster), Nikolay (monitoring), Pedro (monitoring), Pepe (PIC), Ron (NLT1), Shawn (networks), Stephan (CMS), Thomas (DESY)
  • apologies:

Operations News

  • the next meeting is planned for Feb 4

Special topics

WLCG/FTS/Xrootd transfer dashboards update

see the presentation

Discussion

  • Pepe: how complete is the data? are the GLED collectors getting data from all
    Xrootd services in WLCG?
  • Nikolay: we recently needed to migrate the GLED collectors to CentOS 7 and
    contacted the experiments to verify things from their perspectives
  • Pedro: so far we have relied on experiments providing their sites with
    configuration instructions for the monitoring info to be sent to MONIT

  • Pepe: can local and external Xrootd accesses be distinguished?
  • Pedro: the info has a flag and we only use remote accesses for WLCG dashboards

  • Pepe: is there a distinction between TPC vs. job activities?
  • Pedro: no

  • Stephan: after the migration we found a large fraction of the sites no longer
    reporting and we checked the configurations with those sites
  • Pedro:
    • there was similar feedback from ATLAS
    • maybe these matters should be checked at the WLCG level?
    • the FTS dashboards can be trusted, the Xrootd ones are unclear
  • Pepe: CMS expects their Xrootd traffic to be reported accurately

  • Stephan:
    • there are way more Xrootd servers than FTS hosts
    • not so easy to get all of them to report correctly
    • US sites send their info to US collectors, not directly to MONIT
  • Julia:
    • US sites were supposed to use US collectors, other sites to use CERN
    • how is that nowadays?
  • Pedro:
    • we have all that under control
    • the trouble is that some servers do not report at all

  • Stephan:
    • the responsibility for the US collectors has shifted from UCSD to OSG
    • some bugs were discovered and the services are being revamped
  • Julia:
    • it would be good if US experience could be shared for deployments elsewhere
    • we also should think about how to check if the monitoring is working
  • Stephan
    • this is work in progress
    • we will get back to these matters

  • Maarten: to decide how much effort to spend, how useful are these dashboards?
  • David Cameron: after the migration we found many ATLAS sites not reporting
    and we just decided to switch off reporting altogether!
  • Julia:
    • the FTS dashboards only show part of the traffic
    • that is why we decided to add the Xrootd dashboards
  • Pepe: either we provide sites with clear instructions or those dashboards cannot be trusted
  • Pedro:
    • different dashboards can serve different purposes
    • the FTS dashboards are useful for operations and accounting
    • the Xrootd dashboards have sites missing at least for ATLAS
    • can an absence of 10% of the Xrootd traffic be tolerated for the WLCG views?

  • Julia:
    • do the WLCG views avoid double counting of Xrootd transfers via FTS?
    • with the progress in DOMA TPC, those are steadily becoming more numerous
  • Pedro, Nikolay:
    • we need to check that

  • Julia:
    • we would like to have a complete picture for WLCG
    • we need to get back to this next year
      • check which fraction is missing
      • determine proper configurations for sites and MONIT
  • Stephan:
    • let's wait for the new US collectors to get set up
    • after the meeting Stephan checked the time line and confirmed that
      the schedule for changing/upgrading the collector in the US is middle of 2021.

  • Dave M:
    • is the scope of these efforts correct?
    • we would like to understand how our networks are used
    • we may not want to have such monitoring depend on fragile site configurations
  • Pedro:
    • these are transfers dashboards: we need to monitor transfer activities
  • Maarten:
    • one day we might obtain our usage details from generic network monitoring
    • for now we can use such transfer dashboards and hopefully they add up to
      what we see on the networks
  • Julia:
    • those dashboards are rather for the global picture, less for operations
  • Thomas:
    • are sites supposed to look at them regularly?
  • Maarten:
    • not now, maybe when they have been much enhanced in the future

  • Julia:
    • the improvements that were presented look good
    • we will get back to these matters next year

  • Julia:
    • also storage accounting is using InfluxDB: do you advise moving away from it?
  • Nikolay:
    • we did that because ES + Spark perform much better for high-cardinality data
    • InfluxDB is OK for other use cases and we still use it indeed

WLCG Network Throughput WG update

see the presentation

Discussion

  • Julia, Maarten: consider using CRIC for your topology info?
  • Marian: CRIC looks OK for WLCG sites and experiments,
    but we also need correlations for other sites
  • Julia:
    • we can work with you to supply information that is currently missing
    • for the WLCG sites all network topology information will be enabled in the WLCG CRIC
    • for sites and VOs beyond WLCG we propose you deploy a dedicated CRIC instance.
      In this case someone should take ownership of this instance and its content.
    • you can already start with the WLCG sites
    • we need to understand how much additional info it would imply to cover non-WLCG sites.
      If it is not too much, it can also be hosted by the WLCG CRIC.
  • Marian: sounds OK
  • Shawn:
    • we need a single source of truth per entity
    • for WLCG sites the info should come from the WLCG CRIC
    • for other sites we are already working with Edoardo
  • Julia: we will set up a meeting to discuss the technical details

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • High activity on average.
  • No major problems.
  • Sites in UK and possibly elsewhere were affected by RAL Stratum-1 misconfiguration:
    • HTTP headers asked clients to cache CVMFS catalogs for 3 days instead of 61 sec.
    • The problem started when the service was upgraded to CentOS 7 on Nov 19.
    • The issue got cured on Dec 2, but previous effects may linger for 3 days more.
    • Other VOs presumably were affected too (GGUS:149736).
  • A small number of sites did not yet upgrade their VOBOX to CentOS 7.
    • They have been requested to do so ASAP.
  • Thanks to all sites and best wishes for 2021!

ATLAS

  • Stable Grid production in the past weeks with up to ~450k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~90k slots from the HLT/Sim@CERN-P1 farm and ~10k slots from BOINC.
  • Occasional additional peaks of ~100k job slots from HPCs.
  • Started a large production of MC DAOD_PHYS (Run 3 analysis format), including reading input AOD from tape through data carousel
  • Expect to continue filling resources as normal over Xmas break, nothing special planned
  • Started deletion from T1 MCTAPE, expected to take several weeks to complete
  • HTTP-TPC migration: 11 dCache, 16 DPM, 1 Storm
  • Thanks to all for keeping things running smoothly throughout a challenging year and all the best for 2021!

CMS

  • CMS collaboration meeting this week
  • running smoothly at around 300k cores
    • usual production/analysis split of 3:1
    • main proccessing activities:
      • Run 2 ultra-legacy Monte Carlo
      • Run 2 pre-UL Monte Carlo
    • on track or beyond on HPC allocation use
  • migration to Rucio complete
    • PhEDEx shutdown at all sites
    • LoadTest setup ongoing (tape endpoints)
  • Castor to CTA migration in progress
    • Castor access closed, CTA write to start Dec 7th
  • migration of CREAM-CEs incomplete
    • CERN factory end-of-life reached
    • 0/7/3 Tier-1/2/3 sites with CREAM-CE(s) remaining
      • concerned about one large Russian site
  • VM SL6 upgrade/migration almost done
  • HammerCloud monitoring outage due to late and failed SL 6 upgrade attempt

LHCb

  • running smoothly at 100-120k cores
    • dominated (90%) by MC production with Run1 and Run2 conditions
    • HPC: using CINECA Marconi-A2 partition (KNL, ~4k cores), a few hundred cores at SDumont (Brazil)
  • Castor to CTA migration
    • Archive/recall tests successful
    • Tested XRootD TPC transfers from T0 to T1, with tape workflow
    • Need to resolve the T0 to StoRM (HTTP only) case
    • DAQ tests being scheduled
    • migration planned end of January 2021
  • Ongoing review of ETF tests
    • getting rid of outdated ones (e.g. MJF, CREAM)
  • Decommissioning of CREAM-CEs still ongoing

Discussion

  • Maarten: in the recent WLCG Storage workshop we concluded that
    HTTPS/WebDAV was going to be the default TPC protocol,
    which means StoRM does not need to support Xrootd TPC

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • NTR

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

  • 90 tickets
  • 39 done: 21 ARC, 18 HTCondor
  • 11 sites plan for ARC, 10 are considering it
  • 18 sites plan for HTCondor, 9 are considering it, 6 consider using SIMPLE
  • 1 ticket without reply

dCache upgrade TF

  • 37 sites out of 41 have been upgraded to 5.2.15 or higher

DPM upgrade TF

  • 27 out of 49 DPM sites have migrated to DPM 1.14 and enabled macaroons

StoRM upgrade TF

  • Still waiting for the release 1.11.19 in UMD
    • added after the meeting: it is there, not all components have that version number

Information System Evolution TF

  • Migration of AGIS to ATLAS CRIC is progressing
  • New functionality has been enabled in CMS CRIC to provide user info for CMS Rucio

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

see WLCG/FTS/Xrootd transfer dashboards update

MW Readiness WG

We finally close this WG after > 2.5 years of inactivity.
Also the MW Officer role is discontinued as of 2021.

Network Throughput WG

see WLCG Network Throughput WG update

Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

  • THANKS for your help in making 2020 a successful year for WLCG !
    • Further challenges and opportunities await us in 2021...
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2020-12-07 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback