WLCG Operations Coordination Minutes, May 5, 2022

Highlights

Agenda

https://indico.cern.ch/event/1156317/

Attendance

  • local:
  • remote: Alessandra D (Napoli), Alessandra F (ATLAS + Manchester + WLCG), Andrea (WLCG), Andrew (TRIUMF), Borja (monitoring), Christophe (LHCb), David S (ATLAS), Derek (Nebraska + monitoring), Doug (ATLAS), Giuseppe (CMS), Jiri (ATLAS), Julia (WLCG), Maarten (ALICE + WLCG), Matt (Lancaster), Mihai (FTS), Miltiadis (ATLAS + WLCG), Panos (WLCG), Pau (CMS), Petr (ATLAS + Prague + WLCG), Selcuk (CMS), Shawn (AGLT2 + networks), Thomas (DESY)
  • apologies:

Operations News

  • ESnet are planning to upgrade their installations at CERN which is scheduled for between June 7th - 10th, so still before the planned start of stable beams on June 15th.. They have redundancy with equipment in B513 and B773 but each will be down for much of a day during the upgrade leading to reduced capacity (and no redundancy).

  • the next meeting is planned for June 2

Special topics

Update on XRootD monitoring progress

see the presentation

Discussion

  • Julia:
    • the Shoveler is expected to become part of the XRootD distribution,
      but that may still take a while

  • Doug:
    • how do the dCache Xrootd services fit in?
  • Derek:
    • they do not generate the monitoring stream in the same way
  • Julia:
    • it has been agreed they will send the info that is needed,
      but the details have to be further looked into

  • Julia:
    • what might be the timeline for progress on the MONIT side?
  • Borja:
    • some delays are expected due to reorganization of the MONIT team
      as well as holidays
    • we expect to see the first sites appear by the end of June
    • over the summer we then hope to add many more

Campaign for enabling SRR on the dCache sites

see the presentation

Discussion

  • Christophe:
    • LHCb also relies on SRR files
    • we are struggling with an XRootD site that does not have it
  • Julia:
    • for XRootD, SRR details are still to be followed up

  • Doug:
    • is the SRR checker code available for sites to use?
  • Julia:
    • it is a cron job on CRIC that just checks the JSON structure
    • we will update the Twiki page with a link to the sources
    • Follow-up after the meeting:
      • CRIC team suggests using CRIC API to get information regarding SRR validity and accessibility rather than running another loop trying to access SRR files. The code of the SRR checker is simple, but would need some modifications (not writing back to CRIC). CRIC team will follow up with Doug to find the best solution.

Middleware News

  • Useful Links
  • Baselines/News
    • CERN Grid CA renewal saga
      • IGTF version 1.116 was released on Monday April 25 to
        make a new CERN Grid CA certificate available with
        its lifetime extended by another 10 years
      • The new certificate was put into production on Monday May 2
      • Despite careful testing of the new certificate beforehand,
        several incidents occurred with instances of Java services
        that make use of the canl-java security library:
        • dCache
        • StoRM
        • VOMS-Admin
        • Argus
      • Except when such a service uses the latest canl-java
        (e.g. the most recent dCache versions), such services
        all need to be restarted to get the updated CA
        certificate correctly loaded in memory
      • VOMS-Admin at CERN was restarted Thu evening April 28,
        earlier than foreseen, in response to tickets from ATLAS and CMS
      • Argus services at CERN were restarted on Monday May 2
    • myproxy.cern.ch (partly) broken from Apr 25 evening to May 3 morning (GGUS:157130)
      • caused by an update to myproxy 6.2.9-7 that is not backward-compatible
      • the service was rolled back to that older version on May 3
      • Grid Community Forum experts released backward-compatible 6.2.9-8 later that day

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Mostly business as usual
  • Most VOboxes were affected by the myproxy.cern.ch incident reported above
  • VOboxes at a few sites are still using the AliEn legacy MW
  • Multicore jobs are run at CERN, 3 T1 and a few smaller sites
    • Together they amount to 40-50% of the capacity, though

ATLAS

  • Smooth running with 600-700k cores including 200k from HPC and bursts of up to 90k from Google
  • Another two rounds of dCache restarts with Slovak and CERN CA updates
  • Bug in latest Xrootd version caused significant job failures, we rolled back to previous release
  • T1 tape deletion campaign ongoing, expected to delete ~30PB

CMS

  • running smoothly with 325-400k cores
    • usual production/analysis split of 3:1
    • 30-60k cores of non-pledged contribution
    • production activity mainly Run 2 ultra-legacy Monte Carlo
  • Tier-0 activities
    • splashes and cosmic ray data taking of Run 3 ongoing
  • waiting on python3 version/port of HC
  • WebDAV commissioning ongoing
    • Tier-0,1 all sites ready/in production
    • Tier-2 all but one sites ready/in production
    • Tier-3: eleven sites ready/in production, 13 sites working on it, on hold/no reply from three sites
  • Token commissioning for HTCondor CEs in progress
    • waiting for HTCondor interface for ARC CEs to be released
  • asking sites to upgrade xrootd to v5 before Run 3 data taking
  • issues with EOS Erasure Coding at CERN and Vienna unresolved, GGUS:156195

LHCb

  • Smooth running with 150-180k cores
  • Tape Challenges for NCBJ not successful
  • CERN CA updates sorted no effect

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

Enabling SRR on dCache sites

Information System Evolution TF

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

  • Task force update will be delivered during May 12th GDB
  • New XRootD Monitoring components
    • Components already receiving data from some ALICE EOS servers so integration with MONIT can be started
    • We will adapt the current documentation on how to install/run the Shoveler and check if needed to be packaged within WLCG
  • Agreed on the minimum required schema for transfers to be meaningful
    • Discussion held with the different interested developers and agreement on first draft for fields defined
      • We are still working on a document with the draft to be circulated to the experiments to reach a final agreement
  • Some internal discussion on how to acquire/integrate Site network monitoring has been had
    • More details will be delivered during GDB update

Network Throughput WG


WG for Transition to Tokens and Globus Retirement

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2022-05-16 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback