WLCG Operations Coordination Minutes, Jan 27, 2022

Highlights

  • Pre-GDB on operational effort and possible optimization will take place on the 24th of February 3:00-5:30 PM CET.

Agenda

https://indico.cern.ch/event/1120678/

Attendance

  • local:
  • remote: Alberto (monitoring), Alessandra D (Napoli), Alessandra F (ATLAS + Manchester + WLCG), Andrea (WLCG), Borja (monitoring), Christoph (CMS), Concezio (LHCb), Danilo (CMS), David Cameron (ATLAS + ARC), David Cohen (Technion), Derek (Nebraska + monitoring), Enrico (CNAF), Eric (IN2P3), Federica (CNAF), Horst (Oklahoma), James (CMS), Julia (WLCG), Lucia (CNAF), Maarten (ALICE + WLCG), Marian (networks + monitoring), Matt (Lancaster), Max (KIT), Mihai (FTS), Miltiadis (WLCG), Nikolay (monitoring), Panos (WLCG), Petr (ATLAS + Prague), Shawn (MWT2 + networks), Stephan (CMS), Thomas (DESY), Xin (BNL + WLCG)
  • apologies:

Operations News

  • the next meeting is planned for March 3

  • IAM service support
    • IAM has been added to the table of critical services at CERN
    • Its urgency and impact numbers have been copied from the VOMS row
      • They are expected to change in the course of this year
    • The details for submitting a ticket have been added to these pages:
    • For the next few months, the support level is closer to 8/5 than 24/7
      • That is expected to improve in the course of this year
      • It implies we should not yet rely on very short-lived tokens
      • Rather imitate what is done with VOMS proxies for now

Discussion

  • Petr:
    • in ATLAS we will need to use 96h tokens for now
    • ETF will need to do the same

  • David Cameron:
    • is the current IAM service support level consistent with
      the phaseout of X509 proxies in OSG by the end of Feb?

  • Maarten:
    • that is for ATLAS to decide, but IMO it is OK to use 96h tokens,
      because they are only used for job submission to HTCondor CEs,
      they are not delegated to the jobs, i.e. not spread over the grid
    • to some extent better than what we have today with VOMS proxies
    • data management tokens used by jobs will need much shorter lifetimes,
      but we are not close to having those in any production workflows
    • developments in that area are discussed in the Authorization WG

  • Thomas:
    • is there documentation on how to map tokens based on which identifiers?

  • Maarten:
    • yes, but probably not quickly found via the Authorization WG docs
    • we will follow up
    • currently such mappings mostly matter to HTCondor CEs

Special topics

XRootD monitoring

see the presentation

Discussion

  • Danilo:
    • page 15: why are there loops between the Collectors and the MQ instances?
    • what is the expected timeline for implementing all this?

  • Derek:
    • the Shoveler uses the message bus to send data to its Collector reliably. These are raw messages, not aggregated for a single transfer event. The content of the message is exactly the same as udp packet content.
    • the Collector parses and aggregates the data and sends the results
      to its consumers also via that message bus. Different topic will be used, not the same as for raw messages.
    • Shovelers are expected to work at ~all OSG sites by ~end of Feb

  • Borja:
    • most of the remaining development work for the EGI/WLCG side is
      expected to be finished by ~end of Feb
    • testing would then start as of March

  • Christoph:
    • can the Shoveler also be used by dCache instances?

  • Derek:
    • that depends on the message format
    • dCache can send XRootD-compatible messages, AFAIU
    • to be tested

  • Alessandra F:
    • the dCache devs will need to be contacted about this

  • Julia:
    • will do

  • Borja:
    • there also is MonALISA to be considered in this respect
    • all senders have to stick to the schema that we will agree on. DCache reports are not expected to be the same as raw messages sent by the Shoveler.

  • Maarten:
    • when the various components are ready for deployment,
      we will need to get them into the right repositories

  • Julia:
    • the Shoveler would be part of XRootD releases
    • dCache have their own releases
    • the Collector could be added to the WLCG repository

  • David Cameron:
    • will we be able to see mappings of transfers to sites?

  • Borja:
    • yes, the usual Monit functionality for transfers

  • Alessandra F:
    • the aim is to have consistency in Monit for all transfers

  • Andrea:
    • can the use of IPv6 be tracked?

  • Derek:
    • yes, it is an attribute in the aggregated data

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Normal activity levels on average in the last 8 weeks
  • No major issues
  • Site VOboxes are being switched from legacy AliEn to JAliEn
    • Most sites and ~85% of the resources are done
      • Progress can be tracked here
      • Issues at a few of the remaining sites are being followed up
    • JAliEn is needed for Run-3 multi-core jobs
    • Most sites should only see 8-core jobs eventually
      • Some already do, some others receive a mix
    • Such jobs also can run up to 8 single-core (legacy) tasks
    • For each task, Singularity is tried from CVMFS
      • If that fails, a local system installation is tried
      • If that fails, the task is run in the classic way

ATLAS

  • Smooth running over Xmas break and in last few weeks with 700-800k slots
    • Main activities Run2 data and MC reprocessing
    • Including running MC reprocessing on 50% of the HLT farm
  • Another CA update (SlovakGrid) means all dCache services need to be restarted (same as with Swiss and Brazilian updates last year)
  • Switch to IAM VOMS server seemed to go smoothly, required some clean up of tools still using legacy (non-RFC) proxies
  • AGIS servers were shut down last week
  • Problems with slow transfers to and from RAL, hard to debug (GGUS:154436)
  • SRR storage reporting is shaky especially at dCache sites. Several times storage got full because SRR was not up to date.
  • Planning a Run 3 commissioning data transfer test ~end Feb/begin March involving full T0 and export to T1 tapes

CMS

  • running smoothly with 300-350k cores
    • no significant issues during the holidays
    • transatlantic link Fermilab--CERN down to tertiary, 20Gbit/s, link during most of the holidays
      • Waiting for last pieces of information about the chain of responsibility linked to the machine that failed at CERN causing the issue
    • internal saturation for analysis jobs during the holidays
      • traced to large number of jobs with low number of sub-jobs
    • usual production/analysis split of 3:1
    • HPC allocations contributing up to 30k cores
      • 2021 allocations for machines in the US all consumed well before the end of the allocation period
    • production activity mainly Run 2 ultra-legacy Monte Carlo
    • re-reconstruction of parked B data, 11B of 12B processed
  • SRM+WebDAV commissioning at Tier-1 sites started
  • accidental deletion of SAM/HC datasets middle of December
    • at about a third of sites
    • all files restored by middle of January
  • HammerCloud instabilities since several weeks
    • job status queries failed causing multiple jobs and empty status page, corrected
    • no new jobs being submitted for series, still being investigated
    • working on updating HC jobs for Run 3 software/input datasets
  • big Thanks to all sites contributing above the pledge!
    • this is much appreciated while sites struggle to get new machines
    • a very welcome boost of the CMS physics program

LHCb

  • smooth running over Xmas break and in January
  • using 140-160k cores
    • re-processing campaign of Run2 ended in Dec 31st, 2021 (!)
    • simulation jobs at 95%, user jobs at 5%
  • some data movements / replicas to
    • recover disk space at PIC
    • deal with long-term downtime of storage at CBPF
  • planning tape throughput tests at Tier0
    • tape read test at Tier1 should also be planned

Task Forces and Working Groups

GDPR and WLCG services

  • Updated list of services
  • Review of the status of publishing of the CERN RoPOs for WLCG services hosted at CERN and WLCG Data Privacy Notice for other WLCG services has been given at the January WLCG MB. We were asked to go ahead and to accelerate this process. WLCG Ops Coordination will follow up.

Accounting TF

dCache upgrade TF

Information System Evolution TF

  • WLCG CRIC has been bootstrapped with initial information for network topology. Will submit tickets against the sites asking to validate this data.

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

  • Kick-off meeting of the Monitoring Task Force took place on the 13th of January. Agreed on the main directions of work. Jira project WLCGMONTF has been created to follow up on the progress.

Network Throughput WG

  • perfSONAR infrastructure - 4.4.2 is the latest release
  • WLCG/OSG Network Monitoring Platform
    • Weekly meetings focusing on operations and infrastructure and alerts/alarms improvements
  • 100G perfSONAR meeting took place last week
    • We have agreed to tune up performance in order to reach ~ 10% of capacity (right now it's ~3-5%)
  • Recent and upcoming WG updates:
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

WG for Transition to Tokens and Globus Retirement

  • Progressing via Authorization WG meetings

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

-- JuliaAndreeva - 2022-01-25

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2022-01-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback