WLCG Operations Coordination Minutes, Jan 27, 2022

Highlights

Pre-GDB on operational effort and possible optimization will take place on the 24th of February 3:00-5:30 PM CET.

Xrootd monitoring progress

IAM service support status

perfSONAR v4.4.2 contains important bugfixes

Agenda

https://indico.cern.ch/event/1120678/

Attendance

local:
remote: Alberto (monitoring), Alessandra D (Napoli), Alessandra F (ATLAS + Manchester + WLCG), Andrea (WLCG), Borja (monitoring), Christoph (CMS), Concezio (LHCb), Danilo (CMS), David Cameron (ATLAS + ARC), David Cohen (Technion), Derek (Nebraska + monitoring), Enrico (CNAF), Eric (IN2P3), Federica (CNAF), Horst (Oklahoma), James (CMS), Julia (WLCG), Lucia (CNAF), Maarten (ALICE + WLCG), Marian (networks + monitoring), Matt (Lancaster), Max (KIT), Mihai (FTS), Miltiadis (WLCG), Nikolay (monitoring), Panos (WLCG), Petr (ATLAS + Prague), Shawn (MWT2 + networks), Stephan (CMS), Thomas (DESY), Xin (BNL + WLCG)
apologies:

Operations News

the next meeting is planned for March 3

IAM service support
- IAM has been added to the table of critical services at CERN
- Its urgency and impact numbers have been copied from the VOMS row
  - They are expected to change in the course of this year
- The details for submitting a ticket have been added to these pages:
  - Twiki version (source)
  - SNow KB article (copy)
- For the next few months, the support level is closer to 8/5 than 24/7
  - That is expected to improve in the course of this year
  - It implies we should not yet rely on very short-lived tokens
  - Rather imitate what is done with VOMS proxies for now

Discussion

Petr:
- in ATLAS we will need to use 96h tokens for now
- ETF will need to do the same

David Cameron:
- is the current IAM service support level consistent with
  the phaseout of X509 proxies in OSG by the end of Feb?

Maarten:
- that is for ATLAS to decide, but IMO it is OK to use 96h tokens,
  because they are only used for job submission to HTCondor CEs,
  they are not delegated to the jobs, i.e. not spread over the grid
- to some extent better than what we have today with VOMS proxies
- data management tokens used by jobs will need much shorter lifetimes,
  but we are not close to having those in any production workflows
- developments in that area are discussed in the Authorization WG

Thomas:
- is there documentation on how to map tokens based on which identifiers?

Maarten:
- yes, but probably not quickly found via the Authorization WG docs
- we will follow up
- currently such mappings mostly matter to HTCondor CEs

Special topics

XRootD monitoring

see the presentation

Discussion

Danilo:
- page 15: why are there loops between the Collectors and the MQ instances?
- what is the expected timeline for implementing all this?

Derek:
- the Shoveler uses the message bus to send data to its Collector reliably. These are raw messages, not aggregated for a single transfer event. The content of the message is exactly the same as udp packet content.
- the Collector parses and aggregates the data and sends the results
  to its consumers also via that message bus. Different topic will be used, not the same as for raw messages.
- Shovelers are expected to work at ~all OSG sites by ~end of Feb

Borja:
- most of the remaining development work for the EGI/WLCG side is
  expected to be finished by ~end of Feb
- testing would then start as of March

Christoph:
- can the Shoveler also be used by dCache instances?

Derek:
- that depends on the message format
- dCache can send XRootD-compatible messages, AFAIU
- to be tested

Alessandra F:
- the dCache devs will need to be contacted about this

Julia:
- will do

Borja:
- there also is MonALISA to be considered in this respect
- all senders have to stick to the schema that we will agree on. DCache reports are not expected to be the same as raw messages sent by the Shoveler.

Maarten:
- when the various components are ready for deployment,
  we will need to get them into the right repositories

Julia:
- the Shoveler would be part of XRootD releases
- dCache have their own releases
- the Collector could be added to the WLCG repository

David Cameron:
- will we be able to see mappings of transfers to sites?

Borja:
- yes, the usual Monit functionality for transfers

Alessandra F:
- the aim is to have consistency in Monit for all transfers

Andrea:
- can the use of IPv6 be tracked?

Derek:
- yes, it is an attribute in the aggregated data

Middleware News

Useful Links
- WLCGBaselineTable
Baselines/News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

Normal activity levels on average in the last 8 weeks
No major issues
Site VOboxes are being switched from legacy AliEn to JAliEn
- Most sites and ~85% of the resources are done
  - Progress can be tracked here
  - Issues at a few of the remaining sites are being followed up
- JAliEn is needed for Run-3 multi-core jobs
- Most sites should only see 8-core jobs eventually
  - Some already do, some others receive a mix
- Such jobs also can run up to 8 single-core (legacy) tasks
- For each task, Singularity is tried from CVMFS
  - If that fails, a local system installation is tried
  - If that fails, the task is run in the classic way

ATLAS

Smooth running over Xmas break and in last few weeks with 700-800k slots
- Main activities Run2 data and MC reprocessing
- Including running MC reprocessing on 50% of the HLT farm
Another CA update (SlovakGrid) means all dCache services need to be restarted (same as with Swiss and Brazilian updates last year)
Switch to IAM VOMS server seemed to go smoothly, required some clean up of tools still using legacy (non-RFC) proxies
AGIS servers were shut down last week
Problems with slow transfers to and from RAL, hard to debug (GGUS:154436)
SRR storage reporting is shaky especially at dCache sites. Several times storage got full because SRR was not up to date.
Planning a Run 3 commissioning data transfer test ~end Feb/begin March involving full T0 and export to T1 tapes

CMS

running smoothly with 300-350k cores
- no significant issues during the holidays
- transatlantic link Fermilab--CERN down to tertiary, 20Gbit/s, link during most of the holidays
  - Waiting for last pieces of information about the chain of responsibility linked to the machine that failed at CERN causing the issue
- internal saturation for analysis jobs during the holidays
  - traced to large number of jobs with low number of sub-jobs
- usual production/analysis split of 3:1
- HPC allocations contributing up to 30k cores
  - 2021 allocations for machines in the US all consumed well before the end of the allocation period
- production activity mainly Run 2 ultra-legacy Monte Carlo
- re-reconstruction of parked B data, 11B of 12B processed
SRM+WebDAV commissioning at Tier-1 sites started
accidental deletion of SAM/HC datasets middle of December
- at about a third of sites
- all files restored by middle of January
HammerCloud instabilities since several weeks
- job status queries failed causing multiple jobs and empty status page, corrected
- no new jobs being submitted for series, still being investigated
- working on updating HC jobs for Run 3 software/input datasets
big Thanks to all sites contributing above the pledge!
- this is much appreciated while sites struggle to get new machines
- a very welcome boost of the CMS physics program

LHCb

smooth running over Xmas break and in January
using 140-160k cores
- re-processing campaign of Run2 ended in Dec 31st, 2021 (!)
- simulation jobs at 95%, user jobs at 5%
some data movements / replicas to
- recover disk space at PIC
- deal with long-term downtime of storage at CBPF
planning tape throughput tests at Tier0
- tape read test at Tier1 should also be planned

Task Forces and Working Groups

GDPR and WLCG services

Updated list of services
Review of the status of publishing of the CERN RoPOs for WLCG services hosted at CERN and WLCG Data Privacy Notice for other WLCG services has been given at the January WLCG MB. We were asked to go ahead and to accelerate this process. WLCG Ops Coordination will follow up.

Accounting TF

dCache upgrade TF

Information System Evolution TF

WLCG CRIC has been bootstrapped with initial information for network topology. Will submit tickets against the sites asking to validate this data.

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

Kick-off meeting of the Monitoring Task Force took place on the 13th of January. Agreed on the main directions of work. Jira project WLCGMONTF has been created to follow up on the progress.

Network Throughput WG

perfSONAR infrastructure - 4.4.2 is the latest release
- Release notes https://www.perfsonar.net/releasenotes-2022-01-24-4-4-2.html
- Contains important bugfixes that were impacting performance of the WLCG perfSONARs
WLCG/OSG Network Monitoring Platform
- Weekly meetings focusing on operations and infrastructure and alerts/alarms improvements
100G perfSONAR meeting took place last week
- We have agreed to tune up performance in order to reach ~ 10% of capacity (right now it's ~3-5%)
Recent and upcoming WG updates:
- LHCOPN/LHCONE workshop in March (https://indico.cern.ch/event/1110783/)
WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

WG for Transition to Tokens and Globus Retirement

Progressing via Authorization WG meetings

Action list

Creation date	Description	Responsible	Status	Comments

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

Specific actions for sites

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

AOB

-- JuliaAndreeva - 2022-01-25

This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsWeb > WLCGOpsCoordination > WLCGOpsMinutes220127
Topic revision: r12 - 2022-01-28 - MaartenLitmaath

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback