WLCG Operations Coordination Minutes, Jan 27, 2022
Highlights
- Pre-GDB on operational effort and possible optimization will take place on the 24th of February 3:00-5:30 PM CET.
Agenda
https://indico.cern.ch/event/1120678/
Attendance
- local:
- remote: Alberto (monitoring), Alessandra D (Napoli), Alessandra F (ATLAS + Manchester + WLCG), Andrea (WLCG), Borja (monitoring), Christoph (CMS), Concezio (LHCb), Danilo (CMS), David Cameron (ATLAS + ARC), David Cohen (Technion), Derek (Nebraska + monitoring), Enrico (CNAF), Eric (IN2P3), Federica (CNAF), Horst (Oklahoma), James (CMS), Julia (WLCG), Lucia (CNAF), Maarten (ALICE + WLCG), Marian (networks + monitoring), Matt (Lancaster), Max (KIT), Mihai (FTS), Miltiadis (WLCG), Nikolay (monitoring), Panos (WLCG), Petr (ATLAS + Prague), Shawn (MWT2 + networks), Stephan (CMS), Thomas (DESY), Xin (BNL + WLCG)
- apologies:
Operations News
- the next meeting is planned for March 3
- IAM service support
- IAM has been added to the table of critical services at CERN
- Its
urgency
and impact
numbers have been copied from the VOMS row
- They are expected to change in the course of this year
- The details for submitting a ticket have been added to these pages:
- For the next few months, the support level is closer to 8/5 than 24/7
- That is expected to improve in the course of this year
- It implies we should not yet rely on very short-lived tokens
- Rather imitate what is done with VOMS proxies for now
Discussion
- Petr:
- in ATLAS we will need to use 96h tokens for now
- ETF will need to do the same
- David Cameron:
- is the current IAM service support level consistent with
the phaseout of X509 proxies in OSG by the end of Feb?
- Maarten:
- that is for ATLAS to decide, but IMO it is OK to use 96h tokens,
because they are only used for job submission to HTCondor CEs,
they are not delegated to the jobs, i.e. not spread over the grid
- to some extent better than what we have today with VOMS proxies
- data management tokens used by jobs will need much shorter lifetimes,
but we are not close to having those in any production workflows
- developments in that area are discussed in the Authorization WG
- Thomas:
- is there documentation on how to map tokens based on which identifiers?
- Maarten:
- yes, but probably not quickly found via the Authorization WG docs
- we will follow up
- currently such mappings mostly matter to HTCondor CEs
Special topics
see the
presentation
Discussion
- Danilo:
- page 15: why are there loops between the Collectors and the MQ instances?
- what is the expected timeline for implementing all this?
- Derek:
- the Shoveler uses the message bus to send data to its Collector reliably. These are raw messages, not aggregated for a single transfer event. The content of the message is exactly the same as udp packet content.
- the Collector parses and aggregates the data and sends the results
to its consumers also via that message bus. Different topic will be used, not the same as for raw messages.
- Shovelers are expected to work at ~all OSG sites by ~end of Feb
- Borja:
- most of the remaining development work for the EGI/WLCG side is
expected to be finished by ~end of Feb
- testing would then start as of March
- Christoph:
- can the Shoveler also be used by dCache instances?
- Derek:
- that depends on the message format
- dCache can send XRootD-compatible messages, AFAIU
- to be tested
- Alessandra F:
- the dCache devs will need to be contacted about this
- Borja:
- there also is MonALISA to be considered in this respect
- all senders have to stick to the schema that we will agree on. DCache reports are not expected to be the same as raw messages sent by the Shoveler.
- Maarten:
- when the various components are ready for deployment,
we will need to get them into the right repositories
- Julia:
- the Shoveler would be part of XRootD releases
- dCache have their own releases
- the Collector could be added to the WLCG repository
- David Cameron:
- will we be able to see mappings of transfers to sites?
- Borja:
- yes, the usual Monit functionality for transfers
- Alessandra F:
- the aim is to have consistency in Monit for all transfers
- Andrea:
- can the use of IPv6 be tracked?
- Derek:
- yes, it is an attribute in the aggregated data
Middleware News
- Useful Links
- Baselines/News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Normal activity levels on average in the last 8 weeks
- No major issues
- Site VOboxes are being switched from legacy
AliEn
to JAliEn
- Most sites and ~85% of the resources are done
- Progress can be tracked here
- Issues at a few of the remaining sites are being followed up
-
JAliEn
is needed for Run-3 multi-core jobs
- Most sites should only see 8-core jobs eventually
- Some already do, some others receive a mix
- Such jobs also can run up to 8 single-core (legacy) tasks
- For each task, Singularity is tried from CVMFS
- If that fails, a local system installation is tried
- If that fails, the task is run in the classic way
ATLAS
- Smooth running over Xmas break and in last few weeks with 700-800k slots
- Main activities Run2 data and MC reprocessing
- Including running MC reprocessing on 50% of the HLT farm
- Another CA update (SlovakGrid) means all dCache services need to be restarted (same as with Swiss and Brazilian updates last year)
- Switch to IAM VOMS server seemed to go smoothly, required some clean up of tools still using legacy (non-RFC) proxies
- AGIS servers were shut down last week
- Problems with slow transfers to and from RAL, hard to debug (GGUS:154436)
- SRR storage reporting is shaky especially at dCache sites. Several times storage got full because SRR was not up to date.
- Planning a Run 3 commissioning data transfer test ~end Feb/begin March involving full T0 and export to T1 tapes
CMS
- running smoothly with 300-350k cores
- no significant issues during the holidays
- transatlantic link Fermilab--CERN down to tertiary, 20Gbit/s, link during most of the holidays
- Waiting for last pieces of information about the chain of responsibility linked to the machine that failed at CERN causing the issue
- internal saturation for analysis jobs during the holidays
- traced to large number of jobs with low number of sub-jobs
- usual production/analysis split of 3:1
- HPC allocations contributing up to 30k cores
- 2021 allocations for machines in the US all consumed well before the end of the allocation period
- production activity mainly Run 2 ultra-legacy Monte Carlo
- re-reconstruction of parked B data, 11B of 12B processed
- SRM+WebDAV commissioning at Tier-1 sites started
- accidental deletion of SAM/HC datasets middle of December
- at about a third of sites
- all files restored by middle of January
- HammerCloud instabilities since several weeks
- job status queries failed causing multiple jobs and empty status page, corrected
- no new jobs being submitted for series, still being investigated
- working on updating HC jobs for Run 3 software/input datasets
- big Thanks to all sites contributing above the pledge!
- this is much appreciated while sites struggle to get new machines
- a very welcome boost of the CMS physics program
LHCb
- smooth running over Xmas break and in January
- using 140-160k cores
- re-processing campaign of Run2 ended in Dec 31st, 2021 (!)
- simulation jobs at 95%, user jobs at 5%
- some data movements / replicas to
- recover disk space at PIC
- deal with long-term downtime of storage at CBPF
- planning tape throughput tests at Tier0
- tape read test at Tier1 should also be planned
Task Forces and Working Groups
GDPR and WLCG services
- Updated list of services
- Review of the status of publishing of the CERN RoPOs for WLCG services hosted at CERN and WLCG Data Privacy Notice for other WLCG services has been given at the January WLCG MB. We were asked to go ahead and to accelerate this process. WLCG Ops Coordination will follow up.
Accounting TF
dCache upgrade TF
Information System Evolution TF
- WLCG CRIC has been bootstrapped with initial information for network topology. Will submit tickets against the sites asking to validate this data.
IPv6 Validation and Deployment TF
Detailed status
here.
Monitoring
- Kick-off meeting of the Monitoring Task Force took place on the 13th of January. Agreed on the main directions of work. Jira project WLCGMONTF has been created to follow up on the progress.
Network Throughput WG
- perfSONAR infrastructure - 4.4.2 is the latest release
- WLCG/OSG Network Monitoring Platform
- Weekly meetings focusing on operations and infrastructure and alerts/alarms improvements
- 100G perfSONAR meeting took place last week
- We have agreed to tune up performance in order to reach ~ 10% of capacity (right now it's ~3-5%)
- Recent and upcoming WG updates:
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
WG for Transition to Tokens and Globus Retirement
- Progressing via Authorization WG meetings
Action list
Specific actions for experiments
Specific actions for sites
AOB
--
JuliaAndreeva - 2022-01-25
This topic: LCG
> WebHome >
WLCGCommonComputingReadinessChallenges >
WLCGOperationsWeb >
WLCGOpsCoordination > WLCGOpsMinutes220127
Topic revision: r12 - 2022-01-28 - MaartenLitmaath