WLCG Operations Coordination Minutes, March 3, 2022
Main points
Agenda
https://indico.cern.ch/event/1133785/
Attendance
- local:
- remote: Alessandra D (Napoli), Andrea (WLCG), Andrew (TRIUMF), Borja (monitoring), Christoph (CMS), Concezio (LHCb), David Cameron (ATLAS + ARC), David Cohen (Technion), Eric (IN2P3), Giuseppe (CMS), Julia (WLCG), Maarten (ALICE + WLCG), Matt (Lancaster), Max (KIT), Miltiadis (WLCG), Panos (WLCG), Pepe (PIC), Romain (WLCG), Shawn (AGLT2 + networks), Simone (WLCG), Stephan (CMS), Thomas (DESY)
- apologies:
Operations News
- the next meeting is planned for April 7
Special topics
Impact of the war in Ukraine on WLCG
see the
presentation
Discussion
- Simone:
- the CERN Council will meet on March 8, more news after that
- Thomas:
- what will the mailing list be used for?
- Simone:
- the list is there for sites to inform us about
policies by which they have to abide and
for advice on how to implement them
- the list is not for sites to get news;
other channels will be used for that
- Romain:
- matters can also be escalated via security contacts
Tokens & Globus update
see the
presentation
Discussion
- Thomas:
- what about other VOs in EGI?
- might we need to set up ARC CEs to keep supporting X509 for them?
- Maarten:
- EGI have launched a survey for other VOs to consider these matters
- some may imitate WLCG and move to IAM
- others may switch to EGI Check-in
- grid services will need to support multiple token providers
- if those providers share a common basis, that should not be difficult
- after the meeting:
- the AuthZ WG has EGI Check-in reps
- the common basis is the
AARC Blueprint Architecture
- Thomas:
- VOs like ILC and Belle-II use DIRAC and hence should be fine?
- Maarten:
- indeed, as DIRAC will be made to work for LHCb anyway
- DIRAC also is the default framework for small VOs in EGI
- Concezio:
- a DIRAC release supporting tokens is in the making
- David Cameron:
- HTCondor-G can still use X509 with the ARC CE REST interface
- after the meeting: the slides have been corrected
- will the development for the longer token lifetimes be ready on time?
- Maarten:
- to be followed up in the AuthZ WG
Middleware News
- Useful Links
- Baselines/News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Mostly business as usual, no major incidents
- Run-3 preparations continuing
- ~90% of the VOboxes switched from legacy
AliEn
to new JAliEn
services
- Fraction of 8-core jobs to be ramped up for Run-3 workflows
- Most sites should only receive 8-core jobs during Run 3
ATLAS
- Mostly smooth running with average 700k cores
- This may go down soon with fewer opportunistic resources available (EuroHPCs and HLT farm)
- Run 2 data and MC reprocessing campaigns effectively done, just following up few remaining problematic tasks
- Tape challenge starting on 14 March for two weeks
- Still issues with out of date storage reporting at dCache sites
Discussion
- Julia:
- the dCache SRR instabilities are being looked into
CMS
- running smoothly with 320-400k cores
- usual production/analysis split of 3:1
- up to 95k cores of non-pledged contribution (40k on average)
- utilization of US HPC allocations on track/ahead of schedule
- production activity mainly Run 2 ultra-legacy Monte Carlo
- new large pile-up library being made; I/O limits of site storage reached resulting in CPU inefficiencies;
- Tier-0 activities
- successful large scale test (P5-->Meyrin, processing, writing to tape)
- 9.1 GB/s processing reached (48 hour average), enough even for HeavyIon
- cosmic ray data taking for Run 3 commissioning started
- upgrade of HammerCloud test jobs for Run 3 software/input datasets on hold
- need python3 version/port of HC, developer estimate: several weeks
- WebDAV commissioning ongoing
- SRM+WebDAV at all but one Tier-1 sites ready
- endpoint check/commissioning at Tier-3 sites in progress
- CMSWeb service upgraded to accept tokens
- Token commissioning for HTCondor CEs in progress
- waiting for HTCondor interface for ARC CEs
- preparing to tape challenge later in March
- successful transfer tests to PIC and FNAL
- Thanks to all sites who made their 2022 pledge already available!
LHCb
- Running at 150-170k cores, no major issues
- lots of webdav transfer failures involving GridKa GGUS:156238
- side effect due to CMS "putting storage systems to their limits"
- Transfers from P8 to CERN Tier0 performed this week
- more than 2PB transferred over two days
- 1.6x nominal throughput sustained
- some further optimisations possible from LHCb online
- a couple of issues with CTA (unbalance between the nodes, and wrong archival reports) being followed up
- plan is to keep data on EOS and use them as input for the next data challenge (3rd and 4th week of March)
Discussion
- Stephan:
- regarding the fallout from CMS activities:
- the production team were too optimistic about how many
of those "heavy" jobs could be run concurrently, sorry!
- the last ones should finish by the weekend
- such productions will be controlled better from now on
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
- Meeting with the experts to discuss the status of preparation for the integration of the new benchmark in the accounting workflow. Will be presented at the GDB next week
Information System Evolution TF
- Validation of the network information in CRIC is progressing well, still ongoing.
IPv6 Validation and Deployment TF
Detailed status
here.
Monitoring
- New XRootD Monitoring components
- XRootD Shoveler is ready to be used to send data to CERN AMQ from non-OSG sites
- XRootD Collector patches are being developed and tested
- It will require having both components ready to establish some first test flows in non-OSG sites
- Agreed on the minimum required schema for transfers to be meaningful
- Meeting with dCache developers held to discuss viability of required fields
- Agreement for some of these fields (activity and vo) to be discussed on a higher level as "scitags" since they are needed for other purposes as well
- Follow up discussions with other developers (XRootD, MonALISA, ...) to be planned/held
- Defined first "Network Monitoring" template draft, to be filled by T1s in the near future
- First iteration will be done with AGLT2 to check how complete it is and needs for improvements
Network Throughput WG
WG for Transition to Tokens and Globus Retirement
see the
special topic
Action list
Specific actions for experiments
Specific actions for sites
AOB
This topic: LCG
> WebHome >
WLCGCommonComputingReadinessChallenges >
WLCGOperationsWeb >
WLCGOpsCoordination > WLCGOpsMinutes220303
Topic revision: r15 - 2022-03-08 - MaartenLitmaath