WLCG Operations Coordination Minutes, Sep 14, 2023
Highlights
Agenda
https://indico.cern.ch/event/1325406/
Attendance
- local: Ewoud (WLCG), Maarten (ALICE + WLCG), Panos (WLCG)
- remote: David B (IN2P3-CC), David C (Technion), Domenico (WLCG), Giuseppe (CMS), Ivan (ATLAS), Mark (LHCb + Birmingham), Matt (Lancaster), Petr (ATLAS + Prague), Stephan (CMS)
- apologies: Julia (WLCG)
Operations News
- Security
- Two named vulnerabilities affect popular data center CPUs
- Mitigations and/or updates may result in performance degradation
- No reports of serious effects so far
- EGI sites have been alerted by the EGI Software Vulnerability Group
- The advisories have also been forwarded to the OSG security team
- The next meeting is foreseen for Oct 12
- Please let us know if that date would be very inconvenient
Special topics
Progress report for the Unified Experiment Monitoring
please see the
presentation
Discussion
- Stephan:
- how did the project come about?
- input comes from VO sources with possibly different definitions
- e.g. CPU efficiency, data transferred
- Ewoud:
- indeed, the sources may have different definitions
- however, there are common metrics across all experiments
- it has been laborious to collect those e.g. for RRB reports
- to make that easier, we unify those metrics in MONIT
- Stephan:
- it may be tricky to compare metrics with different definitions
- all RRB metrics should be agreed and vetted by the experiments
- Domenico:
- most of the data come from the detailed monitoring by the experiments,
but only high-level views of basic metrics are used for the reports
- to allow that to be done, we need to get the data into Grafana first
- ATLAS and CMS were already there since years
- LHCb wanted to rely more on MONIT instead of DIRAC for monitoring
- ALICE will instead expose their relevant data sources
- aggregates for all experiments are then made available at RRB level
- Stephan:
- we may want to expose only a subset of the data for use outside the VO
- Domenico:
- the VO data is only accessible to the Grafana admins
- agreements were made with the experiments to extract only what is useful
- there is no interest in further details
- Stephan:
- the experiments may need to be more involved with such matters,
which also should be adequately documented
- mind there may be a different operations team in 5 years
- Domenico:
- these matters were driven by the LHCC, discussed in a pre-GDB and an MB
- Maarten:
- these RRB report processes have been ongoing since years
- since the advent of WLCG CRIC, we are streamlining them more and more
- the reports should show the usage as perceived by the experiments
- the relevant data could be exported to another Grafana dashboard,
but it is much easier to give limited access to the sources
- Stephan:
- will check if there are any concerns on the CMS side
- we may like to have these workflows defined more formally
- Maarten:
- that would be good, thanks!
Token Trust and Traceability WG
please see the
presentation
Discussion
- Maarten:
- the presented goals were deemed better handled in this new WG,
to allow the AuthZ WG to focus just on technical matters
Middleware News
- Useful Links
- Baselines/News
- Implications of the changing Linux landscape and potential
courses of action have been discussed in the Linux Future
Committee public meeting on 11 September
- And in the GDB on 13 September
- To be continued...
Tier 0 News
- ETF is a supported service with a temporary vacancy that is soon expected to be filled
- The previous experts have moved to other activities within the IT Department
- Development is on hold until the new service manager has got up to speed
Discussion
- Petr:
- mind we will need to add new probes for SE tests with tokens for DC24
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Moderate to normal activity levels, no major issues
- CVMFS
2.11.0
fixes an important issue affecting at least ALICE jobs
- The CVMFS daemons will reuse file descriptors for files already open
- Huge reductions in open file descriptors on big WNs!
- A new parameter needs to be set:
CVMFS_CACHE_REFCOUNT=1
- Expected to become the default in a future version
- Volumes need to be remounted for full, instantaneous effect
- On earlier versions the workaround is to increase another parameter:
-
CVMFS_NFILES=1048576
(for example)
- Continuing again with switching sites from single- to multicore jobs
- ~90% of the resources already support 8-core or whole-node jobs
ATLAS
- Smooth running with 500-850k slots on average
- Healthy job mix, with lots of Fast Simulation
- Big VEGA and Point1 contribution again
- BOINC has been resuscitated
- Smooth data transfers
- General Third-Party-Copy going at 50GB/s, with a few peaks above 100 GB/s, mostly due to scheduled Data Consolidation activities
- Not much T0 Export in the last month
- Principally ready for Heavy Ion data taking and export
- Switch to x86_64 v2 complete
- New software releases switched to v2 compilation
- ARM Cluster in Glasgow up and running
- 1.5k slots running MC Sim 23 / physics validation
- Google project phase 2 coming to an end
- Site will be decommissioned this week
- Follow-up project under discussion and will come early 2024
- Taiwan migration from T1 to US T2 being prepared
CMS
- preparing for Heavy-Ion data recording (high rate)
- plan is to use CERN tape as short-term buffer, i.e. recall data (for transfer to Tier-1s) shortly after writing
- overall smooth running, no major issues
- spiky core usage between 180k and 480k cores
- bursts of significant HPC/opportunistic contributions
- usual production/analysis split of about 3:1
- high/medium priority production requests worked down in early September resulting in unusual workflow mix
- main production activity Run 2 ultra-legacy and HL-LHC Monte Carlos
- waiting on python3 version/port of HammerCloud
- campaign to check CVMFS cache size at sites is sufficient continues
- working on SRM to REST migration for Tape endpoints
- work with our DPM sites to migrate to other storage technology continues
- token migration progressing steadily
- almost ready to start working with dCache sites
- looking forward to 24x7 production IAM support by CERN
- one of the legacy VOMS shutdown hurdles
- need a glide-in WMS and ETF fix to move forward with our new SAM worker node tests
- what is the status with KISTI, do CERN, EGI, WLCG agree on re-using the site?
Discussion
- Maarten:
- will follow up regarding the T2 site at KISTI
LHCb
- Significant data should hopefully start being taken over the next few months meaning a significant ramp up for CPU and storage operations
- Continuing with mostly MC running until then
- There will be a repeat of the Data Challenge with CNAF in the first week of October
- Progress also being made on long standing RAL issues (thanks for all the work here!)
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
Migration from DPM to the alternative solutions
Information System Evolution TF
IPv6 Validation and Deployment TF
A status report was presented in the
GDB on 13 September.
Detailed status
here.
Monitoring
Network Throughput WG
WG for Transition to Tokens and Globus Retirement
Action list
Specific actions for experiments
Specific actions for sites
AOB