WLCG Operations Coordination Minutes, Sep 14, 2023

Highlights

Agenda

https://indico.cern.ch/event/1325406/

Attendance

  • local: Ewoud (WLCG), Maarten (ALICE + WLCG), Panos (WLCG)
  • remote: David B (IN2P3-CC), David C (Technion), Domenico (WLCG), Giuseppe (CMS), Ivan (ATLAS), Mark (LHCb + Birmingham), Matt (Lancaster), Petr (ATLAS + Prague), Stephan (CMS)
  • apologies: Julia (WLCG)

Operations News

  • Security
    • Two named vulnerabilities affect popular data center CPUs
    • Mitigations and/or updates may result in performance degradation
      • No reports of serious effects so far
    • EGI sites have been alerted by the EGI Software Vulnerability Group
      • The advisories have also been forwarded to the OSG security team

  • The next meeting is foreseen for Oct 12
    • Please let us know if that date would be very inconvenient

Special topics

Progress report for the Unified Experiment Monitoring

please see the presentation

Discussion

  • Stephan:
    • how did the project come about?
    • input comes from VO sources with possibly different definitions
      • e.g. CPU efficiency, data transferred

  • Ewoud:
    • indeed, the sources may have different definitions
    • however, there are common metrics across all experiments
    • it has been laborious to collect those e.g. for RRB reports
    • to make that easier, we unify those metrics in MONIT

  • Stephan:
    • it may be tricky to compare metrics with different definitions
    • all RRB metrics should be agreed and vetted by the experiments

  • Domenico:
    • most of the data come from the detailed monitoring by the experiments,
      but only high-level views of basic metrics are used for the reports
    • to allow that to be done, we need to get the data into Grafana first
    • ATLAS and CMS were already there since years
    • LHCb wanted to rely more on MONIT instead of DIRAC for monitoring
    • ALICE will instead expose their relevant data sources
    • aggregates for all experiments are then made available at RRB level

  • Stephan:
    • we may want to expose only a subset of the data for use outside the VO

  • Domenico:
    • the VO data is only accessible to the Grafana admins
    • agreements were made with the experiments to extract only what is useful
    • there is no interest in further details

  • Stephan:
    • the experiments may need to be more involved with such matters,
      which also should be adequately documented
    • mind there may be a different operations team in 5 years

  • Domenico:
    • these matters were driven by the LHCC, discussed in a pre-GDB and an MB

  • Maarten:
    • these RRB report processes have been ongoing since years
    • since the advent of WLCG CRIC, we are streamlining them more and more
    • the reports should show the usage as perceived by the experiments
    • the relevant data could be exported to another Grafana dashboard,
      but it is much easier to give limited access to the sources

  • Stephan:
    • will check if there are any concerns on the CMS side
    • we may like to have these workflows defined more formally

  • Maarten:
    • that would be good, thanks!

Token Trust and Traceability WG

please see the presentation

Discussion

  • Maarten:
    • the presented goals were deemed better handled in this new WG,
      to allow the AuthZ WG to focus just on technical matters

Middleware News

  • Implications of the changing Linux landscape and potential
    courses of action have been discussed in the Linux Future
    Committee
    public meeting on 11 September
    • And in the GDB on 13 September
    • To be continued...

Tier 0 News

  • ETF is a supported service with a temporary vacancy that is soon expected to be filled
    • The previous experts have moved to other activities within the IT Department
    • Development is on hold until the new service manager has got up to speed

Discussion

  • Petr:
    • mind we will need to add new probes for SE tests with tokens for DC24

  • Maarten:
    • will point that out

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Moderate to normal activity levels, no major issues
    • CVMFS 2.11.0 fixes an important issue affecting at least ALICE jobs
      • The CVMFS daemons will reuse file descriptors for files already open
        • Huge reductions in open file descriptors on big WNs!
      • A new parameter needs to be set:
         CVMFS_CACHE_REFCOUNT=1 
        • Expected to become the default in a future version
      • Volumes need to be remounted for full, instantaneous effect
    • On earlier versions the workaround is to increase another parameter:
      •  CVMFS_NFILES=1048576 
        (for example)
  • Continuing again with switching sites from single- to multicore jobs
    • ~90% of the resources already support 8-core or whole-node jobs

ATLAS

  • Smooth running with 500-850k slots on average
    • Healthy job mix, with lots of Fast Simulation
    • Big VEGA and Point1 contribution again
    • BOINC has been resuscitated
  • Smooth data transfers
    • General Third-Party-Copy going at 50GB/s, with a few peaks above 100 GB/s, mostly due to scheduled Data Consolidation activities
    • Not much T0 Export in the last month
    • Principally ready for Heavy Ion data taking and export
  • Switch to x86_64 v2 complete
    • New software releases switched to v2 compilation
  • ARM Cluster in Glasgow up and running
    • 1.5k slots running MC Sim 23 / physics validation
  • Google project phase 2 coming to an end
    • Site will be decommissioned this week
    • Follow-up project under discussion and will come early 2024
  • Taiwan migration from T1 to US T2 being prepared

CMS

  • preparing for Heavy-Ion data recording (high rate)
    • plan is to use CERN tape as short-term buffer, i.e. recall data (for transfer to Tier-1s) shortly after writing
  • overall smooth running, no major issues
    • spiky core usage between 180k and 480k cores
    • bursts of significant HPC/opportunistic contributions
    • usual production/analysis split of about 3:1
    • high/medium priority production requests worked down in early September resulting in unusual workflow mix
    • main production activity Run 2 ultra-legacy and HL-LHC Monte Carlos
  • waiting on python3 version/port of HammerCloud
  • campaign to check CVMFS cache size at sites is sufficient continues
  • working on SRM to REST migration for Tape endpoints
  • work with our DPM sites to migrate to other storage technology continues
  • token migration progressing steadily
    • almost ready to start working with dCache sites
  • looking forward to 24x7 production IAM support by CERN
    • one of the legacy VOMS shutdown hurdles
  • need a glide-in WMS and ETF fix to move forward with our new SAM worker node tests
  • what is the status with KISTI, do CERN, EGI, WLCG agree on re-using the site?

Discussion

  • Maarten:
    • will follow up regarding the T2 site at KISTI

LHCb

  • Significant data should hopefully start being taken over the next few months meaning a significant ramp up for CPU and storage operations
  • Continuing with mostly MC running until then
  • There will be a repeat of the Data Challenge with CNAF in the first week of October
  • Progress also being made on long standing RAL issues (thanks for all the work here!)

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

Migration from DPM to the alternative solutions

Information System Evolution TF

IPv6 Validation and Deployment TF

A status report was presented in the GDB on 13 September.

Detailed status here.

Monitoring

Network Throughput WG


WG for Transition to Tokens and Globus Retirement

  • The first CE token support campaign on EGI is done
  • A second campaign is to be launched in the next weeks
    • Version 9.0.19 will allow VOMS proxy identities to be mapped via the SSL scheme
      • No wildcards needed
      • VOMS attributes will be ignored
      • Clients relying on SSL mappings also need to use a recent condor version
    • Sites will be asked to replace their legacy GSI mappings accordingly
      • GSI can still be used as a fallback while SSL is being tested
    • When SSL works OK for such legacy workflows,
      a given CE can be upgraded to v6 with a HTCondor 10.7.x or higher
    • Those versions also support use of the EGI Check-in token plug-in
      • Configuration details will be provided
    • Mind this HTCondor CE setting for APEL:
       USE_VOMS_ATTRIBUTES = True 

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2023-09-21 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback