WLCG Operations Coordination Minutes, May 4, 2023

Main points

Agenda

https://indico.cern.ch/event/1282064/

Attendance

  • local:
  • remote: Alastair (RAL), Alessandra (Napoli), Eric (IN2P3), Julia (WLCG), Maarten (ALICE + WLCG), Mario (ATLAS), Matt (Lancaster), Panos (WLCG), Stephan (CMS), Thomas (DESY)
  • apologies:

Operations News

  • on Sat April 22, the CERN SSO was unavailable for about 2.5h starting 13:20 CEST
    • at that time the old CERN Grid CA certificate expired
    • though the currently valid certificate was available since 1 year,
      there were a number of places where the old one was still used
    • the SSO was such an example (OTG:0076975)
    • there was a lot of fall-out, some of which is reported below
    • a comprehensive service incident report is being worked on
    • a presentation about the matter is foreseen for our next meeting

  • our next meeting is planned for June 1st

Special topics

RAL-LCG2 Network Outage October 2022

see the presentation

Discussion

  • Stephan:
    • 3 out of 4 switches failed - what was the correlation?
  • Alastair:
    • the Z9100 going down triggered crashes of the 2 others
    • as that HW was old, such trouble could have happened any time
    • the DB switch was close to dying as well
    • we therefore decided to move all affected links to newer HW

  • Julia:
    • was it needed to make all changes in a single day?
  • Alastair:
    • originally, a plan in stages was foreseen, to minimize the risks
    • given the perilous state of the legacy network, however,
      it was decided to do all the operations in one go instead

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Moderate activity levels, no major issues
  • Tokens for normal jobs are already in use at most HTCondor CE sites
  • Continuing with switching sites from single- to multicore jobs
    • ~85% of the resources are already supporting 8-core or whole-node jobs

ATLAS

  • Smooth running with 500-750k slots on average with lots of Full Simulation
  • CERN Grid CA certificate expiration
    • Ticketing and communication left a lot to be desired
    • ATLAS operations people (WFMS & DDM) had to investigate and put workarounds in place during the weekend to stabilise the situation
    • Had to insist to release of new RPM on Monday and not wait for another week
  • 2022 data reprocessing coming soon

CMS

  • recording collision data
  • overall smooth running, no major issues
    • good core usage between 350k and 450k cores
    • significant HPC/opportunistic contributions
    • usual production/analysis split of about 3:1
    • main production activity Run 2 ultra-legacy Monte Carlo
  • messy CERN Grid CA certificate expiration/update
    • various CMS services down for up to three and a half days
    • seems yum repos of old OSes were not updated and RPM had a missing bundle update execution
  • waiting on python3 version/port of HammerCloud
  • working with our DPM sites to migrate to other storage technology
  • token migration progressing steadily
    • working with native xrootd sites to enable IAM-issued token support
  • looking forward to 24x7 production IAM support by CERN

Discussion

  • Stephan:
    • also the InCommon CA will expire this year
  • Maarten:
    • it is normal for a number of CAs to expire every few years
    • new certificates normally are made available at least 1 month in advance
    • some of our services then need to be restarted to pick them up
      • examples include versions of dCache, StoRM, Argus and VOMS-Admin
        • all using older versions of the canl-java library
    • the CERN Grid CA is used for T0 services and thus impacts a lot more

  • Maarten:
    • what were those old OSes?
  • Stephan:
    • e.g. CentOS Stream 8
  • Maarten:
    • will check what can be done there

LHCb

Impact on experiment operations of the CERN Grid CA expiration incident

  • the related SSO outage is described above

ALICE

  • Data taking:
    • Only impacted by the SSO being dysfunctional:
      authentication for controls, monitoring, logging and documentation.
    • Last-resort workarounds needed to be applied.
    • If the beam had been lost, there might have been further fall-out.
    • Fortunately, Mattermost kept working, as well as unauthenticated Zoom.

  • Grid operations:
    • VOboxes at some sites had outdated CAs and failed in various ways.
    • CVMFS had outdated CAs in several places that were quickly fixed.
    • Users had outdated CAs in local analysis installations,
      for which recipes were subsequently provided by experts.

ATLAS

  • CERN Grid CA certificate expiration
    • Ticketing and communication left a lot to be desired
    • ATLAS operations people (WFMS & DDM) had to investigate and put workarounds in place during the weekend to stabilise the situation
    • Had to insist to release of new RPM on Monday and not wait for another week

CMS

  • messy CERN Grid CA certificate expiration/update
    • various CMS services down for up to three and a half days
    • seems yum repos of old OSes were not updated and RPM had a missing bundle update execution

LHCb

  • Online
    • SSO unavailability affected users who wanted to connect to web-services
    • no other major issues
  • Offline
    • little impact
    • some web pages started to fail, quickly fixed

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • APEL client supporting HEPScore benchmark should be ready for testing next week
  • APEL server with aggregation by a benchmark name is supposed to be released the week of the 22nd of May

Information System Evolution TF

  • NTR

DPM migration to other solutions

  • 13 tickets out of 45 are solved. Some of the sites which migrated still need to enable SRR and fix information in CRIC

HEPScore in production

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

Network Throughput WG


WG for Transition to Tokens and Globus Retirement

Discussion

  • Maarten:
    • in the next months we will need to do another campaign
      to get all HTCondor CEs upgraded to v6 using HTCondor 10.x
  • Thomas:
    • has APEL been made to work for those newer versions?
    • mind it currently relies on x509 variables being set for jobs
  • Maarten:
    • good point, we will need to see what can be done by when
    • unfortunately only X509 attributes are considered by default (updated May 11):
      VOMS attributes are no longer made available in 10.x by default (ditto)
    • in particular, the VO is no longer set by default (ditto)
  • Julia:
    • we will discuss this matter with the APEL team
  • Maarten:
    • will bring this up also in our meeting with the HTCondor devs and EGI tomorrow
  • Stephan:
    • also ARC CEs need to be considered in this respect

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2023-05-31 - ConcezioBozzi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback