WLCG Operations Coordination Minutes, Sep 1, 2022

Highlights

Agenda

https://indico.cern.ch/event/1188546/

Attendance

  • local:
  • remote: Alessandra (Napoli), Alessandro (EGI), Andrea (WLCG), Andrew (TRIUMF), Christoph (CMS), David Cameron (ATLAS + ARC), David Cohen (Technion), Doug (BNL), Giuseppe (CMS), Maarten (ALICE + WLCG), Marian (networks + ETF), Mark (LHCb + Birmingham), Panos (WLCG), Pavel (KIT), Shawn (networks + AGLT2), Stefano (CNAF), Stephan (CMS), Thomas (DESY), Xavier (KIT), ufa
  • apologies: Julia (WLCG)

Operations News

  • the next meeting is planned for Sep 29

Special topics

Future WLCG helpdesk

see the presentation

Discussion

  • Alessandro:
    • there also are requirements from EOSC-Future and other parties
    • we need to keep track of all requirements together somewhere
    • the EGI JIRA could be used for that
    • this presentation should also be given to the EGI OMB
  • Pavel:
    • we should indeed merge the requirements in some way
    • a presentation to the EGI OMB can be scheduled

  • David Cohen:
    • what will happen to the GGUS ticket history?
    • many tickets may provide a valuable knowledge base
  • Maarten:
    • we consider preserving them as static HTML pages for a while
    • however, there are e.g. GDPR issues to be concerned about
    • and the tickets' relevance will steadily decline over time
  • Pavel:
    • the new helpdesk has knowledge base functionality
    • some of its knowledge might be imported from GGUS

  • Maarten:
    • the new helpdesk will need to have several features for WLCG
    • we will not just copy everything that GGUS can do today
    • some things were never really used and can finally be dropped
    • we may also come up with new things to consider
    • we will organize followup meetings for experiment input etc.
  • Pavel:
    • we have a Google Doc and Sheet for first proposals etc.
    • from that we derive a prioritized list of features to be implemented
    • we expect to have a first version deployed early next year
  • Maarten:
    • we then will start trying it out with some support units
    • we need to have everything working well before the end of 2024,
      when the Remedy version used by GGUS becomes unsupported
    • GGUS will disable ticket submission when the new helpdesk is
      sufficiently ready to take over
    • old tickets are handled in GGUS until it has to be shut down

  • Alessandro:
    • the evolution of the new helpdesk and the transition from GGUS
      are funded through EGI-ACE etc.

KIT storage SIR

see the presentation

Discussion

  • Maarten:
    • the incident looks mostly due to human error?
  • Xavier:
    • well, it should also be noted that dCache had received
      return codes that suggested those files were safe on GPFS,
      when in fact they had only been successfully received by
      the GPFS buffer from which they are sent to disk servers
  • Maarten:
    • maybe that is due to configuration, but receiving confirmation
      only when the files are on the disk servers might well imply
      a big, possibly unacceptable impact on performance
    • we can live with such occasional, small-scale (!) incidents

Middleware News

  • Useful Links
  • Baselines/News
    • we intend to launch a short questionnaire about the
      experiences, if any, that sites have gathered on the
      various EL8- and possibly EL9-compatible distributions
      • results to be presented in Ops Coordination or the GDB

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Mostly business as usual

ATLAS

  • Smooth running with 500-700k cores including HLT farm since beam stopped last week
  • Removing GridFTP support completely from BNL tape systems revealed some dependencies in FTS, will be fixed in release deployed next week
  • Robot cert used for all Rucio transfers will change owner and hence DN. We will do thorough testing and notify sites of any problems.
  • All ATLAS jobs use Apptainer instead of Singularity from 4 August (ATLAS-maintained version in CVMFS with fallback to local install)
  • Some concerns from sites about enabling dCache SRR API, since it exposes a lot of admin information too

Discussion

  • Thomas:
    • not clear if the issue is with the dCache SRR or REST interface?
  • Xavier:
    • indeed
  • David Cameron:
    • to be followed up
  • Maarten:
    • maybe open an issue in the dCache tracker as needed?
    • will inform Julia of this potential concern

CMS

  • August was a rather quiet month, very few issues for CMS
  • running smoothly between 350k and 450k cores
    • usual production/analysis split of 75% and 25%
    • significant contribution from HPCs up to 70k cores
    • main production activity Run 2 ultra-legacy Monte Carlo
  • VOC working with groups on CERN VM migration campaign
  • waiting on python3 version/port of HammerCloud

LHCb

  • Generally smooth running
  • No significant issues continuing from the summer
  • Significant progress on a couple of problematic and long-standing tickets (many thanks to those involved!):
    • GGUS:155120 Slow deletions at RAL - XRoot proxies found to serialise deletion requests. Config fix incoming
    • GGUS:153653 Socket timeouts at NL-T1. Found to be due to root attempting to open all files at once and hitting limits on single storage pool nodes. Config changed to increase number of connections and bug report opened against hadd (https://github.com/root-project/root/issues/11276)
  • With some fluctuations, have been nearly 200K jobs reliably all summer, mostly MC generation

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

dCache upgrade TF

Information System Evolution TF

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

  • We remind experiments, xrootd and dCache development teams to provide input for the draft with the minimum schema for transfer monitoring which should be used for all WLCG transfers.

https://docs.google.com/document/d/1tBECfGHk4AybPoorpEe2WiBwYH9zodv-4shiW1RGUv4/edit The deadline is the 30th of September

Network Throughput WG


  • perfSONAR infrastructure - 4.4.4 is the latest release
  • WLCG/OSG Network Monitoring Platform
    • New alarms detecting LHCOPN/LHCONE routing changes and sudden drops in throughput are now available
    • Each alarm comes with a link to a dashboard that shows all the details
    • On-going development and testing of the direct publishing of measurements from perfSONARs
  • Recent and upcoming WG updates:
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

WG for Transition to Tokens and Globus Retirement

  • on Aug 22, the AuthZ WG published v1.0 of the WLCG Token Transition Timeline
    • contains a summary of what was done in the last 2.5 years
    • describes an optimistic set of milestones to work towards
    • updates will be published from time to time as needed

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2022-09-01 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback