WLCG Operations Coordination Minutes, April 6, 2023

Highlights

  • WLCG Operations Coordination is starting a review of operations activities related to accounting and pledge management. Involved systems are APEL, EGI accounting portal, CRIC and WAU. We would like to review not just our tools, their functionality and UIs, but also procedures and workflows. We will ask input from the VOs, sites, WLCG project office, people dealing with pledge management and funding agencies (via experiments and WLCG project office).

Agenda

https://indico.cern.ch/event/1272843/

Attendance

  • local:
  • remote: Alessandra D (Napoli), Alessandro P (EGI), Andrew (TRIUMF), David B (IN2P3-CC), David C (Technion), David M (FNAL), Domenico (WLCG), Henryk (LHCb + NCBJ-CIS), Julia (WLCG), Maarten (ALICE + WLCG), Mario (ATLAS), Matt D (Lancaster), Matthias (KIT), Renato (EGI), Shawn (networks + AGLT2), Stephan (CMS)
  • apologies:

Operations News

  • GGUS suffered several issues during and after the update on March 29
    • the service was down for 26 hours, about which a Service Incident Report
      has been added to the WLCG SIR archive
    • for a few days, e-mails were not sent for ticket updates,
      due to a critical field being cleared when a ticket was updated
    • these issues took time to be fixed due to absence of a key expert

  • the next meeting is planned for May 4

Special topics

Network monitoring campaign

see the presentation

Discussion

  • David B:
    • when might we see the UDP fireflies?
  • Shawn:
    • we are working on a new version of flowd for that
    • we expect to ramp up the fireflies by the end of June, early July

  • Julia:
    • after Easter we will open tickets for the target sites

  • David C:
    • what is the amount of effort expected of site admins?
  • Shawn:
    • it depends on local circumstances
    • at AGLT2 it took me about 1 hour
  • David C:
    • not sure if I have access to all the info that is requested
  • Shawn:
    • it is an opportunity for site admins to engage more with their network teams

Switch to HEPscore23. Status.

see the presentation and the Twiki page

Discussion

  • Stephan:
    • will the results and tables be made publicly available?
    • for sites to compare and take purchasing decisions
  • Domenico:
    • we are looking into which level of access can be given
    • some sites may not want their setups to be disclosed
  • Stephan:
    • it would be like what was done for HS06
    • a simple table about CPU numbers without site information
    • a snapshot could be published every few months e.g. on Twiki
  • Julia:
    • it would also allow sites to check if their own numbers make sense
  • Domenico:
    • such a procedure looks fine to me
    • mind that sites should still run HS23 themselves on their own HW
    • that will also help us discover anomalies
      • very different numbers for the same CPU at different sites

  • Matthias:
    • as the EGI portal shows everything in HS23,
      how can we see which benchmark was used for what?
  • Julia:
    • a new attribute with the name of the benchmark will be added.
      In the EGI accounting portal, there will be a new parameter in the
      drop down menu for the name of the used benchmark,
      so there will be a possibility to group and sort by this attribute.
    • it depends on development work in APEL
    • as most sites will mix HS06 and HS23 results behind the same CE,
      the attribute will not be precise
    • but it will allow us to see which sites have started using HS23
  • Matthias:
    • what if we wanted to provide HS06 and HS23 results for all machines?
  • Julia:
    • there has never been strong support for having several benchmarks
      for the same resources
    • APEL currently expects only a single benchmark for a group of resources
    • if both HS06 and HS23 are provided, it will take the HS23 values
    • if multiple benchmarks were really deemed necessary,
      it would imply a lot of work for APEL etc.
  • Matthias:
    • if only KIT is interested, we can just drop the matter

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • ASGC Stratum-1 incident affected 3 ALICE sites in Asia (GGUS:160909)
    • CVMFS out of date for more than 1 week
  • HTCondor CE sites have been informed of the ALICE token configuration details
    • Tokens for normal jobs are already in use at CERN and T1 sites
    • SAM ETF tests are soon expected to make use of tokens as well
  • Continuing with switching sites from single- to 8-core jobs
    • ~75% of the resources are already supporting 8-core or whole-node jobs

ATLAS

  • Smooth running with 500-750k slots on average with lots of Full Simulation
  • WebDAV based archive & recall (new HTTP TAPE REST API)
    • FZK-LCG2_LOCALGROUPTAPE now in production after dCache reconfiguration
    • Plan to move FZK-LCG2 TAPE fully mid next week
    • Plan to move CTA@CERN on April 17th
  • Accidental saturation of new Prague 100 Gbit link due to unrelated cloud transfer tests

CMS

  • recording cosmic and first beam data with the CMS detector
  • overall smooth running, no major issues
    • good core usage between 300k and 470k cores
    • significant HPC/opportunistic contributions
    • usual production/analysis split of about 3:1
    • main production activity Run 2 ultra-legacy Monte Carlo and Run 3
  • preparing to use Rucio for user and group data
    • ask sites to setup the two areas that did not yet have them
  • waiting on python3 version/port of HammerCloud
  • waiting on HTCondor team for new version of client that fixes ARC-CE submission issues with tokens
  • working with our DPM sites to migrate to other storage technology
    • twelve Tier-2 sites need to migrate
  • token migration progressing steadily
    • old, unused clients with excessive scope in IAM clean up
    • groups for permissions management set up and crontabs running
    • we are eager to explore IAM logging and traceability with next version
    • first four native xrootd sites fully ready
    • first two dCache sites fully ready
  • looking forward to 24x7 production IAM support by CERN
  • working on WLCG 2023 pledge integration

LHCb

  • Full system utilization (around 140k jobs, mostly MC)
  • Issue with CERN HTCondor CEs (should already be solved, but we need to wait)
  • CNAF not working for some users: require OpenSSL3 everywhere
  • Still ticketing sites regarding the use of "SingularityCE"
  • Run3 sprucing production should appear in the next few days

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • Main focus of work is introduction of HEPscore23

Information System Evolution TF

  • wlcg-cric has been upgraded to be compliant with the switch from HEPSPEC06 to HEPscore23.
  • WAU has been upgraded to be compliant with the switch from HEPSPEC06 to HEPscore23.
  • doma-cric instance has been decommissioned
  • The script that synchronises CRIC with the production CMS IAM instance has been deployed.

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

  • Julia:
    • there has also been progress with the Xrootd transfer monitoring
    • a bug was discovered:
      • many transfers were seen reported as having zero transfer time
      • could e.g. be due to the closing message getting lost
      • this bug needs to be fixed before we can ramp up further
    • for dCache Xrootd traffic, which currently is monitored nowhere,
      an implementation has been agreed

Network Throughput WG


  • perfSONAR infrastructure - 5.0 will be released on 17th of April
    • All sites are encouraged to update
    • This relase will be available only for CC7 and sites on auto-updates will be updated automatically (reboot is recommended)
    • Support for Alma8/9 will come in later in the year
  • WLCG/OSG Network Monitoring Platform
    • Proposed that all T1s subscribe to the perfSONAR alarms on bandwidth decreases
      • Still on-going, we're waiting for evaluation/transition to the new subscription portal
    • perfSONAR Alarms Dashboard (psDash) is now in production (https://ps-dash.uc.ssl-hep.org/)
    • Good progress on migration to the direct publishing of measurements from perfSONARs (we now have 15 hosts publishing directly and working fine)
  • Recent and upcoming WG updates:
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

WG for Transition to Tokens and Globus Retirement

  • Further progress with the CE token support campaign on EGI
    • 125 of 133 tickets have already been solved
    • only a few sizable T2 sites remain, the rest are small sites

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2023-04-11 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback