WLCG Operations Coordination Minutes, Dec 7, 2023

Highlights

This is the last meeting this year. We would like to thank all our colleagues for the hard work and dedication in 2023. Enjoy well deserved vacations and wishing you new happiness, new goals, new achievements, and a lot of new inspirations in the coming year!

Agenda

https://indico.cern.ch/event/1354691/

Attendance

  • local: Maarten (ALICE + WLCG), Panos (WLCG)
  • remote: Andrea (WLCG), Christoph (CMS), David C (Technion), Evgeny (Weizmann), Frédérique (IN2P3-LAPP), Julia (WLCG), Marcin (PSNC), Mario (ATLAS), Max (KIT), Ofer (BNL), Sophie (GRIF-IRFU), Stephan (CMS), Thomas (DESY)
  • apologies:

Operations News

  • Julia:
    • in the weekly ops meeting there was a question:
      should the sites upgrade to the latest golden release 9.2?
    • as CMS are already running a dCache upgrade campaign and
      9.2 brings good features for DC24, we recommend that sites
      look into upgrading indeed, but decide themselves if they
      are comfortable with doing that still before DC24 (Feb 12-23)
  • Christoph:
    • CMS ask sites to upgrade either to 9.2 or a more recent 8.2 version,
      for the improved token support


  • the next meeting is foreseen for Feb 1

Special topics

HEPScore integration in the accounting workflow (postponed)

Middleware News

  • Useful Links
  • Baselines/News
    • New VOMS service for dteam available since Dec 7
      • It is the VOMS endpoint of the new IAM service for the VO
      • It should soon replace the current VOMS(-Admin) service of the VO
      • An EGI broadcast was sent asking sites to add the new LSC file
      • The VOMS client details will be broadcast in a few days
      • The new user registration details will be broadcast separately

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Lowish to normal job activity on average
    • May ramp up somewhat in the coming weeks and during the EOY break
    • Much raw data reconstruction
  • Almost all computing resources support multicore jobs
    • 8-core jobs on most sites
    • whole-node jobs on dedicated clusters
  • HI run raw data archiving should start soon and last for many weeks
    • A steady background activity
  • THANKS to all sites and best wishes for 2024!

ATLAS

  • Smooth running with 500-750k slots on average
    • Healthy job mix with mostly Full Sim
    • Recently some more Data Reprocessing and Evgen, fewer Reco
    • Big RDO reprocessing coming up, and "christmas backlog" is being prepared
  • DC24 pre-challenge very successful
  • Problems with VOMS adding new users: “User not known to VO”
    • Seems to be a problem with lcg-voms2.cern.ch
  • IPv6 dual stack everywhere for storage now being followed at https://its.cern.ch/jira/browse/ADCINFR-263
  • Australia-ATLAS situation needs to be clarified
  • New nucleus mode
    • Site with more than 4.5 PB of DATADISK
    • Nucleus flag can be removed based on operational problems
    • Added for the smallest freespace: LANCS, QMUL, NET2

Discussion

  • Andrea:
    • I also got cron job errors for voms-proxy-init
  • Mario:
    • only lcg-voms2.cern.ch looked affected
  • Maarten:
    • to be followed up with the service manager

CMS

  • overall smooth running, no major issues
    • Heavy-Ion data distribution progressing well
    • core usage between 330k and 480k cores
    • bursts of HPC/opportunistic contributions of up to 100k cores
    • higher production/analysis split of about 4:1 (HI processing)
    • main production activity Run 3
    • organizing production activities during holiday break
  • Atlas-CMS joint pre-DC24 T0-T1 exercise end of last week went well on CMS side
  • waiting on python3 version/port of HammerCloud
  • work on SRM to REST migration for Tape endpoints continues (dCache upgrades)
  • work with our DPM sites to migrate to other storage technology continues
  • token migration progressing steadily
    • working with dCache sites (odd configs at some sites from auto DPM migrations)
    • waiting on EOS to be fully token-ready (which is waiting on a SciToken library fix)
  • looking forward to 24x7 production IAM support by CERN
    • one of the legacy VOMS shutdown hurdles
  • waiting on a glide-in WMS fix to move forward with new SAM worker node tests
  • WLCG CE/WN IPv6 campaign progressing well at CMS sites
  • CMS is ready to move to larger, 16-core, pilots; where do other experiments stand, could this be a common/WLCG wide approach?

Discussion

  • Maarten:
    • 16-core slots probably are OK for ALICE (confirmed)
  • Mario:
    • details to be checked, but probably OK also for ATLAS
    • there could be a concern for "overlay" (pile-up) jobs
  • Thomas:
    • multi-VO sites might lose significant CPU time while
      draining cores to carve out a 16-core slot
    • the more VOs want that, the better
    • backfill would be possible with correct run-time requests
  • Stephan:
    • for example, low-priority short single-core jobs
  • Max:
    • it could be easier to go for whole-node jobs instead
    • they are OK for ALICE and CMS
  • Stephan:
    • the downside of those could be a long turn-over time,
      when a VO was inactive and then submits new jobs
  • Maarten:
    • to be followed up early next year

LHCb

Task Forces and Working Groups

Accounting TF

  • Main effort is dedicated to HEPScore integration in the accounting infrastructure. New APEL server is under testing. Required changes in the EGI portal have been prototyped.

Migration from DPM to the alternative solutions

  • 30 out of 43 tickets solved.

Information System Evolution TF

  • NTR

IPv6 Validation and Deployment TF

The deployment on compute services is proceeding well, with about 18% of the sites having already finished (or had IPv6 already).

Detailed status here.

Monitoring

  • Julia:
    • there are preparations ongoing for DC24
    • site network monitoring is progressing well
    • regarding XRootD traffic monitoring:
      • the EOS services at CERN have been restarted after the HI run
      • the collector showed inefficiencies or bottlenecks to be debugged
      • the data from dCache at DESY need to be integrated

Network Throughput WG


WG for Transition to Tokens and Globus Retirement

  • The HTCondor CE campaign on EGI runs with condor v9.0.20 since Nov 17
    • 9 out of the 53 tickets are solved, many are in progress, some on hold
    • Many thanks to Jaime Frey of the HTCondor team for implementing
      a very non-trivial fix for a fatal issue encountered with v9.0.19 !
    • Clients still presenting VOMS proxies can be configured to have
      the SSL method tried first, with fallback to GSI still working
      • Mind: those clients also need to run v9.0.20 or >= v10.7.0
    • When all clients of a given CE are mapped through tokens or SSL,
      the CE can be upgraded to HTCondor v23 (which does not support GSI)
      • Example configurations will be published in the coming months
      • Another campaign is foreseen in the early months of next year,
        to get all HTCondor CEs on >= v23

Discussion

  • Thomas:
    • at DESY we prefer upgrading to Alma 9 and HTCondor v23 in one go
    • the problem is the accounting is only supported on CentOS 7 today
    • who is responsible for that?
  • Maarten:
    • this matter should have higher priority now: will follow up
    • clarified after the meeting:
      • the APEL team provide log parsers for supported batch systems
      • those parsers are written in Python 2.7 which is unavailable for EL 9
      • to be assessed how much work the port to 3.x will take
  • Julia:
    • will follow up in the weekly accounting meeting

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

  • Stephan:
    • what is the status of the UMD for EL 9?
  • Maarten:
    • the UMD support infrastructure got into an unusable state just before summer
    • any non-trivial development work in EGI is unfunded since July
    • the situation is expected to improve again as of January
    • for the time being, packages can be taken directly from EPEL 9 etc.
Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2023-12-11 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback