WLCG Operations Coordination Minutes, March 7, 2024

Highlights

Agenda

https://indico.cern.ch/event/1390112/

Attendance

  • local: Maarten (ALICE + WLCG)
  • remote: Andreu (ATLAS + IFAE), Brian (RAL), Christoph (CMS), Concezio (LHCb), David C (Technion), David M (FNAL), Dimitrios (Rucio), Eric (IN2P3), Federico (LHCb), Gabriel (MyToken), Giuseppe (CMS), Henryk (NCBJ), Ivan (BNL), John (ATLAS + BNL), Julia (WLCG), Marcus (MyToken), Marian (networks), Mario (ATLAS), Matt (Lancaster + GridPP), Ofer (BNL), Onno (SURF), Panos (WLCG), Petr (ATLAS + Prague), Stephan (CMS), Xavier (KIT)
  • apologies:

Operations News

  • the next meeting is planned for April 4

Special topics

Grid service monitoring with tokens at KIT

see the presentation

Discussion

  • Xavier:
    • we use MyToken to facilitate monitoring of our SEs
      • tokens are used for uploading and downloading test files

  • Stephan:
    • IIUC, MyToken authorizes access to refresh tokens (RT)
  • Marcus:
    • indeed
  • Stephan:
    • how are those tokens refreshed themselves?
  • Marcus:
    • the MyToken service is a client of the OP (OIDC Provider), i.e. the VO
  • Stephan:
    • is a new RT obtained for every access?
  • Marcus:
    • not yet, we plan to implement RT recycling as an option
  • Stephan:
    • what is the difference between MyToken and the Vault approach used on OSG?
  • Marcus:
    • the extra configurability of MyToken, as shown on this page:
    • more fine-grained restrictions can be configured for improved security

  • Maarten:
    • mytok.eu can be considered like a MyProxy service for the whole grid?
  • Marcus:
    • indeed
    • sites also can do their own deployments, if they wish

  • Maarten:
    • I would not mind seeing how MyToken is used exactly at KIT
      • example commands, configurations, screenshots, ...
    • it could be very useful to other site admins
    • can be presented in an Ops Coordination and/or AuthZ WG meeting
  • Xavier:
    • will look into it

  • Stephan:
    • are there plans for integration with Rucio, HTCondor etc?
  • Marcus:
    • there is a project at KIT on integration with HTCondor
    • pull requests would be submitted upstream as needed

  • Maarten:
    • mind we are still discovering how to use tokens, depending on the workflow
      • and possibly depending on the VO
    • short-lived tokens generally are better from a security perspective,
      but longer-lived tokens may be perfectly fine, depending on their usage
      • how concerned we are that tokens for particular workflows might be stolen
      • how much damage might be done with such stolen tokens
    • also mind that today, we use a much less fine-grained system in production
      • long-lived VOMS proxies!
    • we therefore do not need to aim for perfection right from the start
    • it is good that we have services like MyToken and Vault to help us
    • we will steadily gain experience and improve how tokens are used
    • that will take many months, possibly a few years still

Transition timeline from VOMS-Admin to IAM

see the presentation

Discussion

  • Stephan:
    • has a shared DB between OpenShift and K8s been tested?
  • Maarten:
    • only on a small scale
  • Stephan:
    • there could be locks, crashes, ...
  • Maarten:
    • indeed, we need to see if it can work at scale

  • Stephan:
    • the new hosts should be set up before sites start trusting them
  • Maarten:
    • good point, we will reserve the hostnames already (done)

  • Stephan:
    • as we come to rely more on tokens, we need a better SLA for IAM
    • ETF does not renew tokens continuously, but only some time before expiry
  • Maarten:
    • we need to improve the IAM code, configuration and SLA in the next months
    • ETF should renew tokens more frequently; to be followed up

  • Petr:
    • are you in touch with the CNAF deployment team?
  • Maarten:
    • yes, there will be a meeting with them, the devs and our experts next week
    • we will discuss in particular the possible HA options
  • Petr:
    • we should also plan stress tests of the K8s instances
  • Maarten:
    • good point

2024 pledge deployment status

  • Julia:
    • ATM we have ~50% of answers from the federations
    • in ~30% of the total the deployments are reported to be OK by April 1st
    • some of the sites report CPU to be OK, while storage needs more time, or vice versa
    • the majority are optimistic about deployments being OK by late April or early May
    • we will continue pinging the remaining sites

  • Stephan:
    • pledges should be reset in CRIC if they are not deployed
  • Julia:
    • we trust the sites and can check the accounting as needed
  • Stephan:
    • some sites that pledged last year no longer provide grid services
  • Julia:
    • pledges typically are done by federations, not individual sites
    • we can check offline if there is some cleanup to be done

Preparation for the WLCG WS. Ops and Facilities session

  • Julia:
    • the initial list of proposed potential topics is linked to the agenda
    • please let us know if you have more suggestions
  • Maarten:
    • we have 2 slots of 1.5 hours each
    • we probably can do only a subset of the potential topics
    • it would be good to know which ones have the highest priority

Middleware News

  • We got a question whether anyone is using the following libraries which have been developed in the context of messaging services. If they are not used, they will be frozen:
    • C
      • c-dirq
    • Java
      • java-dirq
      • java-dirq-test
    • Perl
      • Authen-Credential
      • Config-Generator
      • Config-Validator
      • Directory-Queue
      • Messaging-Message
      • Net-STOMP-Client
      • No-Worries
      • stompclt
    • Python
      • python-auth-credential
      • python-dirq
      • python-messaging
      • python-simplevisor

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Normal activity levels on average
  • No major problems
  • Successful participation in DC24 with raw heavy-ion data replication
    to CERN CTA and the T1 sites, to be finished by the end of March

ATLAS

  • Smooth running in the last month
    • Average 620k slots, peaks up to 950k slots in the last 30 days.
    • 6.75PB average per day transferred ( 2.6M files avg/day) in the last 30 days
  • CSCS is delivering work but is shown with 0 reliability and 0 availability in the Feb 2024 WLCG report
  • DC24 finished. To plan for a new ATLAS CERN - T1s DC. see slides
    • Overall success! Generating that solid purple wall of increasing rates without affecting production much was a success!
    • Getting to 2.5 Tb/s for ~9 hours was a success!
    • T0 exports test will need to be rerun before the WLCG/HSF workshop at DESY

Discussion

  • Maarten:
    • to resolve the CSCS mystery, a ticket should be opened for WLCG Grid Monitoring

CMS

  • overall smooth running, no major issues
    • core usage between 350k and 510k cores
    • bursts of HPC/opportunistic contributions between 12k and 133k cores
    • higher production/analysis split of about 4:1
    • main production activities Run 3 and ultra-legacy Run 2
  • data challenge was useful and successful from CMS point of view
    • continued to exercise transfers via token, just switched back to x509
  • waiting on python3 version/port of HammerCloud; HC jobs are an important "standard candle" to monitor not only site health but also measure site performance
  • work on SRM to REST migration for tape endpoints continues
  • working with remaining DPM sites on migration to other storage technology continues
  • token migration progressing steadily
    • two-third of the dCache sites ready (plus all StoRM and all but one native xrootd sites)
    • waiting on EOS (scitoken library) to be fully token-ready

LHCb

  • smooth running
  • DC24 exercise globally successful (slides here)
    • new NCBJ Tier1 joined, targets reached
    • will do some more tests with INFN-CNAF and the Beijing proto-Tier1
  • ETF tests updated in pre-production, still fixing tests for a couple of sites before going into production
  • there is a discrepancy between WLCG and DIRAC accounting; in particular, a decrease in CPU efficiency was seen in WLCG accounting in 2023H2 at the Tier0; this is not confirmed by DIRAC
  • we noticed underusage at a couple of sites in 2023, we are trying to understand why

Discussion

  • Julia:
    • there does not seem to be a problem with the accounting workflow
    • the CERN batch team found the Accounting Portal values to be OK
    • however, an issue was also seen e.g. for CMS
    • there was a decrease in CPU efficiency from Oct through Dec
    • at the start of 2024 the issue disappeared for unknown reasons
    • Domenico Giordano and the batch team are looking into the matter
      • it is not related to the benchmark

  • Julia:
    • do you have a timeline for Beijing to become a T1 officially?
  • Concezio:
    • for that, a blessing is needed from the Overview Board
    • currently the site only has a shared link to CERN
    • we also need to do more tests

Task Forces and Working Groups

Accounting TF

  • NTR

Migration from DPM to the alternative solutions

  • No much changes last month. The status has been reported at the GDB in February.

Information System Evolution TF

  • NTR

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

  • Quite some firefighting during data challenges
    • New XRootD:
      • Not extracting domains correctly for hostnames with more than one sub-domain (Fixed)
      • Not always resolving FQDNs (so domain couldn't be extracted) (Fixed)
      • Remote access flag swapped (Fixed)
    • dCache:
      • No any complaint, so assuming it's working as expected
    • FTS:
      • Remote access was wrong as it was always hardcoded to "true" (Fixed)
      • Some LHCb endpoints not correctly enriched as missing usage in CRIC (Fixed)
      • Some LHCb endpoints not correctly enriched due to some MONIT issue (being investigated)
    • WLCG Sites Network monitoring:
      • Mainly working as expected, from time to time some sites overflowing and sending weird numbers that had to be manually removed
  • Received a request to freeze raw data used for Data Challenges, this data should not be deleted until explictly requested
    • DDM, FTS, Rucio Events
  • Still working on aggregation jobs for new flows (XRootD NG and dCache) so we can keep data forever (similar to FTS)
    • These jobs will recompute all data since first of February

Discussion

  • these findings will be reported and discussed in the DOMA meeting on March 13

Network Throughput WG


  • perfSONAR infrastructure - 5.0.8 is the latest release
    • All sites are encouraged to upgrade to Alma8/9 (or other compatible distribution) in case they haven't done it already
    • Our updated documentation is at https://osg-htc.org/networking/
      • Deployment of one physical server with two network cards is sufficient (and there is no need to migrate the existing storage)
      • There is also an alternative deployment via docker, please contact perfsonar-support (at cern.ch) if you have any questions (more details in the docs)
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

WG for Transition to Tokens and Globus Retirement

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2024-04-04 - JuliaAndreeva
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback