LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes240307 (2024-04-04, JuliaAndreeva)

EditAttachPDF

WLCG Operations Coordination Minutes, March 7, 2024

Highlights

Grid service monitoring with tokens at KIT

Transition timeline from VOMS-Admin to IAM

2024 pledge deployment status

Preparation for the WLCG WS. Ops and Facilities session

Sites should upgrade perfSONAR to 5.0.8 on EL8 / 9 or compatible

Agenda

https://indico.cern.ch/event/1390112/

Attendance

local: Maarten (ALICE + WLCG)
remote: Andreu (ATLAS + IFAE), Brian (RAL), Christoph (CMS), Concezio (LHCb), David C (Technion), David M (FNAL), Dimitrios (Rucio), Eric (IN2P3), Federico (LHCb), Gabriel (MyToken), Giuseppe (CMS), Henryk (NCBJ), Ivan (BNL), John (ATLAS + BNL), Julia (WLCG), Marcus (MyToken), Marian (networks), Mario (ATLAS), Matt (Lancaster + GridPP), Ofer (BNL), Onno (SURF), Panos (WLCG), Petr (ATLAS + Prague), Stephan (CMS), Xavier (KIT)
apologies:

Operations News

the next meeting is planned for April 4

Special topics

Grid service monitoring with tokens at KIT

see the presentation

Discussion

Xavier:
- we use MyToken to facilitate monitoring of our SEs
  - tokens are used for uploading and downloading test files

Stephan:
- IIUC, MyToken authorizes access to refresh tokens (RT)
Marcus:
- indeed
Stephan:
- how are those tokens refreshed themselves?
Marcus:
- the MyToken service is a client of the OP (OIDC Provider), i.e. the VO
Stephan:
- is a new RT obtained for every access?
Marcus:
- not yet, we plan to implement RT recycling as an option
Stephan:
- what is the difference between MyToken and the Vault approach used on OSG?
Marcus:
- the extra configurability of MyToken, as shown on this page:
  - https://mytoken.data.kit.edu/#mt
- more fine-grained restrictions can be configured for improved security

Maarten:
- mytok.eu can be considered like a MyProxy service for the whole grid?
Marcus:
- indeed
- sites also can do their own deployments, if they wish

Maarten:
- I would not mind seeing how MyToken is used exactly at KIT
  - example commands, configurations, screenshots, ...
- it could be very useful to other site admins
- can be presented in an Ops Coordination and/or AuthZ WG meeting
Xavier:
- will look into it

Stephan:
- are there plans for integration with Rucio, HTCondor etc?
Marcus:
- there is a project at KIT on integration with HTCondor
- pull requests would be submitted upstream as needed

Maarten:
- mind we are still discovering how to use tokens, depending on the workflow
  - and possibly depending on the VO
- short-lived tokens generally are better from a security perspective,
  but longer-lived tokens may be perfectly fine, depending on their usage
  - how concerned we are that tokens for particular workflows might be stolen
  - how much damage might be done with such stolen tokens
- also mind that today, we use a much less fine-grained system in production
  - long-lived VOMS proxies!
- we therefore do not need to aim for perfection right from the start
- it is good that we have services like MyToken and Vault to help us
- we will steadily gain experience and improve how tokens are used
- that will take many months, possibly a few years still

Transition timeline from VOMS-Admin to IAM

see the presentation

Discussion

Stephan:
- has a shared DB between OpenShift and K8s been tested?
Maarten:
- only on a small scale
Stephan:
- there could be locks, crashes, ...
Maarten:
- indeed, we need to see if it can work at scale

Stephan:
- the new hosts should be set up before sites start trusting them
Maarten:
- good point, we will reserve the hostnames already (done)

Stephan:
- as we come to rely more on tokens, we need a better SLA for IAM
- ETF does not renew tokens continuously, but only some time before expiry
Maarten:
- we need to improve the IAM code, configuration and SLA in the next months
- ETF should renew tokens more frequently; to be followed up

Petr:
- are you in touch with the CNAF deployment team?
Maarten:
- yes, there will be a meeting with them, the devs and our experts next week
- we will discuss in particular the possible HA options
Petr:
- we should also plan stress tests of the K8s instances
Maarten:
- good point

2024 pledge deployment status

Julia:
- ATM we have ~50% of answers from the federations
- in ~30% of the total the deployments are reported to be OK by April 1st
- some of the sites report CPU to be OK, while storage needs more time, or vice versa
- the majority are optimistic about deployments being OK by late April or early May
- we will continue pinging the remaining sites

Stephan:
- pledges should be reset in CRIC if they are not deployed
Julia:
- we trust the sites and can check the accounting as needed
Stephan:
- some sites that pledged last year no longer provide grid services
Julia:
- pledges typically are done by federations, not individual sites
- we can check offline if there is some cleanup to be done

Preparation for the WLCG WS. Ops and Facilities session

Julia:
- the initial list of proposed potential topics is linked to the agenda
- please let us know if you have more suggestions
Maarten:
- we have 2 slots of 1.5 hours each
- we probably can do only a subset of the potential topics
- it would be good to know which ones have the highest priority

Middleware News

We got a question whether anyone is using the following libraries which have been developed in the context of messaging services. If they are not used, they will be frozen:
- C
  - c-dirq
- Java
  - java-dirq
  - java-dirq-test
- Perl
  - Authen-Credential
  - Config-Generator
  - Config-Validator
  - Directory-Queue
  - Messaging-Message
  - Net-STOMP-Client
  - No-Worries
  - stompclt
- Python
  - python-auth-credential
  - python-dirq
  - python-messaging
  - python-simplevisor

Useful Links
- WLCGBaselineTable
Baselines/News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

Normal activity levels on average
No major problems
Successful participation in DC24 with raw heavy-ion data replication
to CERN CTA and the T1 sites, to be finished by the end of March
- Details in yesterday's DOMA presentation

ATLAS

Smooth running in the last month
- Average 620k slots, peaks up to 950k slots in the last 30 days.
- 6.75PB average per day transferred ( 2.6M files avg/day) in the last 30 days
CSCS is delivering work but is shown with 0 reliability and 0 availability in the Feb 2024 WLCG report
DC24 finished. To plan for a new ATLAS CERN - T1s DC. see slides
- Overall success! Generating that solid purple wall of increasing rates without affecting production much was a success!
- Getting to 2.5 Tb/s for ~9 hours was a success!
- T0 exports test will need to be rerun before the WLCG/HSF workshop at DESY

Discussion

Maarten:
- to resolve the CSCS mystery, a ticket should be opened for WLCG Grid Monitoring

CMS

overall smooth running, no major issues
- core usage between 350k and 510k cores
- bursts of HPC/opportunistic contributions between 12k and 133k cores
- higher production/analysis split of about 4:1
- main production activities Run 3 and ultra-legacy Run 2
data challenge was useful and successful from CMS point of view
- continued to exercise transfers via token, just switched back to x509
waiting on python3 version/port of HammerCloud; HC jobs are an important "standard candle" to monitor not only site health but also measure site performance
work on SRM to REST migration for tape endpoints continues
working with remaining DPM sites on migration to other storage technology continues
token migration progressing steadily
- two-third of the dCache sites ready (plus all StoRM and all but one native xrootd sites)
- waiting on EOS (scitoken library) to be fully token-ready

LHCb

smooth running
DC24 exercise globally successful (slides here)
- new NCBJ Tier1 joined, targets reached
- will do some more tests with INFN-CNAF and the Beijing proto-Tier1
ETF tests updated in pre-production, still fixing tests for a couple of sites before going into production
there is a discrepancy between WLCG and DIRAC accounting; in particular, a decrease in CPU efficiency was seen in WLCG accounting in 2023H2 at the Tier0; this is not confirmed by DIRAC
we noticed underusage at a couple of sites in 2023, we are trying to understand why

Discussion

Julia:
- there does not seem to be a problem with the accounting workflow
- the CERN batch team found the Accounting Portal values to be OK
- however, an issue was also seen e.g. for CMS
- there was a decrease in CPU efficiency from Oct through Dec
- at the start of 2024 the issue disappeared for unknown reasons
- Domenico Giordano and the batch team are looking into the matter
  - it is not related to the benchmark

Julia:
- do you have a timeline for Beijing to become a T1 officially?
Concezio:
- for that, a blessing is needed from the Overview Board
- currently the site only has a shared link to CERN
- we also need to do more tests

Task Forces and Working Groups

Accounting TF

Migration from DPM to the alternative solutions

No much changes last month. The status has been reported at the GDB in February.

Information System Evolution TF

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

Quite some firefighting during data challenges
- New XRootD:
  - Not extracting domains correctly for hostnames with more than one sub-domain (Fixed)
  - Not always resolving FQDNs (so domain couldn't be extracted) (Fixed)
  - Remote access flag swapped (Fixed)
- dCache:
  - No any complaint, so assuming it's working as expected
- FTS:
  - Remote access was wrong as it was always hardcoded to "true" (Fixed)
  - Some LHCb endpoints not correctly enriched as missing usage in CRIC (Fixed)
  - Some LHCb endpoints not correctly enriched due to some MONIT issue (being investigated)
- WLCG Sites Network monitoring:
  - Mainly working as expected, from time to time some sites overflowing and sending weird numbers that had to be manually removed
Received a request to freeze raw data used for Data Challenges, this data should not be deleted until explictly requested
- DDM, FTS, Rucio Events
Still working on aggregation jobs for new flows (XRootD NG and dCache) so we can keep data forever (similar to FTS)
- These jobs will recompute all data since first of February

Discussion

these findings will be reported and discussed in the DOMA meeting on March 13

Network Throughput WG

perfSONAR infrastructure - 5.0.8 is the latest release
- All sites are encouraged to upgrade to Alma8/9 (or other compatible distribution) in case they haven't done it already
- Our updated documentation is at https://osg-htc.org/networking/
  - Deployment of one physical server with two network cards is sufficient (and there is no need to migrate the existing storage)
  - There is also an alternative deployment via docker, please contact perfsonar-support (at cern.ch) if you have any questions (more details in the docs)
WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

WG for Transition to Tokens and Globus Retirement

Action list

Creation date	Description	Responsible	Status	Comments

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

Specific actions for sites

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

AOB

Topic revision: r16 - 2024-04-04 - JuliaAndreeva

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback