LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGOpsCoordination>WLCGOpsMinutes230608 (2023-06-19, MaartenLitmaath)

EditAttachPDF

WLCG Operations Coordination Minutes, June 8, 2023

Highlights

We kindly remind sites and people dealing with pledge management to provide input to the surveys about accounting and pledge management tools and procedures. Thanks a lot to those who already answered. The surveys will be closed at the end of June. We have the same request addressed to the experiments. Thanks a lot to ALICE and ATLAS who already answered.

Review of the CERN Grid CA expiration incident from the WLCG perspective

HEPScore status update

APEL progress on HEPScore integration
- Postponed till the next meeting.

Agenda

https://indico.cern.ch/event/1293820/

Attendance

local: Berk (CERN IAM), Christoph (CMS), Eric G (CERN IT-PW GL), Giuseppe (CMS), Julia (WLCG), Maarten (ALICE + WLCG), Panos (WLCG), Paolo (CERN Grid CA), Stephan (CMS)
remote: Adrian (APEL), Alessandra D (Napoli), Alessandro DG (ATLAS), Alessandro P (EGI), Andrew (TRIUMF), Asier (CERN SSO), Borja (monitoring), Concezio (LHCb), David C (Technion), David M (FNAL), David S (ATLAS), Domenico (WLCG), Leona (computing), Maria (computing), Mario (ATLAS), Matt D (Lancaster), Ofer (BNL), Petr (ATLAS + Prague), Sebastian (CERN SSO), Thomas (DESY)
apologies:

Operations News

the next meeting is planned for July 13

Special topics

Review of the CERN Grid CA expiration incident from the WLCG perspective

Links for collected material

please see the presentations

Discussion

Mario:
- what might be the impact of the move to the new CA?
- mind that old clients will be there at least throughout Run 3!
Paolo:
- the new CA initiative is still under development
- we will have a pilot service to collect feedback
- the new CA should be much better w.r.t. host certificates:
  - the CA is in IGTF, yet trusted by default in browsers
  - it is already being used at other sites on the grid
Thomas:
- we have been using Sectigo since 1 year, but here is a word of caution:
  some clients ran into problems depending on the PEM format used
  - some libraries have issues with chains of host + issuer certificates
Paolo:
- we will need to see what to do about that

Julia:
- does ATLAS have a suggestion for how the communication could be improved?
Mario:
- once we managed to open a ticket about the dysfunctional SSO,
  the reply was not helpful ("it works for us")
- there were emotions because data taking was in jeopardy
- in the end, some things cannot be controlled, which is acceptable
- as there is ancient SW to be supported: maybe test for that in the future?

Sebastian:
- a service provider needs to know the dependencies on their service
- is data taking affected when the SSO is down?
Mario:
- we could not log into SNow
- the main problem for ATLAS turned out to be the Python client issue on CC7,
  but things were not clear, there were several reports and escalations
David S:
- data can be buffered, but many services depend on SSO
- the bottom line is that SSO needs to be considered critical for data taking
Eric G:
- do we need to do something special for that?
- SSO sits on the GPN and has dependencies on other services in the data center
Maarten:
- to create a separate SSO for the experiments would be very expensive
- we should not go overboard here: even though the incident was unexpected and
  happened at quite an awkward time, it was mostly resolved within 3 hours
- we apply the lessons learned and that should already help a lot
Paolo:
- agreed
Eric G:
- the next major SSO change will happen during a Technical Stop,
  which may be fine for the experiments, but not for the accelerators
Sebastian:
- interventions are carefully planned and usually go OK

Mario:
- what is the relationship between Kerberos and SSO?
- data taking needs Kerberos to work
Paolo + Sebastian:
- SSO depends on Active Directory (AD) services: Kerberos and LDAP
- AD is provided by a sizable number of Domain Controller machines,
  some of which are for special networks
  - ATLAS, CMS, Technical Network, ...
- AD is one of the most reliable services in the Data Center
- if SSO is down, AD is not affected
- conversely, Kerberos and LDAP can be used to authenticate to SSO

Concezio:
- LHCb agrees with the conclusions from the discussion so far
- LHCb was not so affected, due to not taking data

Paolo:
- for the record, the SSO was recovered in particular thanks to Vincent Brillault

Paolo:
- please check OTG:0077330 linked to the agenda, about a
  new release of the CERN-CA-certs package (not the Grid CA)
- the old certificate will expire on Sat July 29
Maarten:
- is it used by Puppet?
Paolo:
- not AFAIK
- it may be used by XLDAP, but its clients typically use the ldap protocol, not ldaps
  - hence should not care about any certificate change
- eduroam may well be the biggest user
Maarten:
- a reminder in the week just before the expiration would be good
  - to be followed up

HEPScore status update

please see the presentation

Discussion

Stephan:
- are there plans for making a WN's HS23 rating available on the WN?
- it would allow pilots to have a better idea of what can still be run in a slot
Maarten + Julia:
- we tried that for HS06 in the Machine Job Features TF
- only LHCb was interested and after a few years we gave up and closed it
Stephan:
- it does not look too hard?
Domenico:
- the bare-metal HS23 value could be published, but there may well be
  discrepancies with the performance of a grid job slot
  - see page 3 in the presentation
- LHCb preferred letting their jobs run a fast benchmark at the beginning
  - obviously with less precision
Stephan:
- overclocking could be an issue
- the value at the start of a job might not be valid N hours later
Alessandro DG:
- if we set up a central DB with records for most of the machines,
  we could have a simple lookup table in CVMFS
- if we ask each site admin to supply numbers, that will not work,
  we tried it several times
- if a precision of 20-30% is good enough, a table should be sufficient
Domenico:
- we have endeavored to make it easy to publish results centrally
David M:
- mind we have declared HS06 ratings of old HW to be on a par with HS23,
  which would imply those values cannot be trusted for such purposes
Stephan:
- we could publish the values as HS06 for old HW
Domenico:
- HS06 for old HW is equivalent to HS23 only in matters of accounting
- the performance table would be filled with new measurements
Thomas:
- can we stop relying on certificates for uploading those values?
Domenico:
- that depends on when Active MQ can be made to support tokens
Julia:
- token support has been requested also for monitoring workflows
- it is on the plate of the Authorization WG
Maarten:
- how would cloud resources be handled?
- one will not be able to obtain bare-metal HS23 values for those
Stephan:
- a CE flag would indicate that
- it is not only CPUs that are concerned, but also GPUs etc.
Maarten:
- my concern would be that another big effort gets spent on this,
  only for a conclusion to be reached that the numbers being
  published on the WNs are too unreliable in practice
Julia:
- can we first look into a table?
- it would offload the sites
- to be followed up

Julia:
- thanks very much to Domenico and colleagues for all the work done!

APEL progress on HEPScore integration

Postponed till the next meeting.

Middleware News

Useful Links
- WLCGBaselineTable
Baselines/News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

Moderate to normal activity levels, no major issues
Tokens for normal jobs in use at most of the HTCondor CE sites
Continuing with switching sites from single- to multicore jobs
- ~90% of the resources already support 8-core or whole-node jobs

ATLAS

Smooth running with 500-750k slots on average, mostly Full Simulation
- Two peaks beyond 850k slots, one drop to 300k due to low simulation
Smooth data taking
TAPE REST API
- CERN, FZK, and RAL migrated.
- Ticketing campaign to other sites, next potential site: PIC
LHCOPN cloud peering initiative with ESnet
Transfers with tokens and HTTP protocol enabled on one storage for each SE implementation
- Will now advertise and push
Waiting for large scale mc23 campaigns to start, once reco releases available

CMS

conflicting CMS O&C meeting, we'll connect late
smooth collision data recording and processing
brief factory no-new-pilot submission period but otherwise no major issues
- good core usage between 180k and 430k cores
- usual production/analysis split of about 3:1
- up to 32k cores from HPC/opportunistic sources
- main production activity still Run 2 ultra-legacy Monte Carlo
waiting on python3 version/port of HammerCloud
planning SRM to REST migration for Tape endpoints
working with our DPM sites to migrate to other storage technologies
token migration progressing steadily
- storage at 11/62 sites ready
looking forward to 24x7 production IAM support by CERN
- several services at sites running a version without public key caching
in discussion with Andrea Valassi about HEP_OSlibs RPM, thanks for connecting us Maarten

Discussion

Julia:
- next time we will invite the new team responsible for HammerCloud

LHCb

usual activities, no major issues
- dominated by simulation production
- detector commissioning at point8, with very small samples transferred at the Tier0 and distributed on Tier1 disk and tape
planning for a reprocessing of Run2 data during 2023/2024 EYETS
- massive tape recalls expected

Task Forces and Working Groups

GDPR and WLCG services

Updated list of services

Accounting TF

Main effort is on HEPScore integration. See presentation of Adrian.

Information System Evolution TF

DPM migration to other solutions

17 out of 45 tickets solved.

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

Handled a meeting for XrootD servers monitoring with OSG shoveler and collector main developers and test users
- Helped identifying some issues and planning tasks
- Working in close collaboration with OSG to short out the issues as soon as possible
  - Discussing possibility to have a "hack-a-thon" to speed up things
No news from dCache team about the integration with MONIT

Network Throughput WG

perfSONAR infrastructure - 5.0.2 bugfix was released this week
- All sites are encouraged to update
- This relase is available only for CC7 and sites on auto-updates will be updated automatically (reboot is recommended)
- Support for Alma8/9 is coming soon (next couple of weeks)
WLCG/OSG Network Monitoring Platform
- We're currently working on updating our documentation to reflect changes in 5.0 and new deployment models
  - We plan to send a broadcast with summary information to all sites (at the end of June)
WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

WG for Transition to Tokens and Globus Retirement

No change in the CE token support campaign on EGI
- 130 of 133 tickets solved
- only a few small sites remain

Action list

Creation date	Description	Responsible	Status	Comments

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

Specific actions for sites

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

AOB

-- JuliaAndreeva - 2023-06-06

Topic revision: r11 - 2023-06-19 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback