WLCG Operations Coordination Minutes, June 8, 2023

Highlights

  • We kindly remind sites and people dealing with pledge management to provide input to the surveys about accounting and pledge management tools and procedures. Thanks a lot to those who already answered. The surveys will be closed at the end of June. We have the same request addressed to the experiments. Thanks a lot to ALICE and ATLAS who already answered.

Agenda

https://indico.cern.ch/event/1293820/

Attendance

  • local: Berk (CERN IAM), Christoph (CMS), Eric G (CERN IT-PW GL), Giuseppe (CMS), Julia (WLCG), Maarten (ALICE + WLCG), Panos (WLCG), Paolo (CERN Grid CA), Stephan (CMS)
  • remote: Adrian (APEL), Alessandra D (Napoli), Alessandro DG (ATLAS), Alessandro P (EGI), Andrew (TRIUMF), Asier (CERN SSO), Borja (monitoring), Concezio (LHCb), David C (Technion), David M (FNAL), David S (ATLAS), Domenico (WLCG), Leona (computing), Maria (computing), Mario (ATLAS), Matt D (Lancaster), Ofer (BNL), Petr (ATLAS + Prague), Sebastian (CERN SSO), Thomas (DESY)
  • apologies:

Operations News

  • the next meeting is planned for July 13

Special topics

Review of the CERN Grid CA expiration incident from the WLCG perspective

Links for collected material

please see the presentations

Discussion

  • Mario:
    • what might be the impact of the move to the new CA?
    • mind that old clients will be there at least throughout Run 3!
  • Paolo:
    • the new CA initiative is still under development
    • we will have a pilot service to collect feedback
    • the new CA should be much better w.r.t. host certificates:
      • the CA is in IGTF, yet trusted by default in browsers
      • it is already being used at other sites on the grid
  • Thomas:
    • we have been using Sectigo since 1 year, but here is a word of caution:
      some clients ran into problems depending on the PEM format used
      • some libraries have issues with chains of host + issuer certificates
  • Paolo:
    • we will need to see what to do about that

  • Julia:
    • does ATLAS have a suggestion for how the communication could be improved?
  • Mario:
    • once we managed to open a ticket about the dysfunctional SSO,
      the reply was not helpful ("it works for us")
    • there were emotions because data taking was in jeopardy
    • in the end, some things cannot be controlled, which is acceptable
    • as there is ancient SW to be supported: maybe test for that in the future?

  • Sebastian:
    • a service provider needs to know the dependencies on their service
    • is data taking affected when the SSO is down?
  • Mario:
    • we could not log into SNow
    • the main problem for ATLAS turned out to be the Python client issue on CC7,
      but things were not clear, there were several reports and escalations
  • David S:
    • data can be buffered, but many services depend on SSO
    • the bottom line is that SSO needs to be considered critical for data taking
  • Eric G:
    • do we need to do something special for that?
    • SSO sits on the GPN and has dependencies on other services in the data center
  • Maarten:
    • to create a separate SSO for the experiments would be very expensive
    • we should not go overboard here: even though the incident was unexpected and
      happened at quite an awkward time, it was mostly resolved within 3 hours
    • we apply the lessons learned and that should already help a lot
  • Paolo:
    • agreed
  • Eric G:
    • the next major SSO change will happen during a Technical Stop,
      which may be fine for the experiments, but not for the accelerators
  • Sebastian:
    • interventions are carefully planned and usually go OK

  • Mario:
    • what is the relationship between Kerberos and SSO?
    • data taking needs Kerberos to work
  • Paolo + Sebastian:
    • SSO depends on Active Directory (AD) services: Kerberos and LDAP
    • AD is provided by a sizable number of Domain Controller machines,
      some of which are for special networks
      • ATLAS, CMS, Technical Network, ...
    • AD is one of the most reliable services in the Data Center
    • if SSO is down, AD is not affected
    • conversely, Kerberos and LDAP can be used to authenticate to SSO

  • Concezio:
    • LHCb agrees with the conclusions from the discussion so far
    • LHCb was not so affected, due to not taking data

  • Paolo:
    • for the record, the SSO was recovered in particular thanks to Vincent Brillault

  • Paolo:
    • please check OTG:0077330 linked to the agenda, about a
      new release of the CERN-CA-certs package (not the Grid CA)
    • the old certificate will expire on Sat July 29
  • Maarten:
    • is it used by Puppet?
  • Paolo:
    • not AFAIK
    • it may be used by XLDAP, but its clients typically use the ldap protocol, not ldaps
      • hence should not care about any certificate change
    • eduroam may well be the biggest user
  • Maarten:
    • a reminder in the week just before the expiration would be good
      • to be followed up

HEPScore status update

please see the presentation

Discussion

  • Stephan:
    • are there plans for making a WN's HS23 rating available on the WN?
    • it would allow pilots to have a better idea of what can still be run in a slot
  • Maarten + Julia:
    • we tried that for HS06 in the Machine Job Features TF
    • only LHCb was interested and after a few years we gave up and closed it
  • Stephan:
    • it does not look too hard?
  • Domenico:
    • the bare-metal HS23 value could be published, but there may well be
      discrepancies with the performance of a grid job slot
      • see page 3 in the presentation
    • LHCb preferred letting their jobs run a fast benchmark at the beginning
      • obviously with less precision
  • Stephan:
    • overclocking could be an issue
    • the value at the start of a job might not be valid N hours later
  • Alessandro DG:
    • if we set up a central DB with records for most of the machines,
      we could have a simple lookup table in CVMFS
    • if we ask each site admin to supply numbers, that will not work,
      we tried it several times
    • if a precision of 20-30% is good enough, a table should be sufficient
  • Domenico:
    • we have endeavored to make it easy to publish results centrally
  • David M:
    • mind we have declared HS06 ratings of old HW to be on a par with HS23,
      which would imply those values cannot be trusted for such purposes
  • Stephan:
    • we could publish the values as HS06 for old HW
  • Domenico:
    • HS06 for old HW is equivalent to HS23 only in matters of accounting
    • the performance table would be filled with new measurements
  • Thomas:
    • can we stop relying on certificates for uploading those values?
  • Domenico:
    • that depends on when Active MQ can be made to support tokens
  • Julia:
    • token support has been requested also for monitoring workflows
    • it is on the plate of the Authorization WG
  • Maarten:
    • how would cloud resources be handled?
    • one will not be able to obtain bare-metal HS23 values for those
  • Stephan:
    • a CE flag would indicate that
    • it is not only CPUs that are concerned, but also GPUs etc.
  • Maarten:
    • my concern would be that another big effort gets spent on this,
      only for a conclusion to be reached that the numbers being
      published on the WNs are too unreliable in practice
  • Julia:
    • can we first look into a table?
    • it would offload the sites
    • to be followed up

  • Julia:
    • thanks very much to Domenico and colleagues for all the work done!

APEL progress on HEPScore integration

Postponed till the next meeting.

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Moderate to normal activity levels, no major issues
  • Tokens for normal jobs in use at most of the HTCondor CE sites
  • Continuing with switching sites from single- to multicore jobs
    • ~90% of the resources already support 8-core or whole-node jobs

ATLAS

  • Smooth running with 500-750k slots on average, mostly Full Simulation
    • Two peaks beyond 850k slots, one drop to 300k due to low simulation
  • Smooth data taking
  • TAPE REST API
    • CERN, FZK, and RAL migrated.
    • Ticketing campaign to other sites, next potential site: PIC
  • LHCOPN cloud peering initiative with ESnet
  • Transfers with tokens and HTTP protocol enabled on one storage for each SE implementation
    • Will now advertise and push
  • Waiting for large scale mc23 campaigns to start, once reco releases available

CMS

  • conflicting CMS O&C meeting, we'll connect late
  • smooth collision data recording and processing
  • brief factory no-new-pilot submission period but otherwise no major issues
    • good core usage between 180k and 430k cores
    • usual production/analysis split of about 3:1
    • up to 32k cores from HPC/opportunistic sources
    • main production activity still Run 2 ultra-legacy Monte Carlo
  • waiting on python3 version/port of HammerCloud
  • planning SRM to REST migration for Tape endpoints
  • working with our DPM sites to migrate to other storage technologies
  • token migration progressing steadily
    • storage at 11/62 sites ready
  • looking forward to 24x7 production IAM support by CERN
    • several services at sites running a version without public key caching
  • in discussion with Andrea Valassi about HEP_OSlibs RPM, thanks for connecting us Maarten

Discussion

  • Julia:
    • next time we will invite the new team responsible for HammerCloud

LHCb

  • usual activities, no major issues
    • dominated by simulation production
    • detector commissioning at point8, with very small samples transferred at the Tier0 and distributed on Tier1 disk and tape
  • planning for a reprocessing of Run2 data during 2023/2024 EYETS
    • massive tape recalls expected

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • Main effort is on HEPScore integration. See presentation of Adrian.

Information System Evolution TF

  • NTR

DPM migration to other solutions

  • 17 out of 45 tickets solved.

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

  • Handled a meeting for XrootD servers monitoring with OSG shoveler and collector main developers and test users
    • Helped identifying some issues and planning tasks
    • Working in close collaboration with OSG to short out the issues as soon as possible
      • Discussing possibility to have a "hack-a-thon" to speed up things
  • No news from dCache team about the integration with MONIT

Network Throughput WG


  • perfSONAR infrastructure - 5.0.2 bugfix was released this week
    • All sites are encouraged to update
    • This relase is available only for CC7 and sites on auto-updates will be updated automatically (reboot is recommended)
    • Support for Alma8/9 is coming soon (next couple of weeks)
  • WLCG/OSG Network Monitoring Platform
    • We're currently working on updating our documentation to reflect changes in 5.0 and new deployment models
      • We plan to send a broadcast with summary information to all sites (at the end of June)
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

WG for Transition to Tokens and Globus Retirement

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

-- JuliaAndreeva - 2023-06-06

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2023-06-19 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback