WLCG Operations Coordination Minutes, July 13, 2023

Highlights

Agenda

https://indico.cern.ch/event/1306667/

Attendance

  • local: Julia (WLCG), Maarten (ALICE + WLCG), Natalie (Service Management)
  • remote: Adrian (APEL), Andrea (WLCG), Andrew (TRIUMF), David C (Technion), David M (FNAL), Eric (IN2P3), Giuseppe (CMS), Mark (LHCb + Birmingham), Matt D (Lancaster), Max (KIT), Panos (WLCG), Peter (ATLAS), Petr (ATLAS + Prague), Stephan (CMS), Steve (HammerCloud), Thomas (DESY)
  • apologies:

Operations News

  • the next meeting is planned for Sep 14

Special topics

HammerCloud status report

see the presentation

Discussion

  • Stephan:
    • when the old SSO is put behind the firewall, it will affect
      operators at FNAL who need to look at the jobs
    • and site admins as well
  • Steve:
    • people with CERN accounts can tunnel through lxplus
      • recipes will be provided
    • no estimate yet of when HammerCloud can be reached via the new SSO

  • Julia:
    • what kinds of data are in the DB and what should be their retention period(s)?
  • Steve:
    • DB keeps job log files. Besides results, the DB contains a lot of transient data not worth keeping
    • the large DB size slows down queries
    • the proposal is to keep just the last 3 months of results, TBC

CERN support units in the new WLCG helpdesk

see the presentation

Discussion

  • Stephan:
    • a first thing to note is the new Helpdesk having progressed very slowly
    • 1 ticket every 2 days on average is already significant
      • also taking into account all the ticket updates
    • the SNow interface would also serve FNAL and possibly others
    • the use of SNow runs into permission issues all the time!
    • mind that tickets opened for CERN do not just concern CERN people:
      other parties need to be involved for more complex cases
      • the new Helpdesk is hoped to make that easier

  • Natalie:
    • 15 tickets per month is low compared to the 6k user tickets served per month
      across all CERN IT Department services
    • ticket visibility issues can be followed up
    • special ticket forms can have relevant parties added automatically
      as is done today for the WLCG CERN Collaborative tools form

  • Stephan:
    • we have often heard that access permissions would be reviewed:
      not much came from it so far
    • at FNAL there have been similar complaints about their SNow instance
    • we do not expect any improvements
    • asking people to subscribe would be a big hurdle
    • mind they would typically be in different time zones as well

  • Peter:
    • we support the arguments brought forward by CMS
    • SNow ACLs have always been a problem
    • as tickets to CERN potentially affect thousands of grid users,
      their handling must be smooth
    • also mind we only have a skeleton crew for operations

  • Stephan:
    • CERN tickets are much more important than other tickets indeed

  • Natalie:
    • at CERN we still use SNow in the end
    • could we at least look into dissolving ROC_CERN?

  • all:
    • yes

  • Natalie:
    • It should not be up to the CERN Service Management team to
      train developers from the new Helpdesk tool

  • David M:
    • can ping the SNow team at FNAL about the various matters raised

  • Stephan:
    • more sites may be interested
    • perhaps RAL?

  • Adrian:
    • for RAL, the new Helpdesk would need to be interfaced to JIRA

  • Maarten:
    • it appears we cannot yet draw firm conclusions at this time
    • we will follow up with the parties concerned

APEL update on HEPscore integration

see the presentation

Discussion

  • Julia:
    • has there been feedback from sites about the new APEL client?

  • Adrian:
    • things look OK fo far, but the portal testing is key
    • to be continued when key people are back from holidays

Overview of the site input for the accounting and pledge management survey

see the presentation

Discussion

  • Thomas:
    • are results available for each CE type?
  • Julia:
    • no, we did not ask sites which CE type they have

  • Max:
    • could the results be split by tier?
      • T1 sites may have different concerns than T2 sites
  • Julia:
    • yes, we can do that for the GDB presentation we will have in the end

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Moderate to normal activity levels, generally no major issues
    • KISTI was down for 1.5 days because of a security incident
      • Services for ALICE were not compromised
    • Discussion with RAL about their CTA buffer functionality (GGUS:162369)
  • Tokens for normal jobs in use at almost all HTCondor CE sites now
    • 2 small sites remaining
  • Continuing with switching sites from single- to multicore jobs
    • ~90% of the resources already support 8-core or whole-node jobs

ATLAS

  • Smooth running with 500-750k slots on average, mostly Full Simulation
    • Healthy mix of MC Reconstruction, Group Production, User Analysis
  • Smooth data taking and export
    • SARA-TAPE issue GGUS:162655, currently unsubscribed from ATLAS Tier-0 export of RAW data
  • 2022 data reprocessing launched on Google Cloud resources
  • Large scale MC campaigns continue for both 2022 and 2023
  • Due to ongoing issues with TW and AUSTRALIA, unique data has been removed and storage endpoints set offline
  • Phase out of (support for) x86-64-v1 CPUs (~2007) almost done

CMS

  • smooth collision data recording and processing
  • overall smooth running, no major issues
    • good core usage between 300k and 430k cores
    • significant HPC/opportunistic contributions
    • usual production/analysis split of about 3:1
    • main production activity Run 3 and Run 2 ultra-legacy Monte Carlo
  • waiting on python3 version/port of HammerCloud
  • starting SRM to REST migration for Tape endpoints
  • working with our DPM sites to migrate to other storage technology
  • token migration progressing steadily
    • working with native xrootd and StoRM sites
  • looking forward to 24x7 production IAM support by CERN
  • started to overhaul SAM worker node tests, to be phased in with IAM-issued token submission
    • discovered very small CVMFS cache at a few worker nodes and plan to require a minimum amount
  • who coordinates/follows up on site downtimes matching EGI don't-use in case of security incidents?

Discussion

  • Maarten:
    • will check if the handling of the KISTI security incident
      left something to be improved for future cases

LHCb

  • Smooth running with no major issues to report
    • Mostly simulation running at present
  • Have switched most (~85%) of HTCondorCEs to using Tokens.
    • Plan to move the ARC CEs next

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • Survey of the accounting and pledge management tools and procedures
  • HEPScore integration in the accounting workflow

Migration from DPM to alternative solutions

  • Steady progress

Information System Evolution TF

  • NTR

IPv6 Validation and Deployment TF

Detailed status here.

Monitoring

  • During shoveler (new component for xrootd monitoring) testing several issues have been detected. Debugging and fixing are ongoing.
  • Polishing documentation for launching campaign on enabling site network monitoring.

Network Throughput WG


WG for Transition to Tokens and Globus Retirement

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

-- JuliaAndreeva - 2023-07-11
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2023-07-18 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback