WLCG Operations Coordination Minutes, July 2, 2020

Highlights

Agenda

https://indico.cern.ch/event/933764/

Attendance

local:
remote: Alberto (monitoring), Andrew (TRIUMF), Borja (monitoring), Concezio (LHCb), Dave (FNAL), David (Technion), Felix (ASGC), Gavin (T0), Giuseppe (CMS), Horst (Oklahoma), Johannes (ATLAS), Maarten (ALICE + WLCG), Matt (Lancaster), Nikolay (monitoring), Pedro (monitoring), Stephan (CMS), Vincent (security)
apologies:

Operations News

After a few years of inactivity, we propose to close the Machine / Job Features TF

The next meeting is planned for Sep 3

Discussion

Concezio:
- the MJF functionality is not critical for LHCb at WLCG sites
- we do depend on it for Vac and cloud setups, though
Maarten:
- WLCG Ops Coordination is concerned with activities that potentially are
  of interest to more than 1 experiment
- for MJF that looked the case a few years ago, but in the end only LHCb
  wanted to pursue that functionality
- we just move the TF to the page listing the closed TFs (done),
  all its materials remain available for continued use

Special topics

WLCG Critical Services proposal followup

There were no objections to go ahead with proposed changes,
which can still be fine-tuned further as we implement them
To be finalized by autumn

CERN Grid CA OCSP incident

After a scheduled intervention on June 24, the CERN Grid CA OCSP service
became inaccessible from outside CERN (OTG:0057432)
Requests to the service were dropped by the CERN perimeter firewall
A CREAM CE will try to check a client certificate's status via OCSP,
if the existence of such an endpoint is indicated in the certificate details
- It appears other CE flavors rely on CRLs only and just ignore OCSP services
Checks of CERN Grid CA certificates were then hanging until a timeout was reached
The CREAM client code would time out first, thus failing job submissions that
used CERN Grid CA certificates
This affected the 4 experiments and, through the SAM tests, sites running CREAM
- Some A/R recomputations may be needed
The service was restored about 24h later on June 25
Some improvements are foreseen to make reoccurrence a lot less probable

Further details here

Discussion

Maarten: the trouble was that requests were not failing quickly;
as far as we know, if e.g. the connection is refused, it is not fatal

Stephan: is only CREAM concerned? we saw a ticket implicating an Xrootd service
Maarten: please send me the details and I will look into the need for followup
- after the meeting: it probably was a mistaken inference

SAM migration progress

see the presentation

Also see the GDB presentation that will happen next week
There were no comments

Middleware News

Useful Links
Baselines/News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

Mostly business as usual, no major issues

ATLAS

Stable Grid production with up to ~380k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~45k slots from the HLT/Sim@CERN-P1 farm and ~15k slots from Boinc. Occasional additional peaks of 200k job slots from HPCs.
Continuing with about 60k job slots used for Folding@Home jobs since 4 April. 50% from ~55 different grid sites via opt-in and 50% at CERN-P1
No other major other issues apart from the usual storage or transfer related problems at sites
Finishing grand unification of production+analysis queues in PanDA in the next days.
All systems recovered quickly from Oracle/DBonDemand downtime last Saturday - would appreciate to avoid such downtimes over the weekend next time
CTA in production for ATLAS since Monday - still fixing some issues in Rucio/middleware

CMS

Covid-19 compute contributions being returned to experiment use
main processing activities:
- Run 2 ultra-legacy Monte Carlo
- Run 2 pre-UL Monte Carlo
migration to Rucio ongoing
- production of nanoAOD samples configured for PhEDEx being bumped up to complete more quickly

LHCb

still running F@H on part of HLT farm
large MC requests coming up so we are going to reduce this Covid-19-related activity
processing (small) samples of lead-lead collisions and lead-neon fixed target collisions
grid drained in preparation for the CERN Oracle/DBOD outage of last Saturday, DIRAC services and agents switched off, then on again after the outage, everything went extremely smoothly

Discussion on F@H reductions

Maarten: it is perfectly defensible to ramp down resources for F@H,
as we have already done a lot and we cannot neglect our own duties

Task Forces and Working Groups

GDPR and WLCG services

Updated list of services

Accounting TF

Archival Storage WG

Containers WG

CREAM migration TF

Details here

Summary:

90 tickets
14 done: 7 ARC, 7 HTCondor
16 sites plan for ARC, 15 are considering it
20 sites plan for HTCondor, 14 are considering it, 8 consider using SIMPLE
14 tickets on hold, to be continued in the coming weeks / months
7 tickets without reply
- response times possibly affected by COVID-19 measures

dCache upgrade TF

DPM upgrade TF

StoRM upgrade TF

Information System Evolution TF

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG

perfSONAR infrastructure status - 4.2.4 versions was released - please upgrade
- Release notes: https://www.perfsonar.net/releasenotes-2020-04-01-4-2-4.html
100 Gbps perfSONAR mesh was established with 6 participating sites (TRIUMF, CERN, BNL, KIT, IC, AGLT2, Prague)
- 100G testing/tuning resources added to WG Twiki (https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#perfSONAR_100G)
- Testbed mailing list to join: http://cern.ch/simba3/SelfSubscription.aspx?groupName=wlcg-perfsonar-100g
New LHCONE mesh established testing from sites to R&E perfSONAR endpoints (on LHCONE)
OSG/WLCG infrastructure
- Discussing plan of migration to the push model - direct publishing of results from toolkits to RabbitMQ
- ESnet (router) traffic feed now available, working on its integration to our pipeline - prototype already working
WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Traceability WG

Action list

Creation date	Description	Responsible	Status	Comments

Specific actions for experiments

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

Specific actions for sites

Creation date	Description	Affected VO	Affected TF/WG	Deadline	Completion	Comments

AOB

This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsWeb > WLCGOpsCoordination > WLCGOpsMinutes200702
Topic revision: r11 - 2020-07-03 - MaartenLitmaath

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback