WLCG Operations Coordination Minutes, July 2, 2020
Highlights
Agenda
https://indico.cern.ch/event/933764/
Attendance
- local:
- remote: Alberto (monitoring), Andrew (TRIUMF), Borja (monitoring), Concezio (LHCb), Dave (FNAL), David (Technion), Felix (ASGC), Gavin (T0), Giuseppe (CMS), Horst (Oklahoma), Johannes (ATLAS), Maarten (ALICE + WLCG), Matt (Lancaster), Nikolay (monitoring), Pedro (monitoring), Stephan (CMS), Vincent (security)
- apologies:
Operations News
- The next meeting is planned for Sep 3
Discussion
- Concezio:
- the MJF functionality is not critical for LHCb at WLCG sites
- we do depend on it for Vac and cloud setups, though
- Maarten:
- WLCG Ops Coordination is concerned with activities that potentially are
of interest to more than 1 experiment
- for MJF that looked the case a few years ago, but in the end only LHCb
wanted to pursue that functionality
- we just move the TF to the page listing the closed TFs (done),
all its materials remain available for continued use
Special topics
WLCG Critical Services proposal followup
- There were no objections to go ahead with proposed changes,
which can still be fine-tuned further as we implement them
- To be finalized by autumn
CERN Grid CA OCSP incident
- After a scheduled intervention on June 24, the CERN Grid CA OCSP service
became inaccessible from outside CERN (OTG:0057432)
- Requests to the service were dropped by the CERN perimeter firewall
- A CREAM CE will try to check a client certificate's status via OCSP,
if the existence of such an endpoint is indicated in the certificate details
- It appears other CE flavors rely on CRLs only and just ignore OCSP services
- Checks of CERN Grid CA certificates were then hanging until a timeout was reached
- The CREAM client code would time out first, thus failing job submissions that
used CERN Grid CA certificates
- This affected the 4 experiments and, through the SAM tests, sites running CREAM
- Some A/R recomputations may be needed
- The service was restored about 24h later on June 25
- Some improvements are foreseen to make reoccurrence a lot less probable
Discussion
- Maarten: the trouble was that requests were not failing quickly;
as far as we know, if e.g. the connection is refused, it is not fatal
- Stephan: is only CREAM concerned? we saw a ticket implicating an Xrootd service
- Maarten: please send me the details and I will look into the need for followup
- after the meeting: it probably was a mistaken inference
SAM migration progress
see the
presentation
Middleware News
- Useful Links
- Baselines/News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Mostly business as usual, no major issues
ATLAS
- Stable Grid production with up to ~380k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~45k slots from the HLT/Sim@CERN-P1 farm and ~15k slots from Boinc. Occasional additional peaks of 200k job slots from HPCs.
- Continuing with about 60k job slots used for Folding@Home jobs since 4 April. 50% from ~55 different grid sites via opt-in and 50% at CERN-P1
- No other major other issues apart from the usual storage or transfer related problems at sites
- Finishing grand unification of production+analysis queues in PanDA in the next days.
- All systems recovered quickly from Oracle/DBonDemand downtime last Saturday - would appreciate to avoid such downtimes over the weekend next time
- CTA in production for ATLAS since Monday - still fixing some issues in Rucio/middleware
CMS
- Covid-19 compute contributions being returned to experiment use
- main processing activities:
- Run 2 ultra-legacy Monte Carlo
- Run 2 pre-UL Monte Carlo
- migration to Rucio ongoing
- production of nanoAOD samples configured for PhEDEx being bumped up to complete more quickly
LHCb
- still running F@H on part of HLT farm
- large MC requests coming up so we are going to reduce this Covid-19-related activity
- processing (small) samples of lead-lead collisions and lead-neon fixed target collisions
- grid drained in preparation for the CERN Oracle/DBOD outage of last Saturday, DIRAC services and agents switched off, then on again after the outage, everything went extremely smoothly
Discussion on F@H reductions
- Maarten: it is perfectly defensible to ramp down resources for F@H,
as we have already done a lot and we cannot neglect our own duties
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
Archival Storage WG
Containers WG
CREAM migration TF
Details
here
Summary:
- 90 tickets
- 14 done: 7 ARC, 7 HTCondor
- 16 sites plan for ARC, 15 are considering it
- 20 sites plan for HTCondor, 14 are considering it, 8 consider using SIMPLE
- 14 tickets on hold, to be continued in the coming weeks / months
- 7 tickets without reply
- response times possibly affected by COVID-19 measures
dCache upgrade TF
DPM upgrade TF
StoRM upgrade TF
Information System Evolution TF
IPv6 Validation and Deployment TF
Detailed status
here.
Machine/Job Features TF
Monitoring
MW Readiness WG
Network Throughput WG
- perfSONAR infrastructure status - 4.2.4 versions was released - please upgrade
- 100 Gbps perfSONAR mesh was established with 6 participating sites (TRIUMF, CERN, BNL, KIT, IC, AGLT2, Prague)
- New LHCONE mesh established testing from sites to R&E perfSONAR endpoints (on LHCONE)
- OSG/WLCG infrastructure
- Discussing plan of migration to the push model - direct publishing of results from toolkits to RabbitMQ
- ESnet (router) traffic feed now available, working on its integration to our pipeline - prototype already working
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Traceability WG
Action list
Specific actions for experiments
Specific actions for sites
AOB
This topic: LCG
> WebHome >
WLCGCommonComputingReadinessChallenges >
WLCGOperationsWeb >
WLCGOpsCoordination > WLCGOpsMinutes200702
Topic revision: r11 - 2020-07-03 - MaartenLitmaath