Week of 230925

WLCG Operations Call details
General Information
Best practices for scheduled downtimes
Monday

WLCG Operations Call details

The connection details for remote participation are provided on this agenda page.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to the wlcg-scod list (at cern.ch) to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Best practices for scheduled downtimes

Monday

Attendance:

local: Ivan (ATLAS), Maarten (Alice, WLCG, GGUS),
remote: Alessandro (CNAF), Andrew (TRIUMF), Borja (Monitoring), Christian (NDGF), Christoph (CMS), Darren (RAL), David B (IN2P3), David M (FNAL), Julia (WLCG), Mark (LHCb), Nils (CERN Computing), Onno (NL-T1), Panos(Chair, WLCG)

Experiments round table:

ATLAS reports ( raw view) -
- NTR

CMS reports ( raw view) -
- Problem with CVMFS Stratum-1 at RAL during the weekend (GGUS:163398)
  - Solved on Monday morning

ALICE
- Lowish activity on average, no problems
- The cause for certain jobs to misbehave at CNAF is not yet understood
  - Experts are looking into further limiting the fall-out of any bad jobs

LHCb reports ( raw view) -
- RAL: Some files unavailable at RAL (GGUS:163410)
- IN2P3: Started a rolling reboot for kernel update - just to confirm: Is it recommended by WLCG to put downtime in GOCDB for this? LHCb would prefer it!
  - Maarten: If resources are drained first it's not necessary this is announced, it's up to the site's discretion. In case jobs will be killed there should be proper announcing.
  - David B.: The WNs are drained before the reboots. If WLCG needs downtimes declared for these operations we would be happy to do so.
  - Maarten: This depends on how visible the reduction of resources is. No need if it's less than 50%, especially in multi-vo sites.

Sites / Services round table:

ASGC:
BNL:
CNAF:
EGI:
FNAL:
IN2P3: Number of jobs went down yesterday until early this afternoon. Seems related to dCache but the reason was not found and everything came back on track without any action on our side.
JINR: NTR
KISTI:
KIT:
NDGF: HPC2N moved to 100 Gbit/s network last week. Saw a few sub 5 minutes drops of connectivity after the move. Pools left RO over weekend, but the gremlins have been purged from the network, and everything is back in production.
NL-T1: Last Friday, we upgraded dCache from 8.2.24 to 8.2.32. We had hoped this would enable us to solve a directory listing issue. Some directory listings are queued, waiting for someone else's directory listing to complete, and we have a user with extremely large directories. Long ago this used to work fine, but the dCache team had changed the directory listing queuing algorithm recently, and the new algorithm caused these problems for our users. With 8.2.32, this is configurable and we hoped to be able to revert to the old algorithm. Unfortunately reverting did not seem to work and we still see queuing, so we reached out to the dCache devs for support. There is a (lengthy) ticket about this at https://github.com/dCache/dcache/issues/7252.
- Experiments didn't report noticing any big effects on their side.
- Maarten: LHC experiments don't use that functionally that much, probably that's why they didn't notice. A possible solution would be for dCache to set a high hard limit on the dir size.
NRC-KI:
OSG:
PIC:
RAL: NTR
TRIUMF: NTR

CERN computing services: NTR (Lost one rack of batch servers this morning due to a faulty switch OTG:0144681 only minor impact on capacity.)
CERN storage services:
CERN databases:
GGUS: NTR
Monitoring: NTR
Middleware: NTR
Networks:
Security:

AOB:

Topic revision: r14 - 2023-09-26 - NikolayVoytishin

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback