Week of 220307

WLCG Operations Call details
General Information
Best practices for scheduled downtimes
Monday

WLCG Operations Call details

For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
- The pass code is provided on the wlcg-operations list.
- You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to the wlcg-scod list (at cern.ch) to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Best practices for scheduled downtimes

Monday

Attendance:

remote: Kate (DB, chair), Julia (WLCG), Maarten (WLCG, ALICE), Borja (monitoring), Xavier (KIT), Peter (ATLAS), Raja (LHCb), Steve (storage), Andrew (TRIUMF), Onno (NL-T1), Vincenzo (CNAF), Chrstoph (CMS), DaveM (FNAL), Ville (NDGF), Marian (networks), Douglas (RAL), Alberto (monitoring), Nils (computing)

Experiments round table:

ATLAS reports ( raw view) -
- is there any news on SRR fix for dCache? (ATLAS has email threads and there are SRR issues besides dCache)
  - Julia mentioned that the issue is under WLCG observation and the dCache developers are involved
- what is expected impact of GEANT dropping LHCONE peering to Kurchatov?
  - Maarten reported: ATM the impact is small thanks to alternative routes; a crisis team is trying to oversee the evolving situation and the experiment computing coordinators are members

CMS reports ( raw view) -
- NTR

ALICE
- NTR

LHCb reports ( raw view) -
- running on 170k cores
- datasets with single copy at KIAE copied over to CERN (120TB in total)
- no major issues to report

Sites / Services round table:

ASGC:
BNL: NTR
CNAF: ntr
EGI:
FNAL: ntr
IN2P3: NTR
JINR:ntr
KISTI:
KIT:
- All activity involving storage was less reliable/performant due to CMS' MinBias campaign, which ran for the last two weeks and finished Friday. Only when we tried to improve the situation did things go wrong drastically, including crashing dCache pools. As such we eventually decided to let things run and for the future we will have to develop better strategies to avoid side effects for other users whenever a single VO puts extreme load on their storage share.
- Mid of last week one ATLAS dCache pool reported I/O errors. At first we relocated the service to a different server, but the errors moved with it, so the conclusion is that the source must be with our GPFS storage system. We declared a downtime for the entire atlassrm-kit.gridka.de SE on Thursday in order to run an offline file system check. That check did spot some problems and supposedly fixed them. After we had started the services again though, the entire data content hosted by that dCache pool had become inaccessible - 2,469,917 files totaling to 1,075 TB in volume. To not risk any more damage we shut down the SE again for the entire weekend and are still busy trying to restore the data. We are in contact with IBM, but at this point we cannot tell how long that will take us or how likely either success or failure are.
- All LHCb dCache pools with disk-only files ran into a memory exhaustion error on Friday morning, about 3 a.m. CET. At the time, each pool had gathered a high number of xrootd transfers, ultimately too high to sustain the pressure. The good news is that dCache is configured to restart in such a case autonumously, so the service was back online rather quickly for LHCb.

Christoph asked if LHCb and ATLAS issues can be caused by CMS load. Xavier explained that dCache instances are separate but the underlying storage is shared. This can explain performance issues, but not the IO errors.

NDGF: No major issues currently
NL-T1: NTR
NRC-KI:
OSG:
PIC:
RAL: NTR
TRIUMF: NTR

CERN computing services: On Tuesday 1st of March, about 23k cores of batch capacity went down due to a power issue at Point 8. (Ref. OTG:0069491 )
CERN storage services: NTR
CERN databases: NTR
GGUS: NTR
Monitoring:
- Distributed draft SiteMon availability/reliability reports for February 2022
Middleware: NTR
Networks:
- On Friday at 5pm GMT, GÉANT has shutdown the peering with KIAE, as per the decision of the GÉANT Board of Directors.
Security:

AOB:

Topic revision: r17 - 2022-03-07 - NikolayVoytishin

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback