Week of 231030

WLCG Operations Call details
General Information
Best practices for scheduled downtimes
Monday

WLCG Operations Call details

The connection details for remote participation are provided on this agenda page.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to the wlcg-scod list (at cern.ch) to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Best practices for scheduled downtimes

Monday

Attendance:

local: -
remote: Andrew (TRIUMF), Andrew P (NL-T1), Borja (Chair, Monitoring), Brian (RAL), David B (IN2P3), David M (FNAL), Doug (BNL), Henryk (LHCb), Ivan (ATLAS), Julia (WLCG), Maria (ATLAS), Panos (WLCG, CMS), Stephan (CMS), Steven M (Storage), Vincenzo (CNAF), Xavier (KIT)

Experiments round table:

ATLAS reports (raw view) -
- NTR

CMS reports (raw view) -
- NTR

ALICE
- Medium average load, no major issues

LHCb reports (raw view) -
- Problem with storage at SARA since Friday afternoon (ticket)
  - Disrupted staging, MC and analysis jobs during the whole weekend
  - No contact from the site until Monday morning
  - Issue solved before noon
- Problem with data staging at PIC
  - Many failed FTS transfers
  - Low transfer speed (10-100MB/s)
- Low level of running jobs (not enough work to do)
- We are preparing for the replication of ion data

Sites / Services round table:

ASGC: NC
BNL: Planned dCache outage migration to new servers. Tuesday (Halloween) 31-Oct-23. 09:00 - 18:00 EDT 14:00 - 22:00 UTC
CNAF: NTR
EGI: NC
FNAL: NTR
- Julia: Please check accounting reports
- David: Done just few moments ago, sorry for the delay
IN2P3: NTR
JINR: NC
KISTI: NC
KIT:
- Last week's update of the NXOS on storage switches (GOCDB #34615) did not go as smoothly as one could hope. The main database was not reachable for all our dCache instances at some point. The issues were quickly resolved once they were discovered. Our vendor has been involved for incident analysis and the downtime activities were aborted. That means,we still have to apply updates to the remaining switches at some later date.
- Today morning we applied some wide-spread changes to the CMS dCache namespace regarding owner uid and access mode settings (GOCDB #34663). This is one step towards enforcing the correct privileges for CMS users going forward. As far as we can tell, there were no (lasting) issues with this.
NDGF: NC
NL-T1:
- Surfsara's dCache authentication module (gPlazma) broke because of an empty LSC file. The problem started on Friday and was fixed this morning.
  - Henryk: Could you check why there was no answer to the ticket until Monday?
  - Maarten: SARA works on best effort, which means support outside working hours for non alarm tickets is not guaranteed
  - Andrew: It's very unlikely tickets arriving Friday after 3PM will be followed, as other work might be in progress
  - Maarten: Suggestion will be to raise an alarm ticket if needed
- Nikhef's ARC CEs were upgraded to the latest UMD4 at the end of last week. However this update contained an ARC version that resulted in a broken information provider. Two of our ARC CEs, brug.nikhef.nl and dissel.nikhef.nl have been downgraded and are working again. The third klomp.nikhef.nl was not downgraded and has been marked in downtime (GOCDB#34666)
NRC-KI: NC
OSG: NC
PIC: NC
RAL:
- Partial success for Power Intervention on our whole data-centre (gocdb dowtime 34592)
  - Power testing completed successfully
  - Federated Services and Tape services back in production within schedule.
  - Echo storage services delayed/ongoing (as of this meeting). (required an un-scheduled new/extension in downtime )
  - Batch services ready to go when storage is available
TRIUMF: NTR

CERN computing services: NTR
CERN storage services: NTR
CERN databases: NC
GGUS: NTR
Monitoring:
- WLCG sitemon reports for September were resent as requested by CMS
- Applied requested recomputation for KIT for ALICE and LHCb
Middleware: NTR
Networks: NC
Security: NC

AOB:

Change of the twiki SSO?
- Stephan: I've noticed today for the first time when saving the twiki that it redirects me to the new SSO, and got a failure while saving.
- Maarten: This should've been the case already since few weeks ago, more people have reported the issue saving, catalogued as intermitent and workaround is to "go back and save again". If this becomes an issue we should report it again the next time it happens.

Topic revision: r16 - 2023-10-30 - BorjaGarridoBear

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback