Week of 200323

WLCG Operations Call details
General Information
Best practices for scheduled downtimes
Monday

WLCG Operations Call details

At CERN the meeting room is 513-R-068.

For remote participation we use the Vidyo system. Instructions can be found here.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Portal
Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Best practices for scheduled downtimes

Monday

Attendance:

remote: Kate (WLCG, DB, chair), Julia (WLCG), Maarten (WLCG, ALICE), Eric (storage), Xavier (KIT), Vincent (security), Vladimir (LHCb), Borja (monitoring), Michal (ATLAS), Darren (RAL), Cristi (storage), Christoph (CMS), Andrew (TRIUMF), Christian (NDGF), Dave M (FNAL), David B (IN2P3)

Experiments round table:

ATLAS reports ( raw view) -
- Activities:
  - Ongoing reprocessing
  - SFO->CTA stress test finished on Friday
- Issues
  - staging errors (“Staging not allowed”) at pic (GGUS:146082)
    - After dCache upgrade, no stage requests were allowed through srm
  - 22 lost RAW files on IN2P3-CC_DATATAPE - recovered
  - 55 lost EVNT files on SARA-MATRIX_MCTAPE - irrecoverable
  - “performance marker timeouts” in transfers to IN2P3-CC (GGUS:146128)
    - ATLAS, CMS and LHCb have all a big transfer activity these last days which saturated 40Gbps network switch which connect dCache servers to LHCOPN network
  - on Saturday, GGUS ticket submission returns some apache test page (this has happened few days ago)
    - GGUS support informed by shifter

CMS reports ( raw view) -
- No major issues

ALICE -
- NTR

LHCb reports ( raw view) -
- Activity:
  - Stripping campaign is finished.
  - Staging is finished.
- Issues:
  - NTR

Sites / Services round table:

ASGC: nc
BNL: dCache downtime scheduled for 03/25~27 has been postponed indefinitely.
CNAF: Today there is the migration of CMS from LSF to HTCondor, ALICE and ATLAS are already migrated last week
EGI: nc
FNAL: NTR
IN2P3: NTR
JINR: NTR
KISTI: nc
KIT: Reminder for storage downtime on 1st of April, where we have to update the firmware of our storage systems. All storage will be offline, so all VOs will be affected!
NDGF: NTR. All countries are working from home now. This was 75% the model for rotating ops staff anyway. I.e. only access to resources in own country. No impact expected.
NL-T1: NTR
NRC-KI: nc
OSG: nc
PIC: The building's electrical maintenance has been canceled. There will be no downtime on Tuesday 7th of April. This intervention will happen much later, no date fixed yet.
RAL: RAL is now effectively running in a "working from home" posture. Currently no impact on services, so far business as usual.
TRIUMF: NTR

CERN computing services: nc
CERN storage services:
- EOSALICE downtime on Sunday: OTG:0055546
  - caused by a similar HW issue as on Friday, 13th: OTG:0055417
  - in order to avoid this from happening again the namespace will be configured to run on a different machine: OTG:0055562
- CASTORCMS slow stager_qry response time after DB upgrade from 3rd Feb 2020. This leads to timeouts in Phedex and breaks their staging. The DB team is investigating. OTG:0055568
- Client configuration change for the EOS mounts for EOSATLAS and EOSCMS; in production starting today
- EOSLHCB planned upgrade tomorrow at 10:00 OTG:0055312
CERN databases: NTR
GGUS:
- A new release is planned for Wed this week
  - Release notes
  - A downtime has been scheduled for 07:30-10:30 UTC
  - Test alarms will be submitted as usual
Monitoring: NTR
MW Officer: nc
Networks: NTR
Security: NTR

AOB:

Topic revision: r24 - 2020-03-23 - JosepFlix

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback