Week of 200224

WLCG Operations Call details
General Information
Best practices for scheduled downtimes
Monday

WLCG Operations Call details

At CERN the meeting room is 513-R-068.

For remote participation we use the Vidyo system. Instructions can be found here.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Portal
Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Best practices for scheduled downtimes

Monday

Attendance:

local: Kate (WLCG, chair), Maarten (WLCG, ALICE), Ben (comp), Renato (LHCb), Vincent (security), Michal (ATLAS), Roberto (storage), Marian (monit, network)
remote: Julia (WLCG), Andrew (NL-T1), Elena (CNAF), Dave (FNAL), Darren (RAL), Christoph (CMS), Andrew (TRIUMF)

Experiments round table:

ATLAS reports ( raw view) -
- Activities:
  - Ongoing reprocessing
- Issues
  - Thursday's incident (OTG:0054944)
    - ATLAS services recovered Thursday evening/Friday morning
    - any updates about causes, followups?
  - unavailable file on CERN-PROD_DERIVED (GGUS:145605)
    - file is on damaged tape
    - file declared lost and will be reproduced
  - "Connection closed" staging errors at NDGF-T1 (GGUS:145702)
  - high deletion rate caused by log removal
    - triggered by decommissioning of sites
    - reached over 700k files per hour
    - did not cause problem, some sites could get close to full
  - Frontier/Squid infrastructure test with connections to CERN-only
    - to test if we can sustain the same level of jobs with frontier launchpads only one site
    - no problems observed so far
  - 24 hour proxy at HTCondor sites
    - the ADC team observed that at some sites which use HTCondor, jobs are being killed after one day because of expired proxy even though their original proxy has 96 hours length
    - email sent to all clouds to check the setting of HTCondor sites
    - later, we came up with preferred setting
      - DELEGATE_JOB_GSI_CREDENTIALS = True
      - DELEGATE_JOB_GSI_CREDENTIALS_LIFETIME = 0
      - Ben commented that proxy duration can be controlled using jdl during job submission
  - lot of “No FTS server has updated the transfer status the last 900 seconds. Probably stalled” errors (INC:2337341)
    - CERN FTS machines were running out of space
    - machines drained and replaced with VMs with double the disk space on Friday
    - but there is still a lot of these failures

CMS reports ( raw view) -
- Outage from last Thursday (Feb 20th) had impact on many CMS systems
  - Detector commissioning (PromptRECO)
  - Central production
  - CRAB (Grid analysis)
  - Various services used by individuals: twiki etc
- Otherwise no major items

ALICE -
- Mostly business as usual
- Some fallout from the Ceph incident at CERN on Thu

LHCb reports ( raw view) -
- Activity:
  - Stripping campaign ongoing, occupying most of T0/1 capacity (no T2s)
  - Staging for 2016 is almost finished. Staging for 2017 is ongoing.
  - Validating "new" role for SAM/ETF tests.
- Issues:
  - CERN: News about the issues last week? (no DT?).
  - CNAF: low level of running jobs (GGUS: 145692). No reply yet.
  - Renato commented that a downtime for CERN site would be useful. Ben responded it was not added due to oversight and now he's working on adding a retrospective downtime

Sites / Services round table:

ASGC: nc
BNL: NTR
CNAF: NTR
EGI: nc
FNAL: NTR
IN2P3: nc
JINR: nc
KISTI: nc
KIT: NTR
NDGF: nc
NL-T1: NTR
NRC-KI: nc
OSG: nc
PIC: nc
RAL: NTR
TRIUMF: NTR

CERN computing services:
- Argus migration March 2nd: direct GLExec affected
  - Christoph confirmed that CMS shouldn't use glexec any more.
- Batch outage due to block storage failure
  - Root cause analysis is ongoing, decision on future improvements will be taken after analysis is completed.
CERN storage services:
- Major Ceph Block Storage incident on Thu 20th due to bit-flips in central metadata OTG0054949 CEPH-823. CERN Ceph team is working with
  - Ceph developers in order to understand the underlying issue. Sorry for all inconveniences caused!
- Deployment of the new EOS FUSE (fusex) client on QA centrally managed EOS clients.
  - EOSCMS: Mon Mar 2 OTG0054952
  - EOSATLAS: Tue Mar 10 OTG0055034
CERN databases:
GGUS: NTR
Monitoring:
- ETF LHCb: following last week's discussion, LHCb now evaluating if they could use /lhcb/Role=NULL
MW Officer:
Networks: NTR
Security: NTR

AOB:

Topic revision: r16 - 2020-02-24 - MichalSvatos

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback