Week of 200217

WLCG Operations Call details
General Information
Best practices for scheduled downtimes
Monday

WLCG Operations Call details

At CERN the meeting room is 513-R-068.

For remote participation we use the Vidyo system. Instructions can be found here.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Portal
Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Best practices for scheduled downtimes

Monday

Attendance:

local: Kate (WLCG, chair), Julia (WLCG), Maarten (WLCG, ALICE), Michal (ATLAS), Olga (computing), Marian (network, monitoring), Remy (storage), Nicolo (storage)
remote: Andrew (NL-T1), Jens (NDGF), Vladimir (LHCb), Darren (NDGF), Elena (CNAF), Christoph (CMS), Dave M (FNAL), Sang Un (KISTI)

Experiments round table:

ATLAS reports ( raw view) -
- Activities:
  - ATLAS S&C Week
  - Ongoing reprocessing (will stage new data week)
- Issues
  - ATLAS dashboards showing no data between Wednesday evening and Thursday morning (RQF:1527003)
    - The monitoring project on HDFS ran out of quota for the "number of files" (OTG:0054839)
  - few not fully filled bins in running jobs plots (RQF:1513867, RQF:1529407)
    - The InfluxDB database was not responsive since yesterday night. Database is operational again and the holes recovered.
  - atlas-in2p3-cc-frontier degraded (GGUS:145575)
  - pilots failing at IN2P3-CC - mistake in HTCondorCE config
  - pilots at BNL_PROD_UCORE are failing with "atlas software repository NOT found" (GGUS:145539, GGUS:145529) - few broken WNs
  - Transfer timeouts from FZK-LCG2_MCTAPE (GGUS:145564)

CMS reports ( raw view) -
- No major problems to report

ALICE -
- NTR

LHCb reports ( raw view) -
- Activity:
  - Stripping campaign ongoing, occupying most of T0/1 capacity (no T2s)
  - Staging for 2016 is almost finished. Staging for 2017 is ongoing.
- Issues:
  - no significant issues

Sites / Services round table:

ASGC: nc
BNL: nc
CNAF: NTR
EGI: nc
FNAL: DCache upgrades finished for both disk and tape
IN2P3: IN2P3-CC will be in maintenance on March 17th, a Tuesday. As usual details will be available one week before the event. CEs and SEs are foreseen to be in downtime for the whole day.
JINR: NTR
KISTI: NTR
KIT: NTR
NDGF: NTR
NL-T1: NTR
NRC-KI: nc
OSG: nc
PIC: Tomorrow Tue. 18th Feb. we will be doing a re-configuration in PIC's firewall. The intervention should be transparent for the users.
RAL: nc
TRIUMF: (Feb17 is a holiday here)
- TRIUMF_SIM had no jobs running for ~ 1day, workernodes (3K cores) located at TRIUMF were inadvertently using legacy configurations for CVMFS cache squid servers.
- Three tape cartridge had media error - all 324 files recovered.

CERN computing services: NTR
CERN storage services: NTR
CERN databases: NTR
GGUS: NTR
Monitoring:
- Final reports for the January availability sent around
- ETF LHCb/ALICE: GGUS:145475 - glite-ce clients broken in latest UMD4; glite-ce and condor clients coexistence appears challenging going forward
- ETF LHCb: change of VOMS role used in testing is planned (to /lhcb/Role=samtest)
MW Officer: nc
Networks: EELA-UTFSM and CBFP networking issues were resolved
Security: nc

AOB: Christoph asked why LHCb is planning the change of the VOMS role. Marian explained that it will be done not to use the production credential. Maarten will check if such a high effort change is necessary.

Topic revision: r19 - 2020-02-17 - JosepFlix

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback