Week of 211018

WLCG Operations Call details
General Information
Best practices for scheduled downtimes
Monday

WLCG Operations Call details

At CERN the meeting room is 513-R-068.

For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
- The pass code is provided on the wlcg-operations list.
- You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Best practices for scheduled downtimes

Monday

Attendance:

local:
remote: Miro (Chair), Xavier (KIT), Maarten (ALICE), Julia (WLCG), Kate (DB), Andrew (NL-T1), Pinja (Security), Ville (NDGF), Jiri (ATLAS), Henryk (LHCb), Andrew (TRIUMF), Borja (Monitoring), David (FNAL), Alberto (Monitoring), Nils (Compute), Christoph (CMS), Chien-De (ASGC), Vincenzo (CNAF), Marian (Network), Douglas (BNL)

Experiments round table:

ATLAS reports (raw view) -
- NTR

CMS reports (raw view) -
- (Partial) network outage at CERN on Friday late afternoon (Oct 15th): OTG:0066817
  - Several CMS affected, particularly CMS webservices
  - Main issue voms-admin clients failing (voms-proxy-init working though): INC:2952661

ALICE
- Some periods of low activity due to lack of ready productions
- No major issues
- The tape challenge went fine Mon-Fri last week: plot

LHCb reports (raw view) -
- Activity:
  - Tape data challenge (ongoing, ~10GB/s throughput achieved)
- Issues:
  - Network issue at CERN on Friday, almost all VO-boxes affected

Sites / Services round table:

ASGC: Site will be in downtime on Nov 7 for International circuit re-cabling
BNL: BNL Tape Downtime 04:00 19-Oct to 20:00 19-Oct. In past week BNL is upgrading ATLAS nodes to Condor 9.0.6.
CNAF:
- two of our CE's are down for upgrade (ce05, ce06) GOCDB (https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=31268)
- maintenance on storage systems (disk only) for VOs: ATLAS, AMS, ALICE GOCDB: (https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=31267)
EGI: NC
FNAL: NTR
IN2P3: NTR
JINR: NC
KISTI: NC
KIT: NTR
NDGF: Some sites were affected by CERN network outage.
NL-T1:
- Nikhef: After our network outages on the 6th and 8th of October our ARC CEs were very slow to update their bdii information. This was due to a combination of a large number of dead jobs left over after the network problems and slow code in the CEinfo.pl script that arc uses. For each batch queue the script loops over all batch system jobs to try and match them with jobs known to the CE. After our network issues we had of order 30,000 jobs known to each CE. Each loop over the batch jobs took around 5 minutes per queue and for all our queues that came to around 1 hour. After removing jobs from the CEs that were no longer in the batch system the update time was eventually lowered to a couple of minutes in total. We also plan to follow with the arc devs with a patch to speed up the slow code.
- SURF: Tape is currently unavailable. We are looking into it. A downtime has been added in the GOCDB.
NRC-KI: NC
OSG: NC
PIC: NC
RAL: NTR
TRIUMF: NTR

CERN computing services:
- Following the major network outage in the CERN computer centre late Friday afternoon (OTG:0066817), batch and grid services were affected until around 20pm (OTG:0066829). Furthermore, the VOMS admin application was unavailable during the weekend due to a database connectivity problem after network incident. ( OTG:0066864 ).
- The Condor-CE grid job router now also supports GPU resources. ( GGUS:154316 )
- Upgrade of the CMS and ATLAS IAM Instances from Openshift 3 to Openshift 4 planned for Monday 25th (OTG0066928)
CERN storage services:
- FTS service partially affected by network incidents
CERN databases: Both DBoD services and Oracle affected by the major network outage. VOMS database crashed and restarted. After the restart, a database parameter modified in the past that was not persistently set caused high load impairing the lbtrans database.CMSR database listener was restarted and didn't register service properly all services causing issues to Rucio. CMSONR ADG was restarted but frontier service didn't start properly and had to be manually restarted.
GGUS: NTR
Monitoring:
- Distributed final SiteMon availability/reliability reports for September 2021
- ETF impacted by the CERN DC network outage (OTG:0066841)
Middleware: NTR
Networks:
- CERN computer centre network outage: OTG:0066817
Security: NTR

AOB:

Topic revision: r26 - 2021-10-19 - HannahShort

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback