Week of 210215

WLCG Operations Call details
General Information
Best practices for scheduled downtimes
Monday

WLCG Operations Call details

At CERN the meeting room is 513-R-068.

For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
- The pass code is provided on the wlcg-operations list.
- You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Best practices for scheduled downtimes

Monday

Attendance:

local:
remote: Alberto (Monitoring), Andrew P (NL-T1), Borja (Chair, Monitoring), Christoph (CMS), Claudia (?), Daniel (Security), Darren (RAL), David (IN2P3-CC), Dave (FNAL), Federico (LHCb), Julia (WLCG), Luis (Computing), Petr (ATLAS), Steven (Storage), Xavier (KIT)

Experiments round table:

ATLAS reports (raw view) -
- Stable production and quick recovery after Friday DB network incident
- Enforcing CPU core limits (RDataFrame) - improve our job submission side but also sites needs to configure properly their local batch system
- HTTP-TPC explicit pull mode configured in Rucio (no more automatic fallback between pull > push with FTS pilot instance)
- Sharing FTS instance between VO - unable to set active connection limit per VO (ATLAS vs. Belle)

CMS reports (raw view) -
- Outage at CERN from Friday
  - Notices by individual users (issues with VOMS client, login to web pages ...)
  - CTA at CERN went down
  - No severe impact on central production

ALICE
- Continued very high activity levels.
  - New record on Sunday: 183k concurrently running jobs.
- No major issues.

LHCb reports (raw view) -
- Activity:
  - MC and WG productions; User jobs.
- Issues:
  - CERN: ticket (GGUS:150584) for jobs HELD at CERN
    - ticket was closed - we lost pilots there.
  - IN2P3:Job submission issue (GGUS:150406)
    - was closed
  - GRIDKA: Job submission issue (GGUS:150403)
  - From the above tickets, and from other ones: we have a general issue on how to run on HtCondorCEs. We don't want to run a local schedd as this is not something formally requested. As a temporary measure, we have been adding a line in our submission string so that the job (pilot) outputs is deleted in the next 24 hours, but this might be not ideal/fragile. A proper solution will not come easily (would require development) and this is not on the horizon.

Sites / Services round table:

ASGC: NC
BNL: NTR
CNAF: NTR
EGI: NC
FNAL: NTR
IN2P3:
- No access to LHCONE on February 12th between 16:30 and 19:00. Traffic went through generalist link during this interruption and some transfers as source went wrong (GGUS: 150562).No problem with LHCOPN.
- IN2P3-CC will be in maintenance on March 16th, a Tuesday. As usual details will be available one week before the event.
JINR: NC
KISTI: NC
KIT: Wednesday, a malfunction on the border routers induced an overload on the central KIT core components, which did also effect the GridKa connectivity. The reason for this is understood and we've taken steps to avoid the same issues in the future.
NDGF: Umeå, Sweden: kebnekaise CE down today at 12:00 - 16:00 UTC due to maintenance. Some data may be unavailable tomorrow Tuesday between 10:00-17:00 UTC due to dcache upgrades at Bergen, Norway.
NL-T1: Nikhef: We have having some issues with our new arc-ce system. There is time unit mismatch which is causing submissions to fail or jobs to exit early. We also see some output files not being copied back to the arc-ce for later fetching.
NRC-KI: NC
OSG: NC
PIC: NC
RAL: NTR
TRIUMF: NTR (Canadian holiday today, won't be logged on. /Andrew)

CERN computing services: NTR
CERN storage services:
- CASTOR service was down from10:30 to 16:50 during Friday 12 Feb 2021 due to Oracle databases outage
  - OTG:0062187, https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=30186
CERN databases: NC
GGUS: NTR
Monitoring:
- Migration of SiteMon job to CRIC scheduled for Thursday OTG:0062075
  - Edit: Migration delayed by one week due to some CRIC inconsitencies found
- Some flows stopped working on Friday due to DB outage, still evaluating full impact and recovering them
Middleware:
- Nordugrid ARC 6.10.1 has been released today:
  - It fixes the privacy issue in the ARC CE information system reported in Dec.
  - Please upgrade at your earliest convenience.
Networks: NTR
Security: New vulnerability in "screen". Possibility to crash screen session with utf8 characters. There's currently no exploit in the wild. (CVE-2021-26937 - https://access.redhat.com/security/cve/cve-2021-26937)

AOB:

Topic revision: r20 - 2021-02-17 - BorjaGarridoBear

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback