Week of 210215
WLCG Operations Call details
- For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
- The pass code is provided on the
wlcg-operations
list.
- You can contact the
wlcg-ops-coord-chairpeople
list (at cern.ch
) if you do not manage to subscribe.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.
Best practices for scheduled downtimes
Monday
Attendance:
- local:
- remote: Alberto (Monitoring), Andrew P (NL-T1), Borja (Chair, Monitoring), Christoph (CMS), Claudia (?), Daniel (Security), Darren (RAL), David (IN2P3-CC), Dave (FNAL), Federico (LHCb), Julia (WLCG), Luis (Computing), Petr (ATLAS), Steven (Storage), Xavier (KIT)
Experiments round table:
- ATLAS reports (raw view) -
- Stable production and quick recovery after Friday DB network incident
- Enforcing CPU core limits (RDataFrame) - improve our job submission side but also sites needs to configure properly their local batch system
- HTTP-TPC explicit pull mode configured in Rucio (no more automatic fallback between pull > push with FTS pilot instance)
- Sharing FTS instance between VO - unable to set active connection limit per VO (ATLAS vs. Belle)
- CMS reports (raw view) -
- Outage at CERN from Friday
- Notices by individual users (issues with VOMS client, login to web pages ...)
- CTA at CERN went down
- No severe impact on central production
- ALICE
- Continued very high activity levels.
- New record on Sunday: 183k concurrently running jobs.
- No major issues.
- LHCb reports (raw view) -
- Activity:
- MC and WG productions; User jobs.
- Issues:
- CERN: ticket (GGUS:150584) for jobs HELD at CERN
- ticket was closed - we lost pilots there.
- IN2P3:Job submission issue (GGUS:150406)
- GRIDKA: Job submission issue (GGUS:150403)
- From the above tickets, and from other ones: we have a general issue on how to run on HtCondorCEs. We don't want to run a local schedd as this is not something formally requested. As a temporary measure, we have been adding a line in our submission string so that the job (pilot) outputs is deleted in the next 24 hours, but this might be not ideal/fragile. A proper solution will not come easily (would require development) and this is not on the horizon.
Sites / Services round table:
- ASGC: NC
- BNL: NTR
- CNAF: NTR
- EGI: NC
- FNAL: NTR
- IN2P3:
- No access to LHCONE on February 12th between 16:30 and 19:00. Traffic went through generalist link during this interruption and some transfers as source went wrong (GGUS: 150562).No problem with LHCOPN.
- IN2P3-CC will be in maintenance on March 16th, a Tuesday. As usual details will be available one week before the event.
- JINR: NC
- KISTI: NC
- KIT: Wednesday, a malfunction on the border routers induced an overload on the central KIT core components, which did also effect the GridKa connectivity. The reason for this is understood and we've taken steps to avoid the same issues in the future.
- NDGF: Umeå, Sweden: kebnekaise CE down today at 12:00 - 16:00 UTC due to maintenance. Some data may be unavailable tomorrow Tuesday between 10:00-17:00 UTC due to dcache upgrades at Bergen, Norway.
- NL-T1: Nikhef: We have having some issues with our new arc-ce system. There is time unit mismatch which is causing submissions to fail or jobs to exit early. We also see some output files not being copied back to the arc-ce for later fetching.
- NRC-KI: NC
- OSG: NC
- PIC: NC
- RAL: NTR
- TRIUMF: NTR (Canadian holiday today, won't be logged on. /Andrew)
- CERN computing services: NTR
- CERN storage services:
- CASTOR service was down from10:30 to 16:50 during Friday 12 Feb 2021 due to Oracle databases outage
- CERN databases: NC
- GGUS: NTR
- Monitoring:
- Migration of SiteMon job to CRIC scheduled for Thursday OTG:0062075
- Edit: Migration delayed by one week due to some CRIC inconsitencies found
- Some flows stopped working on Friday due to DB outage, still evaluating full impact and recovering them
- Middleware:
- Nordugrid ARC 6.10.1 has been released today:
- It fixes the privacy issue in the ARC CE information system reported in Dec.
- Please upgrade at your earliest convenience.
- Networks: NTR
- Security: New vulnerability in "screen". Possibility to crash screen session with utf8 characters. There's currently no exploit in the wild. (CVE-2021-26937 - https://access.redhat.com/security/cve/cve-2021-26937)
AOB: