Week of 210726

WLCG Operations Call details

  • For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
    • The pass code is provided on the wlcg-operations list.
    • You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • remote: Alberto (Monit), Andew (TRIUMF), Borja (Chair, Monit), Chien-De (ASGC), Christoph (CMS), David (IN2P3), David (FNAL), Doug (BNL), Joao (Storage), Luis (Computing), Maarten (ALICE), Mark (LHCb), Onno (NL-T1), Pepe (PIC), Petr (ATLAS), Pinja (Security), Vincenzo (CNAF), Xavier (KIT)

Experiments round table:

  • ATLAS reports (raw view) -
    • FTS ATLAS Internal Server Error - alarm GGUS:153152, transfers moved temporarily to BNL FTS GGUS:153153
    • Improved transfer efficiency to INFN-T1 from busy dCache sites by changing default 5s StoRM timeout STORM_WEBDAV_TPC_TIMEOUT_IN_SECS GGUS:153084
    • New major Rucio release 1.26 deployed

  • CMS reports (raw view) -
    • Disk space usage in the 'unmerged' area is still (very) high
      • Present workflow mix is very demanding for that
      • New remote cleaning procedure is being commissioned (and therefore less effective)

  • ALICE
    • NTR

  • LHCb reports (raw view) -
    • Activity:
      • MC and WG productions; User jobs.
      • Large stripping campaign is being started soon - this will involve significant staging from everywhere
    • Issues:
      • CERN: Issues with FTS transfers over the last week. CTA admins investigating. (GGUS:153132)

Sites / Services round table:

  • ASGC: NTR
  • BNL: Tape system update (Downtime 2-Aug 12:00 UCT to 6-Aug 00:00 UCT)
  • CNAF: Due to a severe security issue recently discovered in the Linux kernel, we are upgrading our computing infrastructure. As an effect, jobs that last more than 3 days are not guaranteed to complete gracefully
  • EGI: NC
  • FNAL: NTR
  • IN2P3: NTR
  • JINR: NC
  • KISTI: NC
  • KIT: Going to update dCache, GPFS and OS versions for the ATLAS (26th, GOCDB #30969), CMS (27th, GOCDB #30968) and LHCb (29th, GOCDB #30967) SEs. The announcements were posted accidentally on the pre-prod GOCDB instance, which is why the downtimes in (production) GOCDB were added only today (and are therefor considered unscheduled).
  • NDGF: NC
  • NL-T1:
    • Nikhef will be updating the nfs server that holds the home directories for our grid jobs on Wednesday between 10am and midday. We'll be in downtime for the update (https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=30965). We've also started draining the system of jobs to ensure nothing is running while the nfs server is down.
    • This morning Surfsara restarted all WebDAV doors to increase the idle timeout (webdav.limits.idle-time and pool.mover.http.timeout.idle), which is needed for checksum calculation after uploading large files (> 150 GB). The idle timeout is now 2 hours which should work for files up to ~4TB.
  • NRC-KI: NC
  • OSG: NC
  • PIC: rebooting Worker Nodes this week to apply the patch for the announced security incident.
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: NTR
  • CERN storage services:
    • FTS incident already mentioned in the ATLAS report
  • CERN databases: NC
  • GGUS:
    • A new release is planned for Wed this week
      • Release notes
      • A downtime from 06:00 to 09:00 UTC will be announced via the usual channels
      • Test alarms will be submitted as usual
  • Monitoring: NTR
  • Middleware: NTR
  • Networks: NC
  • Security: NTR

AOB:

  • Maarten: Security patches for the kernel require rebooting worker nodes
    • Xavier: We are waiting for SL repositories to be updated and will look into a mitigation
    • Maarten: The situation currently is not critical and those repositories are expected to be updated very soon
  • Maarten: First semi-official announcement that is foreseen to have Data Challenge related tests during the first two weeks of October
    • Will affect T1 sites and be of two types: network throughput and storage (mainly tape but not only)
    • More information will be provided in the upcoming weeks
Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2021-07-26 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback