Week of 210802

WLCG Operations Call details

  • For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
    • The pass code is provided on the wlcg-operations list.
    • You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local:
  • remote: Petr (ATLAS), Mark (LHCb), Christoph (CMS), Julia (WLCG Ops, chair), Pinja (security), Xavier (KIT), Borja (Monit), Andew (TRIUMF), Chien-De (ASGC), David B. (IN2P3), David M. (FNAL), Doug (BNL), Joao (Storage), Onno (NL-T1), Pepe (PIC), Petr (ATLAS), Elena (CNAF), Moore (RAL)

Experiments round table:

  • ATLAS reports (raw view) -
    • Ask all ATLAS StoRM sites increase timeout STORM_WEBDAV_TPC_TIMEOUT_IN_SECS timeout to 30s GGUS:153084
    • pic slow 200kb/s transfers
      • Spanish NREN-Geant 10Gb segment saturated
      • big file transfers often terminated after 2h (dCache) timeout
        • retry / failure unstable situation
        • ATLAS FTS concurrent transfers now limited for pic endpoints
    • IPv6 enabled at RRC-KI-T1 - works fine except SRM+GridFTP transfers with FZK GGUS:153313

  • CMS reports (raw view) -
    • Efforts to better control disk space usage in the 'unmerged' area continue

  • ALICE
    • Probably nobody will be able to connect today.
    • No major issues, at least until this morning.
    • Low activity occasionally due to lack of ready productions.

  • LHCb reports (raw view) -
    • Activity:
      • MC and WG productions; Some Staging; User jobs.
    • Issues:
      • NTR

Sites / Services round table:

  • ASGC: Jobs failed due to a misconfiguration of quota token on DPM last Wednesday
  • BNL: HPSS downtime as begun - a "feature" in CRIC mistakenly blacklisted all storage at BNL. Disk only Rucio Endpoints restored manually. 5-Aug-21 22:00 UTC - 6-Aug-21 01:00 UTC reduction in WAN network capacity as main ESnet 200 Gbps WAN curcuit moved to new router on BNL site from NYC.
  • CNAF: NTR
  • EGI:
  • FNAL:
  • IN2P3: NTR
  • JINR: NTR
  • KISTI:
  • KIT:
    • Updated dCache SEs for ATLAS, CMS and LHCb last week to version 6.2. Issues for ATLAS and LHCb with "restarted transfer manager" (GGUS:153179, GGUS:153301) with unclear cause. For now resolved by reducing redundancy for dCache's central services.
    • Updated all Condor CEs yesterday for security fixes.
  • NDGF:
  • NL-T1: A robot in one of the tape libraries at SURF has died. It will be replaced tomorrow. Until then some files can't be retrieved from tape. https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=31004
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL: NTR
  • TRIUMF: Updating HTCondor to 8.8.15 on Tuesday Aug 3 17:00 UTC GOCDB Outage: https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=31002

  • CERN computing services: Capacity of batch reduced reduced by a maximum of 20% / week in order to upgrade the kernel. For details see https://cern.service-now.com/service-portal?id=outage&n=OTG0065291
  • CERN storage services: NTR
  • CERN databases:
  • GGUS: NTR, at least until this morning
  • Monitoring:
    • Distributed draft SiteMon availability/reliability reports for July 2021
  • Middleware: NTR, at least until this morning
  • Networks:
  • Security: An advisory for a critical vulnerability sent on 28.7., in addition a high risk advisory was sent: check your emails and update/mitigate if these are applicable.

AOB:

Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r19 - 2021-08-02 - NikolayVoytishin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback