Week of 220307

WLCG Operations Call details

  • For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
    • The pass code is provided on the wlcg-operations list.
    • You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to the wlcg-scod list (at cern.ch) to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • remote: Kate (DB, chair), Julia (WLCG), Maarten (WLCG, ALICE), Borja (monitoring), Xavier (KIT), Peter (ATLAS), Raja (LHCb), Steve (storage), Andrew (TRIUMF), Onno (NL-T1), Vincenzo (CNAF), Chrstoph (CMS), DaveM (FNAL), Ville (NDGF), Marian (networks), Douglas (RAL), Alberto (monitoring), Nils (computing)

Experiments round table:

  • ATLAS reports ( raw view) -
    • is there any news on SRR fix for dCache? (ATLAS has email threads and there are SRR issues besides dCache)
      • Julia mentioned that the issue is under WLCG observation and the dCache developers are involved
    • what is expected impact of GEANT dropping LHCONE peering to Kurchatov?
      • Maarten reported: ATM the impact is small thanks to alternative routes; a crisis team is trying to oversee the evolving situation and the experiment computing coordinators are members

  • ALICE
    • NTR

  • LHCb reports ( raw view) -
    • running on 170k cores
    • datasets with single copy at KIAE copied over to CERN (120TB in total)
    • no major issues to report

Sites / Services round table:

  • ASGC:
  • BNL: NTR
  • CNAF: ntr
  • EGI:
  • FNAL: ntr
  • IN2P3: NTR
  • JINR:ntr
  • KISTI:
  • KIT:
    • All activity involving storage was less reliable/performant due to CMS' MinBias campaign, which ran for the last two weeks and finished Friday. Only when we tried to improve the situation did things go wrong drastically, including crashing dCache pools. As such we eventually decided to let things run and for the future we will have to develop better strategies to avoid side effects for other users whenever a single VO puts extreme load on their storage share.
    • Mid of last week one ATLAS dCache pool reported I/O errors. At first we relocated the service to a different server, but the errors moved with it, so the conclusion is that the source must be with our GPFS storage system. We declared a downtime for the entire atlassrm-kit.gridka.de SE on Thursday in order to run an offline file system check. That check did spot some problems and supposedly fixed them. After we had started the services again though, the entire data content hosted by that dCache pool had become inaccessible - 2,469,917 files totaling to 1,075 TB in volume. To not risk any more damage we shut down the SE again for the entire weekend and are still busy trying to restore the data. We are in contact with IBM, but at this point we cannot tell how long that will take us or how likely either success or failure are.
    • All LHCb dCache pools with disk-only files ran into a memory exhaustion error on Friday morning, about 3 a.m. CET. At the time, each pool had gathered a high number of xrootd transfers, ultimately too high to sustain the pressure. The good news is that dCache is configured to restart in such a case autonumously, so the service was back online rather quickly for LHCb.

Christoph asked if LHCb and ATLAS issues can be caused by CMS load. Xavier explained that dCache instances are separate but the underlying storage is shared. This can explain performance issues, but not the IO errors.

  • NDGF: No major issues currently
  • NL-T1: NTR
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: On Tuesday 1st of March, about 23k cores of batch capacity went down due to a power issue at Point 8. (Ref. OTG:0069491 )
  • CERN storage services: NTR
  • CERN databases: NTR
  • GGUS: NTR
  • Monitoring:
    • Distributed draft SiteMon availability/reliability reports for February 2022
  • Middleware: NTR
  • Networks:
    • On Friday at 5pm GMT, GÉANT has shutdown the peering with KIAE, as per the decision of the GÉANT Board of Directors.
  • Security:

AOB:

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2022-03-07 - NikolayVoytishin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback