Week of 220110

WLCG Operations Call details

  • For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
    • The pass code is provided on the wlcg-operations list.
    • You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to the wlcg-scod list (at cern.ch) to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local:
  • remote: Miro (Chair, DB), Xavier (KIT), Borja (Monitoring), Maarten (ALICE), Alberto (Monitoring), David (IN2P3), Daniele (CNAF), Douglas (BNL), Luis (Compute), Steven (Storage), Concezio (LHCb), Andrew (TRIUMF), Christoph (CMS), Petr (ATLAS), Darren (RAL), Jose (PIC), David (FNAL), Mihai (Storage), Ville (NDGF)

Experiments round table:

  • ATLAS reports ( raw view) -
    • smooth operation during Xmas
      • Rucio service degradation on last day of 2021 - some signs suggest short database issue and Rucio services did not automatically recovered
    • RAL deployed new WebDAV external gateways last week GGUS:155410
      • multihop transfers to TAPE / CASTOR affected by FTS-1749 (unable to remove bad intermediate file)
      • recent transfer backlog from RAL disappeared but we should still tune / increase FTS active transfer limit for WebDAV
    • plan to enable ATLAS IAM VOMS globally next week
      • which service to use for SNOW tickets
      • Comment from Maarten: IAM will be added properly to critical services (with priority). Maarten will follow-up to expose the proper FE and announce it.
      • Comment from Luis: There is already a dedicated FE available for IAM.
      • After the meeting: the FE is WLCG-IAM and is expected soon to be included in the set of CERN SNow support units.

  • CMS reports ( raw view) -
    • A good year 2022 for everybody!
    • Good CPU usage during the holiday break
    • No major problems

  • ALICE
    • Best wishes for 2022 !
    • Normal activity levels on average in the last 3 weeks.
      • No major issues.
      • Thanks to the sites!

  • LHCb reports ( raw view) -
    • smooth operations during end-of-year break
      • reprocessing (stripping) of 2016 proton collision data successfully completed in 14 days (ending December 31st)
      • a couple of ALARM ticket at RAL GGUS:155398 and GGUS:155440
    • moving to SRM+https for tape at all sites
    • long-standing ticket at NL-T1 (SARA) GGUS:153653 (network socket timeouts)

Sites / Services round table:

  • ASGC: NC
  • BNL
    • Happy 2022 to all. Over year end break - we used time to upgrade tape system (HPSS Core), update all HTCondor CE's, migrate machines between old data center and new one. We request more details on the upcoming Tape challenge. new tape libraries currently connect to test core.
  • CNAF: NTR
  • EGI: NC
  • FNAL
    • Transatlantic link was severely degraded for about 10 days, Dec 27 - Jan 5. Believe this required some physical work at CERN unable to be achieved until people returned onsite on Jan 5. Outgoing transfers FNAL --> Europe were saturated at 20 Gbit (tertiary bandwidth, primary and secondary link were down). We are following up on an informal post-mortem on ours side, ESNet ticket ID is ESNET-20211227-006.
  • IN2P3: NTR. Best wishes for 2022 !
  • JINR: NC
  • KISTI: NC
  • KIT
    • No issues between the years. Though last Thursday one server hosting an ATLAS dCache pool started rebooting autonomously due to ECC errors. We moved the pool to a different server for now.
  • NDGF
    • Some more space at the ATLASDATADISK at NDGF. Happy new year!
  • NL-T1: NC
  • NRC-KI: NC
  • OSG: NC
  • PIC
    • Happy new year! We are upgrading the new IBM tape library, increasing its capacity by installing more expansion frames. These are not yet ready, technicians are fixing some problems with the installation. The rest of the library is fully operative, and used by the experiments.
  • RAL
    • This week RAL will be performing a major upgrade on the Tape Robotics behind Castor (and Antares). The downtime is from 07:00 UTC on Wednesday to 13:00 UTC on Friday. During this time, the Tape Robot will be expanded with additional frames, 16 x LTO-9 drives will be added along with ~20PB of media. Camera systems will also be installed to enable better remote debugging of future hardware problems. The actual hardware should finish installing on the Wednesday and validation of that should happen overnight. Currently the downtime declared is a slightly conservative best estimate however by Thursday morning, we should be in a very good position to inform VOs if the downtime will be changed. If all has gone well with the hardware upgrade we may be able to finish it early. We apologise for the relatively short notice, it has been very difficult to schedule the delivery of the hardware this year and we felt that getting the upgrade done as soon as possible was the best way of minimising impact on the VOs.
  • TRIUMF: NTR

  • CERN computing services: NTR
  • CERN storage services: NTR
  • CERN databases:
    • CMSR DB: 30min downtime plus rolling/at risk intervention (The compatible raise to version 19.0.0, Oracle and OS security patches and remove an unused node from CMSR cluster) (OTG:0068249)
  • GGUS: NTR
  • Monitoring:
    • Distributed draft SiteMon availability/reliability reports for Dec 2021
  • Middleware: NTR
  • Networks: NTR
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2022-01-10 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback