Week of 210517

WLCG Operations Call details

  • For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
    • The pass code is provided on the wlcg-operations list.
    • You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • remote: Andrew (TRIUMF), Borja (Chair, Monitoring), Daniel (Security), Daniele (CNAF), David (FNAL), Maarten (ALICE), Michal (ATLAS), Nils (Computing), Xavier (KIT)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Issues:
      • "Changing file state because request state has changed" staging errors at RRC-KI-T1_MCTAPE (GGUS:151881)
        • discussion between DDM ops and site support ongoing
      • SRM_FILE_UNAVAILABLE transfer errors from FZK (GGUS:152010)
      • "TNetXNGFile::Close ERROR [ERROR] Invalid session" job failures at RAL (GGUS:151964)
        • Caused by tasks running in direct-IO transfer mode. RAL currently has limited support for vector reads.
        • There was also storage/networking issue which was fixed

  • CMS reports ( raw view) -
    • Due to the vCHEP plenary in parallel, nobody from CMS will be able to attend
    • No major problems

  • ALICE
    • NTR

  • LHCb reports ( raw view) -
    • same issue as ATLAS at RAL due to limited support for vector reads (GGUS:142350)
    • otherwise, nothing new to report
    • attending the vCHEP plenary as well, apologies for not being able to attend!

Sites / Services round table:

  • ASGC: NC
  • BNL: NC
  • CNAF: NTR
  • EGI: NC
  • FNAL: NTR
  • IN2P3:
    • IN2P3-CC will be in maintenance on June 15th, a Tuesday. As usual details will be available one week before the event.
    • Apologies for not attending due to vCHEP.
  • JINR: NC
  • KISTI: NC
  • KIT:
    • ATLAS ran out of disk space around today 1 a.m. CEST. The disk storage was not correctly sized to accommodate the entire pledged volume, which is now fixed. A couple of dCache pools had to be restarted, since they had shut down autonomously due to the unforeseen resource shortage. KIT was removed from the blacklist around 11:30 a.m. CEST again.
    • This incident also shed light on the Alice share, which wasn't sized correctly before, too.
  • NDGF: NC
  • NL-T1: NTR. Not able to attend due to CHEP.
  • NRC-KI: NC
  • OSG: NC
  • PIC: PIC will be in Scheduled downtime on Thursday, May 20th from 08:00 to 16:00, CERN and local time. During this downtime two major interventions will be performed in order to continue with the upgrade of our network infrastructure and also upgrades to our dCache configuration. As usual, access to the computing farm will be closed at the start of the SD period.
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: NTR
  • CERN storage services: NC
  • CERN databases: NC
  • GGUS:
    • A new release is planned for Wed next week
      • Release notes
      • A downtime of 8 hours has been scheduled for 07:00-15:00 UTC
      • Test alarms will be submitted as usual
  • Monitoring:
    • Distributed final SiteMon availability/reliability reports for April 2021
  • Middleware: NTR
  • Networks: NTR
  • Security: NTR

AOB:

  • NOTE: the operations meeting next Mon will be virtual .
  • You may provide relevant incidents, announcements etc. for the operations record.
Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2021-05-17 - BorjaGarridoBear
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback