Week of 200323

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • remote: Kate (WLCG, DB, chair), Julia (WLCG), Maarten (WLCG, ALICE), Eric (storage), Xavier (KIT), Vincent (security), Vladimir (LHCb), Borja (monitoring), Michal (ATLAS), Darren (RAL), Cristi (storage), Christoph (CMS), Andrew (TRIUMF), Christian (NDGF), Dave M (FNAL), David B (IN2P3)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Activities:
      • Ongoing reprocessing
      • SFO->CTA stress test finished on Friday
    • Issues
      • staging errors (“Staging not allowed”) at pic (GGUS:146082)
        • After dCache upgrade, no stage requests were allowed through srm
      • 22 lost RAW files on IN2P3-CC_DATATAPE - recovered
      • 55 lost EVNT files on SARA-MATRIX_MCTAPE - irrecoverable
      • “performance marker timeouts” in transfers to IN2P3-CC (GGUS:146128)
        • ATLAS, CMS and LHCb have all a big transfer activity these last days which saturated 40Gbps network switch which connect dCache servers to LHCOPN network
      • on Saturday, GGUS ticket submission returns some apache test page (this has happened few days ago)
        • GGUS support informed by shifter

  • ALICE -
    • NTR

  • LHCb reports ( raw view) -
    • Activity:
      • Stripping campaign is finished.
      • Staging is finished.
    • Issues:
      • NTR

Sites / Services round table:

  • ASGC: nc
  • BNL: dCache downtime scheduled for 03/25~27 has been postponed indefinitely.
  • CNAF: Today there is the migration of CMS from LSF to HTCondor, ALICE and ATLAS are already migrated last week
  • EGI: nc
  • FNAL: NTR
  • IN2P3: NTR
  • JINR: NTR
  • KISTI: nc
  • KIT: Reminder for storage downtime on 1st of April, where we have to update the firmware of our storage systems. All storage will be offline, so all VOs will be affected!
  • NDGF: NTR. All countries are working from home now. This was 75% the model for rotating ops staff anyway. I.e. only access to resources in own country. No impact expected.
  • NL-T1: NTR
  • NRC-KI: nc
  • OSG: nc
  • PIC: The building's electrical maintenance has been canceled. There will be no downtime on Tuesday 7th of April. This intervention will happen much later, no date fixed yet.
  • RAL: RAL is now effectively running in a "working from home" posture. Currently no impact on services, so far business as usual.
  • TRIUMF: NTR

  • CERN computing services: nc
  • CERN storage services:
    • EOSALICE downtime on Sunday: OTG:0055546
      • caused by a similar HW issue as on Friday, 13th: OTG:0055417
      • in order to avoid this from happening again the namespace will be configured to run on a different machine: OTG:0055562
    • CASTORCMS slow stager_qry response time after DB upgrade from 3rd Feb 2020. This leads to timeouts in Phedex and breaks their staging. The DB team is investigating. OTG:0055568
    • Client configuration change for the EOS mounts for EOSATLAS and EOSCMS; in production starting today
    • EOSLHCB planned upgrade tomorrow at 10:00 OTG:0055312
  • CERN databases: NTR
  • GGUS:
    • A new release is planned for Wed this week
      • Release notes
      • A downtime has been scheduled for 07:30-10:30 UTC
      • Test alarms will be submitted as usual
  • Monitoring: NTR
  • MW Officer: nc
  • Networks: NTR
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r24 - 2020-03-23 - JosepFlix
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback