Week of 240506

WLCG Operations Call details

  • The connection details for remote participation are provided on this agenda page.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to the wlcg-scod list (at cern.ch) to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Borja (monitoring), Maarten (ALICE + WLCG), Panos (CMS + WLCG)
  • remote: Alessandro (CNAF), Christian (NDGF), Christoph (CMS), David B (IN2P3-CC), David M (FNAL), Farrukh (FNAL), Julia (WLCG), Maria (computing), Xavier (KIT)

Experiments round table:

  • ATLAS reports (raw view) -
    • MC simulation release delayed so workload reduced over next week or two

  • CMS reports (raw view) -
    • CMS encountered problems to write to EOS at CERN on Sunday evening
      • Due to impact on data taking an Alarm ticket was created GGUS:166721 (additionally also a user ticket was made before GGUS:166720, actually also processed on Sunday evening)
      • EOS team fixed a CMS self introduced bug in the gridmap file
      • Apologies for sending an alarm on an internally created bug
      • CMS internal follow-up is ongoing
        • VM hosting the service (script) to generate the gridmap file is being migrated to Alma9
        • Looks like critical changes became active during the weekend
    • Campaign to phase out deprecated protocols is progressing
    • Discussion:
      • Maarten:
        • was there indeed a connection with the network incident? (OTG:0149898)
        • or was that just a coincidence?
      • Christoph:
        • it is not known at this time
      • Maarten:
        • there will probably be some followup on that outage,
          because the fallout looked bigger than it should have been
      • Maarten:
        • the user ticket that was first opened had these issues:
          • a nondescript subject line
          • CERN-PROD was not selected as the notified site,
            which left the ticket for first-line support in EGI,
            which would have reacted only Monday morning
        • by pure chance, I happened to see the alarm ticket arrive
        • on inspection, it looked a tricky case for the CERN TI operator,
          because it referred to the other GGUS ticket for the details,
          but the operator can only access SNow tickets at best...
        • at least the subject line of the alarm ticket was quite helpful
        • I decided to call the operator to ensure the correct action was taken,
          viz. to contact EOS support according to the alarm ticket procedures
        • I assigned the user ticket to CERN-PROD and even to EOS support
        • the EOS expert on call analyzed the problem and applied a workaround
          to cure the problem within 1h, hooray!
      • Christoph:
        • the main problem was that the shift crew could not open an alarm ticket
          and had to rely on someone else to do that eventually
        • we will review the list of people who can open such tickets
      • Maarten:
        • FYI, the TI operator procedures will be further improved
        • if an alarm ticket does not seem to get a timely response,
          experiment control rooms can always call the TI operator

  • ALICE
    • NTR

Sites / Services round table:

  • BNL: NTR
  • CNAF: Updating to HTCondor23, this week we will update 3 CEs to alma9 and HTC23 (GOCDB:35322 and GOCDB:35323)
  • EGI:
  • FNAL: Alma9 migration of services underway. Nothing major to report otherwise.
  • IN2P3: NTR
  • JINR:
  • KISTI:
  • KIT:
    • LHCb "spontaneously" gives https now higher priority then SRM (GGUS:166451), triggered by the issues that came up with streamlined contact endpoint names for their dCache SE (GGUS:166268).
    • CE-4 is draining in preparation for the reinstallation on RHEL-8 (GOCDB:35328).
  • NCBJ:
  • NDGF:
    • Two pools have been misbehaving since yesterday afternoon, generating a lot of failures. Cause is not yet understood, possibly related to underlying CEPH storage, but work is ongoing.
    • Some pools will be OS upgraded tomorrow. Only short interruptions expected. (GOCDB:35356)
  • NL-T1:
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL:
  • TRIUMF: NTR

  • CERN computing services:
    • Due to the network outage that happened on Sunday (OTG:0149898) , we experienced some Puppet issues that should be solved now.
  • CERN storage services:
  • CERN databases:
  • GGUS: NTR
  • Monitoring:
    • Draft SiteMon reports for April sent around
  • Middleware: NTR
  • Networks:
  • Security:

AOB:

  • Next week the WLCG/HSF Workshop 2024 will take place Mon-Fri at DESY.
  • The week after that has the Whit Monday holiday.
  • NOTE: the next two operations meetings will be virtual .
  • You may provide relevant incidents, announcements etc. for the operations record.
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2024-05-06 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback