Week of 231204

WLCG Operations Call details

  • The connection details for remote participation are provided on this agenda page.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to the wlcg-scod list (at cern.ch) to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Maarten (ALICE + WLCG)
  • remote: Andrew (TRIUMF), Christoph (CMS), Daniele (CNAF), Darren(RAL), David M (FNAL), Giacomo (Computing services), Henryk (LHCb + NCBJ), Ivan (BNL), Jose (PIC), Julia (WLCG), Michal (ATLAS), Panos(WLCG), Ville (NDGF), Xavier (KIT)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Issues:
      • HC mass blacklisting on Friday due to ALRB root (10:40 - 11:55, GVA time). Solved and reverted.
      • Deletion failures at RAL-LCG2 (GGUS:164289)
        • xrootd gateway load distribution issues are being investigated
      • Transfers to NDGF-T1 fail with "transfered a partial file" (GGUS:164436)
        • storage pools restarted and given more memory
      • Transfers from TRIUMF-LCG2 failed with "File not found" (GGUS:164435)
        • files declared lost

  • CMS reports ( raw view) -
    • DC24 pre-challenge exercise (T0->T1 export), ATLAS & CMS in common, went rather successfully
    • Xavier has observed that SAM tests are failing for CMS since Friday. It's suspected that a test that is not in the critical profile is causing the errors. Christoph has followed up and the problem has presumably been already fixed but a ticket will be opened to track this.

  • ALICE
    • Lowish activity on average
    • ALICE disk SE at KIT was dysfunctional for a few days (GGUS:164535)

Sites / Services round table:

  • ASGC:
  • BNL:
    • Downtime on 19.12.2023 (EST) for 24 hours i.e. from 19.12.2023, 06:00 (CEST) till 20.12.2023, 06:00 (CEST)
      • Network intervention
      • US FTS update
      • No way to announce FTS downtime in OSG topology.
  • CNAF: NTR
  • EGI:
  • FNAL:
  • IN2P3:
    • LHCOPN link lost during pre-DC24 tests due to NREN intervention. Traffic has been automatically redirected on LHCONE link with no major incidence (few transfers lost) on pre-DC24 tests. LHCOPN was replugged this morning.
    • Site quarterly maintenance next week on Tuesday Dec. 12th:
      • update to dCache 9.2 -> SE in downtime on the morning ( GOCDB #34782)
      • update to HTCondor 9.0.19 -> CE in downtime the whole day ( GOCDB #34793)
      • maintenance on HPSS -> tape system in downtime for the whole day ( GOCDB #34795)
      • maintenance on network switch: update with no impact if everything goes well, but risky operation ( GODCB #34791)
  • JINR: NTR
  • KISTI:
  • KIT:
    • GGUS:163535 Many servers of the xrootd SE needed a reboot after the processes had turned into zombies (likely due to a GPFS bug that the upcoming downtime is supposed to fix).
    • Site downtime on Dec 6th/7th (GOCDB #34716)
  • NDGF:
    • Afternoon downtime on Thursday for dCache 9.2 upgrade. Hopefully only an hour or two: https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=34815
    • HPC2N compute data center lost cooling around 00:30 between Sunday - Monday, power was cut later when it overheated. Cooling restored around 03.
    • Lost a router Saturday at 2 am and it was restored at 7:30 AM.
    • Some problems with dcache pools crashing, operating system/ceph storage problems or similar issues
  • NL-T1: NTR
  • NRC-KI:
  • OSG:
  • PIC: PIC will be in OUTAGE from 12-12-2023 08:00 (PIC local time) until12-12-2023 15:00 (PIC local time). We will perform a dCache upgrade and a migration of the virtualization platform. HTCondor will be stopped at the beginning of the SD.
  • RAL: We wish to notify everyone of the existence of 2 new job queues that have been made live at RAL.

    These queues are:

    EL8
    EL9

    These queues, when used, will run jobs in a Rocky 8 or Rocky 9 container environment, respectively. For experiments, please do test jobs against these new queues and report back any issues or questions you may have. This does NOT change the default queue. The default queue for all jobs submitted to RAL is still the “EL7” queue.

    These new queues (EL8, EL9) are available on any of the 7 CE’s at RAL (5 production and 2 test CE’s).

    We do wish to stress these queues exist currently as a test framework to confirm functionality of jobs in an EL8 or EL9 environment and jobs may encounter issues.

    One major change experiments should be aware of is the change of Python installed by default in these images, namely the default installation of Python3.
  • TRIUMF: NTR

  • CERN computing services:
    • Reminder on Thursday December 7th the lxplus.cern.ch alias will switch to el9 and concurrently batch will default to el9 OTG0076556
    • VOMS DB upgrade done - 30' downtime as expected OTG0146328
  • CERN storage services:
  • CERN databases:
  • GGUS: NTR
  • Monitoring:
  • Middleware: NTR
  • Networks:
  • Security:

AOB:

Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2023-12-18 - IvanGlushkov
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback