Week of 211018

WLCG Operations Call details

  • For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
    • The pass code is provided on the wlcg-operations list.
    • You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local:
  • remote: Miro (Chair), Xavier (KIT), Maarten (ALICE), Julia (WLCG), Kate (DB), Andrew (NL-T1), Pinja (Security), Ville (NDGF), Jiri (ATLAS), Henryk (LHCb), Andrew (TRIUMF), Borja (Monitoring), David (FNAL), Alberto (Monitoring), Nils (Compute), Christoph (CMS), Chien-De (ASGC), Vincenzo (CNAF), Marian (Network), Douglas (BNL)

Experiments round table:

  • CMS reports (raw view) -
    • (Partial) network outage at CERN on Friday late afternoon (Oct 15th): OTG:0066817
      • Several CMS affected, particularly CMS webservices
      • Main issue voms-admin clients failing (voms-proxy-init working though): INC:2952661

  • ALICE
    • Some periods of low activity due to lack of ready productions
    • No major issues
    • The tape challenge went fine Mon-Fri last week: plot

  • LHCb reports (raw view) -
    • Activity:
      • Tape data challenge (ongoing, ~10GB/s throughput achieved)
    • Issues:
      • Network issue at CERN on Friday, almost all VO-boxes affected

Sites / Services round table:

  • ASGC: Site will be in downtime on Nov 7 for International circuit re-cabling
  • BNL: BNL Tape Downtime 04:00 19-Oct to 20:00 19-Oct. In past week BNL is upgrading ATLAS nodes to Condor 9.0.6.
  • CNAF:
  • EGI: NC
  • FNAL: NTR
  • IN2P3: NTR
  • JINR: NC
  • KISTI: NC
  • KIT: NTR
  • NDGF: Some sites were affected by CERN network outage.
  • NL-T1:
    • Nikhef: After our network outages on the 6th and 8th of October our ARC CEs were very slow to update their bdii information. This was due to a combination of a large number of dead jobs left over after the network problems and slow code in the CEinfo.pl script that arc uses. For each batch queue the script loops over all batch system jobs to try and match them with jobs known to the CE. After our network issues we had of order 30,000 jobs known to each CE. Each loop over the batch jobs took around 5 minutes per queue and for all our queues that came to around 1 hour. After removing jobs from the CEs that were no longer in the batch system the update time was eventually lowered to a couple of minutes in total. We also plan to follow with the arc devs with a patch to speed up the slow code.
    • SURF: Tape is currently unavailable. We are looking into it. A downtime has been added in the GOCDB.
  • NRC-KI: NC
  • OSG: NC
  • PIC: NC
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services:
    • Following the major network outage in the CERN computer centre late Friday afternoon (OTG:0066817), batch and grid services were affected until around 20pm (OTG:0066829). Furthermore, the VOMS admin application was unavailable during the weekend due to a database connectivity problem after network incident. ( OTG:0066864 ).
    • The Condor-CE grid job router now also supports GPU resources. ( GGUS:154316 )
    • Upgrade of the CMS and ATLAS IAM Instances from Openshift 3 to Openshift 4 planned for Monday 25th (OTG0066928)
  • CERN storage services:
    • FTS service partially affected by network incidents
  • CERN databases: Both DBoD services and Oracle affected by the major network outage. VOMS database crashed and restarted. After the restart, a database parameter modified in the past that was not persistently set caused high load impairing the lbtrans database.CMSR database listener was restarted and didn't register service properly all services causing issues to Rucio. CMSONR ADG was restarted but frontier service didn't start properly and had to be manually restarted.
  • GGUS: NTR
  • Monitoring:
    • Distributed final SiteMon availability/reliability reports for September 2021
    • ETF impacted by the CERN DC network outage (OTG:0066841)
  • Middleware: NTR
  • Networks:
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r26 < r25 < r24 < r23 < r22 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r26 - 2021-10-19 - HannahShort
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback