Week of 210809

WLCG Operations Call details

  • For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
    • The pass code is provided on the wlcg-operations list.
    • You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local:
  • remote: Miro (Chair, DB), Borja (Monitoring), David B. (IN2P3), David M. (FNAL), Maarten (ALICE), Nils (Compute), Peter (ATLAS), Pinja (Security), Steve (Storage), Xavier (KIT), Andrew (TRIUMF), Julia (WLCG), Darren (RAL), Christophe (CMS), Chien-De (ASGC), Ville (NDGF), Mihai (Storage), Alberto (Monitoring)

Experiments round table:

  • ATLAS reports (raw view) -
    • BNL - PIC transfer timeouts, dcache limit 7200 s (davs). Tried changing the number of FTS transfers but no effect, link Spanish NREN-Geant saturation due for upgrade 2021. Gridftp transfers hit the limit set dynamically for filesize assuming 500kB/s. Our FTS transfers use less than 200MB/s of the theoretical 1.25GB/s available. We would need to know what is using the rest, e.g. CMS, and whether this can be scaled back. If CMS use fts, then maybe we could use a common instance for now.

  • CMS reports (raw view) -
    • CMS struggling a bit with lowish CPU efficiency - under investigation
      • Certainly caused by the type of workflows - no wide spread site issue
    • Disk space usage in the 'unmerged' area still under observation

  • ALICE
    • NTR

Sites / Services round table:

  • ASGC: NTR
  • BNL: NC
  • CNAF: NC
  • EGI: NC
  • FNAL: T1 partially drained due to a quiet certificate expiration this weekend, updated & jobs starting again.
  • IN2P3: NTR
  • JINR: NC
  • KISTI: NC
  • KIT:
    • The xrootd redirector for alice-tape-se.gridka.de died on Saturday afternoon (lost 25 % availability to that) and we're bringing it back online right now.
    • As requested by ATLAS (GGUS:153333), we've set up a new dCache pool and SpaceToken for their SRM+HTTPS tests.
    • Since Thursday we publish two versions of SRR for CMS: The "regular" and an experimental version that reports numbers irrespective of SRM space reservations.
    • IPv6 connectivity issues between NRC-Kl-T1 and GridKa resolved (GGUS:153313).
  • NDGF: Blugrass ARC CE at Linköping had some hardware issues last week. Some hardware was replaced and looks like it is ok now.
  • NL-T1: NC
  • NRC-KI: NC
  • OSG: NC
  • PIC: NC
  • RAL:
    • On Saturday August 14th the RAL core network will be in downtime for several hours to under go a major upgrade. While this does not directly affect the LHCOPN link or other external links to the Tier-1, it will have a major impact on many services. Due to the fact that we are not in a data taking period and there are numerous people on holiday currently, we intend to start taking most services down on Friday afternoon with the intention of bringing them up by lunch time on Monday. The Castor tape archive will be down as the Oracle backend database are affected. Echo will be staying up. Jobs submission to the batch farm will be paused as jobs will lose external connectivity.
    • We hope that the fact that Echo is staying up should mitigate the majority of the impact to the VOs as if data is urgently needed then it can be transferred elsewhere. The FTS service and CVMFS Stratum 0/1 will be taken down for the duration. The GOCDB will stay up in read only mode in its backup location. While internal service monitoring will be impacted GGUS alarm tickets will be processed as normal.
  • TRIUMF: NTR

  • CERN computing services:
    • Reduced batch capacity during kernel upgrade campaign, ref. OTG:0065291
  • CERN databases: NTR
  • CERN storage services: NTR
  • GGUS: NTR
  • Monitoring: NTR
  • Middleware: NTR
  • Networks: NC
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2021-08-09 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback