Week of 180611

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Kate (chair, DB, WLCG), Julia (WLCG), Maarten (ALICE, WLCG), Gavin (comp), Ivan (ATLAS), Vincent (sec), Andrea V (LHCb), Borja (monit), Vladimir (LHCb)
  • remote: Marcelo (CNAF), Andrew (NL-T1), Pepe (PIC), Ville (NDGF), John (RAL), Christoph (CMS), David B (IN2P3), Di (TRIUMF), Dave M (FNAL), Xin (BNL), Balazs (MW), Sang Un (KISTI)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Stable operation at:
      • Slots: 288k / 330k
      • Transfers: 8 GB/s
    • Starting tests too switch the Spanish Cloud to Harvester (new work load management system)

  • CMS reports ( raw view) -
    • High CPU usage: ~260k cores (~220k production)
    • CEPH outage at Wigner on Wednesday (OTG:0044290) impacted CMS Frontier service
    • CVMFS/Squids very slow on Wednesday OTG:0044296
      • Unrelated to CEPH issue as far as CMS Squid experts could tell

  • ALICE -
    • NTR

  • LHCb reports ( raw view) -
    • Activity
      • Data reconstruction for 2018 data, MC simulation, user jobs
    • Site Issues

Sites / Services round table:

  • ASGC: nc
  • BNL: NTR (ticketing system at works after OSG changes, GGUS will be used directly after reconfiguration completed)
  • CNAF:
    • CMS kept 60k+ pending pilots which is 1 order of magnitude more than their pledge and had a DOS effect over tomcat in our CEs.
      • LHCb - Tkt opened due to pilots aborted (GGUS:135582) as consequence of the CEs disrupted behaviour
      • LHCb - For the same reason only ~1000 jobs are running
    • This Week will be installed more storage space. to be confirmed the amount.
Maarten asked if the pilots were killed, Marcelo confirmed that most of them were. Maarten and Christoph suggested CNAF opening a ticket to CMS to follow up on the issue investigation.
  • EGI: nc
  • FNAL: NTR (same as in BNL, GGUS tickets to be passed directly to SNOW, alarm tickets have already been tested)
  • IN2P3: Site in scheduled maintenance tomorrow June 12th. dCache will be upgraded to 3.2 and thus no jobs for ATLAS, CMS and LHCb. ALICE will still run as there will be no intervention on XRootD.
  • JINR: dCache and Postgres have been upgraded. Both SEs Disk & Buffer/MSS now run dCache 3.2 and Postgres 10.
  • KISTI: NTR
  • KIT: nc
  • NDGF:
  • NL-T1:
    • Starting on Friday we have had several problems with our tape back-end at SURFsara during which the tape system was sometimes unavailable. We managed to get it back on line just on Monday just before noon. We are investigating this incident together with the supplier of the system.
    • dCache will be upgraded from 2.16 to 3.2 on June 26 & 27. GOCDB
  • NRC-KI: nc
  • OSG: nc
  • PIC: NTR
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: CVMFS issues last week as mentioned by experiments
  • CERN storage services: nc
  • CERN databases: NTR
  • GGUS: NTR
  • Monitoring: NTR
  • MW Officer: NTR
  • Networks:
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2018-06-11 - KateDziedziniewicz
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback