Week of 180709

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday - virtual meeting

  • You may provide relevant incidents, announcements etc. here for the operations record.

Attendance:

  • local:
  • remote:

Experiments round table:

  • ALICE -
    • NTR

Sites / Services round table:

  • ASGC:
  • BNL:
  • CNAF:
  • EGI:
  • FNAL:
  • IN2P3:
  • JINR:
  • KISTI:
  • KIT:
  • NDGF:
    • One CE (ce01.grid.uio.no) is currently down due to filesystem problems. Should return on Tuesday or later if problems persists. Vendor is involved in sorting out the problems.
    • Another CE (atlas.triolith.nsc.liu.se) is going down on Thursday for two weeks and will return with a new name.
  • NL-T1:
    • A dCache pool node broke down on Monday. It was fixed only on Friday. We apologize for the inconvenience. Here's a small timeline:
      • Monday afternoon: we reported the issue to the vendor at 15:08 CEST
      • Wednesday afternoon: an engineer arrived on site (rather late; we have NBD support) but without parts.
      • Wednesday evening: we escalated the case at the vendor. The vendor explained that one of three replacement parts was out of stock.
      • Friday morning: engineer arrived with the two replacement parts that were in stock. One of those fixed the problem. Around lunchtime the node was up again.
      • This service level is not what we expect from this vendor and we pointed this out to them.
    • dCache 3.2 seems to have introduced a bug: when a file is deleted while dCache writes it to tape, the pool on which the file resides is disabled and has to be restarted. Typical problem case: SAM tests. A partial workaround, increasing flush interval, was provided by IN2P3 on the dCache user forum, thanks!
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL: NTR.
  • TRIUMF:

  • CERN computing services:
  • CERN storage services:
  • CERN databases:
  • GGUS: NTR
  • Monitoring:
    • Draft reports for the Jun 2018 availability sent around
  • MW Officer:
  • Networks:
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2018-07-12 - VincentBrillault
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback