Week of 180129

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Kate (chair, WLCG, DB), Julia (WLCG), Maarten (ALICE, WLCG), Vladimir (LHCb), Borja (monitoring), Alberto (monitoring), Marian (network), Paul (storage), Petr (ATLAS)
  • remote: Onno (NL-T1), Xavier (KIT), Di Qing (TRIUMF), Kyle (OSG), Marcelo (CNAF), Darren (RAL), Stephan (CMS), David M (FNAL), Pepe (PIC)

Experiments round table:

  • ATLAS reports ( raw view) -
    • kernel update & reboot of CERN machines
      • did not caused any major problems
      • some of our services were in same availability zone
    • CERN-P1 in downtime for cooling maintenance => ~ 40k job slots sorter for our production
    • preliminary tests with/without meltdown & spectre patches - 2.5-3% for reconstruction
    • RAL FTS server issues ~ 1M transfers stuck
      • started on Thursday but by that time one US site in downtime was a suspect (GGUS:133066)
        • was not automatically blacklisted for ATLAS transfers
      • on Friday we changed ND & IT cloud to use other FTS, today morning also UK
    • RAL CASTOR performance issues (necessity to reindex database) after deletion backlog caused by removing 600TB from storage
    • tape buffers - currently needs manual blacklisting to throttle transfers it case of full buffer
      • more transfers to tape between sites because of changes in our policy (tape & disk replica can't be on one site)

  • CMS reports ( raw view) -
    • Mo major issues
    • Recovery from the Spectre/Meltdown reboots went fine
    • MWGR-1 Thursday/Friday went well, last run continued until Monday morning
    • Smooth running at about 180k cores (1/4 analysis, 3/4 production); increase in negotiation cycle traced to a bad config at one of our sites;

  • ALICE -
    • Normal to high activity level on average.
    • Several periods of high job failure rates during the week.
      Not clear what was the cause; the problem may well be present still.
      There could be a network issue affecting the central services;
      GGUS:133052 was opened for network experts to have a look.
    • IN2P3-CC: very high job failure rates and VOBOX overloads starting
      Fri evening, apparently due to some WN having a stale CVMFS cache.
      • Job submissions had to be stopped Sat evening.
      • GGUS:133088 opened today.
    • CNAF: first successful jobs from Thu late afternoon through Sun morning.
The network team will be pinged to check the GGUS ticket.

  • LHCb reports ( raw view) -
    • Activity
      • HLT farm is partially running
      • 2016 data restripping, MC simulation and user jobs
    • Site Issues
      • CERN/T0
        • NTR
      • T1
Vladimir asked what is the best way to follow up on a CERN cloud incident when there is no clear error or confirmation. A GGUS ticket was suggested by Julia and Maarten. Maarten reminded about the issue with the mail servers at CERN that has caused delays to emails arriving.

Sites / Services round table:

  • ASGC: nc
  • BNL: NTR
  • CNAF: Library is functional
    • Buffer is being recovered
    • Waiting for parts for the storages
    • Waiting for the return of the wet tapes (end of February)
    • Subfarm will be accessible to other experiments
  • EGI: nc
  • FNAL: NTR
  • IN2P3: nc
  • JINR: NTR
  • KISTI: nc
  • KIT: NTR
  • NDGF: nc
  • NL-T1: a non-HEP VO is submitting large numbers of jobs, making the SARA-MATRIX batch system unstable. GGUS:133089
  • NRC-KI: nc
  • OSG: NTR
  • PIC: NTR
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services:
    • The 2nd half of the Batch service (currently draining) will be rebooted for patching on Tuesday 30th, at which point full capacity should return.
  • CERN storage services:
    • EOSCMS: Updated to EOS 4.x (Citrine) last tuesday, some instabilities but a fix has been deployed on Thursday.
  • CERN databases: Multiple rolling database interventions planned for this week to apply latest Security Patch (lhcbonr, lhcbr, lcgr, aliconr, adcr, atlr and castor databases. CMS DBs to be patched next week).
  • GGUS:
    • A new release is planned for Wed this week
      • Release notes
        • In particular: tickets will be anonymized 18 months after getting closed
      • A downtime has been scheduled for 07:00-09:00 UTC
      • Test alarms will be submitted as usual
  • Monitoring: NTR
  • MW Officer: NTR
  • Networks: NTR
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2018-01-29 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback