Week of 210329

WLCG Operations Call details

  • For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
    • The pass code is provided on the wlcg-operations list.
    • You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • remote: Alberto (MONIT), Andrew (TRIUMF), Borja (Chair, MONIT), Christoph (CMS), David M (FNAL), David B (IN2P3-CC), Federico (LHCb), Francesco (CNAF), Julia (WLCG), Kate (DB), Maarten (ALICE), Marian (MONIT, Network), Pepe (PIC), Petr (ATLAS), Pavlo (ATLAS), Xin (BNL)

Experiments round table:

  • ATLAS reports (raw view) -
    • IFIC-LCG2: transfer errors, GGUS:151125
    • FZK-LCG2: transfer failures between Taiwan-LCG2 and FZK-LCG2, known issue GGUS:151050
    • CREAM CE end of life:
    • dCache restarts still necessary for Swiss CA update
      • DESY-HH (100%), DESY-ZN (86%), FZK-LCG2 (57%), IFAE (33%), IN2P3-CC (84%), PIC (94%), CA-SFU-T2 (44%), RRC-KI-T1 (100%), RU-Protvino-IHEP (100%), SARA (92%)
      • percentage of failures in parenthesis ~ rough estimation of dCache doors that needs restart

    • Julia: Do you need help notifying sites for the restart?
    • Petr: ATLAS would like to know which steps can be followed, Maarten already sent an email and nothing happened
    • Julia: Next step will be to send tickets
    • Maarten: Indeed, but it should be ATLAS people to submit them since WLCG is acting as middleman
    • Julia: It might be relevant for other VOs, WLCG should take care of the tickets
    • David B: How are the percentages calculated and IN2P3-CC doesn't see any errors?
    • Petr: Percentages are extracted from 100 test transfers, error pops out only in tests
    • David B: We already restarted WebDav ports, wasn't this enough, should we keep doing it every update?
    • Maarten: All dCache services need to be restarted, it's unusual for a CA to renew itself in such a way, so very unlikely to happen again in the short term
    • Christoph: CMS is still below ATLAS in the usage of HTTP-TPC that's why it might be we have not been affected by this issue yet
    • David B: Checked with the dCache admin, plan is to do the restart at the next major stop (June) if not a problem
    • Petr: We would like to finish the migration by May, but will check with ATLAS

  • CMS reports (raw view) -
    • Day light saving bug in Xrootd
      • Sites using native authenticated xrootd servers or site redirectors are advised to restart the service
      • Problem is identified and soon to be addressed by developers
    • New HTCondor release deployment
      • Solves recent issues with ARC 6.10
      • CMS SAM/ETF upgraded
      • CMS pilot factories partly done - upgrade in the US continues this week
    • CMS Rucio will use a new certificate DN
      • Sites using explicit DN gridmap files need to update before mid of April (when relying on VOMS roles only no action should be needed)
      • old DN: "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsrucio/CN=430796/CN=Robot: CMS Rucio Data Transfer"
      • new DN: "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmsrucio/CN=779320/CN=Robot: CMS Rucio Data Transfer"

  • ALICE
    • NTR

  • LHCb reports (raw view) -
    • Activity:
      • MC and WG productions; User jobs.
    • Issues:

Sites / Services round table:

  • ASGC: NC
  • BNL: NTR
  • CNAF: CNAF to be considered "at risk" due to electrical works. GOC:30354
  • EGI: NC
  • FNAL: NTR
  • IN2P3: Restart of dCache regarding Swiss CA problem is foreseen for next downtime in June.
  • JINR:NTR
  • KISTI: NC
  • KIT: NTR
  • NDGF: NC
  • NL-T1:
    • Tape maintenance ongoing. GOC:30371
    • Unable to attend, apologies.
  • NRC-KI: NC
  • OSG: NC
  • PIC: (Downtime to be updated by Pepe)
    • Petr: So the downtime will trigger the needed restart of dCache services?
    • Pepe: Yes, it will
    • Maarten: Please Pepe, make sure the latest CAs are installed before the restart
  • RAL: NC
  • TRIUMF: 12 hours downtime on Monday March 29th 17:00hrs - March 30 05:00hrs (UTC) for Storage maintenance. GOC:30396

  • CERN computing services: NC
  • CERN storage services: NC
  • CERN databases:
    • ATLAS offline database (ATLR) upgrade to version 19c on 7th of April - OTG:0062675
  • GGUS: NTR
  • Monitoring:
    • ATLAS/CMS discovered some inconsistencies on CRIC data used for SiteMon, we will be waiting for a green light from CRIC to schedule a new migration
      • Tentative migration for tomorrow is delayed

    • Julia: CRIC takes full responsibility for this delay and we will work with high priority with the experiments to have this sorted out since each experiment VOFeed has particularities that are hard to put in common
    • ETF CMS was updated to HTCondor 8.8.13 on Thursday last week (25th of March)
    • ETF ATLAS will be updated on 1st of April and will retire CREAM-CEs testing at the same time (this is because HTCondor 8.8.* in practice does not work for CREAM-CEs)
  • Middleware:
    • HTCondor 8.8.13 was released on March 23
      • It contains the ARC client fix needed for ARC >= 6.10.1
      • CMS pilot factories are being upgraded
      • SAM ETF has been upgraded for CMS
      • SAM ETF for ATLAS will be upgraded by 1st of April
  • Networks: Issue with SAMPA/LHCONE performance reported last week still ongoing
  • Security: NC

AOB:

  • NOTE: the operations meeting next Mon will be virtual .
  • You may provide relevant incidents, announcements etc. for the operations record.
  • Have a good Easter break !
Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r22 - 2021-03-29 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback