Week of 210301

WLCG Operations Call details

  • For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
    • The pass code is provided on the wlcg-operations list.
    • You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • remote: Kate (DB, chair), Maarten (WLCG, ALICE), Christoph (CMS), Federico (LHCb), Pavlo (ATLAS), Francesco (CNAF), Alberto (monitoring), David B (IN2P3), Xavier (KIT), Pedro (monitoring), Andrew (TRIUMF), Marian (monitoring, network), Darren (RAL), Onno (NL-T1), Gavin (computing), DaveM (FNAL), Pepe (PIC), Steve (storage), Christian (NDGF)

Experiments round table:

  • ATLAS reports (raw view) -
    • EOS ATLAS back to GridFTP TPC because of bug in XRootD HTTP-TPC implementation - 30% failure rate after Monday EOS ATLAS update (GGUS:150647, GGUS:150699, OTG:0062286)
    • Testing patched HTCondor compatible with ARC-CE 6.10.1
    • INFN-T1 - AGLT2 stuck transfers still not understood, affects both protocols (gsiftp, https), transfer of big files (20GB+) almost never succeeds
    • Chile CA CRLs still not reachable with IPv6 - reported a month ago (GGUS:150245) - we saw occasional issues with transfers from Chile related probably to fetch-crl timeouts
Maarten commented that the CAs should be contacted directly. He will contact the 3 CAs mentioned as not working. Maarten also requested that ATLAS reviews a procedure for opening ALARM tickets in case of DBoD issues - an ALARM ticket was opened by mistake on Sunday night.

  • CMS reports (raw view) -
    • Contacted sites that run into issues after upgrade to ARC-CE v6.10 (all applied a patch - TBC!)
    • More sites were contacted via GGUS regarding availability of webDAV endpoint
    • Otherwise no major problems and good resource usage

  • ALICE
    • NTR

  • LHCb reports (raw view) -
    • Activity:
      • MC and WG productions; User jobs.
    • Issues:
      • CERN: Migration to CTA to start this week, putting CASTOR in read-only mode on Wednesday.

Sites / Services round table:

  • ASGC: nc
  • BNL: nc
  • CNAF: ticket 150642 (INFN-T1 - AGLT2 stuck transfers) under further investigation, otherwise NTR.
  • EGI: nc
  • FNAL: NTR
  • IN2P3: NTR
  • JINR: nc
  • KISTI:nc
  • KIT: Security updates on network components went fine last Friday. However, two switches are still pending, for which we need to coordinate a downtime, since the entire storage is linked to them.
  • NDGF: NTR
  • NL-T1:
    • Nikhef: The jobs status and gridftp job submission problems on our ARC-CEs have been fixed:
      1. The access fields in the ARC-CE ldap schema were reordered (originally patched to be GDPR compliant).
      2. The bdii-update script was patched to use the bind DN, infosys, for the ldapsearch parameters.
      3. Added the correct DN for the ATLAS robot to be able to submit and retrieve output from the test job
    • Nikhef: Weds 3rd March 09:00 to 17:00, upgrading dcache storage to new dcache patch release. System will remain up but some read transfers will be dropped as pools are updated. (https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=30277)
    • Surf (Sara-matrix): investigating an issue with certificates from our new CA Sectigo. GGUS:150757 GGUS:150706. Possibly multiple issues; one issue may be that Perl (fetch-crl!) in Centos 8 prefers IPv6 over IPv4 and some CA CRL webservers are misconfigured for IPv6.
Maarten reported that INFN CA was decommissioned, it should be removed from configuration soon. Issues are to be followed on case-by-case basis.
  • NRC-KI: nc
  • OSG: nc
  • PIC: NTR
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: NTR
  • CERN storage services: NTR
  • CERN databases: NTR
  • GGUS: NTR
  • Monitoring:
    • Migration of SiteMon job to CRIC delayed
    • SiteMon/ETF - as was agreed at the WLCG operations coordination, we will retire CREAM-CE testing for all experiments on March 8th (next Monday)
    • SiteMon/ETF - HT-Condor 8.8.13 will contain a fix for ARC-CE 6.10 hashed DNs (fixes submission to ARC-CE 6.10+ via gsiftp interface). SAM/ETF for ATLAS, CMS will be updated after HTCondor official release - 9th of March tentative.
      • Fixing the availability numbers can be done by either: applying a generic correction to the full VO (done by MONIT) or fixing site by site (done by Experiment representatives).
      • (Agreed during the meeting) ATLAS and CMS will check if they can produce the required recomputation requests for all affected sites. If not they should contact MONIT.
      • As a consequence the draft report for February (and consequently the final report) will be circulated will some delay.
It was discussed that ATLAS and CMS should review the list of sites affected and correct the data manually. The MONIT team is to be contacted if manual fix is not possible.
  • Middleware:
    • ARC CE plans:
      • As of March 9, update HTCondor-G for SAM ETF and in pilot factories
      • When things look OK there, sites will be asked to update to 6.10.1
  • Networks: NTR
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r23 < r22 < r21 < r20 < r19 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r23 - 2021-03-01 - KateDziedziniewicz
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback