Week of 220822

WLCG Operations Call details

  • For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
    • The pass code is provided on the wlcg-operations list.
    • You can contact the wlcg-ops-coord-chairpeople list (at cern.ch) if you do not manage to subscribe.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to the wlcg-scod list (at cern.ch) to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local:
  • remote: Christoph (CMS), Daniele (CNAF), Darren (RAL), Dave M (FNAL), David C (ATLAS), Doug (BNL), Julia (WLCG), Laurence (computing), Maarten (ALICE + WLCG), Marian (networks), Mark (LHCb + Birmingham), Ofer (BNL), Onno (NLT1), Xavier (KIT)

Experiments round table:

  • ATLAS reports (raw view) -
    • Synchronisation of VOMS information with Rucio was not working for several weeks due to cleanup of expired accounts. This meant new users could not use Rucio without manual registration. The correct permissions were reinstated last Thursday and the certificate owner will be changed soon.
      • Note: change of certificate ownership means change of DN
    • Small fraction of data transfers from CERN to T1s fail due to slow rate and/or timeout, being investigated with EOS team

  • CMS reports (raw view) -
    • Mid of last week stage-out of user analysis jobs submitted via CRAB had trouble due to incompatible change in CMS Rucio deployment
      • No (direct) impact on sites
      • Some inconvenience for CMS users though
    • Christoph: heard of packet losses seen on the link
      from CERN to the US, reported by an admin at Purdue

  • ALICE
    • NTR

  • LHCb reports (raw view) -
    • Issues:
      • RAL:
        • New hardware added to help with slow gateways problem (GGUS:156492)
        • Echo Deletion problems (GGUS:155120) - Issue of XRootd proxy serialising transfers identified being worked on
    • Timeout issues at SARA found to be due to limit on simultaneous connections (GGUS:153653)
      • Solution proposed of increasing number of connections and using a '-n' flag for hadd
        • the potential risks of a significant increase in the number
          of concurrently allowed connections were discussed
        • it would seem much better for users of the hadd application
          to specify a reasonable number instead, e.g. 10 or 20

Sites / Services round table:

  • ASGC:
  • BNL: Starting Friday, the transfer failures from tape due to our disabling dCache gridftp doors and fts still using gridftp as a protocol. - Two day tape outage December (19-Dec to 20-Dec); HPSS upgrade.
  • CNAF: NTR
  • EGI:
  • FNAL: NTR
  • IN2P3:
  • JINR: NTR
  • KISTI:
  • KIT: NTR
  • NDGF:
  • NL-T1:
    • SARA-MATRIX's tape system was stuck from Saturday night until this morning.
    • The root/hadd problem is understood. Hadd opens all input files at once, which in this case led to queued transfers. A `-n` parameter works around this problem. GGUS:153653
  • NRC-KI:
  • OSG:
  • PIC:
  • RAL: NTR
  • TRIUMF: NTR

  • CERN computing services: NTR
  • CERN storage services:
  • CERN databases:
  • GGUS: NTR
  • Monitoring:
  • Middleware: NTR
  • Networks: ESnet upgrades at CERN will take place on the 12-13th of September (please check details at OTG:0071902, OTG:0071903)
  • Security: Advisory sent last Monday afternoon, concerning a high risk vulnerability.

AOB:

  • Xavier:
    • the storage space accounting at KIT is hampered by an
      issue with SRR: is it understood how we can move forward?
  • Julia:
    • while the SRR link is provided via CRIC, the WSSA service
      needs more information mainly for CMS dCache instances,
      to know which storage areas must be taken into account
    • WSSA therefore has been using its own configuration instead
    • the matter has been discussed between experts and we think
      a proper solution can be implemented in the near future
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2022-08-28 - AndrewWong
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback