Week of 231120
WLCG Operations Call details
- The connection details for remote participation are provided on this agenda page.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to the
wlcg-scod
list (at cern.ch
) to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.
Best practices for scheduled downtimes
Monday
Attendance:
- local: Julia (WLCG), Maarten (ALICE + WLCG), Steve M (storage)
- remote: Andrew (TRIUMF), Brian (RAL), Christoph (CMS), David B (IN2P3-CC), David M (FNAL), Doug (BNL), Jens (NDGF), Onno (NLT1), Pepe (PIC), Steve T (computing), Xavier (KIT)
Experiments round table:
- ALICE
- Normal activity on average
- KISTI unusable since Fri evening CET due to an OPN problem being looked into
- Appears to have got fixed during the meeting!
Sites / Services round table:
- ASGC:
- BNL: NTR
- CNAF:
- EGI:
- FNAL:
- David:
- starting as of the first week in the new year,
we intend to upgrade our production dCache to 9.2
- Julia:
- might you upgrade a dev or test instance still this year?
- to allow testing the new XRootD monitoring workflow already
- David:
- Doug:
- would it help if BNL upgraded with a similar timeline?
- Julia:
- FNAL are the highest priority because of the pile-up data
being served to CMS jobs that may run anywhere on WLCG
- in general, any such upgrade would be welcome, though
- Doug:
- might a message about this be sent to the WLCG ops list?
- Julia:
- we will need to discuss with DOMA colleagues what we
would like to ask for between now and DC24
- we do not want to rock the boat too much
- IN2P3: dCache will be updated to 9.2 on next quarterly maintenance Tuesday December 12th.
- JINR: NTR
- KISTI:
- KIT:
- Xavier:
- just a reminder of the site downtime planned for Dec 6-7
- NDGF:
- Jens:
- the dCache upgrade to 9.2 had to be postponed
- we now expect to do that early Dec
- NL-T1: NTR
- NRC-KI:
- OSG:
- PIC:
- We were involved in the problems regarding SSL, GSI and HTCondor-CE 9.0.19 (GGUS:164007).
Today we have updated to 9.0.20 and everything seems ok.
- Problems with LHCb pilots. HTCondor/Alma9 WNs overestimate the memory used and the jobs are
put on hold due to a PIC rule. We will not update the farm to Alma9 until this is clear.
- LHCb is also doing a massive recall from tape. Staging issue performance.
Reducing the number of active requests to 7k (GGUS:164197).
- RAL: NTR
- TRIUMF: NTR
- CERN computing services: NTR
- CERN storage services:
- Steve M:
- Petr Vokac helped the ATLAS FTS getting registered in the ATLAS IAM
- Maarten:
- that is good news as part of the preparations for DC24
- CERN databases:
- GGUS: NTR
- Monitoring:
- Middleware:
- The HTCondor CE campaign on EGI is continuing with condor v9.0.20
- Many thanks to
Jaime Frey
of the HTCondor team for implementing
a very non-trivial fix for the issue encountered earlier!
- Clients still presenting VOMS proxies can be configured to have
the SSL
method tried first, with fallback to GSI
still working
- When all clients of a given CE are mapped through
tokens
or SSL
,
the CE can be upgraded to HTCondor v23 (which does not support GSI
)
- Example configurations will be published in the coming months
- Another campaign is foreseen in the early months of next year
- Doug:
- Would that fix have to be ported to HTCondor v23?
- Maarten:
- The fix had to be done specifically for the
9.0.20
release that
is seen as a stepping stone toward HTCondor v23.
- The problem arose due to the fact that both
SSL
and GSI
use
the same (VOMS) credential for authentication, which is
not an issue for later releases that do not support GSI
.
- Networks:
- Security:
AOB: