Week of 220502
WLCG Operations Call details
- For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
- The pass code is provided on the
wlcg-operations
list.
- You can contact the
wlcg-ops-coord-chairpeople
list (at cern.ch
) if you do not manage to subscribe.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to the
wlcg-scod
list (at cern.ch
) to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.
Best practices for scheduled downtimes
Monday
Attendance:
- remote: Kate (DB, chair), Julia (WLCG), Maarten (ALICE, WLCG), Onno (NL-T1), Alberto (monit), Pinja (sec), Jiri (ATLAS), Christoph (CMS), Borja (monit), Andrew (TRIUMF), Dave (FNAL), Daniele (CNAF), Federico (LHCb), Douglas (BNL), Ville (NDGF)
Experiments round table:
- CMS reports (raw view) -
- IGTF update 1.116 introduced a change for the CERN CA - causing trouble to various Java based services (due to a buggy upstream library)
- SAM tests started to fail on the CMS dCache instance at KIT after installation. Went away after restart. (No GGUS ticket, solved on the spot via MM)
- Similar issue in Bari (GGUS:156985) - not 100% clear, if really related to IGTF update
- VOMS servers at CERN got trouble with CERN certificates: RQF:2026342
- ALICE
- added after the meeting:
myproxy.cern.ch
is broken (GGUS:157130)
Sites / Services round table:
- CERN computing services:
- Renewal of the CERN Grid Certification Authority certificate: OTG:0070386
- CERN storage services:
- CERN databases: NTR
- GGUS: NTR
- Monitoring:
- Distributed draft SiteMon availability/reliability reports for April 2022
- Middleware:
- CERN Grid CA renewal saga
- IGTF version 1.116 was released Monday last week to
make a new CERN Grid CA certificate available with
its lifetime extended by another 10 years
- The new certificate was put into production this morning
- Despite careful testing of the new certificate beforehand,
several incidents occurred with instances of Java services
that make use of the canl-java
security library:
-
dCache
-
StoRM
-
VOMS-Admin
-
Argus
- Except when such a service uses the latest
canl-java
(e.g. the most recent dCache versions), such services
all need to be restarted to get into a state that
is consistent with clients using the new IGTF release
- VOMS-Admin at CERN was restarted Thu evening last week,
earlier than foreseen, in response to tickets from ATLAS and CMS
- Argus services at CERN were restarted this morning
Douglas asked which dCache version doesn't require the upgrade?
Maarten confirmed 7 family is ok, for previous versions there was a backport planned but the status of the backport is unknown
- Networks: NTR
- Security: Advisory sent: up to critical risk vulnerability
Maarten remarked that services using recent java versions are affected
AOB: