TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsWeb
>
WLCGOpsCoordination
>
WLCGOpsMinutes230504
(2023-05-31,
ConcezioBozzi
)
(raw view)
E
dit
A
ttach
P
DF
<!-- -- > <font size="6"> %RED% *DRAFT* %BLACK% </font> <br /><br /> < !-- --> ---+!! WLCG Operations Coordination Minutes, May 4, 2023 %TOC{depth="4"}% ---++ Main points * [[LCG/WLCGOpsMinutes230504#Impact_on_experiment_operations][CERN Grid CA expiration incident]] * [[LCG/WLCGOpsMinutes230504#RAL_LCG2_Network_Outage_October][RAL-LCG2 Network Outage October 2022]] * [[LCG/WLCGOpsMinutes230504#Discussion_AN2][HTCondor CE v6 accounting concern]] ---++ Agenda https://indico.cern.ch/event/1282064/ ---++ Attendance * local: * remote: Alastair (!RAL), Alessandra (Napoli), Eric (!IN2P3), Julia (WLCG), Maarten (ALICE + WLCG), Mario (ATLAS), Matt (Lancaster), Panos (WLCG), Stephan (CMS), Thomas (DESY) * apologies: ---++ Operations News * on Sat April 22, the CERN SSO was unavailable for about 2.5h starting 13:20 CEST * at that time the old CERN Grid CA certificate expired * though the currently valid certificate was available since 1 year, <br /> there were a number of places where the old one was still used * the SSO was such an example (OTG:0076975) * there was a lot of fall-out, some of which is reported below * a comprehensive service incident report is being worked on * a presentation about the matter is foreseen for our next meeting * our next meeting is planned for June 1st ---++ Special topics ---+++ RAL-LCG2 Network Outage October 2022 see the [[https://indico.cern.ch/event/1282064/#9-ral-network-service-incident][presentation]] ---++++ Discussion * Stephan: * 3 out of 4 switches failed - what was the correlation? * Alastair: * the Z9100 going down triggered crashes of the 2 others * as that HW was old, such trouble could have happened any time * the DB switch was close to dying as well * we therefore decided to move all affected links to newer HW * Julia: * was it needed to make all changes in a single day? * Alastair: * originally, a plan in stages was foreseen, to minimize the risks * given the perilous state of the legacy network, however, <br /> it was decided to do all the operations in one go instead ---++ Middleware News * Useful Links * WLCGBaselineTable * Baselines/News ---++ Tier 0 News ---++ Tier 1 Feedback ---++ Tier 2 Feedback ---++ Experiments Reports ---+++ ALICE * Moderate activity levels, no major issues * Tokens for normal jobs are already in use at most HTCondor CE sites * Continuing with switching sites from single- to multicore jobs * ~85% of the resources are already supporting 8-core or whole-node jobs ---+++ ATLAS * Smooth running with 500-750k slots on average with lots of Full Simulation * CERN Grid CA certificate expiration * Ticketing and communication left a lot to be desired * ATLAS operations people (WFMS & DDM) had to investigate and put workarounds in place during the weekend to stabilise the situation * Had to insist to release of new RPM on Monday and not wait for another week * 2022 data reprocessing coming soon ---+++ CMS * recording collision data * overall smooth running, no major issues * good core usage between 350k and 450k cores * significant HPC/opportunistic contributions * usual production/analysis split of about 3:1 * main production activity Run 2 ultra-legacy Monte Carlo * messy CERN Grid CA certificate expiration/update * various CMS services down for up to three and a half days * seems yum repos of old OSes were not updated and RPM had a missing bundle update execution * waiting on python3 version/port of HammerCloud * working with our DPM sites to migrate to other storage technology * token migration progressing steadily * working with native xrootd sites to enable IAM-issued token support * looking forward to 24x7 production IAM support by CERN ---++++ Discussion * Stephan: * also the !InCommon CA will expire this year * Maarten: * it is normal for a number of CAs to expire every few years * new certificates normally are made available at least 1 month in advance * some of our services then need to be restarted to pick them up * examples include versions of dCache, !StoRM, Argus and !VOMS-Admin * all using older versions of the =canl-java= library * the CERN Grid CA is used for T0 services and thus impacts a lot more * Maarten: * what were those old OSes? * Stephan: * e.g. !CentOS Stream 8 * Maarten: * will check what can be done there ---+++ LHCb ---++ Impact on experiment operations of the CERN Grid CA expiration incident * the related __SSO outage__ is described [[LCG/WLCGOpsMinutes230504#Operations_News][above]] ---+++ ALICE * Data taking: * Only impacted by the SSO being dysfunctional: <br /> authentication for controls, monitoring, logging and documentation. * Last-resort workarounds needed to be applied. * If the beam had been lost, there might have been further fall-out. * Fortunately, Mattermost kept working, as well as unauthenticated Zoom. * Grid operations: * VOboxes at some sites had outdated CAs and failed in various ways. * CVMFS had outdated CAs in several places that were quickly fixed. * Users had outdated CAs in local analysis installations, <br /> for which recipes were subsequently provided by experts. ---+++ ATLAS * CERN Grid CA certificate expiration * Ticketing and communication left a lot to be desired * ATLAS operations people (WFMS & DDM) had to investigate and put workarounds in place during the weekend to stabilise the situation * Had to insist to release of new RPM on Monday and not wait for another week ---+++ CMS * messy CERN Grid CA certificate expiration/update * various CMS services down for up to three and a half days * seems yum repos of old OSes were not updated and RPM had a missing bundle update execution ---+++ LHCb * Online * SSO unavailability affected users who wanted to connect to web-services * no other major issues * Offline * little impact * some web pages started to fail, quickly fixed ---++ Task Forces and Working Groups ---+++ GDPR and WLCG services * [[GDPRandWLCG][Updated list of services]] ---+++ Accounting TF * APEL client supporting HEPScore benchmark should be ready for testing next week * APEL server with aggregation by a benchmark name is supposed to be released the week of the 22nd of May ---+++ Information System Evolution TF * NTR ---+++ DPM migration to other solutions * 13 tickets out of 45 are solved. Some of the sites which migrated still need to enable SRR and fix information in CRIC ---+++ HEPScore in production ---+++ IPv6 Validation and Deployment TF Detailed status [[WlcgIpv6#IPv6Depl][here]]. ---+++ Monitoring ---+++ Network Throughput WG <br />%INCLUDE{ "NetworkTransferMetrics" section="04052023" }% ---+++ WG for Transition to Tokens and Globus Retirement * Further progress with the [[CEtokenSupportCampaign][CE token support campaign on EGI]] * 130 of 133 tickets have already been solved * only a few small sites remain ---++++ Discussion * Maarten: * in the next months we will need to do another campaign <br /> to get all HTCondor CEs upgraded to =v6= using HTCondor =10.x= * Thomas: * has APEL been made to work for those newer versions? * mind it currently relies on =x509= variables being set for jobs * Maarten: * good point, we will need to see what can be done by when * unfortunately only __X509__ attributes are considered __by default__ (updated May 11): <br /> __VOMS__ attributes are no longer made available in =10.x= __by default__ (ditto) * in particular, the __VO__ is no longer set __by default__ (ditto) * Julia: * we will discuss this matter with the APEL team * Maarten: * will bring this up also in our meeting with the HTCondor devs and EGI tomorrow * Stephan: * also ARC CEs need to be considered in this respect ---++ Action list %INCLUDE{ "WLCGOpsCoordActionList" }% ---++ AOB
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r11
<
r10
<
r9
<
r8
<
r7
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r11 - 2023-05-31
-
ConcezioBozzi
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback