TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsWeb
>
WLCGOpsCoordination
>
WLCGOpsMinutes220127
(2022-01-28,
MaartenLitmaath
)
(raw view)
E
dit
A
ttach
P
DF
<!-- -- > <font size="6"> %RED% *DRAFT* %BLACK% </font> <br /><br /> <!-- --> ---+!! WLCG Operations Coordination Minutes, Jan 27, 2022 %TOC{depth="4"}% ---++ Highlights * Pre-GDB on operational effort and possible optimization will take place on the 24th of February 3:00-5:30 PM CET. * [[LCG/WLCGOpsMinutes220127#XRootD_monitoring][Xrootd monitoring progress]] * [[LCG/WLCGOpsMinutes220127#Operations_News][IAM service support status]] * [[LCG/WLCGOpsMinutes220127#Network_Throughput_WG][perfSONAR v4.4.2 contains important bugfixes]] ---++ Agenda https://indico.cern.ch/event/1120678/ ---++ Attendance * local: * remote: Alberto (monitoring), Alessandra D (Napoli), Alessandra F (ATLAS + Manchester + WLCG), Andrea (WLCG), Borja (monitoring), Christoph (CMS), Concezio (LHCb), Danilo (CMS), David Cameron (ATLAS + ARC), David Cohen (Technion), Derek (Nebraska + monitoring), Enrico (CNAF), Eric (!IN2P3), Federica (CNAF), Horst (Oklahoma), James (CMS), Julia (WLCG), Lucia (CNAF), Maarten (ALICE + WLCG), Marian (networks + monitoring), Matt (Lancaster), Max (KIT), Mihai (FTS), Miltiadis (WLCG), Nikolay (monitoring), Panos (WLCG), Petr (ATLAS + Prague), Shawn (MWT2 + networks), Stephan (CMS), Thomas (DESY), Xin (BNL + WLCG) * apologies: ---++ Operations News * the next meeting is planned for March 3 * IAM service support * IAM has been added to the [[LCG/WLCGCritSvc#CERN_IT_services][table]] of critical services at CERN * Its =urgency= and =impact= numbers have been copied from the !VOMS row * They are expected to change in the course of this year * The details for submitting a ticket have been added to these pages: * [[WLCGggusSNow][Twiki]] version (source) * SNow [[https://cern.service-now.com/service-portal/?id=kb_article&n=KB0007528][KB article]] (copy) * For the next few months, the support level is closer to *8/5* than *24/7* * That is expected to improve in the course of this year * It implies we should not yet rely on very short-lived tokens * Rather imitate what is done with !VOMS proxies for now ---+++ Discussion * Petr: * in ATLAS we will need to use 96h tokens for now * ETF will need to do the same * David Cameron: * is the current IAM service support level consistent with <br/> the phaseout of X509 proxies in OSG by the end of Feb? * Maarten: * that is for ATLAS to decide, but IMO it is OK to use 96h tokens, <br/> because they are only used for job _submission_ to HTCondor CEs, <br/> they are not delegated to the jobs, i.e. not spread over the grid * to some extent better than what we have today with !VOMS proxies * _data management_ tokens used by jobs will need much shorter lifetimes, <br/> but we are not close to having those in any production workflows * developments in that area are discussed in the [[WLCGAuthorizationWG][Authorization WG]] * Thomas: * is there documentation on how to map tokens based on which identifiers? * Maarten: * yes, but probably not quickly found via the Authorization WG docs * we will follow up * currently such mappings mostly matter to HTCondor CEs ---++ Special topics ---+++ XRootD monitoring see the [[https://indico.cern.ch/event/1120678/#7-xrootd-monitoring][presentation]] ---++++ Discussion * Danilo: * page 15: why are there loops between the Collectors and the MQ instances? * what is the expected timeline for implementing all this? * Derek: * the Shoveler uses the message bus to send data to its Collector reliably. These are raw messages, not aggregated for a single transfer event. The content of the message is exactly the same as udp packet content. * the Collector parses and aggregates the data and sends the results <br/> to its consumers also via that message bus. Different topic will be used, not the same as for raw messages. * Shovelers are expected to work at ~all OSG sites by ~end of Feb * Borja: * most of the remaining development work for the EGI/WLCG side is <br/> expected to be finished by ~end of Feb * testing would then start as of March * Christoph: * can the Shoveler also be used by dCache instances? * Derek: * that depends on the message format * dCache can send !XRootD-compatible messages, AFAIU * to be tested * Alessandra F: * the dCache devs will need to be contacted about this * Julia: * will do * Borja: * there also is !MonALISA to be considered in this respect * all senders have to stick to the schema that we will agree on. DCache reports are not expected to be the same as raw messages sent by the Shoveler. * Maarten: * when the various components are ready for deployment, <br/> we will need to get them into the right repositories * Julia: * the Shoveler would be part of !XRootD releases * dCache have their own releases * the Collector could be added to the WLCG repository * David Cameron: * will we be able to see mappings of transfers to sites? * Borja: * yes, the usual Monit functionality for transfers * Alessandra F: * the aim is to have consistency in Monit for all transfers * Andrea: * can the use of IPv6 be tracked? * Derek: * yes, it is an attribute in the aggregated data ---++ Middleware News * Useful Links * WLCGBaselineTable * Baselines/News ---++ Tier 0 News ---++ Tier 1 Feedback ---++ Tier 2 Feedback ---++ Experiments Reports ---+++ ALICE * Normal activity levels on average in the last 8 weeks * No major issues * Site VOboxes are being switched from legacy =AliEn= to =JAliEn= * Most sites and ~85% of the resources are done * Progress can be tracked [[http://alimonitor.cern.ch/stats?page=proxies][here]] * Issues at a few of the remaining sites are being followed up * =JAliEn= is needed for Run-3 multi-core jobs * Most sites should only see 8-core jobs eventually * Some already do, some others receive a mix * Such jobs also can run up to 8 single-core (legacy) tasks * For each task, Singularity is tried from CVMFS * If that fails, a local system installation is tried * If that fails, the task is run in the classic way ---+++ ATLAS * Smooth running over Xmas break and in last few weeks with 700-800k slots * Main activities Run2 data and MC reprocessing * Including running MC reprocessing on 50% of the HLT farm * Another CA update (SlovakGrid) means all dCache services need to be restarted (same as with Swiss and Brazilian updates last year) * Switch to IAM VOMS server seemed to go smoothly, required some clean up of tools still using legacy (non-RFC) proxies * AGIS servers were shut down last week * Problems with slow transfers to and from RAL, hard to debug (GGUS:154436) * SRR storage reporting is shaky especially at dCache sites. Several times storage got full because SRR was not up to date. * Planning a Run 3 commissioning data transfer test ~end Feb/begin March involving full T0 and export to T1 tapes ---+++ CMS * running smoothly with 300-350k cores * no significant issues during the holidays * transatlantic link Fermilab--CERN down to tertiary, 20Gbit/s, link during most of the holidays * Waiting for last pieces of information about the chain of responsibility linked to the machine that failed at CERN causing the issue * internal saturation for analysis jobs during the holidays * traced to large number of jobs with low number of sub-jobs * usual production/analysis split of 3:1 * HPC allocations contributing up to 30k cores * 2021 allocations for machines in the US all consumed well before the end of the allocation period * production activity mainly Run 2 ultra-legacy Monte Carlo * re-reconstruction of parked B data, 11B of 12B processed * SRM+WebDAV commissioning at Tier-1 sites started * accidental deletion of SAM/HC datasets middle of December * at about a third of sites * all files restored by middle of January * HammerCloud instabilities since several weeks * job status queries failed causing multiple jobs and empty status page, corrected * no new jobs being submitted for series, still being investigated * working on updating HC jobs for Run 3 software/input datasets * big Thanks to all sites contributing above the pledge! * this is much appreciated while sites struggle to get new machines * a very welcome boost of the CMS physics program ---+++ LHCb * smooth running over Xmas break and in January * using 140-160k cores * re-processing campaign of Run2 ended in Dec 31st, 2021 (!) * simulation jobs at 95%, user jobs at 5% * some data movements / replicas to * recover disk space at PIC * deal with long-term downtime of storage at CBPF * planning tape throughput tests at Tier0 * tape read test at Tier1 should also be planned ---++ Task Forces and Working Groups ---+++ GDPR and WLCG services * [[GDPRandWLCG][Updated list of services]] * Review of the status of publishing of the CERN RoPOs for WLCG services hosted at CERN and WLCG Data Privacy Notice for other WLCG services has been given at the January WLCG MB. We were asked to go ahead and to accelerate this process. WLCG Ops Coordination will follow up. ---+++ Accounting TF ---+++ dCache upgrade TF ---+++ Information System Evolution TF * WLCG CRIC has been bootstrapped with initial information for network topology. Will submit tickets against the sites asking to validate this data. ---+++ IPv6 Validation and Deployment TF Detailed status [[WlcgIpv6#IPv6Depl][here]]. ---+++ Monitoring * Kick-off meeting of the Monitoring Task Force took place on the 13th of January. Agreed on the main directions of work. Jira project WLCGMONTF has been created to follow up on the progress. ---+++ Network Throughput WG %INCLUDE{ "NetworkTransferMetrics" section="27012022" }% ---+++ WG for Transition to Tokens and Globus Retirement * Progressing via Authorization WG [[https://indico.cern.ch/category/68/][meetings]] ---++ Action list %INCLUDE{ "WLCGOpsCoordActionList" }% ---++ AOB -- Main.JuliaAndreeva - 2022-01-25
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r12
<
r11
<
r10
<
r9
<
r8
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r12 - 2022-01-28
-
MaartenLitmaath
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback