WLCG Operations Coordination Minutes, June 2, 2022
Highlights
Agenda
https://indico.cern.ch/event/1166315/
Attendance
- local:
- remote: Christoph (CMS), David B (IN2P3-CC), David C (ATLAS), David M (FNAL), Edoardo (networks), Eric (IN2P3), Julia (WLCG), Maarten (ALICE + WLCG), Masahiko (Tokyo), Miltiadis (WLCG), Nikolay (monitoring), Petr (Prague + ATLAS + WLCG), Raffaele, Renato (EGI), Romain, Sebastian (GoeGrid), Shawn (AGLT2 + networks), Stephan (CMS), Thomas (DESY), Xin (BNL)
- apologies:
Operations News
- the next meeting is planned for July 7
Special topics
WLCG site network monitoring
see the
presentation
Discussion
- Edoardo:
- what is requested looks feasible, at least the documentation
- is the total traffic to be reported, or should it be separated?
- Shawn:
- initially we would just like to see the totals, as that is easier
- in the future we could e.g. have errors and discards added,
differentiate between experiments, network types etc.
- CRIC can be made to support further substructures
- Petr:
- might a push model be considered instead of the pull model?
- Shawn:
- the pull model looks easier to implement at this time
- a push model can be considered for the future
- Maarten:
- some site admins may feel uncomfortable about providing some of
the information you are requesting, because of security reasons
- they may not want to expose network details, equipment used etc.
- though the info should never be world-readable in CRIC,
it might get exposed due to a configuration mistake or bug
- you would need to open a ticket per site and see how far
they are willing to go in these matters
- Shawn:
- most of the information is optional
- the more info is provided, the more we can help with network issues
- sensitive information will be protected
- mandatory information is not sensitive
- we can make adjustments depending on the feedback from sites
- Thomas:
- DESY-ZN is connected to the grid via DESY-HH:
can such an arrangement be handled?
- Shawn:
- yes: DESY-ZN would just report the traffic numbers they see,
whereas DESY-HH reports the totals for both sites together
- AGLT2 is a similar case: each of its 2 constituent sites reports
external traffic totals that include the inter-site traffic
- Edoardo:
- do you have plans to show the in/out counters somewhere?
- Shawn:
- the plan is to get them into MONIT
- the displays would somewhat resemble the ESnet dashboard,
as shown on page 15 of the attached presentation
- Edoardo:
- will there be links to perfSONAR instances in CRIC?
- Shawn:
- the CRIC devs are augmenting the schema to allow perfSONAR
instances to be registered
- we would like CRIC to become the source of truth for perfSONAR
- Julia:
- it is WIP at the prototype stage
- Julia:
- Victoria provided feedback on this meeting's Twiki page:
what do people think of their suggestions?
- Shawn:
- a central repository at CERN has been considered,
but the difficulty is in the authN/authZ of site admins
- we may look further into that approach later
Middleware News
- Useful Links
- Baselines/News
- A CE token support campaign was launched yesterday
- All EGI sites supporting at least one of ALICE, ATLAS or LHCb were ticketed
- CMS Operations had already launched such a campaign for all CMS sites
- The sites were asked:
- to upgrade their CEs to versions compatible with the use of tokens,
- to configure the tokens of ATLAS and/or CMS as needed and,
- for ARC CEs, enable the REST interface
- Each experiment needs to check itself if the CEs at their sites actually work
- The tickets can be used to convey observations or requests
- The WLCGBaselineTable has been updated accordingly
Discussion
- Julia:
- what about ATLAS sites on OSG?
- Maarten:
- OSG is taking care of them
- Shawn:
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Mostly business as usual
- CVMFS problem affected 2 sites in Japan and Thailand
- During a few weeks they often got stuck at the same obsolete revision
- The Stratum-1 service at ASGC had run out of disk space (fixed)
- Only a few NorduGrid sites still run the AliEn legacy MW
- To be switched to the
JAliEn
stack in the next weeks
ATLAS
- Smooth running with 500-700k cores including 200k from HPC
- Ran a lot of multicore event generation in preparation for Run 3 simulation which should start soon
- Some issues with ADC Oracle databases - incidents on 4 May and 27 May and read-only replica out of sync for 10 days
- Deletion campaigns on disk and tape have freed ~30PB from each
- Next week we plan to remove SRM and GridFTP disk endpoints for TPC from CRIC and Rucio (SRM will be kept for tape and may need to be kept for space reporting for some sites)
- Only one US T3 site still relies on GridFTP and will be cut off
CMS
- Overall rather smooth operations
- Good CPU usage ~370k cores (usual split 75% production, 25% user) - Significant contribution from HPC 20-50k cores
- WebDav transition: all Tier1 and Tier2 ok. Most Tier-3 sites are ready/in production; we expect three more sites to be ready very soon
- Still waiting on a Python3 version of HammerCloud
- First round of integrating 2022 pledges completed
LHCb
Site input for Network monitoring discussion
CA-VICTORIA-WESTGRID-T2
Suggest using git, twiki, or some other tool suitable for documentation, to store, serve and manage the site network information templates in one place, instead of standalone webservers/webpages for each site.
Task Forces and Working Groups
GDPR and WLCG services
- Julia:
- the latest version of the WLCG Privacy Notice can be customized
- it is available from the WLCG document repository: WLCGPrivacyNotice2022.pdf
- sites will be asked to publish either a locally defined privacy notice
or the WLCG Privacy Notice, possibly customized as needed
Accounting TF
dCache upgrade TF
Progress is tracked via
twiki page
Information System Evolution TF
IPv6 Validation and Deployment TF
Detailed status
here.
Monitoring
- the May GDB had 2 talks featuring monitoring:
- WLCG Monitoring Task Force update
- Tape Challenge Results (conclusions and future directions)
Network Throughput WG
WG for Transition to Tokens and Globus Retirement
- A CE token support campaign was launched, as reported under the Middleware News
Action list
Specific actions for experiments
Specific actions for sites
AOB