Week of 210906
WLCG Operations Call details
- For remote participation we use Zoom: https://cern.zoom.us/j/99591482961
- The pass code is provided on the
wlcg-operations
list.
- You can contact the
wlcg-ops-coord-chairpeople
list (at cern.ch
) if you do not manage to subscribe.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.
Best practices for scheduled downtimes
Monday
Attendance:
- local:
- remote: Alberto (monitoring), Andrew (TRIUMF), Chien-De (ASGC), Christoph (CMS), Daniel (security), Daniele (CNAF), Dave (FNAL), David B (IN2P3-CC), Maarten (ALICE + WLCG), Mark (LHCb), Mihai (FTS), Pepe (PIC), Peter (ATLAS), Priyanshu (computing)
Experiments round table:
- ATLAS reports ( raw view) -
- CERN: LDAP/eGroup sync problem last week, impact included AIADM login failures, Rucio cron unavailable. No direct impact on grid production.
- CMS reports ( raw view) -
- Nothing major to report
- LDAP Sync issue at CERN beginning of last week
- Led to some failures in CMS internal site availability evaluation
- ALICE
- Very high activity, no issues
- New record reached: 191k concurrent jobs
- LHCb reports ( raw view) -
- Activity:
- MC and WG productions; User jobs.
- Issues:
- RAL: We found a workaround for transfer issues at RAL. Problems caused by script being sourced at RAL WNs that set GFAL (and other) env variables (GGUS:153532)
Sites / Services round table:
- ASGC:NTR
- BNL: US National Holiday (Labor Day) - last week (02-Sep 18:00 - 03-Sep 20:00 - 26 hrs) ATLAS sent to TAPE 2.1 GB/s with peaks up to 7 GB/s. This filled HPSS disk cache and site needed to be temporarily blacklisted and number of write drives increased from 4 to 10. Working with ATLAS DDM Ops to understand if this will be the norm or just exceptional. This is close to the Run 3 requirement of 2.2 GB/s.
- CNAF: NTR
- EGI:
- FNAL: NTR
- IN2P3: IN2P3-CC will be in maintenance on September 21st, a Tuesday. As usual details will be available one week before the event.
- JINR:
- KISTI:
- KIT:
- NDGF:
- NL-T1:
- Unable to attend; apologies!
- Sara-matrix did an emergency core router replacement last Thursday. The old core router had memory issues which caused problems in the traffic (including LHCOPN and LHCONE, including Nikhef's WLCG connectivity). The other core router took over most of the traffic during the maintenance, but not all. We hope the network is more stable now.
- NRC-KI:
- OSG:
- PIC: NTR
- RAL: NTR
- TRIUMF: NTR
- CERN computing services: Migrate ATLAS t0 to bare metal Migrate CMS t0 to bare metal
- CERN storage services:
- CERN databases:
- GGUS: NTR
- Monitoring:
- SiteMon draft reports for August sent to WLCG office.
- Middleware:
- A campaign will be launched shortly for sites to deploy additional
LSC
files to support new VOMS servers of the LHC experiments
- Networks: NTR
- Security: NTR
AOB: