Week of 180129
WLCG Operations Call details
- For remote participation we use the Vidyo system. Instructions can be found here.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Portal
- Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.
Best practices for scheduled downtimes
Monday
Attendance:
- local: Kate (chair, WLCG, DB), Julia (WLCG), Maarten (ALICE, WLCG), Vladimir (LHCb), Borja (monitoring), Alberto (monitoring), Marian (network), Paul (storage), Petr (ATLAS)
- remote: Onno (NL-T1), Xavier (KIT), Di Qing (TRIUMF), Kyle (OSG), Marcelo (CNAF), Darren (RAL), Stephan (CMS), David M (FNAL), Pepe (PIC)
Experiments round table:
- ATLAS reports ( raw view) -
- kernel update & reboot of CERN machines
- did not caused any major problems
- some of our services were in same availability zone
- CERN-P1 in downtime for cooling maintenance => ~ 40k job slots sorter for our production
- preliminary tests with/without meltdown & spectre patches - 2.5-3% for reconstruction
- RAL FTS server issues ~ 1M transfers stuck
- started on Thursday but by that time one US site in downtime was a suspect (GGUS:133066)
- was not automatically blacklisted for ATLAS transfers
- on Friday we changed ND & IT cloud to use other FTS, today morning also UK
- RAL CASTOR performance issues (necessity to reindex database) after deletion backlog caused by removing 600TB from storage
- tape buffers - currently needs manual blacklisting to throttle transfers it case of full buffer
- more transfers to tape between sites because of changes in our policy (tape & disk replica can't be on one site)
- CMS reports ( raw view) -
- Mo major issues
- Recovery from the Spectre/Meltdown reboots went fine
- MWGR-1 Thursday/Friday went well, last run continued until Monday morning
- Smooth running at about 180k cores (1/4 analysis, 3/4 production); increase in negotiation cycle traced to a bad config at one of our sites;
- ALICE -
- Normal to high activity level on average.
- Several periods of high job failure rates during the week.
Not clear what was the cause; the problem may well be present still.
There could be a network issue affecting the central services;
GGUS:133052 was opened for network experts to have a look.
- IN2P3-CC: very high job failure rates and VOBOX overloads starting
Fri evening, apparently due to some WN having a stale CVMFS cache.
- Job submissions had to be stopped Sat evening.
- GGUS:133088 opened today.
- CNAF: first successful jobs from Thu late afternoon through Sun morning.
The network team will be pinged to check the GGUS ticket.
- LHCb reports ( raw view) -
- Activity
- HLT farm is partially running
- 2016 data restripping, MC simulation and user jobs
- Site Issues
Vladimir asked what is the best way to follow up on a CERN cloud incident when there is no clear error or confirmation. A GGUS ticket was suggested by Julia and Maarten.
Maarten reminded about the issue with the mail servers at CERN that has caused delays to emails arriving.
Sites / Services round table:
- ASGC: nc
- BNL: NTR
- CNAF: Library is functional
- Buffer is being recovered
- Waiting for parts for the storages
- Waiting for the return of the wet tapes (end of February)
- Subfarm will be accessible to other experiments
- EGI: nc
- FNAL: NTR
- IN2P3: nc
- JINR: NTR
- KISTI: nc
- KIT: NTR
- NDGF: nc
- NL-T1: a non-HEP VO is submitting large numbers of jobs, making the SARA-MATRIX batch system unstable. GGUS:133089
- NRC-KI: nc
- OSG: NTR
- PIC: NTR
- RAL: NTR
- TRIUMF: NTR
- CERN computing services:
- The 2nd half of the Batch service (currently draining) will be rebooted for patching on Tuesday 30th, at which point full capacity should return.
- CERN storage services:
- EOSCMS: Updated to EOS 4.x (Citrine) last tuesday, some instabilities but a fix has been deployed on Thursday.
- CERN databases: Multiple rolling database interventions planned for this week to apply latest Security Patch (lhcbonr, lhcbr, lcgr, aliconr, adcr, atlr and castor databases. CMS DBs to be patched next week).
- GGUS:
- A new release is planned for Wed this week
- Release notes
- In particular: tickets will be anonymized 18 months after getting closed
- A downtime has been scheduled for 07:00-09:00 UTC
- Test alarms will be submitted as usual
- Monitoring: NTR
- MW Officer: NTR
- Networks: NTR
- Security: NTR
AOB: