Week of 190415
WLCG Operations Call details
- For remote participation we use the Vidyo system. Instructions can be found here.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Portal
- Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.
Best practices for scheduled downtimes
Monday
Attendance:
- local: Ivan (ATLAS), Borja (Monitoring), Miroslav (Chair), Maarten (ALICE), Gavin (Compute), Vincent (Security), Andrei (DB), Enrico (ST)
- remote: Andrew (NIKHEF), Marcelo (INFN), Sang Un (KISTI), Raja (LHCb), Dave (FNAL), Di (TRIUMF), Darren (RAL), Jens (NDGF), David (IN2P3)
Experiments round table:
- LHCb reports ( raw view) -
- Activity
- User jobs, MC productions, staging and some reprocessing this week.
- Issues
- RAL:
- Continuing migration from Castor to ECHO
- A disk server (gdss811) is down - causing various hold-ups and slow-downs of the different productions and the migration
- PIC : Machine ran out of disk space (GGUS:140715) fixed now - thanks!
- IN2P3 : Batch system issues (GGUS:140652) possibly ongoing
Sites / Services round table:
- ASGC: NC
- BNL: NTR
- CNAF: NTR
- EGI: NC
- FNAL: NTR
- IN2P3: several batch system issues last week due to different incidents on NFS storage used by the batch system. Instabilities on resource sharing impacting LHCb are still under investigations and a workaround has been set up to get a more stable situation. Apologies for all these instabilities.
- JINR: NTR
- KISTI: Planned downtime for storage layer upgrade today. All OK afterwards
- KIT: NC
- NDGF: NTR
- NL-T1: A router firmware upgrade was done at Nikhef on Saturday 13th April. This was relatively trouble free with the exception of one storage node were the dpm-gridftp service failed and had to be restarted.
- NRC-KI: NC
- OSG: NC
- PIC: NC
- RAL: NTR
- TRIUMF: NTR
- CERN computing services: NTR
- CERN storage services:
- EOSATLAS crash and software update: OTG0049876
- EOSCMS software update: OTG0049776
- Certificate for s3.cern.ch is not trusted by IGTF, i.e., works for web browsers but not for grid sites. Still investigating on solutions
- CERN databases: NTR
- GGUS: NTR
- Monitoring: NTR
- MW Officer: NTR
- Networks: NTR
- Security: Several Jenkins & Confluence server compromise being reported globally (not within sites). Please make sure they are up to date and safe.
AOB:
- NOTE: the operations meeting next Mon will be virtual .
- You may provide relevant incidents, announcements etc. for the operations record.
- Have a good Easter break !