Week of 130415
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
The scod rota for the next few weeks is at
ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Attendance:
- local: AndreaS, Jarka, Simone, MariaD
- remote: Lisa (FNAL), Onno (NL-T1), Michael (BNL), Lucia (CNAF), Wei-Jen (ASGC), Xavier (KIT), Tiju (RAL), Kyle (OSG), Rolf (IN2P3), Rob (OSG)
Experiments round table:
- ATLAS reports (raw view) -
- T1s:
- Some failures in the weekend copying data to FZK-LCG2. GGUS:93320. Problem went away by itself. The ticket can be closed and another one will be opened if the problem reappears.
- CMS reports (raw view) -
- No significant events in last days.
- LHC / CMS
- Rereconstruction of 2012 data in the tails, load at the T1 sites small. User's analysis goes on at constant pace.
- CERN / central services and T0
- Tier-1:
- Tier-2:
- ALICE -
- CNAF: VOBOX host cert expired Sat afternoon, causing CNAF to get steadily drained of ALICE jobs since 04:30 today (GGUS:93319) [Lucia adds that they already requested a new certificate but the CA is late in delivering it]
- KIT: concurrent jobs cap lowered to 2500 on Fri afternoon to avoid overloading the firewall with off-site SE accesses while the new Xrootd servers are being debugged
- LHCb reports (raw view) -
- Apologies for not attending due to LHCb Analysis and Software week
- Still mainly user jobs
- T0:
- Still pb with SLS LFC sensor
- Activities:
- Re-stripping postponed until beginning of May, will start with latest reprocessed data to take advantage of files still on disk. Then jobs will trigger staging from tape. Tier1s should be prepared for high staging rate then for achieving the rest ripping within 60 days:
Total |
TB |
Rate (MB/s) |
CERN-RDST |
259.4 |
50 |
CNAF-RDST |
794.6 |
153 |
GRIDKA-RDST |
643.5 |
124 |
IN2P3-RDST |
696.8 |
134 |
PIC-RDST |
201.4 |
39 |
RAL-RDST |
575.8 |
111 |
SARA-RDST |
537.7 |
104 |
Sites / Services round table:
- ASGC: ntr
- BNL: ntr
- CNAF: ntr
- FNAL: ntr
- IN2P3: ntr
- KIT: ntr
- NL-T1: reminder from NIKHEF about a network maintenance intervention this Thursday affecting both computing and storage services; the queues will be drained starting from Wednesday afternoon.
- PIC: ntr
- RAL: ntr
- OSG: ntr
- Dashboards: ntr
AOB:
Thursday
Attendance:
- local: AndreaS, Jarka, Simone, Xavier, Gavin, MariaD
- remote: Xavier (KIT), David (CMS), John (RAL), Lucia (CNAF), Ronald (NL-T1), Rolf (IN2P3), Rob (OSG)
Experiments round table:
- ATLAS reports (raw view) -
- Central services
- SAM-ATLAS-PROD machine was down since 16th of April 13:40 . https://cern.service-now.com/service-portal/view-incident.do?n=INC279368 . The importance of the machine was 5 (very low), while it should have been >50: this has been changed yesterday (Wed) around 14:00 . Problem fixed at 10:30 of Thursday morning.
- SARA-MATRIX ~2800 transfer failures: the certificate has expired on Tuesday. GGUS:93353 solved: the same day, the issue was fixed.
- RAL-LCG2: still some transfer failures at low rate with "checksum mismatch". GGUS:93315 updated on Tuesday. Under investigation.
- CMS reports (raw view) -
- LHC / CMS
- Rereconstruction of 2012 data in the tails, load at T1's continues to tail off. Analysis load continues as usual.
- CERN / central services and T0
- GGUS:93372 CMS VOMS -- currently correcting some manually entered incomplete DN's.
- Tier-1:
- GGUS:93440 KIT -- black hole worker node due to CVMFS partition being marked read only -- resolved quickly.
- Tier-2:
- ALICE -
- central services
- a cleanup operation unexpectedly caused a very high I/O load on the AliEn DB for a few hours starting Tue mid morning, causing lots of jobs and services around the grid to time out
- the cleanup was interrupted mid afternoon, after which the DB needed a few more hours to roll back
- normal conditions were restored early evening
- KIT: the new Xrootd servers look better now - thanks!
- concurrent jobs cap raised to 10k on Wed afternoon
- SARA: dCache Xrootd interface fixed for writing - thanks! (GGUS:93045)
Sites / Services round table:
- ASGC: ntr
- CNAF: ntr
- IN2P3:ntr
- KIT: tomorrow, emergency downtime fro 0900LT to 1300LT to update the GPFS cluster; 900 TB of ATLAS data will be offline
- NL-T1: NIKHEF is in downtime for network maintenance as previously announced; everything should be back by the end of the afternoon
- PIC: ntr
- RAL: next Wednesday all Oracle databases will be patched; the upgrade should be transparent, but in case of problems with the ATLAS 3D database, the ATLAS Frontier servers will fail over to use the IN2P3 3D database. Rolf confirms that no intervention is planned at IN2P3 at that time.
- OSG: ntr
- CERN batch and grid services: Next Tuesday the NFS shared storage used by myproxy.cern.ch to store proxy credentials will be moved to a new system. We expect to preserve read-only access during the intervention so that proxy certificate renewal should keep working, but it will not be possible to store or delete stored proxy certificates for a few minutes. https://itssb.web.cern.ch/planned-intervention/change-backend-storage-server-myproxycernch/23-04-2013
- CERN storage services: last Tuesday from 1500 to 1540 there was a problem on the EOS head nodes for ATLAS and ALICE and the EOS service was unavailable
- Dashboards: ntr
- GGUS: GGUS Release next Wednesday, April 24 from 06:00 to 07:00 UTC with ALARM test round as usual. Reminder: As announced on https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek130304#Thursday and at the WLCG Ops. Coord. meeting last week, the GGUS host certificate will be renewed. This certificate is used for authentication purposes of SOAP and hence impacts all systems that consume GGUS web services. The new certificate is attached to the relevant tickets in Savannah:136227.
AOB: