Week of 130819
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
The scod rota for the next few weeks is at
ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Attendance:
- local: AndreaV/SCOD, Ben/Grid, Xavi/Storage, Ivan/Dashboard, Felix/ASGC
- remote: Lucia/CNAF, Michael/BNL, Sang-Un/KISTI, Marc/IN2P3, Xavier/KIT, Onno/NLT1, Tiju/RAL, Roger/NDGF, Kyle/OSG, David/CMS, Elena/ATLAS
Experiments round table:
- ATLAS reports (raw view) -
- T0
- T1
- RAL (GGUS:96674): FTS3 problem. Transfer fail with SRM_FILE_BUSY error. Under investigation.
- Several sites were affected by cvmfs upgrage to version 2.1.14.
- CMS reports (raw view) -
- MC production and reconstruction continue at low-ish levels
- GGUS:96662 Apparent CVMFS black hole node at KIT -- node is offline and CMS see no remaining effects, so ticket may be closed
- GGUS:96664 Hammercloud failures at KIT on Aug 16 -- monitoring shows "cancelled by user". Metric green 24 hours following
- GGUS:96668 glexec failures at T2_EE_Estonia -- ongoing
- GGUS:96688 EGI accounting portal down yesterday -- now solved
Sites / Services round table:
- Lucia/CNAF: ntr
- Michael/BNL: ntr
- Sang-Un/KIST: ntr
- Marc/IN2P3: applied workaround for the CVMFS exploit, waiting for next scheduled downtime before upgrading to next version of CVMFS
- Xavier/KIT: two downtimes tomorrow
- intervention on local hard disks of pool nodes in the morning
- intervention on dcap host in the afternoon, CMS will not be able to use dcap at all
- Onno/NLT1: two issues
- one dcache pool node crashed at SARA and files were not available, fixed after a reboot
- there is a broken DIMM in the namespace node, which is a single point of failure, will need to bring down whole dcache cluster, not urgent, will be announced in GOCDB
- Tiju/RAL: ntr
- Roger/NDGF: next Wed emergency maintenance of network to Finland, some dcache pools will be unavailable for ATLAS
- Kyle/OSG: ntr
- Felix/ASGC: reminder, ASGC is now in downtime for router interface upgrade
- Ben/Grid: currently reinstalling lxbatch to fix CVMFS issues
- Xavi/Storage: reminder intervention tomorrow on CASTOR, DB upgrade that should be transparent, but service is declared at risk
- Ivan/Dashboard: ntr
AOB: The WLCG MB on Aug 20 is
CANCELLED.
Thursday
Attendance:
- local: MariaD/SCOD, Felix/ASGC, Eva/CERN-DB, Ivan/CERN-Dashboards
- remote: Xavier/KIT, Lucia/CNAF, Elena/ATLAS, Sang-Un/KISTI, David M,/CMS, Dennis/NL_T1, John K./RAL, Philippe Charpentier/LHCb, Roger/NDGF, Jeremy/GridPP, Rob/OSG.
Experiments round table:
- ATLAS reports (raw view) -
- T0
- T1
- RAL (GGUS:96674): FTS3 testing. New FTS3 update ( to solve issue of no contact between fts3 service and se for 300 seconds) yesterday caused a spike of errors corresponds. The ticket is closed by RAL but we still observe failure of some transfers in DE cloud.
- NDGF (GGUS:96762): Transfer failing (error message Marking Space as Being Used). Under investigation (the problem has no trivial reason or fix)
- CMS reports (raw view) -
- CVMFS issues at KIT -- two tickets: GGUS:96714, GGUS:96761 First is on hold, the other in progress. Several nodes involved
- CVMFS issues on a single node at RAL -- node has been disabled and problems not seen by CMS -- ticket may be closed
- GGUS:96725 T2_KR_KISTI troubles moving to EMI-3. Ongoing
- GGUS:96739 CVMFS problem on worker at UKI-SOUTHGRID-OX-HEP -- worker upgraded and awaiting confirmation its fixed.
- ALICE -
- KIT: 23 production files lost due to a damaged tape; as they already had 2 other replicas, only catalog entries needed to get adjusted
- LHCb reports (raw view) -
- Only MC productions ongoing, including using BOINC prototype
- Testing reconstruction on multicore (using GaudiMP) on the CERN cloud
- For the moment problems with CVMFS stability
- T0:
- T1:
Sites / Services round table:
- ASGC: ntr
- BNL: none connected
- CNAF: ntr
- FNAL: none connected
- KIT: A typo in a site name caused wrong information publishing. As soon as this mistake was discovered the correct information started being propagated.
- PIC: none connected
- RAL: Bank holiday next Monday and Tuesday in the UK. RAL will be closed. None will join.
- OSG: ntr
- NDGF: Network mantainance next Monday evening. dCache instance in Finland will be unavailable. Some ALICE data will not be accessible.
- NL_T1: A disk controller failure this morning caused 15 mins of unscheduled downtime.
- IN2P3: none connected.
- KISTI: ntr
- CERN DB: Problem with a switch on RAC 9 requires an intervention next Monday, in principle transparent but site declared at risk as LHCb online and CMS,ATLAS archive dbs are involved in this change.
AOB: