Week of 130819

Daily WLCG Operations Call details
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Thursday

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability				SIRs	Broadcasts	Operations Web
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	Broadcast archive	Operations Web

General Information

General Information			GGUS Information	LHC Machine Information
CERN IT status board	WLCG Baseline Versions	WLCG Blogs	GgusInformation	Sharepoint site - LHC Page 1

Monday

Attendance:

local: AndreaV/SCOD, Ben/Grid, Xavi/Storage, Ivan/Dashboard, Felix/ASGC
remote: Lucia/CNAF, Michael/BNL, Sang-Un/KISTI, Marc/IN2P3, Xavier/KIT, Onno/NLT1, Tiju/RAL, Roger/NDGF, Kyle/OSG, David/CMS, Elena/ATLAS

Experiments round table:

ATLAS reports (raw view) -
- T0
  - ntr
- T1
  - RAL (GGUS:96674): FTS3 problem. Transfer fail with SRM_FILE_BUSY error. Under investigation.
  - Several sites were affected by cvmfs upgrage to version 2.1.14.

CMS reports (raw view) -
- MC production and reconstruction continue at low-ish levels
- GGUS:96662 Apparent CVMFS black hole node at KIT -- node is offline and CMS see no remaining effects, so ticket may be closed
- GGUS:96664 Hammercloud failures at KIT on Aug 16 -- monitoring shows "cancelled by user". Metric green 24 hours following
- GGUS:96668 glexec failures at T2_EE_Estonia -- ongoing
- GGUS:96688 EGI accounting portal down yesterday -- now solved

ALICE -
- NTR

LHCb reports (raw view) -
- No report

Sites / Services round table:

Lucia/CNAF: ntr
Michael/BNL: ntr
Sang-Un/KIST: ntr
Marc/IN2P3: applied workaround for the CVMFS exploit, waiting for next scheduled downtime before upgrading to next version of CVMFS
Xavier/KIT: two downtimes tomorrow
- intervention on local hard disks of pool nodes in the morning
- intervention on dcap host in the afternoon, CMS will not be able to use dcap at all
Onno/NLT1: two issues
- one dcache pool node crashed at SARA and files were not available, fixed after a reboot
- there is a broken DIMM in the namespace node, which is a single point of failure, will need to bring down whole dcache cluster, not urgent, will be announced in GOCDB
Tiju/RAL: ntr
Roger/NDGF: next Wed emergency maintenance of network to Finland, some dcache pools will be unavailable for ATLAS
Kyle/OSG: ntr
Felix/ASGC: reminder, ASGC is now in downtime for router interface upgrade

Ben/Grid: currently reinstalling lxbatch to fix CVMFS issues
Xavi/Storage: reminder intervention tomorrow on CASTOR, DB upgrade that should be transparent, but service is declared at risk
Ivan/Dashboard: ntr

AOB: The WLCG MB on Aug 20 is CANCELLED.

Thursday

Attendance:

local: MariaD/SCOD, Felix/ASGC, Eva/CERN-DB, Ivan/CERN-Dashboards
remote: Xavier/KIT, Lucia/CNAF, Elena/ATLAS, Sang-Un/KISTI, David M,/CMS, Dennis/NL_T1, John K./RAL, Philippe Charpentier/LHCb, Roger/NDGF, Jeremy/GridPP, Rob/OSG.

Experiments round table:

ATLAS reports (raw view) -
- T0
  - ntr
- T1
  - RAL (GGUS:96674): FTS3 testing. New FTS3 update ( to solve issue of no contact between fts3 service and se for 300 seconds) yesterday caused a spike of errors corresponds. The ticket is closed by RAL but we still observe failure of some transfers in DE cloud.
  - NDGF (GGUS:96762): Transfer failing (error message Marking Space as Being Used). Under investigation (the problem has no trivial reason or fix)

CMS reports (raw view) -
- CVMFS issues at KIT -- two tickets: GGUS:96714, GGUS:96761 First is on hold, the other in progress. Several nodes involved
- CVMFS issues on a single node at RAL -- node has been disabled and problems not seen by CMS -- ticket may be closed
- GGUS:96725 T2_KR_KISTI troubles moving to EMI-3. Ongoing
- GGUS:96739 CVMFS problem on worker at UKI-SOUTHGRID-OX-HEP -- worker upgraded and awaiting confirmation its fixed.

ALICE -
- KIT: 23 production files lost due to a damaged tape; as they already had 2 other replicas, only catalog entries needed to get adjusted

LHCb reports (raw view) -
- Only MC productions ongoing, including using BOINC prototype
- Testing reconstruction on multicore (using GaudiMP) on the CERN cloud
  - For the moment problems with CVMFS stability
- T0:
  - NTR
- T1:
  - NTR

Sites / Services round table:

ASGC: ntr
BNL: none connected
CNAF: ntr
FNAL: none connected
KIT: A typo in a site name caused wrong information publishing. As soon as this mistake was discovered the correct information started being propagated.
PIC: none connected
RAL: Bank holiday next Monday and Tuesday in the UK. RAL will be closed. None will join.
OSG: ntr
NDGF: Network mantainance next Monday evening. dCache instance in Finland will be unavailable. Some ALICE data will not be accessible.
NL_T1: A disk controller failure this morning caused 15 mins of unscheduled downtime.
IN2P3: none connected.
KISTI: ntr

CERN DB: Problem with a switch on RAC 9 requires an intervention next Monday, in principle transparent but site declared at risk as LHCb online and CMS,ATLAS archive dbs are involved in this change.

CERN Dashboards: ntr

AOB:

Topic revision: r12 - 2013-08-22 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback