Week of 130708

Daily WLCG Operations Call details
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Thursday

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability				SIRs	Broadcasts	Operations Web
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	Broadcast archive	Operations Web

General Information

General Information			GGUS Information	LHC Machine Information
CERN IT status board	WLCG Baseline Versions	WLCG Blogs	GgusInformation	Sharepoint site - LHC Page 1

Monday

Attendance:

local: AndreaV/SCOD, Kate/Databases, Jerome/Grid, Eddie/Dashboard, LucaM/Storage, MariaD/GGUS, Maarten/ALICE
remote: Matteo/CNAF, Alexander/NLT1, Xavier/KIT, Lisa/FNAL, Tiju/RAL, Pepe/PIC, Wei-Jen/ASGC, Rob/OSG, Michael/BNL, Peter/ATLAS, Alexey/LHCb

Experiments round table:

ATLAS reports (raw view) -
- T0/Central services
  - CASTOR ATLAS sls monitoring down GGUS:95485 & INC:333771. One certificate from SAM monitoring boxes got expired. Fixed. [LucaM: there were some monitoring glitches due to issues with one machine]
  - Analysis at CERN failing GGUS:95445 Failed Critical SAM test with error "File was NOT deleted from SRM. Quota reached for number of files (5M) on atlasscratchdisk. 0-length files transfers failing. Files declared lost. Ongoing. [LucaM: analysed the issue with Alessandro this morning. The permissions did not allow file deletion. Should be fixed by 10am this morning, permissions now allow file deletion and overwriting in case of failures.]
- T1
  - BNL lost files. Recovery ongoing.

CMS reports (raw view) -
- No report

ALICE -
- KIT: concurrent jobs cap had to be lowered to 2k since Sat afternoon to avoid firewall overload due to off-site data access; the recurrent issues with the local SE will be the subject of a meeting foreseen for this week

LHCb reports (raw view) -
- Incremental stripping campaign in progress and MC productions ongoing
- T0:
  - [LucaM: GGUS:95501 is now solved. There were two BestMan running.]
  - [LucaM: as discussed with Philippe, LHCb reached 95% of its EOS space, will add capacity.]
- T1:
  - GridKa: staging is still problematic (GGUS:95135)

Sites / Services round table:

Matteo/CNAF: ntr
Alexander/NLT1: ntr
Xavier/KIT: ntr
Lisa/FNAL: ntr
Tiju/RAL: upgrading CASTOR stager for ATLAS to 2.1.13-9
Pepe/PIC: upgrade to SLC6/EMI2 is ongoing, migrating half of the farm, the whole farm should be done by next week
Wei-Jen/ASGC: ntr
Rob/OSG: ntr
Michael/BNL: ntr

Kate/Databases: ntr
Jerome/Grid: ntr
Eddie/Dashboard: ntr
LucaM/Storage: nta

MariaD/GGUS:
- Reminder! Release this Wed July 10th with ALARM tests and GGUS host certificate renewal (no DN change).
- GGUS:94704 needs an answer from PIC (about the continuation of condor-utils support).
- GGUS:94706 needs an answer from INFN (about the continuation of torque-utils support).
- All GGUS ticket activity is up-to-date in file ggus-tickets.xls attached to twiki page WLCGOperationsMeetings.

AOB: none

Thursday

Attendance:

local: AndreaV/SCOD, Kate/Databases, Eddie/Dashboard, Jerome/Grid, Maarten/ALICE
remote: Xavier/KIT, Kyle/OSG, Wei-Jen/ASGC, Gareth/RAL, Rolf/IN2P3, Jeff/NLT1

Experiments round table:

ATLAS reports (raw view) -
- T0/Central services
  - NTR
- T1
  - BNL file recovery ongoing, job failures in US cloud reduce after increasing transferring timeout. Load reduced on Site Service machine.

CMS reports (raw view) -
- 2011 data legacy rereco beginning at T1.
- GGUS:95553 File read errors at IN2P3 -- similar to a previous ticket, from file access timing out -- GGUS:95434. Current ticket possibly resulting from near simultaneous startup of ~3600 jobs. Looks OK now (last comment says to close ticket)
  - GGUS:95654 IN2P3 SAM CE briefly red due to job submission timeout -- also looks OK now.

ALICE -
- CERN: almost all CEs refuse ALICE proxies, alarm ticket GGUS:95662 [Jerome/Grid: the issue is now understood and is being fixed]
- KIT: the concurrent jobs cap had to be lowered to 2k late yesterday evening; a plan to address issues with the local SE is being looked into

LHCb reports (raw view) -
- No report

Sites / Services round table:

Xavier/KIT: three downtimes for Gridka and dcache over four days are scheduled for the next few weeks, will put them in GOCDB
Kyle/OSG: GGUS alarm tests went off yesterday, all was OK for ATLAS, but there was an issue with CMS tests, so we requested another test to check what happened
Wei-Jen/ASGC: Taiwan-CERN network is unavailable for maintenance of the submarine cable, will be back tomorrow
Gareth/RAL:
- upgraded ATLAS Castor yesterday, the outage was longer than foreseen but all was OK eventually
- CE issues this morning, all OK now
Rolf/IN2P3: ntr
Jeff/NLT1: ntr

Kate/Databases: ntr
Eddie/Dashboard: ntr
Jerome/Grid: nta

AOB: none

Topic revision: r8 - 2013-07-11 - AndreaValassi

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback