Week of 141208

WLCG Operations Call details
General Information
Monday
Thursday

WLCG Operations Call details

At CERN the meeting room is 513 R-068.

For remote participation we use the Vidyo system. Instructions can be found here.

General Information

The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

local: Alessandro (ATLAS), Christoph (CMS), Maarten (SCOD + ALICE), Nathalie, Pepe (PIC), Stefan (LHCb), Tsung-Hsun (ASGC)
remote: Christian (NDGF), Dimitri (KIT), Gareth (RAL), Lisa (FNAL), Michael (BNL), Onno (NLT1), Rob (OSG), Rolf (IN2P3)

Experiments round table:

ATLAS reports (raw view) -
- CentralService & Tier0/Tier1s
  - BNL-ATLAS GGUS:110582 not really a site issue: it was due to the changes in the periodic remove algo used in the AutoPilotFactory.
- Daily Activity overview:
  - not so much ongoing in the system for various reasons: 1) not really a lot of workload from MC and Derivation Framework coords, 2) some issues with Rucio (described in details below) which led to some tasks broken ( could be improved with more retries with exponential decay time of retry) 3) some Derivation Framework tasks were failing due to SW issues.

CMS
- Trouble with Phedex data service over the weekend GGUS 110600
- Problems reading data from EOS for Tier-0 tests on Friday GGUS 110593

ALICE -
- KIT: many raw data files are still being read remotely
  - an inconsistency was found between the AliEn File Catalog and the SE inventory
  - a consistency check is being prepared

LHCb reports (raw view) -
- MC and user jobs. "Legacy Run1 Stripping" campaign running full steam and progressing well
- T0: Very low number of stripping job errors but CERN still shows a multiple error rate compared to other T1 sites, would like to understand this. Resubmission of the same file usually works (GGUS:110604)
- T1:
  - IN2P3: DT tomorrow, draining tonight
  - RAL: Again, low total number of errors but saw several jobs failing with "bus error" during the week-end for stripping

Sites / Services round table:

ASGC: ntr
BNL: ntr
CNAF:
FNAL: ntr
GridPP:
IN2P3:
- downtime tomorrow, batch system will be drained as of 22:00 CET
JINR:
KISTI:
KIT: ntr
NDGF: ntr
NL-T1:
- MSS down on Wed
OSG: ntr
PIC:
- dCache upgrade to 2.10 next Mon
- HTTP read-only access to be enabled also for LHCb next week
RAL:
- there is a problem with the ATLAS SRM since yesterday
  - a downtime has been declared
  - the trouble may come from "bad" file names with double slashes
  - under investigation
RRC-KI:
TRIUMF:

CERN batch and grid services:
CERN storage services:
Databases:
GGUS:
- GGUS update on the 10th of December, including cleanup of non-used VOs, removed ticket category 'SPAM',
- Supporters using SNOW are reminded to keep using GGUS as the main tool until the synchronization issues have been solved
Grid Monitoring:
- Drafts of the SAM3 availability reports available at http://wlcg-sam.cern.ch/reports/2014/201411/wlcg/ Requests for recomputations should arrive before the 15th of December
  - Alessandro: besides opening a ticket, ATLAS sites can contact atlas-adc-sam@cernNOSPAMPLEASE.ch to discuss issues with the SAM A/R report
MW Officer:

AOB:

Thursday

Attendance:

local: Daniel (grid services), Maarten (SCOD + ALICE), Pablo (GGUS + grid monitoring), Stefan (LHCb), Ulf (NDGF)
remote: Andrej (ATLAS), Antonio (CNAF), Christian (NDGF), Christoph (CMS), Dennis (NLT1), Michael (BNL), Pepe (PIC), Rob (OSG), Rolf (IN2P3), Thomas (KIT), Tiju (RAL), Tsung-Hsun (ASGC), Young-Bok (KISTI)

Experiments round table:

ATLAS reports (raw view) -
- CentralService & Tier0/Tier1s
  - FZK issue GGUS:110693 , not reproducible but "solved by itself".
- Andrej:
  - a few sites complained about jobs from 1 task running in multi-core mode in single-core queues; the task was killed and the cause investigated to avoid such incidents
  - low activity because of Rucio and ProdSys2 commissioning, but also awaiting SW readiness for Run 2 Monte Carlo campaign
  - disks are quite full: the normal deletion activities are being monitored carefully after last week's incident

CMS reports (raw view) -
- Still some trouble with CMS web services
- Testing some rather high I/O jobs on HLT (with EOS experts in the loop)

ALICE -
- NTR

LHCb reports (raw view) -
- "Legacy Run1 Stripping" campaign running full steam and progressing well + MC and user jobs.
- T0: Two GGUS tickets open and not answered yet
  - GGUS:110604 - investigation of higher failure rate for stripping jobs at CERN
  - GGUS:110583 - SAM probes were timing out at CERN queues
- T1:
  - IN2P3:
    - Stefan: ~1500 jobs had to be killed on the morning of the intervention; next time we will try to have less fallout

Sites / Services round table:

ASGC: ntr
BNL: ntr
CNAF: ntr
FNAL:
GridPP:
IN2P3:
- the downtime on Tue generally went OK; Xrootd was back later than foreseen
JINR:
KISTI: ntr
KIT:
- updating CMS dCache to 2.11 today
- updated ATLAS dCache on Tue plus Wed
  - in the downtime extension the CMS SRM was mentioned accidentally
  - Christoph: fixed after its effect on PhEDEx operations was noticed
- a high load ob the GridFTP doors was observed since the upgrade; being investigated
NDGF:
- 1 tape with ATLAS data got destroyed; under investigation
NL-T1: ntr
OSG: ntr
PIC:
- next Mon downtime for dCache upgrade to 2.10
RAL:
- the ATLAS SRM instabilities were fixed on Mon
- CMS and ATLAS CASTOR head nodes were upgraded to SL6 on Tue and Wed
RRC-KI:
TRIUMF:

CERN batch and grid services: ntr
CERN storage services:
Databases:
GGUS:
- New release of GGUS done on wednesday. All alarms have been acknowledged.
- Snow fixed the synchronization issue with GGUS. Supporters using GGUS and SNOW can go back to use their preferred interface
Grid Monitoring: ntr
MW Officer:

AOB:

Topic revision: r10 - 2014-12-11 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback