Week of 141006

WLCG Operations Call details
General Information
Monday
Thursday

WLCG Operations Call details

At CERN the meeting room is 513 R-068.

For remote participation we use the Vidyo system. Instructions can be found here.

General Information

The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

local: Alessandro (ATLAS), Ben (grid services), Hervé (storage), Maarten (SCOD + ALICE), Tsung-Hsun (ASGC)
remote: Christian (NDGF), Christoph (CMS), Dea-Han (KISTI), John (RAL), Lisa (FNAL), Michael (BNL), Onno (NLT1), Pepe (PIC), Rob (OSG), Rolf (IN2P3), Sonia (CNAF)

Experiments round table:

ATLAS reports (raw view) -
- Daily Activity overview -- for WLCG
  - Rucio: one job in Panda finished (by Friday night)!
  - Reconstruction MCORE went to zero during the weekend: tasks were not launched, only half were submitted, now we have 250M submitted. From AleF: we need to improve the mixture of tasks submitted.
  - FTS at CERN was down during the weekend, now fixed.
    - Alessandro: though the load of the FTS-3 service at CERN can be taken over by BNL and RAL, the FTS-3 service needs to have the correct criticality; this will be followed up in other fora
  - HotDisk: will disable 2 or 3 so that Rucio won't return them as valid replicas. Will ask FZK and IN2P3-CC about accesses (NDGF to be seen by David).
- CentralService/T0/T1s
  - INFN-T1 file transfer issue GGUS:109011
  - NIKHEF-ELPROD file access issue GGUS:109058 (mapping?)
  - TAIWAN-LCG2 storage access issue GGUS:109054
  - IN2P3-CC file transfer issue GGUS:109055

CMS reports (raw view) -
- Christoph: no big issues, but the FTS-3 criticality will be followed up indeed

ALICE -
- NDGF: potentially big data loss from tape to be investigated ASAP

LHCb reports (raw view) -
- No activities : LHCbDIRAC downtime to migrate tables from local MySQL to MySQL in the DBOD services.
- T0: lbvobox19 and lbvobox27 down (ALARM ticket : GGUS:109086)
- T0 : CERN: FTS3 down b/c of underlying DBOD failed, now back in production.
- T1: NTR
- Services: NTR

Sites / Services round table:

ASGC: ntr
BNL: ntr
CNAF: ntr
FNAL: ntr
GridPP:
IN2P3: ntr
JINR:
KISTI: ntr
KIT:
NDGF: ntr
NL-T1:
- downtime this Wed to migrate CREAM CEs
OSG:
- quiet weekend after ShellShock...
PIC: ntr
RAL:
- downtime to fix DB RAID issue affecting SRM and stagers for ATLAS and ALICE
RRC-KI:
TRIUMF:

CERN batch and grid services:
- FTS-3 incident will be followed up
CERN storage services: ntr
Databases:
GGUS:
Grid Monitoring:
MW Officer:

AOB:

Thursday

Attendance:

local: Andrea Sciabà, Joel Closier (LHCb), Hervé Rousseau (IT-DSS), Ben Jones (central services), Pablo Saiz (WLCG monitoring), Maarten Litmaath (ALICE)
remote: Dea-Han Kim (KISTI), Tommaso Boccali (CMS), Michael Ernst (BNL), Andrej Filipcic (ATLAS), John Kelly (RAL), Rolf Rumler (IN2P3-CC), Thomas Hartmann (KIT), Pepe Flix (PIC), Rob Quick (OSG), Jeremy Coles (GridPP)

Experiments round table:

ATLAS reports (raw view) -
- Daily Activity overview
  - Panda-Rucio integration: bulk of jobs submitted "by hand" . Understood one problem, GUID was passed all capitalized and this caused an issue, now fixed. Not understood before because the jobs that was "hacked", i.e. not really a production job. Now we are submitting a list of jobs, real, which we hope we will get through in one or two days.
  - Panda-Rucio integration: APFs "dev" with customized pilot pointing to the rucio-panda server have been setup: many queues
  - Panda-Rucio integration: the "task" test is now also tried.
  - Rucio "full chain": we need to test the staging extensively. David tested already but up to now "never really worked". To be tested: transfer from T1 to T2, and test that we can stage from T1 (to T1), and we need also to test that Panda works with the staging.
  - Derivation Production framework submitted: many tasks "broken", there is actually no difference (visible) between the successful and the broken one. Should be checked with Andreu/GDP. Yesterday talk from James Catmore explained that there are still some problems... to be checked with them.
- CentralService/T0/T1s
  - ntr except yesterday ADCR issue, which caused the loss of about 100,000 jobs

CMS reports (raw view) -
- production / analysis in full steam
- Global Run ongoing, includes tests of Tier0
- No major issues, just a relevant ticket:
  - GGUS:109201 since 4 days CMS SRM SLS shows frequent problems. Seems to be due to overload caused by an ongoing transfer of several millions of streamer files from CASTORCMS to EOSCMS via FTS3.

ALICE -
- CERN: fewer job failures thanks to improved parameters of the batch system queue and of the pilot code
- NDGF: files accidentally deleted from tape are being replicated from CERN

LHCb reports (raw view) -
- No activities :
- T0: lbvobox19 and lbvobox27 down (ALARM ticket : GGUS:109086) Post Mortem : why when the problem has been fixed, we did not get any notifications?
- T0 : CERN: FTS3 need a new release (to be installed on Monday . https://cern.service-now.com/service-portal/view-request.do?n=RQF0383783).
- T1: NTR
- Services: NTR

Sites / Services round table:

ASGC:
BNL: ntr
CNAF:
FNAL:
GridPP: ntr
IN2P3: ntr
JINR:
KISTI: ntr
KIT: ntr
NDGF:
NL-T1: ATLAS filed several tickets about failing data transfers. We're looking into that, as it may be unrelated to the network upgrade. The downtime for the site lasts at least until Friday night.
OSG: Maintenance on 14/10, one of the two OSG BDII will be out of production. It should be transparent.
PIC: ntr
RAL: Last Monday had a database outage which affected CASTOR. Next week there will be a one hour intervention for CASTOR for ATLAS and ALICE. Gareth is in touch with the experiments.
RRC-KI:
TRIUMF:

CERN batch and grid services:
- FTS software upgrade 3.2.27 -> 3.2.28 on fts3.cern.ch , from 08:00 UTC on Monday 13th October, should be transparent. ITSSB. Includes fix for staging error - thankyou LHCb.
CERN storage services: a faulty disk server in CASTORCMS caused some CMS file transfers to fail. Now it has been fixed.
Databases:
GGUS:
- Test alarm for INFN-T1 still open after more than one week
Grid Monitoring:
- In order to ensure smooth transition to the new message brokers, the port number has to be changed in the FTS configuration. The change was performed for many servers already, there are still some which need to be updated. Required change is : change the port number to the new "61113" in "/etc/fts3/fts-msg...conf" and restart "fts-msg-bulk" service
  - fts02.usatlas.bnl.gov (192.12.15.14)
  - fts3-node1-kit.gridka.de (192.108.45.59)
  - fts3-node2-kit.gridka.de (192.108.45.146)
  - ftshost05.cr.cnaf.infn.it (131.154.129.14)

MW Officer:

AOB:

Topic revision: r12 - 2014-10-09 - AndreaSciaba

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback