LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek131028 (2013-10-31, MaartenLitmaath)

EditAttachPDF

Week of 131028

WLCG Operations Call details
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Thursday

WLCG Operations Call details

To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability				SIRs	Broadcasts	Operations Web
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	Broadcast archive	Operations Web

General Information

General Information			GGUS Information	LHC Machine Information
CERN IT status board	WLCG Baseline Versions	WLCG Blogs	GgusInformation	Sharepoint site - LHC Page 1

Monday

Attendance:

local: AndreaS, Luca, Alberto, Felix, Maarten
remote: Vladimir/LHCb, Sang-Un/KISTI, Rolf/IN2P3-CC, Xavier/KIT, Michael/BNL, Salvatore/CNAF, Onno/NL-T1, Oli/CMS, Christian/NDGF, Tiju/RAL, Rob/OSG, Wahid/ATLAS, Pepe/PIC

Experiments round table:

ATLAS reports (raw view) -
- Nothing to Report

CMS reports (raw view) -
- T2_FR_GRIF_IRFU (WLCG name: IFRU): problems being flagged with SAM test problems although jobs are running fine:
  - One CE: node74.datagrid.cea.fr simply disappeared from the SSB, GGUS:98341, can the site support team please follow up, maybe related to next issue
  - CEStatus of 2nd CE lpnhe-cream.in2p3.fr for the CMS queue is not the same between top-bdii.cern.ch, topbdii.grif.fr: GGUS:98418

Andrea: inconsistencies in the CERN top BDII have been seen often, lately.

Maarten: indeed some BDII bugs were found and corrected, one remaining issue is not yet understood and is not fixed but it's usually remedied by a restart. Unfortunately the main developer is on holiday for 1-2 weeks. CERN is using the latest version, by the way.

ALICE -
- CERN
  - because of persistent low efficiencies and high failure rates, job submission to SLC6 queues was disabled Thu last week
  - to be debugged at a smaller scale
  - meanwhile half of the SLC5 jobs will be made to use CVMFS instead of Torrent, to check if the use of CVMFS has anything to do with the matter
- KIT
  - error rate has been highish due to known issues in CVMFS version on WN
  - looking forward to seeing 2.1.15 deployed...

Andrea: Xavier, when do you plan to update CVMFS?

Xavier: now we are busy with migrating to SL6, it will be done after.

LHCb reports (raw view) -
- Main activities are incremental stripping (T0/1) and Simulation(T2)
- T0: FTS3 Downtime
- T1:
  - SARA: Downtime

Sites / Services round table:

ASGC: one DPM disk server has a high failure rate and impacts ATLAS datadisk, trying to recover.
BNL: ntr
CNAF: ntr
IN2P3-CC: will have a dCache downtime on Nov 12 to install a SHA-2 compliant version.
KIT: ntr
KISTI: ntr
NDGF: ntr
NL-T1: about the SARA downtime, cksum verifications found some corrupted files. We needed vendor help to remount a filesystem but we are getting some I/O errors so we will soon migrate data out of it (it will also go out of warranty within a month). The advantage of migrating files is that dCache will checksum all of them and therefore we'll spot any corrupted files. A list of such files will be eventually circulated.
PIC: ntr
RAL: ntr
OSG: ntr
CERN FTS - CERN FTS3 was updated this morning to 3.1.33 as published on ITSSB. It is accepted this should have been published in GOCDB as well. Monitoring at https://fts3.cern.ch:8449/ will be available outside CERN shortly.
CERN storage services: we just upgraded CASTORCMS and CASTORALICE. The former is confirmed to be OK, for the latter it would be good if ALICE could test it. [Maarten: I will pass the message]. Tomorrow we'll upgrade CASTORLHCB and CASTORPUBLIC.

AOB:

Thursday

Attendance:

local: AndreaS, Maarten, Alessandro, Luca, Alberto
remote: Gareth/RAL, Jhen-Wei/ASGC, Burt/FNAL, Sang-Un/KISTI, Kyle/OSG, Michael/BNL, Ronald/NL-T1

Experiments round table:

ATLAS reports (raw view) -
- T0/T1
  - TAIWAN-LCG2 high job failure rate due to storage issue. GGUS:98482
  - INFN-T1 file transfer issue GGUS:98471

Jhen-Wei explains that the problem is due to the disk server failure reported last Monday. Experts are trying to recover the data, but in the worst case about one million files and 140 TB would be lost. Waiting for (hopefully) good news from Taiwan. Andrea asks to produce a SIR.

CMS reports (raw view) -

ALICE -
- CERN
  - SLC6 jobs: the efficiency has climbed to very good levels on Tue ~11:20 CET, for reasons unknown
    - the current job mix should be more demanding on I/O, if anything
    - the efficiencies varied between 70 and 90% during the last 24h and were significantly higher than the SLC5 jobs efficiency for a number of hours
  - SLC5 jobs: 1 of the 2 VOBOXes was switched to CVMFS on Tue morning
    - about half the SLC5 jobs were running with CVMFS, the rest with Torrent
    - Torrent jobs could no longer run as of yesterday due to an unexpected side effect of an AliEn update for CVMFS
      - this also affected other sites that had not yet fully migrated to CVMFS (RRC-KI, NIKHEF, ...)
      - Torrent jobs now use the previous version for the time being

Maarten suspects that the job failures and inefficiencies might be related to worker nodes or disk servers at Wigner; collecting more evidence will be necessary. Alessandro suggests that PES could report on the fraction of SLC6 resources at Wigner, at least to have an idea if they could produce a measurable effect on efficiency. All experiments are invited to share experiences, which would help in finding the cause of the problem.

LHCb reports (raw view) -
- Main activities are incremental stripping (T0/1) and Simulation(T2)
- T0:
- T1:
  - SARA: Downtime

Sites / Services round table:

ASGC: ntr
BNL: ntr
FNAL: installing SLC6 on the worker nodes, it should be completed by end of November. It took some time due to the need of puppetizing everything.
KIT:
- Lost 2260 files for CMS. SIR is already created and uploaded: https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/KIT_SIR_Storage_20131028.pdf. The list of lost files was send to CMS today.
- CMS requested to convert 500 TB of read-tape disk cache to disk-only. Today we started this for 688 TB in total.
KISTI: ntr
NL-T1: the problem at SARA mentioned by LHCb is GGUS:98370
RAL: there will be a scheduled downtime on Tuesday-Wednesday for works on the power supplies, some services will be taken down (FTS-3 should keep running apart possibly from a short interruption)
OSG: ntr
CERN batch and grid services: ntr
CERN storage: the upgrade of the CASTOR stagers went OK; on Monday we will upgrade the nameserver and it should be transparent.

AOB:

Ticket INC:422227 was opened for the Audioconf experts to look into why the standard "daily" meeting teleconference was suddenly deemed "not yet started" at 15:00 CET.

Topic revision: r11 - 2013-10-31 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback