Week of 130819

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: AndreaV/SCOD, Ben/Grid, Xavi/Storage, Ivan/Dashboard, Felix/ASGC
  • remote: Lucia/CNAF, Michael/BNL, Sang-Un/KISTI, Marc/IN2P3, Xavier/KIT, Onno/NLT1, Tiju/RAL, Roger/NDGF, Kyle/OSG, David/CMS, Elena/ATLAS

Experiments round table:

  • ATLAS reports (raw view) -
    • T0
      • ntr
    • T1
      • RAL (GGUS:96674): FTS3 problem. Transfer fail with SRM_FILE_BUSY error. Under investigation.
      • Several sites were affected by cvmfs upgrage to version 2.1.14.

  • CMS reports (raw view) -
    • MC production and reconstruction continue at low-ish levels
    • GGUS:96662 Apparent CVMFS black hole node at KIT -- node is offline and CMS see no remaining effects, so ticket may be closed
    • GGUS:96664 Hammercloud failures at KIT on Aug 16 -- monitoring shows "cancelled by user". Metric green 24 hours following
    • GGUS:96668 glexec failures at T2_EE_Estonia -- ongoing
    • GGUS:96688 EGI accounting portal down yesterday -- now solved

  • ALICE -
    • NTR

Sites / Services round table:

  • Lucia/CNAF: ntr
  • Michael/BNL: ntr
  • Sang-Un/KIST: ntr
  • Marc/IN2P3: applied workaround for the CVMFS exploit, waiting for next scheduled downtime before upgrading to next version of CVMFS
  • Xavier/KIT: two downtimes tomorrow
    • intervention on local hard disks of pool nodes in the morning
    • intervention on dcap host in the afternoon, CMS will not be able to use dcap at all
  • Onno/NLT1: two issues
    • one dcache pool node crashed at SARA and files were not available, fixed after a reboot
    • there is a broken DIMM in the namespace node, which is a single point of failure, will need to bring down whole dcache cluster, not urgent, will be announced in GOCDB
  • Tiju/RAL: ntr
  • Roger/NDGF: next Wed emergency maintenance of network to Finland, some dcache pools will be unavailable for ATLAS
  • Kyle/OSG: ntr
  • Felix/ASGC: reminder, ASGC is now in downtime for router interface upgrade

  • Ben/Grid: currently reinstalling lxbatch to fix CVMFS issues
  • Xavi/Storage: reminder intervention tomorrow on CASTOR, DB upgrade that should be transparent, but service is declared at risk
  • Ivan/Dashboard: ntr

AOB: The WLCG MB on Aug 20 is CANCELLED.

Thursday

Attendance:

  • local: MariaD/SCOD, Felix/ASGC, Eva/CERN-DB, Ivan/CERN-Dashboards
  • remote: Xavier/KIT, Lucia/CNAF, Elena/ATLAS, Sang-Un/KISTI, David M,/CMS, Dennis/NL_T1, John K./RAL, Philippe Charpentier/LHCb, Roger/NDGF, Jeremy/GridPP, Rob/OSG.

Experiments round table:

  • ATLAS reports (raw view) -
    • T0
      • ntr
    • T1
      • RAL (GGUS:96674): FTS3 testing. New FTS3 update ( to solve issue of no contact between fts3 service and se for 300 seconds) yesterday caused a spike of errors corresponds. The ticket is closed by RAL but we still observe failure of some transfers in DE cloud.
      • NDGF (GGUS:96762): Transfer failing (error message Marking Space as Being Used). Under investigation (the problem has no trivial reason or fix)

  • CMS reports (raw view) -
    • CVMFS issues at KIT -- two tickets: GGUS:96714, GGUS:96761 First is on hold, the other in progress. Several nodes involved
    • CVMFS issues on a single node at RAL -- node has been disabled and problems not seen by CMS -- ticket may be closed
    • GGUS:96725 T2_KR_KISTI troubles moving to EMI-3. Ongoing
    • GGUS:96739 CVMFS problem on worker at UKI-SOUTHGRID-OX-HEP -- worker upgraded and awaiting confirmation its fixed.

  • ALICE -
    • KIT: 23 production files lost due to a damaged tape; as they already had 2 other replicas, only catalog entries needed to get adjusted

  • LHCb reports (raw view) -
    • Only MC productions ongoing, including using BOINC prototype
    • Testing reconstruction on multicore (using GaudiMP) on the CERN cloud
      • For the moment problems with CVMFS stability
    • T0:
      • NTR
    • T1:
      • NTR

Sites / Services round table:

  • ASGC: ntr
  • BNL: none connected
  • CNAF: ntr
  • FNAL: none connected
  • KIT: A typo in a site name caused wrong information publishing. As soon as this mistake was discovered the correct information started being propagated.
  • PIC: none connected
  • RAL: Bank holiday next Monday and Tuesday in the UK. RAL will be closed. None will join.
  • OSG: ntr
  • NDGF: Network mantainance next Monday evening. dCache instance in Finland will be unavailable. Some ALICE data will not be accessible.
  • NL_T1: A disk controller failure this morning caused 15 mins of unscheduled downtime.
  • IN2P3: none connected.
  • KISTI: ntr

  • CERN DB: Problem with a switch on RAC 9 requires an intervention next Monday, in principle transparent but site declared at risk as LHCb online and CMS,ATLAS archive dbs are involved in this change.

  • CERN Dashboards: ntr

AOB:

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2013-08-22 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback