Week of 130805

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Maarten, Andrew, Xavier, Ulrich, Ivan, Stefan
  • remote: Alexander, Michael, Xavier, Stephane, Saverio, Sang-Un, Tiju, Thomas, Wei-Jen, Pepe, Lisa, anonymous, Rob, Stefano

Experiments round table:

  • ATLAS
    • T0/Central services
      • NTR
    • T1
      • RAL FTS3 :
        • Friday-Saturday : Burst of FTS jobs submitted on friday. Transfer rate went down affecting transfers for production. Intervention done by RAL FTS admin on saturday morning which temporary solved the problem. Saturday afternoon : ATLAS rolled back many UK and DE sites to FTS2 and things went smoothly. Now waiting for feedback from FTS3 dev and admin.
      • INFN-T1 : (GGUS:96114) : Problems to transfers files from CNAF

  • CMS
    • Not much production running. Analysis activity at steady pace
    • No incident worth reporting, smooth running.
    • Open GGUS tickets as of today only refer to T2's:
      • GGUS:96329 IN2P3-CC-T2 did not prioritize SAM test
      • GGUS:96328 INFN-Bari persistent (obscure) SRM error copying one file
      • GGUS:96280 Budapest can't transfer files out since 2 days
      • GGUS:96261 Bristol local stageout problem, likely a glitch, CMS investigating
      • GGUS:96217 London-Brunel seems opened by an independent user

  • ALICE -
    • CNAF: VOBOX was in bad shape during part of the weekend (GGUS:96345)

  • LHCb
    • Mostly MC productions ongoing, tail of reprocessing and restripping campaign
    • T0:
      • CERN: NTR
    • T1:
      • GRIDKA: Marked improvement in transfers during the last week.
      • IN2P3: Weekend problems with SRM which affected other experiments were resolved for LHCb too.

Sites / Services round table:

  • NL-T1: NTR
  • BNL: NTR
  • KIT: NTR
  • CNAF: NTR
  • KISTI: NTR
  • RAL: NTR
  • NDGF: DT tomorrow at 11.00 of some of network services, as a consequence some ATLAS data won't be available.
  • ASGC: NTR
  • PIC: Few days ago powered on more CPUs, offering now 30 % more resources compared to pledges, VO contacts have been informed.
  • FNAL: NTR
  • OSG: APEL problem is fixed (reported on Friday).
  • Dashboard: NTR
  • Storage: NTR
  • Grid Services: Suffering from monitoring issue when all CEs started failing with SRM tests on Saturday morning, recovering now. To be likely followed up with the monitoring team.

AOB:

Thursday

Attendance:

  • local: Andrew, Maarten, Ulrich, Ivan, Xavier, Stefan
  • remote: Michael, Xavier, David, Dennis, John, Wei-Jen, Sonia, Rolf, Elizabeth, Thomas, Lisa,

Experiments round table:

  • ATLAS
    • T0/T1
      • CERN-PROD GGUS:96476 FTS2 channels stuck. FTS channels CERN -> INFN-T1/ASGC/NDGF/BNL/GRIDKA stuck. Problem fixed "by itself".
      • INFN-T1: GGUS:96480 ATLASMCTAPE free space low and deccreasing (already seen and fixed few days ago GGUS:96414)
      • RAL FTS3: on monday we reported issues due to high load on FTS3. The problem was understood and fixed. We put back the sites we removed last weekend from the FTS3. Moreover we added also NDGF-T1, and we noticed an factor 2 increase in number of file transferred to the site.

  • CMS
    • Keeping T1-2 resources reasonably busy with MC production
    • GGUS:96429 T1_FR_CCIN2P3 seeing file read errors in MC reprocessing possibly due to a dataset inconsistency, SAM tests timing out possibly due high load and long workflows not giving up slots needed for the test.
    • GGUS:96482 Transfers from Caltech to T1_UK_RAL failing, though errors perhaps at source end
    • GGUS:96452 DNS reverse lookup problems at T2_IN_TIFR

  • ALICE -
    • NTR

  • LHCb
    • Mostly MC productions ongoing, tail of reprocessing and restripping campaign
    • T0:
      • CERN: NTR
    • T1:
      • Recovered from network interuptions at RAL earlier in the week
      • Local transfer failures at IN2P3 and SARA resolved (SRM overloads?)

Sites / Services round table:

  • BNL: NTR
  • KIT: NTR
  • NIKHEF: Failing disk server of Atlas (memory problems), engineer from supplier will check it tomorrow. Service currently running but may crash any time again (happened already, GGUS:96463)
  • RAL: NTR
  • ASGC: NTR
  • CNAF: NTR
  • IN2P3: NTR
  • OSG: NTR
  • NDGF: NTR
  • FNAL: question: where are problems posted concerning FTS3. Maarten, Stefan, you may subscribe to the fts3-steering mailing list.
  • KISTI: NTR
  • PIC: NTR
  • Storage: 2 small outages for ALICE yesterday and night before b/c of DNS resolution problems. Time limits were increased
  • Dashboards: NTR
  • Grid Services: transparent moving of services to new openstack infrastructure: ARGUS done on Tuesday, BDII services yesterday/today

AOB:

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2013-08-09 - StefanRoiser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback