Week of 130708

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: AndreaV/SCOD, Kate/Databases, Jerome/Grid, Eddie/Dashboard, LucaM/Storage, MariaD/GGUS, Maarten/ALICE
  • remote: Matteo/CNAF, Alexander/NLT1, Xavier/KIT, Lisa/FNAL, Tiju/RAL, Pepe/PIC, Wei-Jen/ASGC, Rob/OSG, Michael/BNL, Peter/ATLAS, Alexey/LHCb

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • CASTOR ATLAS sls monitoring down GGUS:95485 & INC:333771. One certificate from SAM monitoring boxes got expired. Fixed. [LucaM: there were some monitoring glitches due to issues with one machine]
      • Analysis at CERN failing GGUS:95445 Failed Critical SAM test with error "File was NOT deleted from SRM. Quota reached for number of files (5M) on atlasscratchdisk. 0-length files transfers failing. Files declared lost. Ongoing. [LucaM: analysed the issue with Alessandro this morning. The permissions did not allow file deletion. Should be fixed by 10am this morning, permissions now allow file deletion and overwriting in case of failures.]
    • T1
      • BNL lost files. Recovery ongoing.

  • ALICE -
    • KIT: concurrent jobs cap had to be lowered to 2k since Sat afternoon to avoid firewall overload due to off-site data access; the recurrent issues with the local SE will be the subject of a meeting foreseen for this week

  • LHCb reports (raw view) -
    • Incremental stripping campaign in progress and MC productions ongoing
    • T0:
      • [LucaM: GGUS:95501 is now solved. There were two BestMan running.]
      • [LucaM: as discussed with Philippe, LHCb reached 95% of its EOS space, will add capacity.]
    • T1:
      • GridKa: staging is still problematic (GGUS:95135)

Sites / Services round table:

  • Matteo/CNAF: ntr
  • Alexander/NLT1: ntr
  • Xavier/KIT: ntr
  • Lisa/FNAL: ntr
  • Tiju/RAL: upgrading CASTOR stager for ATLAS to 2.1.13-9
  • Pepe/PIC: upgrade to SLC6/EMI2 is ongoing, migrating half of the farm, the whole farm should be done by next week
  • Wei-Jen/ASGC: ntr
  • Rob/OSG: ntr
  • Michael/BNL: ntr

  • Kate/Databases: ntr
  • Jerome/Grid: ntr
  • Eddie/Dashboard: ntr
  • LucaM/Storage: nta

  • MariaD/GGUS:
    • Reminder! Release this Wed July 10th with ALARM tests and GGUS host certificate renewal (no DN change).
    • GGUS:94704 needs an answer from PIC (about the continuation of condor-utils support).
    • GGUS:94706 needs an answer from INFN (about the continuation of torque-utils support).
    • All GGUS ticket activity is up-to-date in file ggus-tickets.xls attached to twiki page WLCGOperationsMeetings.

AOB: none

Thursday

Attendance:

  • local: AndreaV/SCOD, Kate/Databases, Eddie/Dashboard, Jerome/Grid, Maarten/ALICE
  • remote: Xavier/KIT, Kyle/OSG, Wei-Jen/ASGC, Gareth/RAL, Rolf/IN2P3, Jeff/NLT1

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • NTR
    • T1
      • BNL file recovery ongoing, job failures in US cloud reduce after increasing transferring timeout. Load reduced on Site Service machine.

  • CMS reports (raw view) -
    • 2011 data legacy rereco beginning at T1.
    • GGUS:95553 File read errors at IN2P3 -- similar to a previous ticket, from file access timing out -- GGUS:95434. Current ticket possibly resulting from near simultaneous startup of ~3600 jobs. Looks OK now (last comment says to close ticket)
      • GGUS:95654 IN2P3 SAM CE briefly red due to job submission timeout -- also looks OK now.

  • ALICE -
    • CERN: almost all CEs refuse ALICE proxies, alarm ticket GGUS:95662 [Jerome/Grid: the issue is now understood and is being fixed]
    • KIT: the concurrent jobs cap had to be lowered to 2k late yesterday evening; a plan to address issues with the local SE is being looked into

Sites / Services round table:

  • Xavier/KIT: three downtimes for Gridka and dcache over four days are scheduled for the next few weeks, will put them in GOCDB
  • Kyle/OSG: GGUS alarm tests went off yesterday, all was OK for ATLAS, but there was an issue with CMS tests, so we requested another test to check what happened
  • Wei-Jen/ASGC: Taiwan-CERN network is unavailable for maintenance of the submarine cable, will be back tomorrow
  • Gareth/RAL:
    • upgraded ATLAS Castor yesterday, the outage was longer than foreseen but all was OK eventually
    • CE issues this morning, all OK now
  • Rolf/IN2P3: ntr
  • Jeff/NLT1: ntr

  • Kate/Databases: ntr
  • Eddie/Dashboard: ntr
  • Jerome/Grid: nta

AOB: none

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2013-07-11 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback