Week of 141208

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Alessandro (ATLAS), Christoph (CMS), Maarten (SCOD + ALICE), Nathalie, Pepe (PIC), Stefan (LHCb), Tsung-Hsun (ASGC)
  • remote: Christian (NDGF), Dimitri (KIT), Gareth (RAL), Lisa (FNAL), Michael (BNL), Onno (NLT1), Rob (OSG), Rolf (IN2P3)

Experiments round table:

  • ATLAS reports (raw view) -
    • CentralService & Tier0/Tier1s
      • BNL-ATLAS GGUS:110582 not really a site issue: it was due to the changes in the periodic remove algo used in the AutoPilotFactory.
    • Daily Activity overview:
      • not so much ongoing in the system for various reasons: 1) not really a lot of workload from MC and Derivation Framework coords, 2) some issues with Rucio (described in details below) which led to some tasks broken ( could be improved with more retries with exponential decay time of retry) 3) some Derivation Framework tasks were failing due to SW issues.

  • CMS
    • Trouble with Phedex data service over the weekend GGUS 110600
    • Problems reading data from EOS for Tier-0 tests on Friday GGUS 110593

  • ALICE -
    • KIT: many raw data files are still being read remotely
      • an inconsistency was found between the AliEn File Catalog and the SE inventory
      • a consistency check is being prepared

  • LHCb reports (raw view) -
    • MC and user jobs. "Legacy Run1 Stripping" campaign running full steam and progressing well
    • T0: Very low number of stripping job errors but CERN still shows a multiple error rate compared to other T1 sites, would like to understand this. Resubmission of the same file usually works (GGUS:110604)
    • T1:
      • IN2P3: DT tomorrow, draining tonight
      • RAL: Again, low total number of errors but saw several jobs failing with "bus error" during the week-end for stripping

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF:
  • FNAL: ntr
  • GridPP:
  • IN2P3:
    • downtime tomorrow, batch system will be drained as of 22:00 CET
  • JINR:
  • KISTI:
  • KIT: ntr
  • NDGF: ntr
  • NL-T1:
    • MSS down on Wed
  • OSG: ntr
  • PIC:
    • dCache upgrade to 2.10 next Mon
    • HTTP read-only access to be enabled also for LHCb next week
  • RAL:
    • there is a problem with the ATLAS SRM since yesterday
      • a downtime has been declared
      • the trouble may come from "bad" file names with double slashes
      • under investigation
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services:
  • CERN storage services:
  • Databases:
  • GGUS:
    • GGUS update on the 10th of December, including cleanup of non-used VOs, removed ticket category 'SPAM',
    • Supporters using SNOW are reminded to keep using GGUS as the main tool until the synchronization issues have been solved
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Daniel (grid services), Maarten (SCOD + ALICE), Pablo (GGUS + grid monitoring), Stefan (LHCb), Ulf (NDGF)
  • remote: Andrej (ATLAS), Antonio (CNAF), Christian (NDGF), Christoph (CMS), Dennis (NLT1), Michael (BNL), Pepe (PIC), Rob (OSG), Rolf (IN2P3), Thomas (KIT), Tiju (RAL), Tsung-Hsun (ASGC), Young-Bok (KISTI)

Experiments round table:

  • ATLAS reports (raw view) -
    • CentralService & Tier0/Tier1s
      • FZK issue GGUS:110693 , not reproducible but "solved by itself".
    • Andrej:
      • a few sites complained about jobs from 1 task running in multi-core mode in single-core queues; the task was killed and the cause investigated to avoid such incidents
      • low activity because of Rucio and ProdSys2 commissioning, but also awaiting SW readiness for Run 2 Monte Carlo campaign
      • disks are quite full: the normal deletion activities are being monitored carefully after last week's incident

  • CMS reports (raw view) -
    • Still some trouble with CMS web services
    • Testing some rather high I/O jobs on HLT (with EOS experts in the loop)

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • "Legacy Run1 Stripping" campaign running full steam and progressing well + MC and user jobs.
    • T0: Two GGUS tickets open and not answered yet
      • GGUS:110604 - investigation of higher failure rate for stripping jobs at CERN
      • GGUS:110583 - SAM probes were timing out at CERN queues
    • T1:
      • IN2P3:
        • Stefan: ~1500 jobs had to be killed on the morning of the intervention; next time we will try to have less fallout

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL:
  • GridPP:
  • IN2P3:
    • the downtime on Tue generally went OK; Xrootd was back later than foreseen
  • JINR:
  • KISTI: ntr
  • KIT:
    • updating CMS dCache to 2.11 today
    • updated ATLAS dCache on Tue plus Wed
      • in the downtime extension the CMS SRM was mentioned accidentally
      • Christoph: fixed after its effect on PhEDEx operations was noticed
    • a high load ob the GridFTP doors was observed since the upgrade; being investigated
  • NDGF:
    • 1 tape with ATLAS data got destroyed; under investigation
  • NL-T1: ntr
  • OSG: ntr
  • PIC:
    • next Mon downtime for dCache upgrade to 2.10
  • RAL:
    • the ATLAS SRM instabilities were fixed on Mon
    • CMS and ATLAS CASTOR head nodes were upgraded to SL6 on Tue and Wed
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services: ntr
  • CERN storage services:
  • Databases:
  • GGUS:
    • New release of GGUS done on wednesday. All alarms have been acknowledged.
    • Snow fixed the synchronization issue with GGUS. Supporters using GGUS and SNOW can go back to use their preferred interface
  • Grid Monitoring: ntr
  • MW Officer:

AOB:

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2014-12-11 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback