Week of 140609

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday: Whit Monday holiday

  • The meeting will be held on Tuesday instead.

Tuesday

Attendance:

  • local: Jerome (CERN grid services), Maarten (SCOD), Stefan (LHCb)
  • remote: Alexander (NLT1), Alexey (ATLAS), Gabriela (ATLAS), Kyle (OSG), Lisa (FNAL), Michael (BNL), Pavel (KIT), Rolf (IN2P3), Sang-Un (KISTI), Tiju (RAL), Tommaso (CMS)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
    • T0/T1s
      • "No such file or directory" transfer errors (very few) from : FZK - JIRA:ATLDDMOPS-4678

  • CMS reports (raw view) -
    • Very quiet period
    • CERN had some transparent interventions this morning (DB, CVMFS, etc) - we do not seem to be suffering from them at the moment
    • T0
      • NTR
    • T1
      • NTR

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Main activity: MC and user jobs, low amount of activities over the long week-end
    • T0:
    • T1:
      • CNAF : problem of storage, LFC dump given to site for consistency checks
      • RAL: Problems with ARC CEs publishing wrong values for MaxCPUTime in the BDII (GGUS:106059), fixed
      • RAL: It seems the local batch system returns 0 for the CPU time left for a landed pilot, all jobs failing (GGUS:)
      • RAL & PIC in DT today

Sites / Services round table:

  • ASGC:
  • BNL: ntr
  • CNAF:
  • FNAL: ntr
  • GridPP:
  • IN2P3:
    • network maintenance outage Tue Jun 17
    • the MSS will also be down on Wed Jun 18
  • JINR:
  • KISTI: ntr
  • KIT: ntr
  • NDGF:
  • NL-T1: ntr
  • OSG:
    • looking into rollout of Condor-C CE to replace GRAM-5 CEs, but first we need to ensure the SAM tests can handle the new type
      • Maarten: will notify the SAM team (and experiment experts) to get this matter on the agenda
  • PIC:
  • RAL:
    • CASTOR Name Server upgrade went OK and the batch system has been restarted
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services: ntr
  • CERN storage services:
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Andrea M (MW Officer), Felix (ASGC), Jerome (CERN grid services), Kate (databases), Maarten (SCOD), Maria A (WLCG), Stefan (LHCb)
  • remote: Antonio (CNAF), Dennis (NLT1), Gareth (RAL), Kyle (OSG), Lisa (FNAL), Michael (BNL), Pepe(PIC), Rolf (IN2P3), Thomas (NDGF), Wahid (ATLAS)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • Problems with FTS3 causing huge backlog for transferring jobs of multicloud production (from US to TW,CA,FR), GGUS:106095
      • We will move TW and FR cloud endpoints to RALFTS3

  • CMS reports (raw view) -
    • Very quiet period
    • a CERN DB machine needs a reboot after upgrade, should be completely transparent to Computing
    • During the night European Xrootd redirector went down; restored at 9 am
    • T0
      • NTR
    • T1
      • NTR

  • ALICE -
    • CERN: 'permission denied' errors while trying to read raw data files from CASTOR; fixed by the CASTOR team

  • LHCb reports (raw view) -
    • Main activity: MC and user jobs, reprocessing campaign about to start today or tomorrow
    • T0:
    • T1:
      • CNAF : back from their DT, in contact with LHCb DataMgmt team to restore missing files
        • Antonio: a SIR will be provided when the incident has been resolved with the vendor etc.

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: nta
  • FNAL: ntr
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI:
  • KIT:
  • NDGF: ntr
  • NL-T1: ntr
  • OSG: ntr
  • PIC:
    • Tue downtime went well. We are running now dCache 2.6.29 in PIC!
  • RAL: ntr
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services:
    • ce208 has been taken out because of a HW problem
  • CERN storage services:
  • Databases:
    • on Fri late evening for ~2h there was a power supply problem affecting production RACs hosting various databases, including ADCR, CMSR, LCGR and CASTOR
    • the CMS integration DB currently cannot be reached from off-site; remote users could e.g. tunnel through lxplus; will be fixed next week
  • GGUS:
  • Grid Monitoring:
  • MW Officer:
    • today's UMD update release includes the fix (bouncycastle-mail-1.46-2) for GGUS:104768 ("Job submission fails for VOs supported by VOMS server with SHA-512 host certificate")
      • WLCG sites and experiments should update their affected nodes:
        • Argus
        • CREAM
        • UI
        • WN
    • the next UMD updates are planned for July and Oct

AOB:

WLCG Operations Coordination presented some changes concerning operations meetings at the last Operations Coordination Meeting. See slides for more details. As far as Mon and Thu meeting is concerned, there are no big changes apart from the new MW officer report who will gather feedback from open MW issues and will act as a link with the MW developers. Please, do not hesitate to report any issue that requires his attention.

Apart from this, T1s and T2s have now a slot at the bi weekly Operations Coordination meeting where they can raise issues and give feedback relevant to the audience of the meeting, like experiments, other sites, operations coordinators and TFs coordinators. We expect more participation from sites, so you are all welcome to attend. The Operations Coordination meeting discusses more short and long term plans from experiments and TFs that may have an impact on the sites, so your feedback and questions are much appreciated.

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2014-06-13 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback