Week of 141006

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Alessandro (ATLAS), Ben (grid services), Hervé (storage), Maarten (SCOD + ALICE), Tsung-Hsun (ASGC)
  • remote: Christian (NDGF), Christoph (CMS), Dea-Han (KISTI), John (RAL), Lisa (FNAL), Michael (BNL), Onno (NLT1), Pepe (PIC), Rob (OSG), Rolf (IN2P3), Sonia (CNAF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Daily Activity overview -- for WLCG
      • Rucio: one job in Panda finished (by Friday night)!
      • Reconstruction MCORE went to zero during the weekend: tasks were not launched, only half were submitted, now we have 250M submitted. From AleF: we need to improve the mixture of tasks submitted.
      • FTS at CERN was down during the weekend, now fixed.
        • Alessandro: though the load of the FTS-3 service at CERN can be taken over by BNL and RAL, the FTS-3 service needs to have the correct criticality; this will be followed up in other fora
      • HotDisk: will disable 2 or 3 so that Rucio won't return them as valid replicas. Will ask FZK and IN2P3-CC about accesses (NDGF to be seen by David).
    • CentralService/T0/T1s

  • CMS reports (raw view) -
    • Christoph: no big issues, but the FTS-3 criticality will be followed up indeed

  • ALICE -
    • NDGF: potentially big data loss from tape to be investigated ASAP

  • LHCb reports (raw view) -
    • No activities : LHCbDIRAC downtime to migrate tables from local MySQL to MySQL in the DBOD services.
    • T0: lbvobox19 and lbvobox27 down (ALARM ticket : GGUS:109086)
    • T0 : CERN: FTS3 down b/c of underlying DBOD failed, now back in production.
    • T1: NTR
    • Services: NTR

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI: ntr
  • KIT:
  • NDGF: ntr
  • NL-T1:
    • downtime this Wed to migrate CREAM CEs
  • OSG:
    • quiet weekend after ShellShock...
  • PIC: ntr
  • RAL:
    • downtime to fix DB RAID issue affecting SRM and stagers for ATLAS and ALICE
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services:
    • FTS-3 incident will be followed up
  • CERN storage services: ntr
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Andrea Sciabà, Joel Closier (LHCb), Hervé Rousseau (IT-DSS), Ben Jones (central services), Pablo Saiz (WLCG monitoring), Maarten Litmaath (ALICE)
  • remote: Dea-Han Kim (KISTI), Tommaso Boccali (CMS), Michael Ernst (BNL), Andrej Filipcic (ATLAS), John Kelly (RAL), Rolf Rumler (IN2P3-CC), Thomas Hartmann (KIT), Pepe Flix (PIC), Rob Quick (OSG), Jeremy Coles (GridPP)

Experiments round table:

  • ATLAS reports (raw view) -
    • Daily Activity overview
      • Panda-Rucio integration: bulk of jobs submitted "by hand" . Understood one problem, GUID was passed all capitalized and this caused an issue, now fixed. Not understood before because the jobs that was "hacked", i.e. not really a production job. Now we are submitting a list of jobs, real, which we hope we will get through in one or two days.
      • Panda-Rucio integration: APFs "dev" with customized pilot pointing to the rucio-panda server have been setup: many queues
      • Panda-Rucio integration: the "task" test is now also tried.
      • Rucio "full chain": we need to test the staging extensively. David tested already but up to now "never really worked". To be tested: transfer from T1 to T2, and test that we can stage from T1 (to T1), and we need also to test that Panda works with the staging.
      • Derivation Production framework submitted: many tasks "broken", there is actually no difference (visible) between the successful and the broken one. Should be checked with Andreu/GDP. Yesterday talk from James Catmore explained that there are still some problems... to be checked with them.
    • CentralService/T0/T1s
      • ntr except yesterday ADCR issue, which caused the loss of about 100,000 jobs

  • CMS reports (raw view) -
    • production / analysis in full steam
    • Global Run ongoing, includes tests of Tier0
    • No major issues, just a relevant ticket:
      • GGUS:109201 since 4 days CMS SRM SLS shows frequent problems. Seems to be due to overload caused by an ongoing transfer of several millions of streamer files from CASTORCMS to EOSCMS via FTS3.

  • ALICE -
    • CERN: fewer job failures thanks to improved parameters of the batch system queue and of the pilot code
    • NDGF: files accidentally deleted from tape are being replicated from CERN

Sites / Services round table:

  • ASGC:
  • BNL: ntr
  • CNAF:
  • FNAL:
  • GridPP: ntr
  • IN2P3: ntr
  • JINR:
  • KISTI: ntr
  • KIT: ntr
  • NDGF:
  • NL-T1: ATLAS filed several tickets about failing data transfers. We're looking into that, as it may be unrelated to the network upgrade. The downtime for the site lasts at least until Friday night.
  • OSG: Maintenance on 14/10, one of the two OSG BDII will be out of production. It should be transparent.
  • PIC: ntr
  • RAL: Last Monday had a database outage which affected CASTOR. Next week there will be a one hour intervention for CASTOR for ATLAS and ALICE. Gareth is in touch with the experiments.
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services:
    • FTS software upgrade 3.2.27 -> 3.2.28 on fts3.cern.ch , from 08:00 UTC on Monday 13th October, should be transparent. ITSSB. Includes fix for staging error - thankyou LHCb.
  • CERN storage services: a faulty disk server in CASTORCMS caused some CMS file transfers to fail. Now it has been fixed.
  • Databases:
  • GGUS:
    • Test alarm for INFN-T1 still open after more than one week
  • Grid Monitoring:
    • In order to ensure smooth transition to the new message brokers, the port number has to be changed in the FTS configuration. The change was performed for many servers already, there are still some which need to be updated. Required change is : change the port number to the new "61113" in "/etc/fts3/fts-msg...conf" and restart "fts-msg-bulk" service
      • fts02.usatlas.bnl.gov (192.12.15.14)
      • fts3-node1-kit.gridka.de (192.108.45.59)
      • fts3-node2-kit.gridka.de (192.108.45.146)
      • ftshost05.cr.cnaf.infn.it (131.154.129.14)

  • MW Officer:

AOB:

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2014-10-09 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback