Week of 131028

WLCG Operations Call details

To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: AndreaS, Luca, Alberto, Felix, Maarten
  • remote: Vladimir/LHCb, Sang-Un/KISTI, Rolf/IN2P3-CC, Xavier/KIT, Michael/BNL, Salvatore/CNAF, Onno/NL-T1, Oli/CMS, Christian/NDGF, Tiju/RAL, Rob/OSG, Wahid/ATLAS, Pepe/PIC

Experiments round table:

  • CMS reports (raw view) -
    • T2_FR_GRIF_IRFU (WLCG name: IFRU): problems being flagged with SAM test problems although jobs are running fine:
      • One CE: node74.datagrid.cea.fr simply disappeared from the SSB, GGUS:98341, can the site support team please follow up, maybe related to next issue
      • CEStatus of 2nd CE lpnhe-cream.in2p3.fr for the CMS queue is not the same between top-bdii.cern.ch, topbdii.grif.fr: GGUS:98418
Andrea: inconsistencies in the CERN top BDII have been seen often, lately.

Maarten: indeed some BDII bugs were found and corrected, one remaining issue is not yet understood and is not fixed but it's usually remedied by a restart. Unfortunately the main developer is on holiday for 1-2 weeks. CERN is using the latest version, by the way.

  • ALICE -
    • CERN
      • because of persistent low efficiencies and high failure rates, job submission to SLC6 queues was disabled Thu last week
      • to be debugged at a smaller scale
      • meanwhile half of the SLC5 jobs will be made to use CVMFS instead of Torrent, to check if the use of CVMFS has anything to do with the matter
    • KIT
      • error rate has been highish due to known issues in CVMFS version on WN
      • looking forward to seeing 2.1.15 deployed...
Andrea: Xavier, when do you plan to update CVMFS?

Xavier: now we are busy with migrating to SL6, it will be done after.

  • LHCb reports (raw view) -
    • Main activities are incremental stripping (T0/1) and Simulation(T2)
    • T0: FTS3 Downtime
    • T1:
      • SARA: Downtime

Sites / Services round table:

  • ASGC: one DPM disk server has a high failure rate and impacts ATLAS datadisk, trying to recover.
  • BNL: ntr
  • CNAF: ntr
  • IN2P3-CC: will have a dCache downtime on Nov 12 to install a SHA-2 compliant version.
  • KIT: ntr
  • KISTI: ntr
  • NDGF: ntr
  • NL-T1: about the SARA downtime, cksum verifications found some corrupted files. We needed vendor help to remount a filesystem but we are getting some I/O errors so we will soon migrate data out of it (it will also go out of warranty within a month). The advantage of migrating files is that dCache will checksum all of them and therefore we'll spot any corrupted files. A list of such files will be eventually circulated.
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr
  • CERN FTS - CERN FTS3 was updated this morning to 3.1.33 as published on ITSSB. It is accepted this should have been published in GOCDB as well. Monitoring at https://fts3.cern.ch:8449/ will be available outside CERN shortly.
  • CERN storage services: we just upgraded CASTORCMS and CASTORALICE. The former is confirmed to be OK, for the latter it would be good if ALICE could test it. [Maarten: I will pass the message]. Tomorrow we'll upgrade CASTORLHCB and CASTORPUBLIC.

AOB:

Thursday

Attendance:

  • local: AndreaS, Maarten, Alessandro, Luca, Alberto
  • remote: Gareth/RAL, Jhen-Wei/ASGC, Burt/FNAL, Sang-Un/KISTI, Kyle/OSG, Michael/BNL, Ronald/NL-T1

Experiments round table:

Jhen-Wei explains that the problem is due to the disk server failure reported last Monday. Experts are trying to recover the data, but in the worst case about one million files and 140 TB would be lost. Waiting for (hopefully) good news from Taiwan. Andrea asks to produce a SIR.

  • ALICE -
    • CERN
      • SLC6 jobs: the efficiency has climbed to very good levels on Tue ~11:20 CET, for reasons unknown
        • the current job mix should be more demanding on I/O, if anything
        • the efficiencies varied between 70 and 90% during the last 24h and were significantly higher than the SLC5 jobs efficiency for a number of hours
      • SLC5 jobs: 1 of the 2 VOBOXes was switched to CVMFS on Tue morning
        • about half the SLC5 jobs were running with CVMFS, the rest with Torrent
        • Torrent jobs could no longer run as of yesterday due to an unexpected side effect of an AliEn update for CVMFS
          • this also affected other sites that had not yet fully migrated to CVMFS (RRC-KI, NIKHEF, ...)
          • Torrent jobs now use the previous version for the time being

Maarten suspects that the job failures and inefficiencies might be related to worker nodes or disk servers at Wigner; collecting more evidence will be necessary. Alessandro suggests that PES could report on the fraction of SLC6 resources at Wigner, at least to have an idea if they could produce a measurable effect on efficiency. All experiments are invited to share experiences, which would help in finding the cause of the problem.

  • LHCb reports (raw view) -
    • Main activities are incremental stripping (T0/1) and Simulation(T2)
    • T0:
    • T1:
      • SARA: Downtime

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • FNAL: installing SLC6 on the worker nodes, it should be completed by end of November. It took some time due to the need of puppetizing everything.
  • KIT:
  • KISTI: ntr
  • NL-T1: the problem at SARA mentioned by LHCb is GGUS:98370
  • RAL: there will be a scheduled downtime on Tuesday-Wednesday for works on the power supplies, some services will be taken down (FTS-3 should keep running apart possibly from a short interruption)
  • OSG: ntr
  • CERN batch and grid services: ntr
  • CERN storage: the upgrade of the CASTOR stagers went OK; on Monday we will upgrade the nameserver and it should be transparent.

AOB:

  • Ticket INC:422227 was opened for the Audioconf experts to look into why the standard "daily" meeting teleconference was suddenly deemed "not yet started" at 15:00 CET.
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2013-10-31 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback