Week of 121112

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Massimo, Rod, Mark, MariaD, MariaG, Maarten, Felix, Szymon, Ian);remote(Stefano CNAF, Roger and Ulf NDGF, Rolf IN2P3, Ronald and Alexander NLT1, Gonzalo PIC, Tiju RAL, Alain OSG, Dimitri KIT, Lisa FNAL).

Experiments round table:

  • ATLAS reports -
    • Report to WLCGOperationsMeetings
      • T0/T1
        • 70k lost files at NDGF in process of being recovered from other sites
        • DATADISK space problem on PIC,INFN-T1, FKZ and SARA
          • suspended from T0 data distribution, but only DATADISK, not TAPE
          • caused mostly by reprocessing. Some intermediate datasets to be cleaned after validation

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Disk server and file loss reported Friday evening at 17:30. These were primarily streamer files from P5. Files that had not been successfully repacked into RAW data files were resent from P5
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • ALICE reports -
    • KIT: still investigating issues preventing normal job behavior; the SW provisioning was switched back to the shared SW area on Sat, but very few jobs show up in the MonALISA plots

  • LHCb reports -
    • Reprocessing at T1s and "attached" T2 sites
    • User analysis at T0/1 s
    • Prompt reconstruction at CERN + 4 attached T2s
    • MC productions at T2s and T1s if resources available
    • New GGUS (or RT) tickets
    • T0:
    • T1:
      • GridKa:
        • A number of timeouts when transferring. Possibly only restricted to a few pool nodes? (GGUS:88425)
Sites / Services round table:
  • ASGC: ntr
  • CNAF: glitch on CE (failing due to a cert change on Saturday without restart). Now fixed.
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: File loss. Still looking into it. The site should prepare an incident report, which may also benefit other sites; it seems to be a problem in the HD firmware
  • NLT1: ntr
  • PIC: ntr
  • RAL: Over the weekend: Saturday CRL problems on CE; Sunday SRM ATLAS DB overload. Tomorrow short network intervention (GocDB)
  • OSG:BDII problem recovered (last Friday, 2h downtime)

  • CASTOR/EOS:
    • CMS incident: "timing" issue in a reinstall script (machine reinstalled itself after a routine reboot). Script being protected.
    • CASTOR --> EOS (ALICE): 12 M files left (out of 36 M). On track (deadline 30 of Nov)
    • FYI: CASTOR PUBLIC upgrade this Wednesday (transparent; LHC instances untouched)
  • Data bases: Saturday the streams propagation (ATLAS) was blocked due to an overload on ADCR (cohosted DB). Root cause being investigated.
  • Dashboard: ntr
  • GGUS: KIT intervention will affect GGUS availability (19-Nov 5:00-7:00 UTC 2h) scheduled. Announcement on https://ggus.eu/pages/news_detail.php?ID=472
AOB:

Tuesday

Attendance: local(Ian, Ignacio, Ivan, Maarten, Mark, Rod, Wei-Jen);remote(Gareth, Lisa, Roger, Rolf, Ronald, Scott, Stefano, Xavier).

Experiments round table:

  • ATLAS reports -
    • T0/T1
      • 70k NDGF files from last week (confused with new 4 file servers - failed but recoverable)
      • T1 DATADISK situation eased a little, due to cleaning old data
    • Request to add PerfSonar hosts to GOCDB
      • net.perfSONAR.Bandwidth, net.perfSONAR.Latency

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • NTR
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • ALICE reports -
    • KIT: job profile looking a lot better (already 2k out of 5k jobs showing up in MonALISA), but not understood why - nothing was changed on the ALICE side, as far as is known

  • LHCb reports -
    • Reprocessing at T1s and "attached" T2 sites
    • User analysis at T0/1 s
    • Prompt reconstruction at CERN + 4 attached T2s
    • MC productions at T2s and T1s if resources available
    • New GGUS (or RT) tickets
    • T0:
    • T1:
      • GridKa:
        • Still having significant problems with FTS transfers to GridKa (GGUS:88425)
        • Had a large peak of FTS transfers to GridKa-RAW this morning - we wondered if the write access was suffering due to the tuning for staging?
          • Xavier: we noticed, but did not look into it further
          • Mark/Maarten: OK, let's have a closer look next time
Sites / Services round table:

  • ASGC - ntr
  • CNAF
    • around 11:30 during a routine intervention the hypervisor of the LHCb VObox has been accidentally rebooted. The machine restarted soon and now is working fine. Apparently this did not cause any big issue on the LHCb operation. We are not aware of any other service that suffered from this.
  • FNAL - ntr
  • IN2P3
    • reminder: there will be an outage from Dec 10 morning until Dec 12 noon to replace an external power supply cable with a new one from a separate electricity grid; all our Tier-1 services will be down, while remaining core services will depend on a local power generator
  • KIT - nta
  • NDGF
    • still recovering the crashed disk servers
  • NLT1 - ntr
  • OSG
    • currently in the regular maintenance window
    • sites can define their PerfSONAR bandwidth and latency endpoints in OIM now
  • RAL
    • router intervention this morning went OK

  • dashboards
    • yesterday a new version of the CMS Site Status Board was deployed, offering various enhancements and bug fixes
  • grid services
    • CREAM ce203.cern.ch developed a problem with job submission and has been put in downtime
    • the Argus service was upgraded to EMI-2 on SL6 to fix a problem with authentication failures
AOB:

Wednesday

Attendance: local(Massimo, Wei Jen, Mark, Massimo, Julia, Luca, Ignacio, Ian);remote(AdioAplication not working, again).

Experiments round table:

  • ATLAS reports -
    • Report to WLCGOperationsMeetings
      • ATLAS rep at GDB today
      • T0/T1
        • PIC and SARA >100TB free datadisk so back in T0 distribution
          • ASGC 20TB so removed
        • INFN-T1 tap downtime tomorrow. They asked for data to pre-stage, but no useful response from ATLAS.
          • Not much tape ressident data needed for repro - thanks for warning but we will live the 5hr downtime

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • T0 and T1 CERN are suffering HammerCloud Failures. Ticket submitted Seems to be concentrated in ce203.cern.ch
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR

  • ALICE reports -
    • KIT: job profile looks mostly OK again, not yet understood why; still using the shared SW area for now

  • LHCb reports -
    • Reprocessing and Reconstruction ramping down while waiting for new Conditions DB
    • MC productions at T2s and T1s if resources available
    • New GGUS (or RT) tickets
    • T0:
    • T1:
      • GridKa:
        • Still having significant problems with FTS transfers to GridKa. Now pointing towards overloaded storage using gridFTP2. (GGUS:88425)
      • SARA:
        • SARA_BUFFER seemed to go offline this morning for ~2 hours. No ticket as it was fixed without intervention from VO side.
Sites / Services round table:
  • ASGC: Migration to EMI2 CEs
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: ntr
  • NLT1: The issue reported by LHCb corrected by itself
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS: ntr
  • Central Services: CE203 seems the origin of the CMS problem. After debugging it is back in production (under observation)
  • Data bases: ntr
  • Dashboard: ntr
AOB:

Thursday

Attendance: local(Massimo, Wei Jen, Ignacio, Mark, Guido, Maarten); remote(Paolo, Lisa, Rolf, Marian, Roger, Ulf, Tiju, Ronald, Gonzalo, Ian).

Experiments round table:

  • ATLAS reports -
    • Report to WLCGOperationsMeetings
      • T0
      • T1
        • ASGC staging from tape giving errors. System overloaded, yet no improvements (GGUS:88490)
        • INFN-T1 tape downtime, no problem so far
      • T2
        • Problem with Sheffield SRM ("failed to contact remote SRM", one of the disk servers overloaded, fixed) GGUS:88508
        • Problems with INFN-MILANO-ATLASC (still to be tackled). Apparently, problem disappeared (GGUS:88506)

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Castor issue this morning stopped transfers from P5. We did not see it on the offline monitors because the number of active transfers seemed to stay constant. Recovered and backlog promptly cleared
    • Tier-1:
      • NTR
    • Tier-2:
      • NTR
  • ALICE reports -
    • NTR

  • LHCb reports -
    • Reprocessing and Reconstruction ramping down while waiting for new Conditions DB
    • MC productions at T2s and T1s if resources available
    • New GGUS (or RT) tickets
    • T0:
    • T1:
      • GridKa:
        • Still having significant problems with FTS transfers to GridKa. Now pointing towards overloaded storage using gridFTP2. (GGUS:88425)
Sites / Services round table:
  • ASGC: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: ntr
  • NLT1: ntr
  • PIC: ntr
  • RAL: ntr

  • CASTOR/EOS:Upgrade of CASTOR 2.1.13-6: we propose CMS&LHCb on Wed 21 (8:00 UTC and 13:00 UTC) and ALICE&ATLAS on Thu 22 (8:00 UTC and 13:00 UTC). Transparent upgrade already in prod on Public
  • Central Services: The LSF problem is not quite the same of the previous one. This time it looks not all the slots are filled. The root cause might be the same.
AOB:

Friday

Attendance: local(Massimo, Wei Jen, Ignacio, Mark, Guido, Maarten, Eva); remote(Paolo, Lisa, Rolf, Elisabeth, Michael, Dimitrios, Roger, Ulf, Tiju, Alexander, Gonzalo, Ian, Christian).

Experiments round table:

  • ATLAS reports -
    • Report to WLCGOperationsMeetings
      • T0
        • LSF quotas and shares not respected. Reconfiguration in the night helped, situation seems back to normal and problem understood (GGUS:88493)
        • Castoratlas: RAID controller broken on a server (GGUS:88528), content drained, problem fixed for ATLAS (a hardware ticket was opened for the broken server)
      • T1
        • TW staging from tape giving errors. System overloaded, yet no improvements, "[StatusOfBringOnlineRequest][SRM_FAILURE] Stager did not answer within the requested time" (GGUS:88490)
      • T2
        • nothing to report

  • CMS reports -
    • LHC / CMS
      • NTR
    • CERN / central services and T0
      • Problem with LSF over night. Only 400 jobs running in CMS. Reconfiguration fixed the issue.
    • Tier-1:
    • Tier-2:
      • NTR

  • LHCb reports -
    • Reprocessing and Reconstruction ramping down while waiting for new Conditions DB
      • LHCb identified a glitch in their monitoring (some jobs failures not accounted correctly - being fixed)
    • New processing runs should be starting later today
    • MC productions at T2s and T1s if resources available
    • New GGUS (or RT) tickets
    • T0:
    • T1:
      • GridKa:
        • Still having significant problems with FTS transfers to GridKa. Investigations still ongoing. (GGUS:88425)
Sites / Services round table:
  • ASGC:ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT:KIT asks ATLAS and CMS to reduce their tape useage as the disk caches are almost full.
  • NDGF: Tue 2 intervention (dCache headnode and disk firmware). Already in GOCDB
  • NLT1: ntr
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS:4 intervnetion on CASTOR confirmed by the experiment (pending final mail from ATLAS)
  • Central Services: ntr
  • Data bases: ntr
AOB:

-- JamieShiers - 18-Sep-2012

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2012-11-16 - MassimoLamanna
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback