Week of 130415

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: AndreaS, Jarka, Simone, MariaD
  • remote: Lisa (FNAL), Onno (NL-T1), Michael (BNL), Lucia (CNAF), Wei-Jen (ASGC), Xavier (KIT), Tiju (RAL), Kyle (OSG), Rolf (IN2P3), Rob (OSG)
Experiments round table:

  • ATLAS reports (raw view) -
    • T1s:
      • Some failures in the weekend copying data to FZK-LCG2. GGUS:93320. Problem went away by itself. The ticket can be closed and another one will be opened if the problem reappears.

  • CMS reports (raw view) -
    • No significant events in last days.
    • LHC / CMS
      • Rereconstruction of 2012 data in the tails, load at the T1 sites small. User's analysis goes on at constant pace.
    • CERN / central services and T0
      • ntr
    • Tier-1:
      • ntr
    • Tier-2:
      • ntr

  • ALICE -
    • CNAF: VOBOX host cert expired Sat afternoon, causing CNAF to get steadily drained of ALICE jobs since 04:30 today (GGUS:93319) [Lucia adds that they already requested a new certificate but the CA is late in delivering it]
    • KIT: concurrent jobs cap lowered to 2500 on Fri afternoon to avoid overloading the firewall with off-site SE accesses while the new Xrootd servers are being debugged

  • LHCb reports (raw view) -
    • Apologies for not attending due to LHCb Analysis and Software week
    • Still mainly user jobs
    • T0:
      • Still pb with SLS LFC sensor frown
    • Activities:
      • Re-stripping postponed until beginning of May, will start with latest reprocessed data to take advantage of files still on disk. Then jobs will trigger staging from tape. Tier1s should be prepared for high staging rate then for achieving the rest ripping within 60 days:
Total TB Rate (MB/s)
CERN-RDST 259.4 50
CNAF-RDST 794.6 153
GRIDKA-RDST 643.5 124
IN2P3-RDST 696.8 134
PIC-RDST 201.4 39
RAL-RDST 575.8 111
SARA-RDST 537.7 104

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NL-T1: reminder from NIKHEF about a network maintenance intervention this Thursday affecting both computing and storage services; the queues will be drained starting from Wednesday afternoon.
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr
  • Dashboards: ntr
AOB:

Thursday

Attendance:

  • local: AndreaS, Jarka, Simone, Xavier, Gavin, MariaD
  • remote: Xavier (KIT), David (CMS), John (RAL), Lucia (CNAF), Ronald (NL-T1), Rolf (IN2P3), Rob (OSG)
Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • SAM-ATLAS-PROD machine was down since 16th of April 13:40 . https://cern.service-now.com/service-portal/view-incident.do?n=INC279368 . The importance of the machine was 5 (very low), while it should have been >50: this has been changed yesterday (Wed) around 14:00 . Problem fixed at 10:30 of Thursday morning.
      • SARA-MATRIX ~2800 transfer failures: the certificate has expired on Tuesday. GGUS:93353 solved: the same day, the issue was fixed.
      • RAL-LCG2: still some transfer failures at low rate with "checksum mismatch". GGUS:93315 updated on Tuesday. Under investigation.

  • CMS reports (raw view) -
    • LHC / CMS
      • Rereconstruction of 2012 data in the tails, load at T1's continues to tail off. Analysis load continues as usual.
    • CERN / central services and T0
      • GGUS:93372 CMS VOMS -- currently correcting some manually entered incomplete DN's.
    • Tier-1:
      • GGUS:93440 KIT -- black hole worker node due to CVMFS partition being marked read only -- resolved quickly.
    • Tier-2:
      • NTR

  • ALICE -
    • central services
      • a cleanup operation unexpectedly caused a very high I/O load on the AliEn DB for a few hours starting Tue mid morning, causing lots of jobs and services around the grid to time out
      • the cleanup was interrupted mid afternoon, after which the DB needed a few more hours to roll back
      • normal conditions were restored early evening
    • KIT: the new Xrootd servers look better now - thanks!
      • concurrent jobs cap raised to 10k on Wed afternoon
    • SARA: dCache Xrootd interface fixed for writing - thanks! (GGUS:93045)

Sites / Services round table:

  • ASGC: ntr
  • CNAF: ntr
  • IN2P3:ntr
  • KIT: tomorrow, emergency downtime fro 0900LT to 1300LT to update the GPFS cluster; 900 TB of ATLAS data will be offline
  • NL-T1: NIKHEF is in downtime for network maintenance as previously announced; everything should be back by the end of the afternoon
  • PIC: ntr
  • RAL: next Wednesday all Oracle databases will be patched; the upgrade should be transparent, but in case of problems with the ATLAS 3D database, the ATLAS Frontier servers will fail over to use the IN2P3 3D database. Rolf confirms that no intervention is planned at IN2P3 at that time.
  • OSG: ntr
  • CERN batch and grid services: Next Tuesday the NFS shared storage used by myproxy.cern.ch to store proxy credentials will be moved to a new system. We expect to preserve read-only access during the intervention so that proxy certificate renewal should keep working, but it will not be possible to store or delete stored proxy certificates for a few minutes. https://itssb.web.cern.ch/planned-intervention/change-backend-storage-server-myproxycernch/23-04-2013
  • CERN storage services: last Tuesday from 1500 to 1540 there was a problem on the EOS head nodes for ATLAS and ALICE and the EOS service was unavailable
  • Dashboards: ntr
  • GGUS: GGUS Release next Wednesday, April 24 from 06:00 to 07:00 UTC with ALARM test round as usual. Reminder: As announced on https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek130304#Thursday and at the WLCG Ops. Coord. meeting last week, the GGUS host certificate will be renewed. This certificate is used for authentication purposes of SOAP and hence impacts all systems that consume GGUS web services. The new certificate is attached to the relevant tickets in Savannah:136227.
AOB:
Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2013-04-18 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback