Week of 120723

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Massimo, Oliver, Xavier, Edward, Raja); remote(Michael, Saerda, Jhen-Wei, Paolo, Lisa, Rob, Dmitri, Marc, Onno, Jeremy)

Experiments round table:

  • CMS reports -
    • LHC machine / CMS detector
      • physics data taking
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • NTR
    • Other:
      • NTR

  • ALICE reports -
    • CERN: Alarm ticket GGUS:84419 was opened late Friday night because of continuous errors experienced by the CASTOR Xrootd redirector. Nothing was changed on the ALICE side and a reboot did not help. The problem went away by itself, came back for a while and now is gone since many hours. An intermittent network problem is suspected (our machine shows no conclusive evidence, though).

  • LHCb reports -
    • Users analysis and Reconstructionat at T1s
    • MC production at T2s
    • Prompt reconstruction at CERN + Tier-1s
  • New GGUS (or RT) tickets
    • T0:
    • T1:
      • SARA: Missing files (GGUS:84328). More missing files found. Would like to understand what happened to the new set.
      • IN2P3 : Batch system reporting jobs running over CPU limit every night at ~8 - 9:30 PM UTC. Investigation ongoing with Aresh (LHCb site contact). Waiting to understand before opening GGUS ticket.
Sites / Services round table:
  • ASGC: DPM HW problem. Now fixed
  • BNL: ntr
  • CNAF: ntr
  • FNAL: Thu and Fri (5:45 - 15:00 Chicaco time): downtime. The first is affecting all the WM (running jobs killed and rescheduled, new will pend) - the rest of the site will be "at risk" (running on power generators). The second will affect tape: enough disk should be in place to absob new data being written at the site; recal from tape will be unavailable
  • IN2P3: Question for ALICE: efficiency going up: what did ALICE change?
  • KIT: Additional Squid (CMS request) will be deployed on Wednesday
  • NDGF: ntr
  • NLT1: ntr (Time stamps for the LHCB investigation received from Elisa)
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS: ntr
  • Central Services: ntr
  • Dashboard: ntr

  • GGUS: (MariaDZ) ALARM drills for tomorrow's MB are attached at the end of this page.
AOB:

Tuesday

Attendance: local(Massimo, Pete, Peter, Edward, Raja, Philippe); remote(Michael, Saerda, Jhen-Wei, Paolo, Lisa, Xavier, Marc, Ronald, Tiju)

Experiments round table:

  • CMS reports -
    • LHC machine / CMS detector
      • physics data taking
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • NTR
    • Other:
      • Peter Kreuzer CRC for the week until next Tuesday.
      • Will be absent from the Daily WLCG Call tomorrow due to the visit of NASA Astronauts at the CMS CC at 3PM.

  • ALICE reports -
    • CERN: yesterday's CASTOR Xrootd redirector problem so far did not reappear since the net.ipv4.neigh.default.gc_threshN kernel parameters were much increased on that machine.

  • LHCb reports -
    • Users analysis and Reconstructionat at T1s
    • MC production at T2s
    • Prompt reconstruction at CERN + Tier-1s
    • New GGUS (or RT) tickets
      • T0:
        • CERN: Backlog of jobs waiting to run
      • T1:
        • GridKa : (GGUS:84476) Mysterious problem with srm. Appeared at 3PM yesterday (23 July) and went away at 10 AM today (24 July). No action taken by anyone. Mainly affected FTS transfers into GridKa and some jobs were rescheduled.
Sites / Services round table:
  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: ntr
  • KIT: ntr
  • NDGF: ntr
  • NLT1: ntr
  • RAL: ntr

  • CASTOR/EOS: ntr
  • Central Services: ntr
  • Dashboard: ntr
AOB:

Wednesday

Attendance: local(Massimo, Pete, Peter, Edward, Raja, Philippe, Eva); remote(Michael, Saerda, Lisa, Xavier, Pavel, Alexander, Tiju, Rob)

Experiments round table:

  • ATLAS reports -
    • EMI-2 CREAM-CE converting-to-lower-case bug (GGUS:84509), we request sites are warned update 2.1.0-1 will break them for ATLAS

  • CMS reports -
    • LHC machine / CMS detector
      • physics data taking, with intermittent tests (effect on CMS of LHCb polarity-changes)
    • CERN / central services and T0
      • CMS had an issue with its Savannah-to-GGUS bridging mechanism, which got fixed by Günter Grein, thanks !
    • Tier-1/2:
      • NTR
    • Other:
      • NTR

  • LHCb reports -
    • Users analysis and Reconstructionat at T1s
    • MC production at T2s
    • Prompt reconstruction at CERN + Tier-1s
    • New GGUS (or RT) tickets
    • T0:
    • T1:
    • Other :
      • Transfers from CERN->GridKa have suddenly stopped since 11AM. Investigation ongoing.
Sites / Services round table:

  • BNL: ntr
  • FNAL: we are in the 1st of the 2 downtimes. Tmorrow ones will affect tapes (essentially tape recalls)
  • KIT: ntr
  • NDGF: ntr
  • NLT1: ntr
  • RAL: Problem with ICE in upgrading the WMS (ticket to the developpers GGUS:84546)
  • OSG: yestarday maintenance OK (transparent)

  • CASTOR/EOS: Aug 9 10:00: Castor PUBLIC stager/SRM database will be migrated to generation 4 NAS storage.
  • Central Services: ntr. In response to questions from LHCb: the backlog observed yesterday was due to insufficient submission (empty slot on LSF). About the long term of LSF, PES is investigating current LSF limitations (5k host) together with Platform. Other solutions are also under consideration to tackle recurrent issues (slow response time and long "boot" time of the master nodes).
  • Data bases: ntr
  • Dashboard: ntr.
AOB:

Thursday

Attendance: local(Massimo, Pete, Edward, Raja, Steve, Marcin); remote(Peter, Michael, Saerda, Lisa, Marc, Pavel, Ronald, Garreth, Rob, Alexander, JhenWei, Gonzalo, Paolo)

Experiments round table:

  • CMS reports -
    • LHC machine / CMS detector
      • physics data taking, with intermittent tests (effect on CMS of LHCb polarity-changes)
    • CERN / central services and T0
      • Major CMS Tier-0 issue, actually 2 issues, reported and discussed in GGUS:84553
        • 13h45 - 15h45 : CMS Tier-0 jobs were partially failing on the public 1nw queue, due some machine crashes/reboots, and partially not running due to competing faire share with other users. It was acknowledged by CMS that indeed there was an internal miscommunication about various workflows running simulateneously and competing with each other (cmst1 vs cmsprod). The jobs on the dedicated cmst0 farm were not affected.
        • 17:00 - 06:20 : this time the number of running jobs on the dedicated cmst0 farm went gradually down to 1/5 of the expected number (800). The problem suddenly disappeared at 06:20. We would be please to find out what happened ?
    • Tier-1/2:
      • CCIN2P3 (T1) : the VOMS role "t1_access", used for special user access, is using 25% of CMS resources, while it is agreed it should be only 5%. Expecting answer by site admin in GGUS:84580
      • CCIN2P3 (T2) : there seem to be an lfn to pfn mismatch, causing job crashes due to disk accesses on the T1 area rather than on the T2 area. For the time being, the CMS MC production team stopped to send data to the site. See all information in Savannah:30536
    • Other:
      • The central CMS PhEDEx team went through a site password renewal procedure, communicating to sites what they had to change in their local PhEDEx configuration via the standard CMS Computing Operations Hypernews forum. However, it appears that many sites failed to follow the instructions precisly, causing many local PhEDEx agents to be down and transfers to be stalled. Affected Tier-1s are KIT, IN2P3 and FNAL. Tickets are being sent to them right now. Tickets to Tier-2s will follow.

  • LHCb reports -
    • Users analysis and Reconstructionat at T1s
    • MC production at T2s
    • Prompt reconstruction at CERN + Tier-1s
    • New GGUS (or RT) tickets
    • T0:
    • T1:
    • Other :
      • Transfers from CERN->GridKa (GGUS:84550) have improved. No understanding as yet about why the throughput fell in the way it did.
      • DIRAC server hosting output sandboxes ran out of space causing job failures. Firefighting ongoing, immediate problem alleviated.
Sites / Services round table:
  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • IN2P3: ntr. They will come back to CMS for the Phedex issue
  • KIT: ntr
  • NDGF: ntr
  • NLT1: SARA has problems with one dCache pool: some existing reading impossible but new files can be received
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr

  • CASTOR/EOS: ntr
  • Central Services:
    • LSF investigation continuing (more info on the ticket)
    • Transparent update of CVMFS (ATLAS) next Mnday at 10:00. 1h at risk
  • Data bases: ntr
  • Dashboard: ntr
AOB:

Friday

Attendance: local(Massimo, Pete, Edward, Raja, Steve, Marcin); remote(Peter, Michael, Saerda, Lisa, Marc, Pavel, Ronald, Garreth, Rob, Alexander, JhenWei, Gonzalo, Paolo)

Experiments round table:

  • CMS reports -
    • LHC machine / CMS detector
      • physics data taking, with intermittent tests (effect on CMS of LHCb polarity-changes)
    • CERN / central services and T0
      • Follow up on Tier-0 issue on Jul 25-26 (covered in GGUS:84553) : it is now understood that the decreasing nb of running jobs on the dedicated cmst0 farm was due to an LSF reconfiguration that triggered some awkward behavior of the scheduler (this happened before, and only 1-2 jobs per host get scheduled). I am just wondering if one could not implement sensors and alarms on the LSF side to react faster in such cases ? Or does the LSF team entirely rely on alarms sent by experiments ?
    • Tier-1/2:
      • CCIN2P3 (T2) : lfn to pfn mismatch issue (Savannah:130536): still no solution and hence CMS not submitting any jobs to the site.
    • Other:
      • News on the issue related the central PhEDEx "password renewal procedure" : CMS send tickets to all sites that still had their agents down on Jul 26 evening. We are now left with 10 sites that still need to re-configure their local PhEDEx agents, among them 1 T1 (CCIN2P3), see GGUS:84584

  • ALICE reports -
    • This afternoon ALICE reached 50k concurrent jobs for the first time, in particular thanks to CERN increasing the cap on LSF jobs!

  • LHCb reports -
    • Users analysis and Reconstructionat at T1s
    • MC production at T2s
    • Prompt reconstruction at CERN + Tier-1s
    • Validation productions for new versions of LHCb application software.
    • New GGUS (or RT) tickets
    • T0:
    • T1:
      • PIC : SE under load and very slow streaming data to user jobs. These (user) jobs end up with very low efficiency and getting killed by DIRAC watchdog. User jobs now limited to 150 at PIC. Ongoing consultation with LHCb contact at PIC.
    • Other :
Sites / Services round table:
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3: all tickets with CMS closed or asking for information
  • KIT: LHCb asked about problems killing jobs: nothing pathological going on
  • NDGF: ntr
  • NLT1: Yesterday problem still there. Vendor call open. Hope to get it fixed today (otherwise Monday)
  • RAL: ntr
  • OSG: bdii205.cern.ch not updated (BNL/FNAL info stale)

  • CASTOR/EOS: ntr
  • Central Services: comments from CMS received
  • Dashboard: ntr
AOB:

-- JamieShiers - 09-Jul-2012

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data.ppt r1 manage 2532.5 K 2012-07-12 - 16:27 MariaDimou GGUS slides for the 20120724 WLCG MB. ALARMs drilled up to 20120712 included.
Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2012-07-27 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback