Week of 150525

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday: Whit Monday holiday

  • The meeting will be held on Tuesday instead.

Tuesday

Attendance:

  • local: Maria Dimou (SCOD)
  • remote: Dmytro Karpenko (NDGF), Alexander (NL_T1), Daniele Bonacorsi (CMS), Felix Hung-Te Lee (ASGC), Sonia (CNAF), Kyle Gross (OSG), Sang Un Ahn (KISTI), David Cameron (ATLAS), Di Qinq (Triumf).

Experiments round table:

  • ATLAS reports (raw view) -
    • No bad effects seen from stress tests performed over the weekend (T0 export and data consolidation)
    • SARA datadisk full (30TB free), PIC also in trouble (110TB free). Will cleanup and move files.
    • Alarm ticket opened for BNL FTS today, resolved quickly. Stager daemon had died again.
David added that a GGUS ALARM for BNL was opened this morning and solved very fast. The FTS daemon had died. Maria asked why this twiki contains information that appears as "ATLAS internal" in the ADCOperationsDailyReports2015 ATLAS twiki. David said that each CRC decides what to announce to this meeting or not.

  • CMS reports (raw view) -
    • Due to an overlapping CMS Computing meeting most likely the CRC will join in the early part of the call only - sorry
      • Please address feedback, questions or requests to Daniele Bonacorsi (CMS CRC this week) and Christoph Wissing
    • 2015 Production proceeding at good pace, focus mostly on RunII* and Upgrade* campaigns
      • large number of Digireco jobs running on the T2 level (10 sites stably contributing, others coming and going)
      • almost 600M evts produced, over the weekend 60-80k slots used
      • error rate comparable to the T1 sites
    • Tier-0 quite stable in preparation for Run-II collisions, only sporadic Storage Manager related issues being followed up
    • Just came on shift, did not check through all CMS GGUS in details. With a quick look, over last few days only Tier-2 related tickets though (and nothing of vital importance)

  • ALICE -
    • normal to high activity
    • CERN: team ticket GGUS:113858 opened Thu evening
      • CEs were publishing bad job numbers for a few hours
        • E.g. zero running and/or waiting
        • ALICE VOBOXes use those numbers to decide if more jobs can be submitted
        • by the time the problem was fixed, the number of jobs queued in LSF had risen to 14k
Nobody present.

Nobody present. No report found in the ProductionOperationsWLCGdailyReports twiki.

Sites / Services round table:

  • ASGC: ntr
  • BNL: not connected
  • CNAF: ntr
  • FNAL: not connected
  • GridPP: not connected
  • IN2P3: not connected
  • JINR: not connected
  • KISTI: ntr
  • KIT: not connected
  • NDGF: (written report by Dmytro at the airport) :DCSC (Copenhagen) is still suffering from the power outage at the weekend. The storage is OK, but the computing part is down. Should be fixed today. Tomorrow we will have a storage outage at NDGF at 13:00 CET, since the dcache head nodes have to be rebooted. 1 hour at max, hopefully less. In addition, tape system in Oslo will require maintenance, so the data there will be unavailable for the whole working day. Hopefully, no unique replicas there.
  • NL-T1: ntr
  • NRC-KI: not connected
  • OSG: ntr
  • PIC: not connected
  • RAL: not connected
  • TRIUMF: ntr

  • CERN batch and grid services: Nobody present. No report.
  • CERN storage services: Nobody present. No report.
  • Databases: Nobody present. No report.
  • GGUS: Nobody present. No report.
  • Grid Monitoring: Nobody present. No report.
  • MW Officer: This is given on Thursdays.

AOB:

Thursday

Attendance:

  • local: Maria Dimou (SCOD), Alexei Klimentov (ATLAS), Maarten Litmaath (ALICE), Andrea Manzi (MW Officer), Belinda Chan Kwok Cheong & Xavier Espinal Curull (CERN Data mgnt), Gavin McCance (CERN Grid Services).
  • remote: Felix Hung-Te Lee (ASGC), Lucia (CNAF), Kyle Gross (OSG), Di Qinq (Triumf), Lisa Giacchetti (FNAL), Michael Ernst (BNL), Dennis van Dok (NL_T1), Rolf Rumler (IN2P3), Thomas Hartmann (KIT), Vladimir Romanovski (LHCb)

Experiments round table:

  • ATLAS reports (raw view) -
    • Distributed Computing is running well
    • cosmic reprocessing for Tile calorimeter is started today (10 TB of data or so)
    • issue with CASTOR-SRM ATLAS (May 27th), status update will be appreciated
    • Problems with NDGF tape staging due to excessive load is being discussed between NDGF/dCache/FTS experts.
Comment by Xavi on the CASTOR SRM problem: Having observed that the system doesn't cope with the load induced last Monday/Tuesday. On Wednesday a 3rd node was added. The investigation still continues as it remains unknown whether the currently observed load is due to the the stagein request/abort logic (too short expiration time). ATLAS was asked to open a GGUS ticket for this.

No report. No participation.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Operations dominated by MC jobs and user analysis; Preparation for data receiving
    • T0
Xavi reported that the certificate of the gridftp doors expired. The nodes are on quattor. No warning was sent. Now new certs are installed and the service has resumed.

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: since yesterday CMS uses CNAF resources in multicore mode.
  • FNAL: ntr
  • GridPP: not connected
  • IN2P3: ntr
  • JINR: not connected
  • KISTI: not connected
  • KIT: ntr
  • NDGF: not connected
  • NL-T1: ntr
  • NRC-KI: not connected
  • OSG: ntr
  • PIC: not connected
  • RAL: ntr
  • TRIUMF: ntr

  • CERN batch and grid services: ntr
  • CERN storage services:
    • The EOS-ATLAS namespace was repaired and restarted successfully today 9-10am CEST. All went fine.
    • There was a successful CASTOR intervention for ATLAS last Tuesday, upgrade to v2.1.15-8.
    • CMS ticket GGUS:113967 report transfer failures for exports from the T0 to T1s and T2s. gridftp stalled processed blocking transfer slots for a certain diskserver, hence piling up transfers and leading to eventual failures. Cleaned up the situation and in parallel currently having a look with the CASTOR developers.
  • Databases: nobody present
  • GGUS: nobody present
  • Grid Monitoring: nobody present
  • MW Officer: The dCache problem reported by ATLAS and observed at NDGF is due to an issue in gfal2 ( used by FTS) , a downgrade of which is planned on the testing FTS @ RAL.

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-May-15.pptx r1 manage 2886.8 K 2015-05-26 - 09:52 PabloSaiz  
Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2015-05-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback