Week of 130318

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Jerome, Maarten, Maria D, Mark, Pablo, Xavi
  • remote: David, Kyle, Lisa, Michael, Onno, Paolo, Pepe, Rolf, Stephane, Thomas, Tiju, Wei-Jen, Xavier M

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • CERN-PROD: file transfer failures due to SECURITY_ERROR seen on week 3-10 March re-appeared on Wednesday. We were observing transfer failures with the error on Thursday and Friday. Transfers succeeded within one hour. No errors during weekend. Reopened Alarm GGUS:92166 is without reply.
        • Maria: I have escalated that ticket again
        • Jerome: will ping colleagues
      • CERN FTS: there was a request for a new FTS channel RRC-KI-T1 ->CERN (GGUS:91755). FT for 72h during weekend had 100% efficiency. Thanks.
    • T1s
      • TRIUMF: Bad connectivity between TRIUMF and SARA-NIKHEF on Friday (GGUS:92550). Outage of network between SARA and TRIUMF. Quickly resolved by resetting BGP session. Thanks.
      • RAL: FT have been switch to FTS3. Many failures were observed. FT was switch back to FTS2.
      • RAL: ATLASDATATAPE staging failures (GGUS:92571). Network glitch on Friday afternoon caused a number of tape servers to hang. Fixed. Thanks.

  • CMS reports (raw view) -
    • LHC / CMS
      • Rereconstruction of 2012 data progressing well -- processing the last of 4 datataking eras. Still utilizing all T1 resources for this
    • CERN / central services and T0
      • Working on reconfiguring for reprocessing
        • Castor disk pools being cleaned and will ask to move them to EOS
          • Xavi: which CASTOR pools?
          • David: t0export? have to check; by now there may just be a tail left
          • Xavi: CMS should still be using CASTOR for data recording and export to T1 sites; please send an e-mail describing the plans
          • David: will follow up
    • Tier-1:
      • Started to drain queues this morning at IN2P3 ahead of planned downtime.
    • Tier-2:
      • Moving some T1 workflows to larger T2's with xrootd input, due to high T1 use.
        • US T2's working well, expanding to German and Italian T2's.

  • ALICE -
    • CERN: again sporadically slow job submission to all CERN CEs since Thu evening (GGUS:92521 in progress, no news)
    • RAL: VOBOX stopped working on March 15 after upgrade of "sudo" to version with buggy post-install script making /etc/nsswitch.conf readable only for root (GGUS:92570, fixed)

  • LHCb reports (raw view) -
    • Mainly user jobs with some MC ongoing.
    • T0:
      • NTR
    • T1:
      • RAL: Ongoing issues with SetupProject which should hopefully be solved by a CVMFS upgrade

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • reminder: downtime all day tomorrow, draining the long queues as of tonight
  • KIT - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr
  • GGUS
    • Monthly Release on 2013/03/27 as usual (last Wednesday of the month). There will be NO ALARM tests this time. See in Savannah:136545 why not. Basically, release is minor, dev. items' list is http://bit.ly/12nJ9ev .
  • grid services - ntr
  • storage - ntr

AOB:

  • (MariaD) The complete WLCG Service Report is now on tomorrow's MB agenda. There were 11 real ALARMs in 2 months. Thanks to the dashboard team for the plots and their interpretation slides.
  • the Alcatel Audioconf incident has been reported to the experts (INC:261646)
  • next meeting on Thu

Thursday

Attendance:

  • local: Alex B, Jerome, Maarten, Mark, Xavi
  • remote: David, Gareth, Lisa, Pepe, Rob, Rolf, Ronald, Salvatore, Stephane, Thomas, Wei-Jen, Xavier M

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • CERN-PROD ALARM : GGUS:92166 ): file transfer failures due to 'SECURITY_ERROR' . Last update from ATLAS on 18th evening observing same error. No reply.
        • Maarten: let's close that ticket if the problem currently is gone and open a new ticket if it comes back
    • T1s
      • RAL: All UK FT have been switch to FTS3. Failures from last week not reproduced yet
      • SARA : Still instabilities with SL6 due to environment variables not overwritten correctly for SL6. Looks almost solved.

  • CMS reports (raw view) -
    • LHC / CMS
      • Rereconstruction of 2012 data progressing well -- processing the last of 4 datataking eras. Still utilizing all T1 resources for this
    • CERN / central services and T0
      • Working on reconfiguring for reprocessing
        • No longer needed Castor disk space being cleaned and will ask to move the servers to EOS (more will follow as we determine/reduce the castor needs)
          • Involving T1Transfer, T0Streamer, T0Input, CMSProd, T0Export, Default, Archive. Most of these reduced, some to zero, Archive would be new.
          • Note we are not waiting on action on the CERN side here, but are taking time in cleanup, and the detective work understanding the leftover detritus of years of datataking of course takes time...
          • In the tails in deletion and cleaning on our side, looking at requesting changes in a few weeks time.
          • Xavi: can we already drain some CASTOR disk servers and move them to EOS?
          • David: probably OK, to be followed up
          • Xavi: an LS1 planning meeting next week would be good
    • Tier-1:
      • GGUS ticket 92607 against CNAF resolved quickly -- data migration alarms set off due to bug (on CMS side) injecting nonexistent files into CMS bookkeeping DB's.
    • Tier-2:
      • Continue to some T1 workflows to larger T2's with xrootd input, due to high T1 use.
        • US T2's working well, expanding to larger European T2's -- requests to enable xrootd fallback, accept t1production role issued last week

  • ALICE -
    • CERN: very bad efficiency in the last days due to excessive job submission times (GGUS:92521) and failures, causing CERN to get almost fully drained of ALICE jobs today; a slow, steady increase of jobs is seen since ~noon, but the problems persist; thanks to IT-PES for working on this with top priority!
      • as of ~18:00 CET the situation has improved dramatically, but for unknown reasons
    • CERN: problems with migration of raw test data from P2 to CASTOR March 18-20; solved by the CASTOR team (switch to CASTOR-ALICE Xrootd redirector, bug found in CASTOR file logic, manual cleanups), thanks!
    • KIT: installing new Xrootd servers and redirectors, ~1.2 PB already migrated, thanks!

  • LHCb reports (raw view) -
    • Mainly user jobs with some MC ongoing.
    • T0:
      • NTR
    • T1:
      • RAL: Ongoing issues with SetupProject which should hopefully be solved by a CVMFS upgrade - but should this be to 2.1.x or is there a 2.0.x version with the fix in available?
        • Maarten: RAL are very much involved in CVMFS, they will know what version to deploy
      • RAL: Occasional network glitches - experts are aware.

Sites / Services round table:

  • ASGC - ntr
  • CNAF
    • responding to EGI alert on WN kernel security update: preparing rollback option for gradual upgrades starting on Monday
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC
    • still seeing a small fraction of ATLAS jobs being multi-threaded, while submitted to single-threaded queue
  • RAL
    • we tried nicing jobs as a mitigation for the LHCb SetupProject issue, to no avail
    • there have been 2 network issues since Monday, both causing a partial break between the T1 and the rest of the site:
      • Tue around noon
      • Thu morning

  • dashboards - ntr
  • grid services - ntr
  • storage - ntr

AOB:

  • next meeting on Mon
Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2013-03-21 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback