Week of 130722

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: AndreaS, Steve, Maarten, Ken/CMS, Eddie, Massimo
  • remote: Alexander/NL-T1, Michael/BNL, Peter/ATLAS, Xavier/KIT, Saverio/CNAF, Boris/NDGF, Nadia/IN2P3-CC, Kyle/OSG, Tiju/RAL, Lisa/FNAL, Wei-Jen/ASGC, Sang-Un/KISTI

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • NTR
    • T1
      • BNL file recovery progressing
      • RAL disk server problem. All files successfully recovered.
      • SARA Transfer errors (Failed to contact on remote SRM) GGUS:95911. Fixed

  • CMS reports (raw view) -
    • Continuing 2011 legacy rereco activity and some Upgrade MC generation, everything pretty quiet
    • From last week, RAL GGUS:95825 was declared a Castor bug.
    • Castor T1TRANSFER service was degraded again over the weekend. But still not a real problem; probably deserves some discussion with CMS as to what the shifter instructions should be. [Massimo: the problem is related to an increasing mismatch between the CASTOR tuning and the evolution of the CMS usage patterns; it would be good to have a meeting where to discuss these issues.]
    • GGUS:95887 to T1_FR_IN2P3 is about file read errors. Ticket came Friday, site started working on it today.
    • Wigner T-Systems 100 Gbit link down for several hours over the weekend. [Maarten: Wigner vs. CERN computer centre should be transparent: don't mind too much about SLS errors unless they have a direct effect. Ken: will check if there was any impact on CMS.]

  • ALICE -
    • CERN: all CERN CEs started refusing proxies of ALICE and other VOs around 18:20 Sunday evening; fixed around 09:15 Monday morning (TEAM ticket GGUS:95914) [Maarten: the problem was most likely related to a bad ARGUS configuration and it was solved by Ulrich but there are no details in the ticket.]

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • IN2P3-CC: ntr
  • KIT: ntr, just a reminder of a downtime starting tomorrow by closing the queues to update the SE for LHCb
  • KISTI: ntr
  • NDGF: SRM downtime tomorrow from 7 to 9, to update dCache and install a new host certificate; some disk pools will also be updated.
  • NL-T1: ntr
  • RAL: tomorrow, "at risk" intervention to the site firewall, followed from 10 to 16 by a CASTOR upgrade for LHCb and CMS
  • OSG: ntr
  • CERN batch and grid services: ntr
  • CERN storage services: just upgraded SRM for all experiments (apart from LHCb, which was upgraded last week). The upgrade should have been transparent.
  • Dashboards: ntr

AOB:

Thursday

Attendance:

  • local: Eddie, Eva, Luca M, Maarten, Marcin, Nilo, Steve
  • remote: Boris, Dennis, Federico, Gareth, Kyle, Lisa, Matteo, Michael, Nadia, Pavel, Peter, Sang-Un, Stefano, Wei-Jen

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • ATLR database outage yesterday evening, looks fine now INCIDENT
        • Marcin:
          • since ~19:00 there was a problem with the NFS filer serving several DBs including ATLR
          • the fail-over mechanism was unable to move the load from 1 node to another
          • the affected DBs thus remained inaccessible
          • a priority-one case was opened with NetApp
          • the problem was solved shortly after midnight
          • there was no data loss
          • a HW problem was identified
          • spare parts are expected this Fri
        • Peter: are the DBs currently at risk?
        • Marcin: yes, fail-over is not possible at the moment
    • T1
      • All except cern and bnl PRODDISK endpoints have been removed from ATLAS data management
      • FZK and DE T2 sites now configured to use FTS3

  • CMS reports (raw view) -
    • T1 disk/tape separation is getting done (RAL first site) and we are starting to nail out details of how to send users analysis jobs to T1's using pilots
    • Continuing 2011 legacy rereco activity and some Upgrade MC generation, everything pretty quiet
    • GGUS:95887 to T1_FR_IN2P3 about file read error solved by reconfiguring dCache there (more time to dCache movers). Details not understood.
    • GGUS:96102: file read error at T1_UK_RAL . Site now looking into this.
    • nothing else worth mentioning

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Mostly MC productions ongoing, tail of reprocessing and restripping campaign
    • T0:
      • CERN: Issues with Oracle DBs
        • see ATLAS report
    • T1:

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF
    • ATLAS job failures due to lack of disk space being investigated
    • CMS file system maintenance, queues being drained, HammerCloud tests will fail; intervention should be finished tomorrow
  • FNAL - ntr
  • IN2P3 - ntr
  • KISTI
    • the second half of the 2013 CPU pledge will become available in Sep
  • KIT
    • still in scheduled downtime, no problems observed so far, expect to finish a bit earlier than announced
      • main firewall upgraded OK
      • dCache work ongoing
      • GPFS, CVMFS, Xrootd, FTS, ... done
  • NDGF
    • yesterday there was a problem when ATLAS tried writing to disks in Sweden for which SRM reported the wrong size: they became full; the matter is being investigated
  • NLT1 - ntr
  • OSG - ntr
  • RAL
    • CASTOR upgrade for CMS and LHCb went OK on Tue
    • generic instance (also used by ALICE) will be done next Tue
    • batch server problem being investigated

  • dashboards - ntr
  • databases
    • see ATLAS report
  • grid services
    • yesterday 4 out of 8 CREAM CEs have been upgraded to EMI-3
  • storage - ntr

AOB:

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2013-07-25 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback