Week of 140616

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Andrea Sciabà, Akos Hencz (CERN batch and grid services), Katarzyna Dziedziniewicz (CERN databases), Andrew McNab (LHCb), Felix Lee (ASGC), Xavier Espinal (CERN storage services), Maria Alandes, Maarten Litmaath (ALICE), Pablo Saiz (grid monitoring)
  • remote: Wahid Bhimji (ATLAS), Tommaso Boccali (CMS), Michael Ernst (BNL), Onno Zweers (NL-T1), Tiju Idiculla (RAL), Roger Oscarsson (NDGF), Rolf Rumler (IN2P3-CC), Lisa Giacchetti (FNAL), Rob Quick (OSG), Pepe Flix (PIC), Dmitry Nilsen (KIT)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • (from Saturday) GGUS malfunctioning from 4:12 to ~11:30. "No ticket or wrong ticket number." to every ticket. OK now.
    • T0/T1s

  • CMS reports (raw view) -
    • Very quiet period
    • The only real "event" was the GGUS unavailability. It did not create major problems since we did not have to contact centres in the ~ 12 hours period (just lucky!). We contacted support@ggusNOSPAMPLEASE.prg and helpdesk@ggusNOSPAMPLEASE.org, and that seemed to work. We are anyhow wondering which is the correct escalation procedure in a case like this.
    • Apparently the CRC receives mails about SNOW incidents, but cannot see them on the web. Opened a ticket (which I cannot see: INC:0574896) here, no followup still. Ivan Glushkov created the ticket and he can escalate it if needed. The point is that the CMS CRC egroup should be granted the privileges to see these SNOW tickets.
    • T0
      • NTR
    • T1

  • ALICE -
    • CERN: most CEs unavailable for a few hours Sat evening (GGUS:106199). The problem is most likely due to ARGUS and it caused a sizable draining of slots.

  • LHCb reports (raw view) -
    • Main activity: User jobs continuing, reprocessing campaign restarted, expect a lot more MC jobs later this week
    • T0:
    • T1:
      • CNAF : still checking files after DT

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: no report
  • FNAL: ntr
  • GridPP: no report
  • IN2P3: reminded that there will be a full day outage tomorrow
  • JINR: no report
  • KISTI: no report
  • KIT: today we had a downtime of 30' on the ATLAS SRM to perform a filesystem check
  • NDGF: there is a problem (probably at the network level) with the ATLAS dCache pool in Slovenia, causing stability issues
  • NL-T1: SARA-MATRIX is in downtime as planned to upgrade dCache and to perform firmware updates requested by a vendor. It should finish in time.
  • OSG: ntr
  • PIC: Saturday there were SRM problems for CMS, apparently due to the CERN CA CRL having expired. It disappeared after several reboots. The cause is not yet understood and it is still being investigated. The fetch-crl cron job was running, we need to verify if it failed downloading the CRL. It might be related to a dCache upgrade 5 days before.
  • RAL: tomorrow we will upgrade CASTOR for CMS from 1000 to 1400.
  • RRC-KI: ntr
  • TRIUMF: ntr

  • CERN batch and grid services: about the CE issues reported by ALICE, the ARGUS service managers have been notified
  • CERN storage services: ntr
  • Databases:
    • today at 1530 there will be a rolling intervention on the integration database for LCG
    • tomorrow, from 0830 to 1330, a rolling intervention on the LHCb online DB, which will be available but degraded
    • proposed to CMS some dates for rolling interventions to apply the latest CPU patch:
      • cmsarc: Tuesday at 1400
      • CMSR: Wednesday at 1000
      • CMSONR and CMSONR ADG: Wednesday at 1400
  • GGUS:
    • GGUS was down from Friday 13th 19:50 UTC until Saturday 14th 09:17 UTC. The system did not recover from a DB connection problem. The on-call service was not notified about the outage because the Remedy server was not integrated in the on-call service correctly. We have already fixed this.
  • Grid Monitoring: ntr

AOB: none

Thursday

Attendance:

  • local: Akos (CERN batch and grid services), Andrew (LHCb), Felix (ASGC), Jan (CERN storage), Maarten (SCOD + ALICE), Maria A (WLCG)
  • remote: Antonio (CNAF), Dennis (NLT1), John (RAL), Michael (BNL), Oliver (CMS), Pepe (PIC), Rob (OSG), Roger (NDGF), Rolf (IN2P3), Sang-Un (KISTI)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services
      • DDM-vobox'es still having some service intervention. (Much stabler than yesterday)
    • T0/T1s

  • CMS reports (raw view) -
    • No major issues, processing and production is continuing at high scales

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Main activity: User jobs continuing, test reprocessing done and preparing for more, larger MC productions running again
    • T0: NTR
    • T1: NTR

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF:
    • replication of the lost LHCb files is ongoing; in other respects the SE looks OK
  • FNAL:
  • GridPP:
  • IN2P3:
    • the network upgrade foreseen for this week's downtime had to be canceled in the end
    • services were unavailable for 20 min more than declared in the downtime, which led to SRM access failures for LHCb
    • the tape robot ran into a microcode issue and there was another HW problem
    • the MSS remained unavailable yesterday, but all looks OK now
  • JINR:
  • KISTI: ntr
  • KIT:
  • NDGF: ntr
  • NL-T1: ntr
  • OSG:
    • OSG would like to make Condor-C the default CE type around Oct-Nov, replacing the current GRAM-5 based CE
    • we first need to ensure SAM will be able to deal with the new CE type
    • Maria: we will add it as an action item in today's Ops Coordination meeting
    • Maarten: the matter was already reported for inclusion in the to-do list of the WLCG Monitoring Consolidation project
      • meetings usually are held every second week: agendas
  • PIC:
    • the CMS errors during the weekend were due to the fetch-crl cron job not working since the dCache upgrade last week; this has been fixed
    • it is not understood why other VOs were not affected
  • RAL:
    • lingering problems since the CASTOR upgrade for CMS on Tue: Xrootd somehow works differently in versions 2.1.13 and 2.1.14; being followed up with the CASTOR devs
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services: ntr
  • CERN storage services:
    • EOS-CMS dropped out of the AAA framework due to absence of the process that controls participation in the federation; the monitoring did not raise an alarm for that; both issues have been fixed
    • EOS-ALICE SLC5 disk servers are suffering instabilities that may be due to a recent kernel upgrade; they have been set read-only while the matter is being investigated
      • those servers represent less than half of the capacity
      • they are on older HW that cannot be upgraded to SLC6
      • instead, they are being phased out gradually as new HW gets added
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-June.pptx r2 r1 manage 2861.7 K 2014-06-16 - 12:16 PabloSaiz  
Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2014-06-19 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback