Week of 140616

WLCG Operations Call details
General Information
Monday
Thursday

WLCG Operations Call details

At CERN the meeting room is 513 R-068.

For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
2. To have the system call you, click here

In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

local: Andrea Sciabà, Akos Hencz (CERN batch and grid services), Katarzyna Dziedziniewicz (CERN databases), Andrew McNab (LHCb), Felix Lee (ASGC), Xavier Espinal (CERN storage services), Maria Alandes, Maarten Litmaath (ALICE), Pablo Saiz (grid monitoring)
remote: Wahid Bhimji (ATLAS), Tommaso Boccali (CMS), Michael Ernst (BNL), Onno Zweers (NL-T1), Tiju Idiculla (RAL), Roger Oscarsson (NDGF), Rolf Rumler (IN2P3-CC), Lisa Giacchetti (FNAL), Rob Quick (OSG), Pepe Flix (PIC), Dmitry Nilsen (KIT)

Experiments round table:

ATLAS reports (raw view) -
- Central Services
  - (from Saturday) GGUS malfunctioning from 4:12 to ~11:30. "No ticket or wrong ticket number." to every ticket. OK now.
- T0/T1s
  - NDGF-T1 - out of diskspace for data staging - prod job failures: https://ggus.eu/?mode=ticket_info&ticket_id=106207
  - (last friday /weekend) TW - storage directory permissions GGUS:106163, GGUS:106190 (closed). Some job failures since but Qs currently online.
  - (last friday) BNL SRM Problem GGUS:106171 (solved - all tasks running OK now).

CMS reports (raw view) -
- Very quiet period
- The only real "event" was the GGUS unavailability. It did not create major problems since we did not have to contact centres in the ~ 12 hours period (just lucky!). We contacted support@ggusNOSPAMPLEASE.prg and helpdesk@ggusNOSPAMPLEASE.org, and that seemed to work. We are anyhow wondering which is the correct escalation procedure in a case like this.
- Apparently the CRC receives mails about SNOW incidents, but cannot see them on the web. Opened a ticket (which I cannot see: INC:0574896) here, no followup still. Ivan Glushkov created the ticket and he can escalate it if needed. The point is that the CMS CRC egroup should be granted the privileges to see these SNOW tickets.
- T0
  - NTR
- T1
  - Some problems with PIC and ASGC (GGUS:106197 and GGUS:106200) submitted and solved; nothing major.

ALICE -
- CERN: most CEs unavailable for a few hours Sat evening (GGUS:106199). The problem is most likely due to ARGUS and it caused a sizable draining of slots.

LHCb reports (raw view) -
- Main activity: User jobs continuing, reprocessing campaign restarted, expect a lot more MC jobs later this week
- T0:
- T1:
  - CNAF : still checking files after DT

Sites / Services round table:

ASGC: ntr
BNL: ntr
CNAF: no report
FNAL: ntr
GridPP: no report
IN2P3: reminded that there will be a full day outage tomorrow
JINR: no report
KISTI: no report
KIT: today we had a downtime of 30' on the ATLAS SRM to perform a filesystem check
NDGF: there is a problem (probably at the network level) with the ATLAS dCache pool in Slovenia, causing stability issues
NL-T1: SARA-MATRIX is in downtime as planned to upgrade dCache and to perform firmware updates requested by a vendor. It should finish in time.
OSG: ntr
PIC: Saturday there were SRM problems for CMS, apparently due to the CERN CA CRL having expired. It disappeared after several reboots. The cause is not yet understood and it is still being investigated. The fetch-crl cron job was running, we need to verify if it failed downloading the CRL. It might be related to a dCache upgrade 5 days before.
RAL: tomorrow we will upgrade CASTOR for CMS from 1000 to 1400.
RRC-KI: ntr
TRIUMF: ntr

CERN batch and grid services: about the CE issues reported by ALICE, the ARGUS service managers have been notified
CERN storage services: ntr
Databases:
- today at 1530 there will be a rolling intervention on the integration database for LCG
- tomorrow, from 0830 to 1330, a rolling intervention on the LHCb online DB, which will be available but degraded
- proposed to CMS some dates for rolling interventions to apply the latest CPU patch:
  - cmsarc: Tuesday at 1400
  - CMSR: Wednesday at 1000
  - CMSONR and CMSONR ADG: Wednesday at 1400
GGUS:
- GGUS was down from Friday 13th 19:50 UTC until Saturday 14th 09:17 UTC. The system did not recover from a DB connection problem. The on-call service was not notified about the outage because the Remedy server was not integrated in the on-call service correctly. We have already fixed this.
Grid Monitoring: ntr

AOB: none

Thursday

Attendance:

local: Akos (CERN batch and grid services), Andrew (LHCb), Felix (ASGC), Jan (CERN storage), Maarten (SCOD + ALICE), Maria A (WLCG)
remote: Antonio (CNAF), Dennis (NLT1), John (RAL), Michael (BNL), Oliver (CMS), Pepe (PIC), Rob (OSG), Roger (NDGF), Rolf (IN2P3), Sang-Un (KISTI)

Experiments round table:

ATLAS reports (raw view) -
- Central Services
  - DDM-vobox'es still having some service intervention. (Much stabler than yesterday)
- T0/T1s

CMS reports (raw view) -
- No major issues, processing and production is continuing at high scales

ALICE -
- NTR

LHCb reports (raw view) -
- Main activity: User jobs continuing, test reprocessing done and preparing for more, larger MC productions running again
- T0: NTR
- T1: NTR

Sites / Services round table:

ASGC: ntr
BNL: ntr
CNAF:
- replication of the lost LHCb files is ongoing; in other respects the SE looks OK
FNAL:
GridPP:
IN2P3:
- the network upgrade foreseen for this week's downtime had to be canceled in the end
- services were unavailable for 20 min more than declared in the downtime, which led to SRM access failures for LHCb
- the tape robot ran into a microcode issue and there was another HW problem
- the MSS remained unavailable yesterday, but all looks OK now
JINR:
KISTI: ntr
KIT:
NDGF: ntr
NL-T1: ntr
OSG:
- OSG would like to make Condor-C the default CE type around Oct-Nov, replacing the current GRAM-5 based CE
- we first need to ensure SAM will be able to deal with the new CE type
- Maria: we will add it as an action item in today's Ops Coordination meeting
- Maarten: the matter was already reported for inclusion in the to-do list of the WLCG Monitoring Consolidation project
  - meetings usually are held every second week: agendas
PIC:
- the CMS errors during the weekend were due to the fetch-crl cron job not working since the dCache upgrade last week; this has been fixed
- it is not understood why other VOs were not affected
RAL:
- lingering problems since the CASTOR upgrade for CMS on Tue: Xrootd somehow works differently in versions 2.1.13 and 2.1.14; being followed up with the CASTOR devs
RRC-KI:
TRIUMF:

CERN batch and grid services: ntr
CERN storage services:
- EOS-CMS dropped out of the AAA framework due to absence of the process that controls participation in the federation; the monitoring did not raise an alarm for that; both issues have been fixed
- EOS-ALICE SLC5 disk servers are suffering instabilities that may be due to a recent kernel upgrade; they have been set read-only while the matter is being investigated
  - those servers represent less than half of the capacity
  - they are on older HW that cannot be upgraded to SLC6
  - instead, they are being phased out gradually as new HW gets added
Databases:
GGUS:
Grid Monitoring:
MW Officer:

AOB:

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
pptx	MB-June.pptx	r2 r1	manage	2861.7 K	2014-06-16 - 12:16	PabloSaiz

Topic revision: r12 - 2014-06-19 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback