Week of 140707

WLCG Operations Call details
General Information
Monday: CANCELED because of the WLCG Collaboration Workshop in Barcelona
Thursday

WLCG Operations Call details

At CERN the meeting room is 513 R-068.

For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
2. To have the system call you, click here

In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Web

Monday: CANCELED because of the WLCG Collaboration Workshop in Barcelona

check out the presentations

Thursday

Attendance:

local: Andrea Sciabà, Alessandro Di Girolamo (ATLAS), Nacho Barrientos (IT-PES), Felix Lee (ASGC), Andrea Manzi (MW officer), Maria Dimou, Maria Alandes, Xavier Espinal (IT-DSS)
remote: Lucia Morgani (CNAF), Michael Ernst (BNL), Rolf Rumler (IN2P3-CC), David Group (NL-T1), Renato Santana (LHCb), Thomas Hartmann (KIT), Gareth Smith (RAL), Lisa Giacchetti (FNAL)

Experiments round table:

ATLAS reports (raw view) -
- Central Services:
  - FTS3 at CERN poor performances observed, respect to RAL one. Reported to the CERN FTS3 experts, they are looking into it
- FZK tape system issue GGUS:106764 --seem solved/under solution.
- RAL file transfer issue GGUS:106753 (timeouts in put done, under investigation)
- SARA-MATRIX jobs failing while trying to upload the output on the storage, as if no space left on device, but site SRM info shows that space is available. GGUS:106748 . Similar issue already reported on Monday GGUS:106672. David says that the cause hasn't yet been understoond but it might well be due to the dCache information provider.
- TAIWAN file transfer issue GGUS:106736

CMS reports (raw view) -
- (Giuseppe) Due to a colliding meeting, I cannot attend this WLCG Operations Call (sorry!)
- No major issues, processing and production is continuing at high scales
  - CSA14 exercise ongoing: kick-off on Monday 7th
- Smooth update of CMSWEB production service on Tuesday 8th
- T0
  - NTR
- T1
  - GGUS:106763 SAM CE and SAM SRM problems at T1_TW_ASGC

ALICE -
- last Fri several sites (including CERN) reported that some ALICE jobs created large "cint" (ROOT interpreter) files of up to several GB in /tmp instead of the job's work area: they were found to come from 1 user who has since been instructed to avoid such usage on the grid

LHCb reports (raw view) -
- Mainly MC(80%) and User(19%) jobs, with no critical problems.
- T0: NTR
- T1: NTR

Sites / Services round table:

ASGC: There's a serious problem with CASTOR and DPM due to high load, trying to improve the situation.
BNL: ntr
CNAF:
FNAL: ntr
GridPP:
IN2P3: ntr
JINR:
KISTI:
KIT: ntr
NDGF:
NL-T1:
- There is scheduled network maintenance at SARA-MATRIX on July 23rd between 10.00 and 12.00 CEST. This may result in transient connectivity issues and termination of ongoing data streams.
- GridFTP time-out issues (GGUS:106397) has been resolved locally: following the connection to LHCONE traffic to selected T2s was directed over these links, but since some of the T2s were not jumbo-frame capable, and the router at Nikhef facing LHCONE/LHCOPN was not configured to send 'fragmentation-needed' responses back. This has been fixed, and remaining issues (for four T2 sites) over the public internet are due to the remote not fragmenting correctly. This is fixed now - and we suggest to open tickets against the other T2s as and when needed.
OSG:
PIC:
RAL: last Tuesday we upgraded the ATLAS CASTOR instance, which came fully back only the next morning. We are aware of some issues left. It was the last instance still at version 2.1.14.
RRC-KI:
TRIUMF:

CERN batch and grid services: ntr
CERN storage services: yesterday we deployed new keys on all CASTOR servers; it was fully transparent.
Databases:
GGUS:
- Tuesday 2014/07/08: A short term (few minutes) database connection problem around 10am. Developers assume that it was caused by database tests which were performed at that time. It caused a message like 'No ticket or wrong ticket number.' when someone tried to access the ticket-info page.
- Wednesday 2014/07/16: GGUS summer release. Highlights:
  - ALARM testing as usual.
  - GGUS host certificate change. It is used to sign ALARM email notifications. Normally transparent due to no DN/CA change. This was announced on 2014/06/05 at the WLCG Ops Coord meeting.
  - Decommission of the GGUS ticket submission by email. No change for the GGUS ticket update by email. This was announced on on 2014/05/08 here and multiple times in all WLCG Ops Coord fora. The gSOAP interface is still available.

Grid Monitoring:
MW Officer:
- there's an issue with delegation in CANL and Gridsite affecting FTS3; as it's not easy to fix in the libraries, a new version of FTS3 with a workaround will be released next week.
- About 50% of sites still have versions of the CVMFS client older than 2.1.19. Next week we will open tickets to them.

AOB:

Topic revision: r10 - 2014-07-11 - AndreaSciaba

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback