Week of 140707

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday: CANCELED because of the WLCG Collaboration Workshop in Barcelona

Thursday

Attendance:

  • local: Andrea Sciabà, Alessandro Di Girolamo (ATLAS), Nacho Barrientos (IT-PES), Felix Lee (ASGC), Andrea Manzi (MW officer), Maria Dimou, Maria Alandes, Xavier Espinal (IT-DSS)
  • remote: Lucia Morgani (CNAF), Michael Ernst (BNL), Rolf Rumler (IN2P3-CC), David Group (NL-T1), Renato Santana (LHCb), Thomas Hartmann (KIT), Gareth Smith (RAL), Lisa Giacchetti (FNAL)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central Services:
      • FTS3 at CERN poor performances observed, respect to RAL one. Reported to the CERN FTS3 experts, they are looking into it
    • FZK tape system issue GGUS:106764 --seem solved/under solution.
    • RAL file transfer issue GGUS:106753 (timeouts in put done, under investigation)
    • SARA-MATRIX jobs failing while trying to upload the output on the storage, as if no space left on device, but site SRM info shows that space is available. GGUS:106748 . Similar issue already reported on Monday GGUS:106672. David says that the cause hasn't yet been understoond but it might well be due to the dCache information provider.
    • TAIWAN file transfer issue GGUS:106736

  • CMS reports (raw view) -
    • (Giuseppe) Due to a colliding meeting, I cannot attend this WLCG Operations Call (sorry!)
    • No major issues, processing and production is continuing at high scales
      • CSA14 exercise ongoing: kick-off on Monday 7th
    • Smooth update of CMSWEB production service on Tuesday 8th
    • T0
      • NTR
    • T1
      • GGUS:106763 SAM CE and SAM SRM problems at T1_TW_ASGC

  • ALICE -
    • last Fri several sites (including CERN) reported that some ALICE jobs created large "cint" (ROOT interpreter) files of up to several GB in /tmp instead of the job's work area: they were found to come from 1 user who has since been instructed to avoid such usage on the grid

  • LHCb reports (raw view) -
    • Mainly MC(80%) and User(19%) jobs, with no critical problems.
    • T0: NTR
    • T1: NTR

Sites / Services round table:

  • ASGC: There's a serious problem with CASTOR and DPM due to high load, trying to improve the situation.
  • BNL: ntr
  • CNAF:
  • FNAL: ntr
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI:
  • KIT: ntr
  • NDGF:
  • NL-T1:
    • There is scheduled network maintenance at SARA-MATRIX on July 23rd between 10.00 and 12.00 CEST. This may result in transient connectivity issues and termination of ongoing data streams.
    • GridFTP time-out issues (GGUS:106397) has been resolved locally: following the connection to LHCONE traffic to selected T2s was directed over these links, but since some of the T2s were not jumbo-frame capable, and the router at Nikhef facing LHCONE/LHCOPN was not configured to send 'fragmentation-needed' responses back. This has been fixed, and remaining issues (for four T2 sites) over the public internet are due to the remote not fragmenting correctly. This is fixed now - and we suggest to open tickets against the other T2s as and when needed.
  • OSG:
  • PIC:
  • RAL: last Tuesday we upgraded the ATLAS CASTOR instance, which came fully back only the next morning. We are aware of some issues left. It was the last instance still at version 2.1.14.
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services: ntr
  • CERN storage services: yesterday we deployed new keys on all CASTOR servers; it was fully transparent.
  • Databases:
  • GGUS:
    • Tuesday 2014/07/08: A short term (few minutes) database connection problem around 10am. Developers assume that it was caused by database tests which were performed at that time. It caused a message like 'No ticket or wrong ticket number.' when someone tried to access the ticket-info page.
    • Wednesday 2014/07/16: GGUS summer release. Highlights:
      • ALARM testing as usual.
      • GGUS host certificate change. It is used to sign ALARM email notifications. Normally transparent due to no DN/CA change. This was announced on 2014/06/05 at the WLCG Ops Coord meeting.
      • Decommission of the GGUS ticket submission by email. No change for the GGUS ticket update by email. This was announced on on 2014/05/08 here and multiple times in all WLCG Ops Coord fora. The gSOAP interface is still available.

  • Grid Monitoring:
  • MW Officer:
    • there's an issue with delegation in CANL and Gridsite affecting FTS3; as it's not easy to fix in the libraries, a new version of FTS3 with a workaround will be released next week.
    • About 50% of sites still have versions of the CVMFS client older than 2.1.19. Next week we will open tickets to them.

AOB:

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2014-07-11 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback