Week of 140113

WLCG Operations Call details
General Information
Monday
Thursday

WLCG Operations Call details

At CERN the meeting room is 513 R-068.

For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
2. To have the system call you, click here

In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

local: Felix, Stefan, Alberto, Alessandro, Belinda, Andrej, Maarten, Eddie,
remote: Rolf, Michael, Lisa, David, Xavier, Kyle, Sang-Un, Tiju, Pepe, Salverio, Christian,

Experiments round table:

ATLAS
- Central services/T0
  - GGUS:100110 FTS3 pilot service will be put back (most probably tomorrow) for ATLAS Functional Tests
- Andrej Filipcic has joined the ADC operations team
- Asking CERN IT/PES whether Frontier squids can be run as a service
- Maarten: Do you still see openssl related errors? Alessandro: for data transfer no, otherwise will get back to you

CMS
- Running 2011 legacy MC rereco and 13 TeV MC at T1 sites
- GGUS:100198 Requesting access for FNAL to replicate CMS VOMS DB
- Few other GGUS tickets open to T2's & T3's

ALICE -
- NTR

LHCb reports (raw view) -
- Heavy ion reprocessing completed but 2 files (too complex events, timing out)
- MC and user jobs only: no tape recalls requires, only disk access for user jobs and upload from MC jobs and MC-merging jobs
- T0:
  - NTR
- T1:
  - Problems of SRM instability at IN2P3. Downtime over, but still SE under scrutiny
- T2:
  - Problems with CBPF storage: limited number of concurrent transfers to 4

Sites / Services round table:

ASGC: NTR
IN2P3: problems with dcache/SRM, found the problem at the kernel level, relevant parameter was changed and shall fix the problem, recommendation was to upgrade to another kernel, which will be done asap. Would be useful to be able to limit the FTS3 activities for such problems. Stefan: will contact FTS3 team
BNL: NTR
FNAL: NTR
KIT: NTR
OSG: Was there a BDII problem Sat 7.30 am CT? Stefan: Not aware of a problem. Maarten: Could you open a ticket a posteriori? Kyle: yes will do. - Will not attend next Monday local holiday
KISTI: NTR
RAL: NTR
PIC: Next Thu stop FTS2 instance for 3 hours b/c of move to new machine, already inserted in GOC and experiments were informed. Alessandro: Shall we take this opportunity to move to FTS3? Pepe: will get back to you
CNAF: NTR, Alessandro: do you have a plan to upgrade storm and when? Salvatore: was foreseen for tomorrow but has been postponed, no date available yet.
NDGF: Today's downtime finished fine, Wed morning some fixes on pool nodes, upgrade dCache at the same time.
Storage: restarted EOS ATLAS namespace this morning to help Rucio renaming
Grid Services: WMS, Gridsite, L&B to be updated soon to EMI3 update 12. Announcement coming soon.
Monitoring: NTR

AOB:

Thursday

Attendance:

local: Simone (SCOD), Maarten (Alice), Belinda (CERN Storage), Felix (ASGC), Alberto (CERN Grid Services), Eddie (CERN Dashboards)
remote: Philippe (LHCb), Sang-Un (KISTI), Lisa (FNAL), Michael (BNL) Kyle (OSG), Matteo (CNAF), Jeremy (GridPP), Christian (NDGF), Tommaso (CMS), Rolf (IN2P3-CC), Xavier (KIT)

Experiments round table:

ATLAS reports (raw view) -
- Central services/T0
  - No problem with FTS3 FT seen, today the ES cloud was included to use the FTS3 for all ATLAS transfers
  - Again observed high load on ADCR database (central catalogue and Panda queries), which causes many site services daemons to time out (Wednesday after 17:15 CET). We were told, that high load on underlying Oracle storage was also observed, could this be a root cause ?
- T1
  - Staging errors at FZK-LCG2 solved after vendor intervention on tape robots
  - Staging input file failing continues at INFN-T1 (GGUS:99982)

CMS reports (raw view) -
- Running 2011 legacy MC rereco and 13 TeV MC at T1 sites
- Not much to report, smooth operations.
- GGUS:100198 Requesting access for FNAL to replicate CMS VOMS DB [SOLVED]
- GGUS:100315: T2_UK_London_IC, problems with Proxy lifetime [NEW]
- As a note, CMS is starting to send jobs via CRAB also to those T1s which completed DISK/TAPE separation.

ALICE -
- central services moved to new HW on Jan 14-15
  - also more efficient power/UPS distribution and cooling
  - Torrent no longer available
- CERN
  - high error rates after central services restart, to be debugged

LHCb reports (raw view) -
- Only MC and user jobs
- T0: NTR
- T1: NTR
- T2:
  - problems of transfers to CBPF (FTS3) partially understood (gridftp not returning performance marked -> disabled), but still transfers >~ 3650 seconds failing, although timeout is 7200 s)

Sites / Services round table:

OSG: the BDII ticket discussed on monday was opened (GGUS:100263), but the issue may be hard to debug, because the problem fixed itself (disappeared)
CERN Storage: on tuesday EOSATLAS had to be put in R/O mode for 1h following an accidental reboot. Also, there is currently a EOS LHCb SRM problem which is being looked at.
Grid Services : CERN's WMS to be updated to EMI3 update 12 on the 21st of January. ITSSB entry

AOB:

-- SimoneCampana - 16 Dec 2013

Topic revision: r12 - 2014-01-16 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback