Week of 150413

WLCG Operations Call details
General Information
Monday
Thursday

WLCG Operations Call details

At CERN the meeting room is 513 R-068.

For remote participation we use the Vidyo system. Instructions can be found here.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Web
Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

local: Alessandro (ATLAS), Belinda (storage), Lorena (databases), Maarten (SCOD + ALICE), Raja (LHCb)
remote: Alexey (PIC), Di (TRIUMF), Dimitri (KIT), Felix (ASGC), John (RAL), Lisa (FNAL), Onno (NLT1), Rob (OSG), Rolf (IN2P3), Sang-Un (KISTI)

Experiments round table:

ATLAS reports (raw view) -
- Central Services
  - Nothing urgent to report: couple of squids down at 2 Tier1s, GGUS ticketed them and services were restarted.
  - not moved back yet the DDM load on FTSes at RAL, will do in the next hours/days.
    - John: various experts will be occupied at CHEP
    - Alessandro: we are in communication via e-mail and this week would be OK for them; we will follow up offline

CMS reports (raw view) -
- No report

ALICE -
- NTR

LHCb reports (raw view) -
- User and MC jobs are running in the system. The failure rate of the jobs is very low.
- CERN & Tier-1s running fine
  - Problem of failing LHCb jobs solved promptly by RAL admins last Friday evening
- FTS transfer problems to two Tier-2 sites - Grif (GGUS:112985), Nipne (GGUS:112044)
  - Alessandro: are the issues due to the FTS? if so, we would be interested
  - Raja: one looks due to the network, the other due to the local SE
- VOMS-Admin issues waiting for solution from VOMS admins : GGUS:112282 and GGUS:112279
  - Maarten: the first ticket is waiting for a new release, I will "ping" the second

Sites / Services round table:

ASGC: ntr
BNL:
CNAF:
FNAL: ntr
GridPP:
IN2P3:
- batch system outage for upgrade: draining starting on Sun Apr 26, remaining jobs killed Mon morning Apr 27, service back at noon
JINR:
KISTI: ntr
KIT: ntr
NDGF:
NL-T1:
- last week the disk storage was expanded to the new pledge level
- the computing cluster was expanded with 100 nodes of 24 cores each and 8 GB RAM per core
- a Squid service went down in the weekend and was restarted OK
NRC-KI:
OSG:
- issues with the latest monthly WLCG accounting report are being followed up, a few sites have been ticketed
PIC:
- running with 70% of the CPU capacity; the remainder is waiting for full recovery of the power and cooling infrastructure
RAL:
- the CASTOR upgrade to 2.1.14-15 that was canceled last week will happen this Wed
TRIUMF: ntr

CERN batch and grid services:
CERN storage services:
- CASTOR-ATLAS upgrade to 2.1.15-3 tomorrow 09:00-12:00 (the other 3 experiments already have it)
Databases: ntr
GGUS:
Grid Monitoring:
MW Officer:

AOB:

Thursday

Attendance:

local: Alessandro (ATLAS), Andrea (MW Officer), Lorena (databases), Maarten (SCOD + ALICE), Raja (LHCb)
remote: Alexey (PIC), Dennis (NLT1), Di (TRIUMF), Felix (ASGC), Jens (NDGF), Kyle (OSG), Lisa (FNAL), Matteo (CNAF), Pavel (KIT), Rolf (IN2P3), Sang-Un (KISTI), Tiju (RAL), Tommaso (CMS)

Experiments round table:

ATLAS reports (raw view) -
- Central Services
  - Castor intervention done yesterday ok, and EOS namespace was also restarted (quota set issue, now solved)
  - FTS3 RAL put back preprod. https://fts3-test.gridpp.rl.ac.uk:8446 237 DDM ep, fts3.cern.ch 244 DDM ep, fts.usatlas.bnl.gov 138 DDM ep.

CMS reports (raw view) -
- all quiet, not a huge activity at sites (waiting to launch RECO for 2015 MC samples)
- Eu redirector Xrootd.ba.infn.it had problems twice along the week: a daemon crashed due to "too many threads" (but apparently still below the limit set). Being investigated with developers. It should have no effect on readiness (gets a "Warning"), but for 2 sites (T2_HU_Budapest and T2_IN_TIFR) became a "Critical"; not sure why, investigating.
- reached the quota limit on EOS unmerged area, since now also the T0 writes there. Understanding and setting new quotas.
- some sites are complaining about "hot files" being requested much more than the average. This is understood at our SW level: the first pile-up file is always opened, in order to read metadata. We are trying to find solutions.
- some small issues reported:
  - GGUS:113049 (CERN) Castor-T0CMS instance in bad shape for ~24 hours, then recovered. Not sure we understood why.
  - GGUS:113036 T0->T1 tests were failing for a few days, which turned red all the T1s for readiness. Mostly corrected, apart from 2 days still red which will need another correction.
  - GGUS:112942 (CERN) ARGUS not answering correctly for some DNs. Being followed, and the moment cured by adding servers but real issue non understood (as far as I understand).

ALICE -
- high activity

LHCb reports (raw view) -
- Mostly user and MC jobs in the system.
- FTS transfer problems to Nipne continuing (GGUS:112044)
- VOMS-Admin issues waiting for solution from VOMS admins / devs : GGUS:112282 and GGUS:112279
- Raja: thanks to RAL for the smooth CASTOR upgrade on Wed

Sites / Services round table:

ASGC: ntr
BNL:
CNAF: ntr
FNAL: ntr
GridPP:
IN2P3: ntr
JINR:
KISTI: ntr
KIT: ntr
NDGF: ntr
NL-T1: ntr
NRC-KI:
OSG: ntr
PIC:
- CMS opened a ticket against PIC because of slow response times experienced by jobs:
  - there was too much activity by CMS, running at 1.5 times their pledge plus doing many transfers etc.
  - the ticket was therefore bounced back to CMS support
RAL:
- the CASTOR upgrade on Wed went OK
TRIUMF:
- next Wed from 17:00 to 21:00 UTC dCache will be upgraded to 2.10.24 and other services will be upgraded too

CERN batch and grid services:
CERN storage services:
Databases: ntr
GGUS:
Grid Monitoring:
MW Officer:
- High Risk vulnerability in Xen broadcasted by EGI CSIRT : https://wiki.egi.eu/wiki/EGI_CSIRT:Alerts/Xen-2015-04-15
  - Maarten: sites running Xen should already have learned of this a few weeks ago

AOB:

-- AndreaSciaba - 2015-02-27

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
pptx	MB-Apr-15.pptx	r1	manage	2865.2 K	2015-04-13 - 17:05	PabloSaiz

Topic revision: r10 - 2015-04-16 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback