Week of 150406

WLCG Operations Call details
General Information
Monday: Easter Monday holiday
Tuesday
Thursday

WLCG Operations Call details

At CERN the meeting room is 513 R-068.

For remote participation we use the Vidyo system. Instructions can be found here.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Web
Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday: Easter Monday holiday

The meeting will be held on Tuesday instead.

Tuesday

Attendance:

local: Maarten (SCOD + ALICE)
remote: Alessandro (ATLAS), Christoph (CMS), Di (TRIUMF), Jeremy (GridPP), Kyle (OSG), Lisa (FNAL), Onno (NLT1), Pepe (PIC), Rolf (IN2P3), Tiju (RAL), Ulf (NDGF)

Experiments round table:

ATLAS reports (raw view) -
- Central Services
  - CERN-PROD GGUS:112835 this is about one or more files not being staged within the timeout (which is now 2 days). At this point these files should be checked, it's not normal to have such long waiting time to prestage files.

CMS -
- Sorry, nobody could join the call today.
- Nothing to report

ALICE -
- High activity, except between Sun ~19:00 and Mon ~07:00 CEST

LHCb reports (raw view) -
- No report

Sites / Services round table:

ASGC:
BNL:
CNAF:
FNAL:
- downtime this Wed 06:30-19:00 local time for power maintenance affecting the building hosting the WN
GridPP: ntr
IN2P3: ntr
JINR:
KISTI:
KIT:
NDGF:
- received a lot of ATLAS data OK over the weekend
NL-T1:
- congratulations on the first beams in the LHC!
NRC-KI:
OSG: ntr
PIC:
- last week there was a 2-day downtime for a UPS upgrade, which went NOT OK:
  - a major electrical fault occurred towards the end of the maintenance, which had to be extended by 4h
  - a DDN system hosting ~100 TB then suffered a RAID controller misconfiguration, which took ~10h to solve
  - a broken climatization unit affected two thirds of the CPU capacity, which had to be kept switched off over Easter
  - at the moment the T1 is running with 50% of the CPU
  - more blades will be moved to main computer room to recover part of the missing capacity
  - the new UPS is suspected to have been installed incorrectly, which then would have to be fixed in the near future
RAL:
- quiet Easter
- TRIUMF-RAL network issue under investigation
- CASTOR downtime for upgrade tomorrow
RRC-KI:
TRIUMF: ntr

CERN batch and grid services:
CERN storage services:
Databases:
GGUS:
Grid Monitoring:
MW Officer:

AOB:

Thursday

Attendance:

local: Maarten (SCOD + ALICE), Steve (grid services)
remote: Alessandro (ATLAS), Dennis (NLT1), Di (TRIUMF), Felix (ASGC), John (RAL), Matteo (CNAF), Rob (OSG), Rolf (IN2P3), Sang-Un (KISTI), Ulf (NDGF), Zoltan (LHCb)

Experiments round table:

ATLAS reports (raw view) -
- Central Services
  - any news from RAL? it seems from GOCDB that the downtime is over, is everything ok?
    - Alessandro: can we use the FTS-3 instances again?
    - John: AFAIK it was proposed that ATLAS wait until Monday; will be followed up offline
    - Alessandro: OK, we keep using CERN and BNL for now

CMS reports (raw view) -
- No report

ALICE -
- NTR

LHCb reports (raw view) -
- User and MC jobs are running in the system. The failure rate of the jobs is very low.
- No issues to report

Sites / Services round table:

ASGC: ntr
BNL:
CNAF: ntr
FNAL:
GridPP:
IN2P3: ntr
JINR:
KISTI: ntr
KIT:
NDGF:
- Downtime for storage 11-12 UTC on Monday due to head node reboots to get new kernels.
- We got an atlas ticket complaining about lack of space, they had filled the space token for MCTAPE. We increased it.
NL-T1: ntr
NRC-KI:
OSG:
- can ATLAS have a look at GGUS:112470 about SRM failures?
  - Alessandro: OK
- there appears to be an issue with the latest monthly accounting report, possibly related to the introduction of multi-core records; this matter is being looked into
PIC:
RAL:
- yesterday there was a scheduled downtime foreseen for a CASTOR upgrade
- it was a good opportunity to reboot a switch as well
- that unexpectedly caused a lot of network problems, which took until this morning to resolve
- the CASTOR upgrade had to be canceled, a new date will be agreed
TRIUMF: ntr

CERN batch and grid services: ntr
CERN storage services:
Databases:
GGUS:
Grid Monitoring:
MW Officer:

AOB:

-- AndreaSciaba - 2015-02-27

Topic revision: r7 - 2015-04-09 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback