Week of 160926

WLCG Operations Call details
General Information
Tier-1 downtimes
Links to Tier-1 downtimes
Monday

WLCG Operations Call details

At CERN the meeting room is 513 R-068.

For remote participation we use the Vidyo system. Instructions can be found here.

General Information

The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Web
Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:

A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
If there is a conflict, another time slot should be chosen.
In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE	ATLAS	CMS	LHCB
	BNL	FNAL

Monday

Attendance:

local: Raja (LHCb), Doug B. (ATLAS), Belinda (Storage), Julia A. (WLCG), Kate (DB, chair), Maria A (WLCG), Maarten (ALICE), Laurence (Computing), Andrea (MW officer, storage), Marian (Network), Gavin (Computing), Alberto A. (monitoring)
remote: Eygene (NRC-KI), Vincenzo Spinoso (EGI), Dave Mason (FNAL), Rolf Rumler (IN2P3), Sang-Un Ahn (KISTI), Dimitri (KIT), Jens (NDGF), Kyle Gross (OSG), Jose Flix (pic), John (RAL), Vincent (security)

Experiments round table:

ATLAS reports ( raw view) -
- Activities and global report
  - Production going well (above 250k running jobs) mostly event generation and simulation but also analysis and reprocessing tests.
  - Large back log of user jobs (peaked at 600k jobs) drained by Sunday afternoon.
  - T0 running grid jobs when CPU available
  - Some staging activity in order to prepare next reprocessing campaign
- Problems:
  - Data
    - Lost file on EOS (16 Sept. Friday) due to - Xrootd client issue (dual open on write w/ fault on close) - EOS issued a patch to correct the issue.
    - Most T2 site problems were related to data (transfer, deletion, access)
  - Central services
    - Frontier server - atlast0frontier3-ai.cern.ch:8000 (connect: timeout) part of 4 node cluster. VM was restarted no information in Openstack logs to explain why the VM stopped.
    - e-group deletion activities inadvertently removed ATLAS users from zp group. INC:1136368
      - thank you to all the people involved. Perhaps a "lesson learned" would be useful, e.g. synchronization between egroup, AFS and computing accounts seems to be not fully clear to everybody (at least not to us). Also, still waiting for the fix for some of the users who still do not have the Computing group ZP first.
      - A post mortem related to this issue will be requested. LHCb also expressed interest in this PM.

CMS reports ( raw view) -
- data taking overview:
  - Plans for week September 19-25
    - 2.5km beta* setup and physics for the next 4 days
    - then physics
- production activity
  - ramp-up on 21st-22nd. Around ~130k Running job cores in the Grid. * data rereco campaign on-going
- issues
  - Very quiet week
  - 20th Sept: CERN Url Shortening Service not working. It was promptly fixed: RQF:0643842
  - Following-up tape recycles/repacks at the T1s, after the massive data deletion.

ALICE -
- NTR

LHCb reports ( raw view) -
- Activity
  - Monte Carlo simulation, data reconstruction/stripping and user jobs on the Grid
- Site Issues
  - T0:
    - Mass recursive deletion by a user causing EOS problems (GGUS:123957) - resolved
- T1:
  - NIKHEF : CVMFS issues (GGUS:124026) - resolved
  - RAL : Diskserver down
- Others : ARC CEs publishing incorrect numbers of Waiting and Running jobs by default. Please contact LHCb or ARC developers or other ARC sites if starting to deploy ARC CEs for LHCb.

Sites / Services round table:

ASGC: nc
BNL: ntr
CNAF: Unscheduled downtime starting from September 21st until 7am on Sept 22nd. Storage system failed during failover. Alice and ATLAS were affected. Possible cause related to a bug in firmware responsible for failover.
EGI: ntr
FNAL: Follow up of the downtime not being propagated to GOC DB as OSG has entered all downtime details. According to Kyle's explanations, there was never a pull in GOCDB, only in CMS dashboard and the issue was related to wrongly defined service element.
GridPP: nc
IN2P3: NTR
JINR: Nothing to report
KISTI: ntr
KIT: nc
NDGF: Network provider will remove main switch Thursday 13-15 CEST. Atlas and Alice data unavailable. Abel (Norwegian computation site) will be taken off-line for a week Friday. This will reduce NDGF compute by about 25%.
NL-T1:
- SARA preparing for datacenter move, starting coming weekend.
- Unable to dial in today.
NRC-KI: Belgian Tier-2 networking problem is still here; we offered the possibility to reach Belnet in Belgium itself; Belnet refused to peer with us (RU-VRF) on the LHCONE space as NREN and offered ULB option to connect to us directly, but some money from ULB side is needed for this.
OSG: ntr
PIC: ntr
RAL: ntr
TRIUMF: ntr

CERN computing services:
- LSF intervention today to fix one of the issues that occurred last week (ghost jobs).
CERN storage services:
- FTS: upgrade to v 3.5.2 ( new Optimizer implementation with range of actives) and migration to C7 on Wed. OTG:0033008
- CASTOR: LHC MD next week. Planning to deploy 2.1.16-9 everywhere. Will inform Storage Managers.
CERN databases: ntr
GGUS: Release this Wednesday 28/9 with test ALARMs as usual. Maria D. will be travelling. Please contact ggus-info at cern dot ch in case of any trouble.
Monitoring:
- Final reports for the August availability sent around
- Link to Monitoring Portal: http://monit.cern.ch
MW Officer:
- a new version of the WN bundle has been published to CVMFS ( /cvmfs/grid.cern.ch/emi-wn-3.17.1-1_sl6v1), it includes gfal2 v 2.11 (GGUS:123994)
Networks:
Security:
- critical EGI Advisory-SVG-2016-11476, deadline to update & restart affected services: 2016-09-29 00:00 UTC

AOB:

Topic revision: r23 - 2016-09-27 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback