Week of 130415

Daily WLCG Operations Call details
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Thursday

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability				SIRs	Broadcasts	Operations Web
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	Broadcast archive	Operations Web

General Information

General Information			GGUS Information	LHC Machine Information
CERN IT status board	WLCG Baseline Versions	WLCG Blogs	GgusInformation	Sharepoint site - LHC Page 1

Monday

Attendance:

local: AndreaS, Jarka, Simone, MariaD
remote: Lisa (FNAL), Onno (NL-T1), Michael (BNL), Lucia (CNAF), Wei-Jen (ASGC), Xavier (KIT), Tiju (RAL), Kyle (OSG), Rolf (IN2P3), Rob (OSG)

Experiments round table:

ATLAS reports (raw view) -
- T1s:
  - Some failures in the weekend copying data to FZK-LCG2. GGUS:93320. Problem went away by itself. The ticket can be closed and another one will be opened if the problem reappears.

CMS reports (raw view) -
- No significant events in last days.
- LHC / CMS
  - Rereconstruction of 2012 data in the tails, load at the T1 sites small. User's analysis goes on at constant pace.
- CERN / central services and T0
  - ntr
- Tier-1:
  - ntr
- Tier-2:
  - ntr

ALICE -
- CNAF: VOBOX host cert expired Sat afternoon, causing CNAF to get steadily drained of ALICE jobs since 04:30 today (GGUS:93319) [Lucia adds that they already requested a new certificate but the CA is late in delivering it]
- KIT: concurrent jobs cap lowered to 2500 on Fri afternoon to avoid overloading the firewall with off-site SE accesses while the new Xrootd servers are being debugged

LHCb reports (raw view) -
- Apologies for not attending due to LHCb Analysis and Software week
- Still mainly user jobs
- T0:
  - Still pb with SLS LFC sensor
- Activities:
  - Re-stripping postponed until beginning of May, will start with latest reprocessed data to take advantage of files still on disk. Then jobs will trigger staging from tape. Tier1s should be prepared for high staging rate then for achieving the rest ripping within 60 days:

Total	TB	Rate (MB/s)
CERN-RDST	259.4	50
CNAF-RDST	794.6	153
GRIDKA-RDST	643.5	124
IN2P3-RDST	696.8	134
PIC-RDST	201.4	39
RAL-RDST	575.8	111
SARA-RDST	537.7	104

Sites / Services round table:

ASGC: ntr
BNL: ntr
CNAF: ntr
FNAL: ntr
IN2P3: ntr
KIT: ntr
NL-T1: reminder from NIKHEF about a network maintenance intervention this Thursday affecting both computing and storage services; the queues will be drained starting from Wednesday afternoon.
PIC: ntr
RAL: ntr
OSG: ntr
Dashboards: ntr

AOB:

Thursday

Attendance:

local: AndreaS, Jarka, Simone, Xavier, Gavin, MariaD
remote: Xavier (KIT), David (CMS), John (RAL), Lucia (CNAF), Ronald (NL-T1), Rolf (IN2P3), Rob (OSG)

Experiments round table:

ATLAS reports (raw view) -
- Central services
  - SAM-ATLAS-PROD machine was down since 16th of April 13:40 . https://cern.service-now.com/service-portal/view-incident.do?n=INC279368 . The importance of the machine was 5 (very low), while it should have been >50: this has been changed yesterday (Wed) around 14:00 . Problem fixed at 10:30 of Thursday morning.
  - SARA-MATRIX ~2800 transfer failures: the certificate has expired on Tuesday. GGUS:93353 solved: the same day, the issue was fixed.
  - RAL-LCG2: still some transfer failures at low rate with "checksum mismatch". GGUS:93315 updated on Tuesday. Under investigation.

CMS reports (raw view) -
- LHC / CMS
  - Rereconstruction of 2012 data in the tails, load at T1's continues to tail off. Analysis load continues as usual.
- CERN / central services and T0
  - GGUS:93372 CMS VOMS -- currently correcting some manually entered incomplete DN's.
- Tier-1:
  - GGUS:93440 KIT -- black hole worker node due to CVMFS partition being marked read only -- resolved quickly.
- Tier-2:
  - NTR

ALICE -
- central services
  - a cleanup operation unexpectedly caused a very high I/O load on the AliEn DB for a few hours starting Tue mid morning, causing lots of jobs and services around the grid to time out
  - the cleanup was interrupted mid afternoon, after which the DB needed a few more hours to roll back
  - normal conditions were restored early evening
- KIT: the new Xrootd servers look better now - thanks!
  - concurrent jobs cap raised to 10k on Wed afternoon
- SARA: dCache Xrootd interface fixed for writing - thanks! (GGUS:93045)

LHCb reports (raw view) -

Sites / Services round table:

ASGC: ntr
CNAF: ntr
IN2P3:ntr
KIT: tomorrow, emergency downtime fro 0900LT to 1300LT to update the GPFS cluster; 900 TB of ATLAS data will be offline
NL-T1: NIKHEF is in downtime for network maintenance as previously announced; everything should be back by the end of the afternoon
PIC: ntr
RAL: next Wednesday all Oracle databases will be patched; the upgrade should be transparent, but in case of problems with the ATLAS 3D database, the ATLAS Frontier servers will fail over to use the IN2P3 3D database. Rolf confirms that no intervention is planned at IN2P3 at that time.
OSG: ntr
CERN batch and grid services: Next Tuesday the NFS shared storage used by myproxy.cern.ch to store proxy credentials will be moved to a new system. We expect to preserve read-only access during the intervention so that proxy certificate renewal should keep working, but it will not be possible to store or delete stored proxy certificates for a few minutes. https://itssb.web.cern.ch/planned-intervention/change-backend-storage-server-myproxycernch/23-04-2013
CERN storage services: last Tuesday from 1500 to 1540 there was a problem on the EOS head nodes for ATLAS and ALICE and the EOS service was unavailable
Dashboards: ntr
GGUS: GGUS Release next Wednesday, April 24 from 06:00 to 07:00 UTC with ALARM test round as usual. Reminder: As announced on https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek130304#Thursday and at the WLCG Ops. Coord. meeting last week, the GGUS host certificate will be renewed. This certificate is used for authentication purposes of SOAP and hence impacts all systems that consume GGUS web services. The new certificate is attached to the relevant tickets in Savannah:136227.

AOB:

Topic revision: r9 - 2013-04-18 - AndreaSciaba

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback