LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek140623 (2014-07-07, MariaALANDESPRADILLO)

EditAttachPDF

Week of 140623

WLCG Operations Call details
General Information
Monday
Thursday

WLCG Operations Call details

At CERN the meeting room is 513 R-068.

For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
2. To have the system call you, click here

In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

The SCOD rota for the next few weeks is at ScodRota
General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

local: Maria Alandes (chair, minutes), Prasanth Kothuri (IT-DB), Felix Lee (ASGC), Maarten Litmaath (ALICE), Raja Nandakumar (LHCb)
remote: Michael Ernst (BNL), Oliver Gutsche (CMS), Tiju Idiculla (RAL), Lisa Giacchetti (FNAL), Dmitry Nilsen (KIT), Rolf Rumler (IN2P3-CC), Boris Wagner (NDGF), Onno Zweers (NL-T1)

Experiments round table:

ATLAS reports (raw view) -
- Central Services
  - DDM voboxes: failing from time to time: ss-ft (atlas-SS01) and ss-staging (atlas-SS09)
- T0/T1s
  - NIKHEF-ELPROD: transfers to SARA-MATRIX. GGUS:106397. The site is investigating
  - FZK-LCG2: transfers to CERN-PROD. 25% transfer failures. Connection timed out. GGUS:106398

CMS reports (raw view) -
- CERN
  - intermittent xrootd read errors from EOS, tickets: GGUS:105678, GGUS:105836, GGUS:106113
  - problem is not reproducible, but causes a significant file read failure rate
  - all tickets have been moved from EOS 3rd Line Support to group: EOS Functional Manager
- T1 sites
  - KIT: xrootd tests and normal access show failures, following up with KIT here: GGUS:106278, GGUS:106286

Oliver adds for the EOS problems that CMS has given all the possible feedback in order to help debugging the problem. All the details are now in the relevant tickets.

ALICE -
- CERN: SLC6 CEs published bad job numbers for a short while Sun afternoon (GGUS:106399)

Maria asks whether this refers to the numbers published by the BDII. Maarten confirms that this are indeed the numbers of running and waiting jobs published by the CREAM CE resource BDIIs. Maarten adds that this affects the ALICE job submission engine and that it seems to come from a misconfiguration in the LSF information providers. The problem went away with no intervention but this has been reported for the record.

LHCb reports (raw view) -
- Main activity: User jobs continuing, test reprocessing done and preparing for more, larger MC productions running again
- T0: CASTORLHCB seems down currently INC0583679
- T1:
  - IN2P3 : Was publishing file protocol after dCache upgrade. After alarm ticket (GGUS:106368) fixed very fast on Friday.
  - GridKa : CVMFS problem on WN (GGUS:106405)
- Various sites :
  - BDII and SRM publish inconsistent storage information (RAL - GGUS:105571 , CBPF - GGUS:105572)
  - Multiple sites (SARA, RRCKI) provided extra storage last Friday - thanks!

Maria mentions that she will follow up the file protocol issue with the dCache developers. Regarding the BDII and SRM inconsistencies in RAL, Raja adds that this will be presumably fixed with the next CASTOR upgrade scheduled this week.

Sites / Services round table:

ASGC: ntr
BNL: ntr
CNAF: no report
FNAL: ntr
GridPP: no report
IN2P3: ntr
JINR: no report
KISTI: no report
KIT: ntr
NDGF: Some dCache ATLAS files were corrupted. These have been now deleted. ATLAS is aware of it.
NL-T1: Warning downtime of 1h tomorrow afternoon.
OSG: no report
PIC: no report
RAL: A CASTOR upgrade will happen for the whole day on Tuesday for ALICE and on Thursday for LHCb.
RRC-KI: no report
TRIUMF: no report

CERN batch and grid services: no report
CERN storage services: no report
Databases: A series of interventions are scheduled for next week:
- ATONR : Monday 30th June 14:00-16:00 (database unavailable: 14:00 – 14:20)
- ADCR: Tuesday 1st July 09:00-12:00 (database unavailable: 09:00 – 09:20)
- ATLR : Tuesday 1st July 09:00-12:00 (database unavailable: 09:00 – 09:20)
- ATLARC: Tuesday 1st July 14:30-17:30 (database unavailable: 14:30 – 14:50)
- CMSONR & CMSONR ADG: Wednesday, June 25th, 14:00 - 17.00
- CMSARC: Tuesday, June 24th, 14:00 - 16.00
- CMSR: Wednesday, June 25th, 10:00 - 12.00
GGUS: no report
Grid Monitoring: no report
MW Officer: no report

AOB:

None

Thursday

Attendance:

local: Maria Alandes (chair, minutes), Xavier Espinall (Storage), Marcin Blaszczyk (IT-DB), Felix Lee (ASGC), Maarten Litmaath (ALICE), Raja Nandakumar (LHCb), Ignacio Reguero (Batch and Grid services)
remote: Jeremy Coles (GridPP), Dennis van Dok (NL-T1), Michael Ernst (BNL), Antonio Falabella (CNAF), Pepe Flix (PIC), Lisa Giacchetti (FNAL), Kyle Gross (OSG), Thomas Hartmann (KIT), Rolf Rumler (IN2P3-CC), Gareth Smith (RAL), Boris Wagner (NDGF), Rob Quick (OSG)

Experiments round table:

ATLAS reports (raw view) -
- T0/T1s
  - NIKHEF storage GGUS:106397
  - NDGF GGUS:105775 solved yesterday

CMS reports (raw view) -
- No major issues, processing and production is continuing at high scales
- about GGUS tickets about CERN EOS mentioned in mondays' reports: we are trying to assess if problems are still present and in which amount.
- GGUS tickets with KIT mentioned monday have been solved. Thanks.

ALICE -
- NTR

LHCb reports (raw view) -
- Main activity: User jobs continuing, test reprocessing done and preparing for more, larger MC productions running again
- T0:
  - CASTORLHCB down on monday due to a single user hitting it. Fine now after the user was banned.
  - Brazilian proxy problem on EOS (GGUS:106434)
- T1:
  - GridKa : CVMFS problem on WN (GGUS:106405) ongoing
  - RAL : GGUS:105571 - inconsistent storage information publishing actually a sensor problem (mixup of tape and disk info)?

Raja explains for the CASTORLHCB incident that the concerned user is being trained to avoid a similar incident in the future. Xavier explains that on the CASTOR side there is currently a limitation explicitely set for this user. Raja mentions that this could be removed next week, LHCb will get back to CASTOR.

Maria confirms that it is now understood the inconsistent BDII vs SRM numbers published by RAL in the dashboard (See Information System report at the bottom of the minutes).

Maarten suggests to ping again the EOS people for the Brazilian proxy ticket since this seems to be a real bug that can be an issue for everyone. Maria suggests to wait for the EOS people answer and maybe follow up next week if necessary.

Sites / Services round table:

ASGC: ntr
BNL: ntr
CNAF: ntr
FNAL: ntr
GridPP: ntr
IN2P3: There will be an outage of the batch system between 9am and 12am on Tuesday 1st July.
JINR: no report
KISTI: no report
KIT: ntr
NDGF: no report
NL-T1: ntr
OSG: As already discussed with the SAM team after a phone conference, a testing CE to be able to understand whether HTCondor CE works with SAM is being identified. It will probably be in Nebraska, still to be confirmed.
PIC: Exporting of CMS data via xrootd is now enabled. Work ongoing to enable also xrootd monitoring.
RAL: The LHCb CASTOR upgrade was OK. The final one is the ATLAS CASTOR upgrade scheduled for Tuesday 8th of July.
RRC-KI: no report
TRIUMF: no report

CERN batch and grid services: ntr
CERN storage services: CASTOR DB will undergo a set of rolling interventions due to security patches that need to be applied by the DB team (this should be more or less transparent for the users):
- CMS and ATLAS: Monday 30th June at 2pm
- ALICE and LHCb: Tuesday 1st July at 9 am
- Nameserver: Wednesday 2nd July at 10 am
Databases: Rolling interventions due to the same security patches will also affect:
- LCGR: Wednesday 2nd July at 10:30 am
- LHCb Online: Thursday 3rd July at 1:30 pm
- ATLAS: Apart from the rolling intervention due to the security patches, there will be 20min downtime of the DB at the very beginning of the intervention to apply some parameter changes in the DBs:
  - ATLAS online: Monday 3th June at 2 pm
  - ATLAS offline: Tuesday 1st July at 9 am
  - ATLAS archive: Tuesday 1st July at 2:30 pm
GGUS:
Grid Monitoring:
MW Officer:
- Workaround for dCache publishing file/nfs protocols documented in MW Issues table. A fix will be released in version 2.6.31 to be included in the the next EMI update scheduled for the 7th of July. Note that this issue affects dCache versions >= 2.2
Information Systems:
- RAL BDII Storage Capacities for LHCb are now in sync with the numbers reported with SRM (See Dashboard). There is just one mismatch with the Tape that is being sorted out with RAL (numbers are published in both online and nearline attributes when normally for tape online attributes should be 0). See GGUS:105571.
- New T1 storage deployment metrics collected from the BDII have been created in the Dashboard. Note that the MW officer is still planning to maintain the Storage Deployment table for the time being, since planned changes cannot be collected automatically for the dashboard. The dashboard offers an up to date status of the storage deployment and also historic information as versions get updated. T1s have been contacted in the past days when some of the information published in the BDII was not coherent. Opened GGUS tickets are tracked here.

AOB:

Topic revision: r13 - 2014-07-07 - MariaALANDESPRADILLO

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback