LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek130114 (2013-01-18, AndreaValassi)

EditAttachPDF

Week of 130114

Daily WLCG Operations Call details
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Tuesday
Wednesday
Thursday
Friday

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability				SIRs	Broadcasts	Operations Web
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	Broadcast archive	Operations Web

General Information

General Information			GGUS Information	LHC Machine Information
CERN IT status board	WLCG Baseline Versions	WLCG Blogs	GgusInformation	Sharepoint site - LHC Page 1

Monday

Attendance: local (AndreaV, Alessandro, Jerome, Eva, Xavi, Eddie, MariaD, Maarten); remote (Michael/BNL, Gonzalo/PIC, Saverio/CNAF, Matteo/CNAF, Wei-Jen/ASGC, Lisa/FNAL, Christian/NDGF, Tiju/RAL, Renaud/IN2P3, Onno/NLT1, Rob/OSG, Dimitri/KIT; Ian/CMS, Vladimir/LHCb).

Experiments round table:

ATLAS reports -
- BNL-ATLAS GGUS:90352 - not a site issue as discussed in the ticket, the resources behind that PandaResource are opportunistic and it's just that shifters see errors.
- TAIWAN-LCG2 GGUS:90349 disk server problematic on MCTAPE caused file access problems. [Wei-Jen: disk server issue has been fixed.]
- Network issue between FZK and 2 polish sites (CYFRONET and PSNC) GGUS:90341 - "Checked on Polish side PIONIER network, no problem found. Notifed DFN-NOC." then transfers worked again after few hours.
- CERN-PROD EOS SRM/GRIDFTP issue GGUS:90338
- FTS channels not defined for 2 UK sites on RAL-LCG2 FTS GGUS:90291, now solved, thanks.
- [Alessandro: last week we also observed many problems at Tier2's, mainly with storage, mentioned in GGUS]

CMS reports -
- LHC / CMS
  - Getting ready for HI
- CERN / central services and T0
  - There was an issue with Castor transfers from P5 this morning. Not clear if it was a problem with the transfer system or the storage system. [Xavi: this was an issue with a specific server. Xavi: are you using the new workflow and/or observing any issues for lead-proton? Ian: we expect that the rate will not be as high as for heavy ions in 2010, we are watching it but we do not expect any issues.]
- Tier-1:
  - NTR
- Tier-2:
  - NTR

ALICE reports -
- NTR

LHCb reports -
- Mostly Simulation jobs on all Tier levels
- This week restart prompt processing at CERN and 2011 data reprocessing at T1 sites + attached T2s
- T0:
  - VOMS not reachable from outside CERN (GGUS:90295)
- T1:
  - GRIDKA: timeouts of transfers to Gridka Disk storage under investigation (GGUS:88425, GGUS:88906)

Sites / Services round table:

Michael/BNL: ntr
Gonzalo/PIC: ntr
Saverio/CNAF: ntr
Wei-Jen/ASGC: an issue with the Castor database is being followed up
Lisa/FNAL: ntr
Christian/NDGF: coming up after the downtime, all is going smoothly except for some problems with VOMS
Tiju/RAL: ntr
Renaud/IN2P3: we are still having problems with SRM due to proxies generated by SRM. Due to this issue, last week we had around 30 reboots of SRM, which was quite difficult to handle. We have no news from the dcache people yet. For this reason we want to ask CMS if they can stop these proxies in CRAB. [Ian: will follow up with user education, but techniaclly there is nothing we can do. Maarten: the site can ban the users. Renaud: not sure this would work, some of the proxies come from outside sites. Ian: will follow up offline because we need to analyse this more in detail. Maarten: please IN2P3 open a GGUS ticket for CMS.]
Onno/NLT1:
- last Friday the maintenance of the compute cluster file server lasted more than expected, but all was ok by Saturday
- last Friday we also had a problem with a proxy that was rebooted but got stuck, affecting CVMFS because the failover did not work; this was fixed by this morning. [Maarten: please follow up with CVMFS people why the failover did not work.]
- [Alessandro: are you upgrading WNs to SLC6 in January? Onno: will ask my colleagues.]
- [Alessandro: are last week's problems with CEs fixed? Onno: will also follow up with my colleagues.]
Rob/OSG: ntr
- [MariaD: there is a ticket assigned to Rob for the migration, can you please follow it up]

Eva/Databases: we had an issue with storage during the Active DataGuard upgrade for ALICE on Friday, everything was fixed on Saturday
Jerome/Grid: ntr
Eddie/Dashboard: ntr
Xavi/Storage: nta
MariaD/GGUS: The January release will take place on 2013/01/23 (earlier than the norm "last Wednesday of the month"). NB! The GGUS-VOMS synchronisation mechanism will change. The reason is to rely on GGUS own server rather than a GridKa one, used so far. Details in Savannah:133063 . WLCG request to allow all VO members to view their TEAMers and ALARMers from within GGUS (not only voms) will also enter production with this release. Details in Savannah:134672 and Savannah:134697 .

AOB: none

Tuesday

Attendance: local (AndreaV, Eddie, Jerome, Eva, Xavi); remote (Saverio/CNAF, Michael/BNL, Xavier/KIT, Ronald/NLT1, Wei-Jen/ASGC, Tiju/RAL, Christian/NDF, Lisa/FNAL; Vladimir/LHCb, MariaD/GGUS).

Experiments round table:

ATLAS reports -
- No data taking till wednesday
- Nothing else to report.

CMS reports -
- No report

ALICE reports -
- NTR

LHCb reports -
- Mostly Simulation jobs on all Tier levels
- This week restart prompt processing at CERN and 2011 data reprocessing at T1 sites + attached T2s
- T0:
  - VOMS not reachable from outside CERN (GGUS:90295): Solved
- T1:
  - GRIDKA: timeouts of transfers to Gridka Disk storage under investigation (GGUS:88425, GGUS:88906): Downtime for Separation of LHCb from gridka-dcache.fzk.de SE, hopely will solve this problem

Sites / Services round table:

Saverio/CNAF: ntr
Michael/BNL: ntr
Xavier/KIT:
- there was a downtime for LHCb as mentioned in their report
- there was also another downtime for upgrading Frontier Squids
Ronald/NLT1: there were two open questions yesterday
- yes there is the intention to migrate SARA to SL6, but this may rather happen in the first half of February rather than in January
- yes the issue with CEs and CVMFS is fixed
Wei-Jen/ASGC: ntr
Tiju/RAL: ntr
Christian/NDGF:
- VOMS issue mentioned yesterday is now fixed
- there will be a short downtime on Thursday morning
Lisa/FNAL: ntr

Xavi/Storage: added 6 new diskservers for CMS to address the issue mentioned yesterday
Jerome/Grid: ntr
Eddie/Dashboard: ntr
Eva/Databases: another issue with active copies since 6am this morning for ADCR (DB is up but not being synchronised)
MariaD/GGUS: ntr

AOB: none

Wednesday

Attendance: local (AndreaV, Xavi, Jerome, Eddie, LucaC); remote (Saverio/CNAF, Matteo/CNAF, John/RAL, Lisa/FNAL, Pavel/KIT, Rob/OSG, Rolf/IN2P3, Ron/NLT1; Vladimir/LHCb, MariaD/GGUS).

Experiments round table:

ATLAS reports -
- NTR

CMS reports -
- No report

ALICE reports -
- NTR

LHCb reports -
- Mostly Simulation jobs on all Tier levels
- This week restart prompt processing at CERN and 2011 data reprocessing at T1 sites + attached T2s
- T0:
  - CASTOR problem: Possible unavailability detected in c2lhcb/lhcbdisk (Wed Jan 16 08:50:58 2013) 2 nodes seem to be unavailable. [Xavi: the CASTOR issue is now fixed.]
- T1:
  - GRIDKA: timeouts of transfers to Gridka Disk storage under investigation (GGUS:88425, GGUS:88906): Downtime for Separation of LHCb from gridka-dcache.fzk.de SE, hopely will solve this problem

Sites / Services round table:

Christian/NDGF [via email before the meeting]: there will be another downtime tomorrow for updates of the dCache headnodes. This will affect all data. The downtime is only expected to last a few minutes as the servers reboot.
Saverio/CNAF: ntr
John/RAL: ntr
Lisa/FNAL: ntr
Pavel/KIT: ntr
Rob/OSG: ntr
Rolf/IN2P3: ntr
Ron/NLT1: ntr

Xavi/Storage: nta
Eddie/Dashboard: ntr
Jerome/Grid: ntr
LucaC/Databases: issue reported yesterday with ADCR Active Data Guard is due to a lost write and is being investigated with Oracle and Netapp. Active Data Guard for ADCR has been restored yesterday.
MariaD/GGUS: ntr

AOB: none

Thursday

Attendance: local (AndreaV, Eddie, Jerome, Xavi, MariaD); remote (Saverio/CNAF, Matteo/CNAF, Michael/BNL, Wei-Jen/ASGC, Ronald/NLT1, WooJin/KIT, Rolf/IN2P3, Rob/OSG, Jeremy/GridPP, Christian/NDGF, Lisa/FNAL, Dimitrios/RAL; Vladimir/LHCb).

Experiments round table:

ATLAS reports -
- No report

CMS reports -
- No report

ALICE reports -
- NTR

LHCb reports -
- Mostly Simulation jobs on all Tier levels
- This week restart prompt processing at CERN and 2011 data reprocessing at T1 sites + attached T2s
- T0:
- T1:
  - GRIDKA: timeouts of transfers to Gridka Disk storage under investigation (GGUS:88425, GGUS:88906): Downtime for Separation of LHCb from gridka-dcache.fzk.de SE, hopely will solve this problem. Downtime finished, we are trying to use new SRM endpoint.

Sites / Services round table:

Saverio/CNAF: ntr
Michael/BNL: ntr
Wei-Jen/ASGC: ntr
Ronald/NLT1: ntr
WooJin/KIT:
- as reported by LHCb, the downtime is finished
- a tape library is still down and the vendor is working on it
- the Squid upgrade is also complete
Rolf/IN2P3: ntr
Rob/OSG: ntr
Jeremy/GridPP: ntr
Christian/NDGF: someone has cut our main network fiber, we are up but service quality is degraded
Lisa/FNAL: ntr
Dimitrios/RAL: ntr

Jerome/Grid: ntr
Xavi/Storage: ntr
Eddie/Dashboard: ntr
MariaD/GGUS: Reminder: The January release will take place next Wednesday on 2013/01/23. There will be the usual series of test ALARMs.

AOB: none

Friday

Attendance: local (AndreaV, Xavi, Ignacio, Jerome, Eddie, Eva); remote (Michael/BNL, Lisa/FNAL, Jeremy/Gridpp, Wei-Jen/ASGC, Xavier/KIT, Onno/NLT1, Kyle/OSG, Rolf/IN2P3, Gareth/RAL, Saverio/CNAF, Matteo/CNAF; Vladimir/LHCb, Ian/CMS).

Experiments round table:

ATLAS reports -
- NTR

CMS reports -
- LHC / CMS
  - Waiting for beam
- CERN / central services and T0
  - The use of 2 Tier-0 systems (the new as Primary and the old as backup) combined with the writing streamers to tape has the potential to be a high load on Castor. We are watching this and will try to turn off elements as soon as we are convinced the new system results are fully reliable. [Xavi: this is in line with what we have reported on Monday. We are watching this too and we'll keep in contact with CMS.]
- Tier-1:
  - NTR
- Tier-2:
  - NTR

ALICE reports -
- CERN: expired CRL being served for the Polish CA, team ticket GGUS:90539 in progress. [Jerome: this was solved half an hour ago.]

LHCb reports -
- Mostly Simulation jobs on all Tier levels
- This week restart prompt processing at CERN and 2011 data reprocessing at T1 sites + attached T2s
- T0:
- T1:
  - GRIDKA: After few fixes LHCb started transfers. The error level is less we had before, and we have no SRM errors anymore

Sites / Services round table:

Michael/BNL: ntr
Lisa/FNAL: ntr
Jeremy/Gridpp: ntr
Wei-Jen/ASGC: one Castor issue last night due to high load, now fixed
Xavier/KIT: ntr
Onno/NLT1: ntr
Kyle/OSG: ntr
Rolf/IN2P3: ntr
Gareth/RAL: ntr
Saverio/CNAF: ntr

Eddie/Dashboard: ntr
Xavi/Storage: nta
Jerome/Grid: nta
Ignacio/Grid: LFC upgrade is stopped waiting for ATLAS, can they confirm if we can go ahead? [AndreaV: will follow up with ATLAS].
Eva/Database: ntr

AOB: none

Topic revision: r16 - 2013-01-18 - AndreaValassi

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback