LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek120220 (2012-03-07, JaroslavaSchovancova)

EditAttachPDF

Week of 120220

Week of 120220

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability				SIRs, Open Issues & Broadcasts			Change assessments
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	WLCG Service Open Issues	Broadcast archive	CASTOR Change Assessments

General Information

General Information					GGUS Information	LHC Machine Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	WLCG Blogs		GgusInformation	Sharepoint site - Cooldown Status - News

Monday

Attendance: local(Andrew, Dan, Ignacio, Maarten, Maria D, Massimo, Mike, Steve, Xavier);remote(Gonzalo, Jhen-Wei, Joel, Jose, Lisa, Mette, Michael, Onno, Rob, Rolf, Stefano P, Tiju).

Experiments round table:

ATLAS reports -
- LYON LFC. SAV:126485, GGUS:79347 ALARM ticket
  - FR cloud is set offline for DDM, Production and analysis
  - table space was increased, DDM transfers were resumed at 17:50. FR cloud is set online at 19:40
- Dan: reminder for T1 sites to fill in the FTS 2.2.8 deployment spreadsheet announced Thu last week

CMS reports -
- NTR. Quiet weekend.

ALICE reports -
- NTR

LHCb reports -
- T0
  - NTR.
- T1
  - IN2P3 : stalled jobs : (GGUS:79356)
  - SARA : problem of shared area : (GGUS:79368)
  - PIC : Request for space token migration (GGUS:79305)
  - GridKa : Request for space token migration (GGUS:79303)
  - SARA : Request for space token migration (GGUS:79307)
  - SARA : FTS timeout transferring a file to CNAF. (GGUS:79325)
- T2
  - LaL (GGUS:79200). CREAM CE problem?

Sites / Services round table:

ASGC
- downtime for all services Thu 01:00-04:00 UTC for core switch maintenance
BNL - ntr
CNAF - ntr
FNAL
- network problem yesterday 08:30-16:30 CST, still reduced capacity, but users should not notice that
IN2P3 - ntr
NDGF - ntr
NLT1 - ntr
OSG - ntr
PIC - ntr
RAL
- CMS CASTOR intervention ongoing as planned, should finish at 16:00

dashboards - ntr
GGUS/SNOW - ntr
grid services
- AFS UI: current version for gLite 3.2/SL5 now pointing to 3.2.11
  - bug: glite-ce-job-output did not work, GGUS:79384 opened - fixed
- FTS pilot updated to official EMI FTS 2.2.8 release, too early to tell if it is working fine, more news tomorrow
- all CREAM CEs updated to EMI-1 release
storage
- EOS-ATLAS updated with fix for memory leak
- CASTOR LHCb default pool overload by 2 users
- CASTOR public update on Wed 09:30-12:30 CET with SAM tests for experiments "at risk"

AOB:

(MariaDZ) File ggus-tickets.xls is up-to-date and attached to page WLCGOperationsMeetings . We had 2 real ALARMs last week, both ATLAS, both solved GGUS:79278 and GGUS:79347.

Steve: sites running an EMI-1 DPM should watch out for GGUS:79354 !

Tuesday

Attendance: local(Alessandro, Dan, Maarten, Maria D, Michail, Mike, Steve, Xavier);remote(Burt, Gareth, Gonzalo, Jeremy, Joel, Kyle, Mette, Michael, Oliver, Paco, Rolf, Stefano P).

Experiments round table:

ATLAS reports -
- T0/1 problems: NTR
- Panda queues are starting to be automatically set to brokeroff, offline, test, online by the Site Status Board according to declared downtimes. Details are here: https://atlas-logbook.cern.ch/elog/ATLAS+Computer+Operations+Logbook/33952

CMS reports -
- Tier-0:
  - next Mid Week Global Run (Cosmics data taking without magnetic field): Wednesday, Feb 22 - Thursday, Feb 23
- Processing activities:
  - 8 TeV MC simulation on Tier-1 sites, ramping up on Tier-2 sites
- Tier-1 site issues, open GGUS tickets (went through all open CMS GGUS tickets today):
  - GGUS:79259: T1_IT_CNAF: request to update FTSMonitor, will be handled within 2-4 weeks due to separation from lemon that runs on the same server, last update 2012-02-21
  - GGUS:79257: T1_DE_KIT: request to update FTSMonitor, will be looked at next week, last update 2012-02-16
  - GGUS:78725: T1_FR_CCIN2P3: CREAM CE proxy expiration issue, should have been fixed on Feb. 7th update, should be closed today
  - GGUS:79162: T1_FR_CCIN2P3: JobRobot errors, "Error while loading shared library", should be closed today
  - GGUS:75983: T1_FR_CCIN2P3: network degradation between ALL US T2s and IN2P3, 2nd 10 Gbit interface provided by RENATER, strong advice to use LHCOne, last update 2012-01-26
  - Oliver: SRM problem at KIT, no ticket yet
- Services and Infrastructure issues, open GGUS tickets (went through all open CMS GGUS tickets today):
  - GGUS:79389: BDII 2nd Line Support: Discrepancy between OSG interoperability BDII and top level CERN BDII, last updated 2012-02-21
  - GGUS:79331: glite WMS: Condor version in current gliteWMS causes churn on GRAM5 Compute Elements, last update 2012-02-21
  - GGUS:76713: lcg_util Development: lcg-get-checksum command issues, request to lower priority, last update 2011-11-30
    - Maria D: you can use the escalation button if you disagree with how a ticket is treated
  - GGUS:76686: gLite Yaim Core: glexec error on cream, traced back to missing group in /etc/groups, installation problem, last update 2012-01-26
  - GGUS:74318: VOMS-Admin: output of the 'voms-admin list-members' contains multiple field separators, on hold, last update 2012-01-26
  - GGUS:66764 gLite VOBOX: Problem renewing proxies in the CMS VOBOX (phedex), suggestion to use new VOBox release, last update 2012-02-20
  - Oliver: keep generic tickets associated with CMS?
  - Maria D: keep such tickets with CMS as concerned VO while assigned to the correct support unit

ALICE reports -
- SARA job submission stopped working when the VOBOX host certificate was recently changed to have a new subject DN, not configured in myproxy.cern.ch (GGUS:79423 opened for that now)

LHCb reports -
- T0
  - SLS issue : someone removed some readonly volume yesterday around 3pm and one of our tests using this volume was timing out. The problem is fixed now.
- T1
  - IN2P3 : (GGUS:79356) : using the queue with higher memory limit helps to run our jobs.
    - Rolf: now that your T2 jobs are running OK, will you keep the higher memory requirement and update the VO card?
    - Joel: there is a memory leak in the current application, a new version is being tested; we will inform you when the problem has been fixed
  - PIC : Request for space token migration (GGUS:79305)
  - GridKa : Request for space token migration (GGUS:79303) "nearly finished"
  - SARA : Request for space token migration (GGUS:79307)
- T2
  - LAL (GGUS:79200). CREAM CE problem?
    - Maarten: you can use the ticket escalation button when there is no response

Sites / Services round table:

BNL
- intervention tomorrow 15:00-19:00 UTC to move conditions data infrastructure to new HW and Oracle 11g, will affect data access and streaming
CNAF - ntr
FNAL
- still reduced network capacity after Sunday outage, but meeting the operational requirements; non-disruptive intervention Thu morning CST to get back to full capacity
GridPP - ntr
IN2P3 - ntr
NDGF - ntr
NLT1 - ntr
OSG - ntr
PIC
- planned network intervention went OK this morning, traffic was low
RAL
- yesterday's CMS CASTOR update went OK, ATLAS instance will be done tomorrow
- Oracle security patch will be applied on ATLAS and LHCb 3D databases on Thu, rolling intervention, services "at risk"

dashboards - ntr
GGUS/SNOW - ntr
grid services
- CERN FTS The Lyon Monitor https://fts-monitor.cern.ch/ will be re-directed from https://fts300.cern.ch/ (version 1.5.1) to https://fts200.cern.ch/ (version 1.6.1) on Thursday 24th at 09:00 UTC. There will be a 10 minute outage for the alias. The old service will be destroyed quietly in two weeks time.
- FTS 2.2.8 pilot looks fairly OK, though a small failure rate has been seen, not yet confirmed if that is due to the new code; proposal to upgrade T2 service tomorrow; known issues will be made available to T1 sites
storage
- reminder - upgrade of CASTOR public instance tomorrow 09:30-12:30 CET
  - better Xrootd plugin
  - tape access improvements
- upgrade of experiment instances to be decided after experience with public instance

AOB:

Wednesday

Attendance: local (AndreaV, Mike, Dan, Alesandro, Eva, Edoardo, Massimo, Xavi, MariaDZ); remote (Gonzalo/PIC, Mette/NDGF, Lisa/FNAL, Alexander/NLT1, Jhen-Wei/ASGC, Pavel/KIT, Tiju/RAL, Rolf/IN2P3, Rob/OSG, Michael/BNL; Joel/LHCb, Oliver/CMS).

Experiments round table:

ATLAS reports -
- Tier0:
  - The castorpublic outage has the following services in the "Affected Service Endpoints": prod-lfc-atlas.cern.ch, fts22-t0-export.cern.ch, fts-t2-service.cern.ch. Are those really affected? The ATLAS auto-exclusion agents can take action to exclude these services so it should be accurate. [Massimo: long ago had an internal dependency of tests on castorpublic, now gone. It must be the CE/FTS tests that depend on it. Andrea: who is responsible for these tests? Alessandro: maybe IT-GT or IT-PES. Massimo: should be IT-PES, will follow up with them.]
  - [Massimo: in general, please remember that critical services should not depend on castorpublic for availability testing! ]
- Tier1:
  - BNL CondDB outage just started. Clients should failover, so we are not taking any actions to disable Panda queues.
- ADC Central Services:
  - Disk failure (warning) in one of the pandacache servers used to host user input sandboxes. If it is taken offline, 50% of user jobs will fail. Instead we have taken the machine out of the load balancer and are now waiting until all dependent analysis jobs have completed. A more resilient design is being implemented.

CMS reports -
- Tier-0:
  - next Mid Week Global Run (Cosmics data taking without magnetic field): Wednesday, Feb 22 - Thursday, Feb 23
- Processing activities:
  - 8 TeV MC simulation on Tier-1 sites, ramping up on Tier-2 sites
- Tier-1 site issues
  - GGUS:79441: T1_ES_PIC: problem with PIC FTS server, CMS group was missing from the common groups.conf for yaim, SOLVED 2012-02-21
  - GGUS:79259: T1_IT_CNAF: request to update FTSMonitor, will be handled within 2-4 weeks due to separation from lemon that runs on the same server, last update 2012-02-21
  - GGUS:79257: T1_DE_KIT: request to update FTSMonitor, will be looked at next week, last update 2012-02-16
  - GGUS:75983: T1_FR_CCIN2P3: network degradation between ALL US T2s and IN2P3, 2nd 10 Gbit interface provided by RENATER, strong advice to use LHCOne, problem was handed to Internet2 who will contact Dante, set to SOLVED 2012-02-22

ALICE reports -
- NTR

LHCb reports -
- Experiment activities: User analysis, Validation productions for restripping
- T0
  - CERN : pilot aborted (GGUS:79459)
- T1
  - PIC : Request for space token migration (GGUS:79305)
  - GridKa : Request for space token migration (GGUS:79303) "nearly finished"
  - SARA : Request for space token migration (GGUS:79307)
- T2
  - LAL (GGUS:79200). CREAM CE problem? [Joel: not happy about the reply from LAL about this issue, "we are on holiday". MariaDZ: is Rolf representing the French NGI here too? Rolf: no, representing only the French T1, not the French NGI. MariaDZ: will follow up on this.]

Sites / Services round table:

Gonzalo/PIC: ntr
Mette/NDGF: ntr
Lisa/FNAL: ntr
Alexander/NLT1: ntr
Jhen-Wei/ASGC: one CMS disk had a partition failure, we are looking into it. [Oliver: any idea which data it contains? Jhen-Wei: will follow up.]
Pavel/KIT: about FTS, our expert is away this week, will look into the issue next week
- [Joel: any news about CVMFS? Pavel: last week reported that it will take a few weeks, no news since then.]
Tiju/RAL: successfully upgraded Castor for ATLAS
Rolf/IN2P3: ntr
Rob/OSG: we are progressing on the issue of attachment exchange with GGUS
Michael/BNL: intervention on ATLAS CondDB is starting and will take 4h

Xavi/Storage: castorpublic intervention is finished; interventions on the VOs will be announced
Mike/Dashboard: ntr
Eva/Databases: ntr
Edoardo/Network: ntr

Steve/Grid: CERN FTS T2 service CERN ITSSB. Upgrade to EMI 2.2.8 from gLite 2.2.4 completed by 12:30. Overrun by 30 minutes. Not quite as transparent as intended and the service endpoint was removed for 30 minutes. Propose to update the T0 is similar fashion on Tuesday 28th February. [Alessandro: this is in linewith what was agreed with Steve, ATLAS would like to wait a few days to make sure all is ok before this is deployed on a more general scale. Oliver: also true for CMS, no particular urgency on our side.]

AOB:

MariaDZ: reminder to T0/1 sites to complete the sheet https://docs.google.com/spreadsheet/ccc?key=0AthhzXLQok7XdFpUeDBfLXE2S1RDZE4zcHp6QWVpUFE#gid=0 with their FTS upgrade plans. [Andrea: there is no T1SCM tomorrow anyway]
MariaDZ: next release of GGUS will be on Monday 27th and announced in gocdb, note the unusual day (usually releases are on Wednesday)

Thursday

Attendance: local(Alessandro, Alex, Dan, Eva, Giuseppe, Maarten, Maria D, Mike, Oliver K, Xavier);remote(Burt, Gonzalo, Jhen-Wei, Joel, Mette, Oliver G, Rob, Rolf, Ronald).

Experiments round table:

ATLAS reports -
- Tier0:
  - NTR
- Tier1:
  - It took RAL took some extra time (~2hrs) to come out of their downtime yesterday due to a fault in the new ATLAS SSB status switcher. Apologies for that.
- ADC Central Services:
  - There was a drop in pilots coming out of the three CERN pilot factories yesterday afternoon due to a failed CondorG upgrade. Fixed last night.

CMS reports -
- Tier-0:
  - next Mid Week Global Run (Cosmics data taking without magnetic field): Wednesday, Feb 22 - Thursday, Feb 23
- Processing activities:
  - 8 TeV MC simulation on Tier-1 sites, ramping up on Tier-2 sites
- Issues:
  - NTR

ALICE reports -
- NTR

LHCb reports -
- T0
  - CERN : EMI WMS problem fixed
```
         2012-02-22 14:13:25 UTC WorkloadManagement/TaskQueueDirector/gLitePilotDirector
         INFO: Reference https://wms301.cern.ch:9000/GnF9HdDoA5-r5LwXWlT29Q for
         TaskQueue 2758122
      
         2012-02-22 14:15:28 UTC WorkloadManagement/TaskQueueDirector/gLitePilotDirector
         INFO: Reference https://wms301.cern.ch:9000/Fpl3XHXTKY_-_ps3D-z9SQ for
         TaskQueue 2758135
         
```
    Which means the problem is solved and LHCB is now successfully using this WMS. The other WMS that need to be updated with this patch are at least:
    - wms01.pic.es
    - wms02.pic.es
    - wms-lhcb.grid.cnaf.infn.it
    - lcgwms01.gridpp.rl.ac.uk
  - Maarten: GGUS:79096 provides a few rpms that fix the problem; other sites could go ahead and deploy them as well; an official release should appear some time in March
- T1
  - RAL : zombie jobs (GGUS:79545)
  - PIC : Request for space token migration (GGUS:79305)
  - GridKa : Request for space token migration (GGUS:79303) "nearly finished"
  - SARA : Request for space token migration (GGUS:79307)
- T2
  - LAL (GGUS:79200). CREAM CE problem?
    - Maria D: will look into how the ticket got handled

Sites / Services round table:

ASGC - ntr
BNL - ntr
FNAL - ntr
IN2P3 - ntr
NDGF - ntr
NLT1 - ntr
OSG
- problem with ticket attachment exchanges between GGUS and OSG: solution expected next week
- ticket exchange between FNAL and OSG currently broken, being looked into
PIC - ntr

dashboards - ntr
databases - ntr
GGUS/SNOW - ntr
storage
- proposal for CASTOR ATLAS upgrade on March 6
  - Alessandro: probably OK, we will let you know

AOB:

(MariaDZ) CERN Grid Services please GGUS:77157 requires attention as requested in the AOB of WLCGDailyMeetingsWeek120213#Tuesday.
FTS 2.2.8
- Oliver K: release notes will have the CERN experience with the upgrade; let us know ASAP if a gLite 3.2 release is needed
- Alessandro: some T1 already indicated when they can do the upgrade
  - missing/incomplete: IN2P3, INFN-T1, KIT, PIC, SARA, TRIUMF
  - let's try to agree a schedule at the next T1 Service Coordination Meeting
  - each upgrade can be smooth, no real worries

Friday

Attendance: local(Massimo, Oliver, Dan, Alex, Eva, David);remote( Gonzalo, Onno, Xavier, Rolf, JhenWei, Lisa, Yiju, Joel, Roger, Rob, Michael).

Experiments round table:

ATLAS reports -
- Tier0: NTR
- Tier1:
  - RAL: transfer errors at RAL-LCG2, mainly with destination RAL-LCG2_DATADISK, with errors
    - [SRM_INTERNAL_ERROR] Too many threads busy with Castor at the moment.
    - [SRM_FAILURE] No valid space tokens found matching query
    - submitted GGUS:79592
    - "the problem looks to be load related and to the fact that there is a script generating a lot of load as it cleans tape backed files from disk that aren't needed". Need to pause that script.
    - problem solved.

CMS reports -
- Tier-0:
- Processing activities:
  - 8 TeV MC simulation on Tier-1 sites, ramping up on Tier-2 sites
- Issues:
  - GGUS:79623: T0_CH_CERN: CASTOR-SRM_CMS is not available, last update 2012-02-24

ALICE reports -
- NTR

LHCb reports -
- T0
  - RAL : EMI WMS problem fixed
- T1
  - PIC : Request for space token migration (GGUS:79305)
  - GridKa : Request for space token migration (GGUS:79303) "nearly finished"
  - SARA : Request for space token migration (GGUS:79307)
- T2
  - LAL (GGUS:79200). CREAM CE problem?

Sites / Services round table:

KIT (Xavier)
- GridKa's annual site downtime is now scheduled for March 13th, 4 a.m., till 15th, 11 a.m. (UTC). All services will be down, except for FTS, which will be down only March 14th.
- In favour of the preparations for the downtime, the priority of CVMFS is a bit lowered. We target the week after the downtime (cw 12) for first production tests.
PIC
- NTR
- LHCb asked if they are going to put in place a workaround for the EMI WMS. This will be asked next week.
NLT1: NTR
IN2P3: NTR
ASGC: NTR
FNAL: NTR
NDGF: Network problems. Investigating.
OSG: Fix for ticket with attachment available; it will be in production on Monday
BNL: NTR
RAL: The site asks ATLAS why they do not see activity since yesterday 5GMT. It looks to be a problem for other sties as well.

Central Services:
Dashboard: NTR
Storage: Investigating the CMS ticket
DB: WLCG prod will be unavailable ~15 Monday at 9:00 (reboot needed for enabling the Ora11 compatibility mode)

AOB: (MariaDZ) There was a (admitedly cryptic) update in the LAL (GGUS:79200) CREAM CE problem and the ticket is now waiting for reply from LHCb. Please don't forget unusual date for GGUS release MONDAY 27 Feb.

-- JamieShiers - 18-Jan-2012

Topic revision: r22 - 2012-03-07 - JaroslavaSchovancova

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback