LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek111128 (2011-12-02, AndreaValassi)

EditAttachPDF

Week of 111128

Week of 111128

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability				SIRs, Open Issues & Broadcasts			Change assessments
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	WLCG Service Open Issues	Broadcast archive	CASTOR Change Assessments

General Information

General Information					GGUS Information	LHC Machine Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	WLCG Blogs		GgusInformation	Sharepoint site - Cooldown Status - News

Monday:

Attendance: local(Massimo, Oliver, Alexei, David, Manuel, Eva, JhenWei);remote(Michael, Onno, Gonzalo, Lisa, Tiju, Rolf, Roger, Kyle, Paolo).

Experiments round table:

ATLAS reports -
- DDM Dashboard issue. Statistics plots are not updated (Nov 25, ~21:00)
  - David Tuckett is investigating it. There is an issue with LCGR database. PhysDB team is investigating it.
    - ticket : https://cern.service-now.com/service-portal/view-incident.do?n=INC082484
      - fixed at ~11:00, some statistics will be restored on Monday
- FR, US Tier2s issues
- Sun Nov 27, Distributed analysis monitoring
  - some tables have no info, experts reported that probably information isn't in the PanDA database, under investigation
- Mon Nov 28. 11:40am : CASTOR to EOS ph.groups migration is in progress

CMS reports -
- LHC / CMS detector
  - Heavy Ions collision data taking
- CERN / central services
  - [CERN SRM]: srm-cms.cern.ch availability had some holes yesterday, under investigation https://sls.cern.ch/sls/service.php?id=CASTOR-SRM_CMS
  - [CERN/Vanderbilt FTS]: create dedicated channel, apologies for competing request to increase the STAR channel before, experts determined that it would be better to have an own channel GGUS:76783, solved.
  - [CASTORCMS_T0TEMP]: 400 active transfers stuck since 24th, continuation of GGUS:76649
- T0
  - Running HI express and prompt reconstruction.
- T1 sites:
  - MC production and/or reprocessing running at all sites.
  - Run2011 data reprocessing.
  - [T1_DE_KIT]: Reduced throughput of transfers from KIT to US T2s. IPerf tests ongoing with some US-sites, seems to be mostly political issues, GGUS:75985 (in progress, last update 11/21)
  - [T1_FR_CCIN2P3]: Reduced throughput of transfers from CCIN2P3, improvements visible, but not enough, IN2P3 asked for starting low level investigations with iperf, due to ThanksGiving we can follow up on Monday GGUS:75983 (in progress, last update 11/23), GGUS:71864 (in progress, last updated 11/23)
  - [T1_IT_CNAF]: CREAM reporting dead jobs @ CNAF T1 as REALLY-RUNNING. Workaround in the queue for next EMI release, CNAF asked to also clear the REALLY-RUNNING jobs left for debugging, this does not hurt processing right now, will clean up on Monday due to ThanksGiving, GGUS:76597 (in progress, last update 11/233)
- T2 sites:
  - NTR (at least, relevant for this meeting)
- Other:
  - NTR

ALICE reports -
- NTR

LHCb reports -
- NTR

Sites / Services round table:

IN2P3:
- UPS incident this morning. Service incident report is being prepared.
- Several interventions are being prepared and will be bundled in a single downtime (Dec 6th). WLCG affected services: dCache and HPSS
PIC: Transparent interventions from today to Wednesday. The only side effects (since we are moving worker nodes) some oscillation (up to -30%) of the total capacity.
NLT1: Several interventions are being prepared and will be bundled in a single downtime (Dec 13th)
KIT: Misbehaving filesystem stopped on Friday. Now it is back (debugged). This affected 5 CMS pools
CERN:
- Dashboards: Follow up of the ATLAS incident: last sets of statistics are being regenerated
- Storage Services: In order to allow swift addition of capacity, a new version of EOS has been deployed on EOSCMS. The CMS T0TEMP ticket is being followed up but it is not impacting production (the 400 transfers are "zombie" transfers).
- Databases: CMSonline DB problems: analysis going on (including service req. to Oracle)

AOB: (MariaDZ) ALARM drills for the last 3 weeks (since last MB) are attached to this page.

Tuesday:

Attendance: local(Alex, Claudio, Jhen-Wei, Maarten, Maria D, Massimo, Mike, Przemyslaw);remote(Burt, Jeremy, Michael, Rob, Roger, Ronald, Tiju).

Experiments round table:

ATLAS reports -
- T0
  - Migrated CERN-PROD group disk area from CASTOR to EOS. Done at 16:00, Nov. 28. No major issue observed.
    - Massimo: there still are other volumes to be moved, in particular atlast3 (large); a big jump is expected after the HI run has finished
- T1 sites
  - Set IT cloud offline for LFC migration. Nov 29th-Dec 1st 2011.
- T2 sites
  - ntr

CMS reports -
- LHC / CMS detector
  - No data taking due to LHC problems until 21.00 then, if recovery is ok, HI physics
- CERN / central services
  - srm-cms availability down to 0 tonight. Due to T1TRANSFER service class unavailability. Had ~ 540 (zombie) active transfers yesterday. OK since the morning.
    - Massimo: the problem was not with the SRM, but due to overload of the service class, which caused monitoring requests to remain queued and the status to go red; we will discuss this offline
  - 400 active transfers on T0TEMP have been cleaned (continuation of GGUS:76649)
  - FTS channel between CERN and Vanderbilt (GGUS:76783): ticket reponened because it had to use CERN-PROD not CERN.
- T0
  - Running HI express and prompt reconstruction.
- T1 sites:
  - MC production and/or reprocessing running at all sites.
  - Run2011 data reprocessing.
  - [T1_DE_KIT]: Reduced throughput of transfers from KIT to US T2s. IPerf tests ongoing with some US-sites, seems to be mostly political issues, GGUS:75985 (in progress, last update 11/21)
  - [T1_FR_CCIN2P3]: Reduced throughput of transfers from CCIN2P3, improvements visible, but not enough, low level investigations with iperf started yesterday GGUS:75983 (in progress, last update 11/28, GGUS:71864 closed as duplicate)
  - [T1_IT_CNAF]: CREAM reporting dead jobs @ CNAF T1 as REALLY-RUNNING. Workaround in the queue for next EMI release, CNAF asked to also clear the REALLY-RUNNING jobs left for debugging, this does not hurt processing right now GGUS:76597 (in progress, last update 11/24)
- T2 sites:
  - NTR (at least, relevant for this meeting)
- Other:
  - NTR

ALICE reports -
- NTR

LHCb reports -
- No report

Sites / Services round table:

BNL - ntr
FNAL - ntr
GridPP - ntr
NDGF - ntr
NLT1 - ntr
OSG - ntr
RAL - ntr

CASTOR/EOS - nta
dashboards - ntr
databases
- yesterday morning the CMS online DB suffered 3 hangs of 0.5 hour each; the problem has been identified by Oracle support and a bugfix will be available at some point; meanwhile a workaround has been applied
- tomorrow between 09:30 and 12:30 CET the LHCb integration DB will be upgraded to Oracle 11g
GGUS/SNOW - ntr
grid services - ntr

AOB:

The AFS team at CERN has observed a high number of failing AFS callback connections to remote machines. For instance, yesterday we detected about 35k failed callback connections. In addition to AFS consistency relying on proper callback delivery, failures to deliver callbacks may result in delays for other clients (as the server needs to time out on the unresponsive clients first). As this impacts the AFS service at CERN, it would be desirable to understand why these machines do not respond to callbacks and if that could be changed. Sites with the most "failed connections" will be contacted by the AFS team.
- Massimo: the first site would be RAL
- Tiju: I will send our helpdesk e-mail address to the list (done)

Wednesday

Attendance: local(Jhen-Wei, Massimo, Claudio, David, Luca, MariaDZ, Maarten, Dirk);remote( Michael/BNL,Giovanni/CNAF, John/RAL, Ron/NL-T1, Roger/NDGF, Lisa/FNAL, Pavel/KIT, Rolf/IN2P3).

Experiments round table:

ATLAS reports - Jhen-Wei
- T0
  - ntr
- T1 sites
  - IN2P3-CC_DATADISK: "Internal timeout. Also affected T0 export.". GGUS:76911. High srmput activity yesterday evening (Atlas and CMS import + WNs), making the transfers slower and generating a backlog. This explains the timeouts seen. The backlog disappeared by itself around midnight. Ticket closed.
  - FZK-LCG2 DATATAPE and MCTAPE: "TRANSFER_TIMEOUT". GGUS:76908. Many queued gridftp movers in our dCache system. The active movers have very low throughput, which leads to timeouts. MCTAPE had been blacklisted to let transfer focus on T0 export. Some transfers back in DATATAPE.
    - Pavel/KIT: contacted dcache support who suggested to update to the latest release, which will be done soon.
  - INFN-T1 Scheduled Downtime for one LFC and SRM [LFC Atlas consolidation and Atlas GPFS filesystem check]. Wed, 30 November, 07:00 – Thu, 1 December, 12:00.
  - Set IT cloud offline for LFC migration. Nov 29th-Dec 1st 2011.
- T2 sites
  - ntr
CMS reports - Claudio
- LHC / CMS detector
  - No data taking due to LHC problems. HI physics will restart hopefully tomorrow.
- CERN / central services
  - The CASTOR_T1TRANSFER load showed up shortly again yesterday afternoon (with a corresponding short glitch in srm-cms availability). Not clear the reason yet. Production activities where high but not enough to explain the peak. Probably it was an overlap with some individual operation.
    - Massimo: added additional resources to the CMS pool.
  - FTS channel to Vanderbilt reconfigured at CERN (GGUS:76783)
- T0
  - Running HI express and prompt reconstruction.
- T1 sites:
  - MC production and/or reprocessing running at all sites.
  - Run2011 data reprocessing.
  - [T1_DE_KIT]: Reduced throughput of transfers from KIT to US T2s. IPerf tests ongoing with some US-sites, seems to be mostly political issues, GGUS:75985 (in progress, last update 11/21)
  - [T1_FR_CCIN2P3]: Reduced throughput of transfers from CCIN2P3, improvements visible, but not enough, low level investigations with iperf started yesterday GGUS:75983 (in progress, last update 11/28, GGUS:71864 closed as duplicate)
  - [T1_IT_CNAF]: CREAM reporting dead jobs as REALLY-RUNNING (GGUS:76597). Ticket closed. Will test the new EMI CREAM as soon as it is installed.
- T2 sites:
  - NTR (at least, relevant for this meeting)
- Other:
  - NTR

ALICE reports - Maarten
- NTR

LHCb reports -
- no report

Sites / Services round table:

Michael/BNL - NTR
Giovanni/CNAF: in scheduled outage for CMS SRM: GPFS is offline due due h/w failures. Expect to finishing by 18:00.
John/RAL: NTR
Ron/NL-T1: NTR
Roger/NDGF: NTR
Lisa/FNAL: tomorrow power work in CC: might affect some dCache pool nodes (expect access delays) from 7:00-13:00
Pavel/KIT: NTR
Rolf/IN2P3: NTR
Rob/OSG: a combination of changes (several weeks ago and recent) broke the comparison report between OSG and SAM calculated availability for few days. This has been fixed and the report looked ok this morning.

AOB: (MariaDZ) Concerns the Tier0 only: In view of the Year End CERN closure the GGUS-SNOW ticket handling was reviewed for the period 2011/12/22-2012/01/04. The conclusions for production grid services at CERN (in IT PES, DSS, DB groups) are summarised by Maite Barroso below:

Main route for getting notification of critical problems are the GGUS ALARM tickets. This is working and does not need any change.
For GGUS TEAM tickets, the members of e-group grid-cern-prod-admins will be put in the Outside Working Hours (OWH) group in order to have access to the SNOW instance of the ticket at all times.
For the rest, CERN specific, all services are correctly declared in SNOW with OWH support groups, and in Services' Data Base (SDB) with the right criticality, so it should be fine.

Thursday

Attendance: local(Jhen-Wei, Massimo, Eva, David, Alex, Dirk);remote(Gonzalo/PIC, Michael/BNL, Gareth/RAL, Giovanni/CNAF, Rolf/IN2P3, Roger/NDGF, Claudio/CMS, Lisa/FNAL, Rob/OSG, Ronald/NL-T1).

Experiments round table:

ATLAS reports - Jhen-Wei
- T0
  - ntr
- T1 sites
  - FZK-LCG2 DATATAPE and MCTAPE: "TRANSFER_TIMEOUT". GGUS:76908. Many queued gridftp movers in our dCache system. The active movers have very low throughput, which leads to timeouts. Is rolling upgrade done? Do other T1s using same version have potential issue?
  - INFN-T1 Scheduled Downtime for one LFC and SRM [LFC Atlas consolidation and Atlas GPFS filesystem check]. Wed, 30 November, 07:00 – Thu, 1 December, 12:00.
  - Set IT cloud offline for LFC migration. Nov 29th-Dec 1st 2011.
- T2 sites
  - ntr
CMS reports -
- LHC / CMS detector
  - HI Physics data taking
- CERN / central services
  - Smooth...
  - Still peaky use of the T1TRANSFER pool. No consequences. The activity is legitimate. We are investigating whether we need more resources allocated permanently to CMS.
- T0
  - Running HI express and prompt reconstruction.
- T1 sites:
  - MC production and/or reprocessing running at all sites.
  - Run2011 data reprocessing.
  - T1_IT_CNAF back after the storage intervention
  - Problem of thoughput from T1_DE_KIT and T1_FR_CCIN2P3 to US T2s still open (GGUS:75985 and GGUS:75983)
- T2 sites:
  - NTR (at least, relevant for this meeting)
- Other:
  - NTR

ALICE reports - Maarten
- NTR

LHCb reports -
- no report

Sites / Services round table:

Gonzalo/PIC - ntr
Michael/BNL - ntr
Gareth/RAL - ntr
Giovanni/CNAF - ntr
Rolf/IN2P3 - ntr
Roger/NDGF - ntr
Claudio/CMS
Lisa/FNAL - in power maintenance with few dCache nodes down. Should be up by 1pm
Ronald/NL-T1 - ntr
Rob/OSG - ntr
Massimo/CERN - next week: transparent castor srm upgrade on Tue (Alice and LHCb) and Wed(CMS and ATLAS)

AOB:

In yesterday's meeting: Michael was asking for an update on the networking issues by KIT. Pavel will follow up with experts at KIT.

Friday

Attendance: local();remote(). Attendance: local (AndreaV, Jhen-Wei, Claudio, Eva, Lola, David); remote(Michael/BNL, Giovanni/CNAF, John/RAL, Alexander/NL-T1, Mette/NDGF, Xavier/KIT, Rolf/IN2P3, Gonzalo/PIC, Rob/OSG).

Experiments round table:

ATLAS reports -
- T0
  - ADCR high load because of deletion behaviors after IT LFC merging. Purpose to stop deletion temporarily.
- T1 sites
  - FZK-LCG2 DATATAPE and MCTAPE: "TRANSFER_TIMEOUT". GGUS:76908. Many queued gridftp movers in our dCache system. The active movers have very low throughput, which leads to timeouts. All GridKa dCache pools of ATLAS were upgraded and all of them are back in production. Ticket closed.
- T2 sites
  - ntr

CMS reports -
- LHC / CMS detector
  - HI Physics data taking
- CERN / central services
  - The size of the T1TRANSFER pool is ok. It is acceptable that during usage peaks we have queued transfers.
- T0
  - Running HI express and prompt reconstruction.
  - Jobs stuck on one T0 WN cannot be killed because machine down (GGUS:76990). Fixed
- T1 sites:
  - MC production and/or reprocessing running at all sites.
  - Problem of thoughput from T1_DE_KIT and T1_FR_CCIN2P3 to US T2s still open (GGUS:75985 and GGUS:75983)
- T2 sites:
  - Missing VOMS groups/roles on CERN CREAM CE's (GGUS:76884). Should be fixed now.
- Other:
  - NTR

ALICE reports -
- Problem with one xrootd serverthis morning, will follow up with the quattor experts. Only affected the machine for a very short time.

LHCb reports -
- No report

Sites / Services round table:

Michael/BNL: ntr
Giovanni/CNAF: ntr
John/RAL: ntr
Alexander/NL-T1: ntr
Mette/NDGF: ntr
Xavier/KIT: staging of files for CMS was blocked, now fixed and working on the backlog
Rolf/IN2P3: ntr
Gonzalo/PIC: ntr
Rob/OSG: ntr

Eva/Databases: ntr
David/Dashboard: ntr

AOB: none

-- JamieShiers - 31-Oct-2011

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
ppt	ggus-data_MB_20111129.ppt	r1	manage	2330.0 K	2011-11-28 - 11:03	MariaDimou

Topic revision: r18 - 2011-12-02 - AndreaValassi

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback