Week of 111128

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Massimo, Oliver, Alexei, David, Manuel, Eva, JhenWei);remote(Michael, Onno, Gonzalo, Lisa, Tiju, Rolf, Roger, Kyle, Paolo).

Experiments round table:

  • ATLAS reports -
    • DDM Dashboard issue. Statistics plots are not updated (Nov 25, ~21:00)
    • FR, US Tier2s issues
    • Sun Nov 27, Distributed analysis monitoring
      • some tables have no info, experts reported that probably information isn't in the PanDA database, under investigation
    • Mon Nov 28. 11:40am : CASTOR to EOS ph.groups migration is in progress

  • CMS reports -
    • LHC / CMS detector
      • Heavy Ions collision data taking
    • CERN / central services
      • [CERN SRM]: srm-cms.cern.ch availability had some holes yesterday, under investigation https://sls.cern.ch/sls/service.php?id=CASTOR-SRM_CMS
      • [CERN/Vanderbilt FTS]: create dedicated channel, apologies for competing request to increase the STAR channel before, experts determined that it would be better to have an own channel GGUS:76783, solved.
      • [CASTORCMS_T0TEMP]: 400 active transfers stuck since 24th, continuation of GGUS:76649
    • T0
      • Running HI express and prompt reconstruction.
    • T1 sites:
      • MC production and/or reprocessing running at all sites.
      • Run2011 data reprocessing.
      • [T1_DE_KIT]: Reduced throughput of transfers from KIT to US T2s. IPerf tests ongoing with some US-sites, seems to be mostly political issues, GGUS:75985 (in progress, last update 11/21)
      • [T1_FR_CCIN2P3]: Reduced throughput of transfers from CCIN2P3, improvements visible, but not enough, IN2P3 asked for starting low level investigations with iperf, due to ThanksGiving we can follow up on Monday GGUS:75983 (in progress, last update 11/23), GGUS:71864 (in progress, last updated 11/23)
      • [T1_IT_CNAF]: CREAM reporting dead jobs @ CNAF T1 as REALLY-RUNNING. Workaround in the queue for next EMI release, CNAF asked to also clear the REALLY-RUNNING jobs left for debugging, this does not hurt processing right now, will clean up on Monday due to ThanksGiving, GGUS:76597 (in progress, last update 11/233)
    • T2 sites:
      • NTR (at least, relevant for this meeting)
    • Other:
      • NTR

Sites / Services round table:

  • IN2P3:
    • UPS incident this morning. Service incident report is being prepared.
    • Several interventions are being prepared and will be bundled in a single downtime (Dec 6th). WLCG affected services: dCache and HPSS
  • PIC: Transparent interventions from today to Wednesday. The only side effects (since we are moving worker nodes) some oscillation (up to -30%) of the total capacity.
  • NLT1: Several interventions are being prepared and will be bundled in a single downtime (Dec 13th)
  • KIT: Misbehaving filesystem stopped on Friday. Now it is back (debugged). This affected 5 CMS pools
  • CERN:
    • Dashboards: Follow up of the ATLAS incident: last sets of statistics are being regenerated
    • Storage Services: In order to allow swift addition of capacity, a new version of EOS has been deployed on EOSCMS. The CMS T0TEMP ticket is being followed up but it is not impacting production (the 400 transfers are "zombie" transfers).
    • Databases: CMSonline DB problems: analysis going on (including service req. to Oracle)

AOB: (MariaDZ) ALARM drills for the last 3 weeks (since last MB) are attached to this page.

Tuesday:

Attendance: local(Alex, Claudio, Jhen-Wei, Maarten, Maria D, Massimo, Mike, Przemyslaw);remote(Burt, Jeremy, Michael, Rob, Roger, Ronald, Tiju).

Experiments round table:

  • ATLAS reports -
    • T0
      • Migrated CERN-PROD group disk area from CASTOR to EOS. Done at 16:00, Nov. 28. No major issue observed.
        • Massimo: there still are other volumes to be moved, in particular atlast3 (large); a big jump is expected after the HI run has finished
    • T1 sites
      • Set IT cloud offline for LFC migration. Nov 29th-Dec 1st 2011.
    • T2 sites
      • ntr

  • CMS reports -
    • LHC / CMS detector
      • No data taking due to LHC problems until 21.00 then, if recovery is ok, HI physics
    • CERN / central services
      • srm-cms availability down to 0 tonight. Due to T1TRANSFER service class unavailability. Had ~ 540 (zombie) active transfers yesterday. OK since the morning.
        • Massimo: the problem was not with the SRM, but due to overload of the service class, which caused monitoring requests to remain queued and the status to go red; we will discuss this offline
      • 400 active transfers on T0TEMP have been cleaned (continuation of GGUS:76649)
      • FTS channel between CERN and Vanderbilt (GGUS:76783): ticket reponened because it had to use CERN-PROD not CERN.
    • T0
      • Running HI express and prompt reconstruction.
    • T1 sites:
      • MC production and/or reprocessing running at all sites.
      • Run2011 data reprocessing.
      • [T1_DE_KIT]: Reduced throughput of transfers from KIT to US T2s. IPerf tests ongoing with some US-sites, seems to be mostly political issues, GGUS:75985 (in progress, last update 11/21)
      • [T1_FR_CCIN2P3]: Reduced throughput of transfers from CCIN2P3, improvements visible, but not enough, low level investigations with iperf started yesterday GGUS:75983 (in progress, last update 11/28, GGUS:71864 closed as duplicate)
      • [T1_IT_CNAF]: CREAM reporting dead jobs @ CNAF T1 as REALLY-RUNNING. Workaround in the queue for next EMI release, CNAF asked to also clear the REALLY-RUNNING jobs left for debugging, this does not hurt processing right now GGUS:76597 (in progress, last update 11/24)
    • T2 sites:
      • NTR (at least, relevant for this meeting)
    • Other:
      • NTR

Sites / Services round table:

  • BNL - ntr
  • FNAL - ntr
  • GridPP - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • RAL - ntr

  • CASTOR/EOS - nta
  • dashboards - ntr
  • databases
    • yesterday morning the CMS online DB suffered 3 hangs of 0.5 hour each; the problem has been identified by Oracle support and a bugfix will be available at some point; meanwhile a workaround has been applied
    • tomorrow between 09:30 and 12:30 CET the LHCb integration DB will be upgraded to Oracle 11g
  • GGUS/SNOW - ntr
  • grid services - ntr

AOB:

  • The AFS team at CERN has observed a high number of failing AFS callback connections to remote machines. For instance, yesterday we detected about 35k failed callback connections. In addition to AFS consistency relying on proper callback delivery, failures to deliver callbacks may result in delays for other clients (as the server needs to time out on the unresponsive clients first). As this impacts the AFS service at CERN, it would be desirable to understand why these machines do not respond to callbacks and if that could be changed. Sites with the most "failed connections" will be contacted by the AFS team.
    • Massimo: the first site would be RAL
    • Tiju: I will send our helpdesk e-mail address to the list (done)

Wednesday

Attendance: local(Jhen-Wei, Massimo, Claudio, David, Luca, MariaDZ, Maarten, Dirk);remote( Michael/BNL,Giovanni/CNAF, John/RAL, Ron/NL-T1, Roger/NDGF, Lisa/FNAL, Pavel/KIT, Rolf/IN2P3).

Experiments round table:

  • ATLAS reports - Jhen-Wei
    • T0
      • ntr
    • T1 sites
      • IN2P3-CC_DATADISK: "Internal timeout. Also affected T0 export.". GGUS:76911. High srmput activity yesterday evening (Atlas and CMS import + WNs), making the transfers slower and generating a backlog. This explains the timeouts seen. The backlog disappeared by itself around midnight. Ticket closed.
      • FZK-LCG2 DATATAPE and MCTAPE: "TRANSFER_TIMEOUT". GGUS:76908. Many queued gridftp movers in our dCache system. The active movers have very low throughput, which leads to timeouts. MCTAPE had been blacklisted to let transfer focus on T0 export. Some transfers back in DATATAPE.
        • Pavel/KIT: contacted dcache support who suggested to update to the latest release, which will be done soon.
      • INFN-T1 Scheduled Downtime for one LFC and SRM [LFC Atlas consolidation and Atlas GPFS filesystem check]. Wed, 30 November, 07:00 – Thu, 1 December, 12:00.
      • Set IT cloud offline for LFC migration. Nov 29th-Dec 1st 2011.
    • T2 sites
      • ntr
  • CMS reports - Claudio
    • LHC / CMS detector
      • No data taking due to LHC problems. HI physics will restart hopefully tomorrow.
    • CERN / central services
      • The CASTOR_T1TRANSFER load showed up shortly again yesterday afternoon (with a corresponding short glitch in srm-cms availability). Not clear the reason yet. Production activities where high but not enough to explain the peak. Probably it was an overlap with some individual operation.
        • Massimo: added additional resources to the CMS pool.
      • FTS channel to Vanderbilt reconfigured at CERN (GGUS:76783)
    • T0
      • Running HI express and prompt reconstruction.
    • T1 sites:
      • MC production and/or reprocessing running at all sites.
      • Run2011 data reprocessing.
      • [T1_DE_KIT]: Reduced throughput of transfers from KIT to US T2s. IPerf tests ongoing with some US-sites, seems to be mostly political issues, GGUS:75985 (in progress, last update 11/21)
      • [T1_FR_CCIN2P3]: Reduced throughput of transfers from CCIN2P3, improvements visible, but not enough, low level investigations with iperf started yesterday GGUS:75983 (in progress, last update 11/28, GGUS:71864 closed as duplicate)
      • [T1_IT_CNAF]: CREAM reporting dead jobs as REALLY-RUNNING (GGUS:76597). Ticket closed. Will test the new EMI CREAM as soon as it is installed.
    • T2 sites:
      • NTR (at least, relevant for this meeting)
    • Other:
      • NTR

Sites / Services round table:

  • Michael/BNL - NTR
  • Giovanni/CNAF: in scheduled outage for CMS SRM: GPFS is offline due due h/w failures. Expect to finishing by 18:00.
  • John/RAL: NTR
  • Ron/NL-T1: NTR
  • Roger/NDGF: NTR
  • Lisa/FNAL: tomorrow power work in CC: might affect some dCache pool nodes (expect access delays) from 7:00-13:00
  • Pavel/KIT: NTR
  • Rolf/IN2P3: NTR
  • Rob/OSG: a combination of changes (several weeks ago and recent) broke the comparison report between OSG and SAM calculated availability for few days. This has been fixed and the report looked ok this morning.

AOB: (MariaDZ) Concerns the Tier0 only: In view of the Year End CERN closure the GGUS-SNOW ticket handling was reviewed for the period 2011/12/22-2012/01/04. The conclusions for production grid services at CERN (in IT PES, DSS, DB groups) are summarised by Maite Barroso below:

  • Main route for getting notification of critical problems are the GGUS ALARM tickets. This is working and does not need any change.
  • For GGUS TEAM tickets, the members of e-group grid-cern-prod-admins will be put in the Outside Working Hours (OWH) group in order to have access to the SNOW instance of the ticket at all times.
  • For the rest, CERN specific, all services are correctly declared in SNOW with OWH support groups, and in Services' Data Base (SDB) with the right criticality, so it should be fine.

Thursday

Attendance: local(Jhen-Wei, Massimo, Eva, David, Alex, Dirk);remote(Gonzalo/PIC, Michael/BNL, Gareth/RAL, Giovanni/CNAF, Rolf/IN2P3, Roger/NDGF, Claudio/CMS, Lisa/FNAL, Rob/OSG, Ronald/NL-T1).

Experiments round table:

  • ATLAS reports - Jhen-Wei
    • T0
      • ntr
    • T1 sites
      • FZK-LCG2 DATATAPE and MCTAPE: "TRANSFER_TIMEOUT". GGUS:76908. Many queued gridftp movers in our dCache system. The active movers have very low throughput, which leads to timeouts. Is rolling upgrade done? Do other T1s using same version have potential issue?
      • INFN-T1 Scheduled Downtime for one LFC and SRM [LFC Atlas consolidation and Atlas GPFS filesystem check]. Wed, 30 November, 07:00 – Thu, 1 December, 12:00.
      • Set IT cloud offline for LFC migration. Nov 29th-Dec 1st 2011.
    • T2 sites
      • ntr
  • CMS reports -
    • LHC / CMS detector
      • HI Physics data taking
    • CERN / central services
      • Smooth...
      • Still peaky use of the T1TRANSFER pool. No consequences. The activity is legitimate. We are investigating whether we need more resources allocated permanently to CMS.
    • T0
      • Running HI express and prompt reconstruction.
    • T1 sites:
      • MC production and/or reprocessing running at all sites.
      • Run2011 data reprocessing.
      • T1_IT_CNAF back after the storage intervention
      • Problem of thoughput from T1_DE_KIT and T1_FR_CCIN2P3 to US T2s still open (GGUS:75985 and GGUS:75983)
    • T2 sites:
      • NTR (at least, relevant for this meeting)
    • Other:
      • NTR

Sites / Services round table:
  • Gonzalo/PIC - ntr
  • Michael/BNL - ntr
  • Gareth/RAL - ntr
  • Giovanni/CNAF - ntr
  • Rolf/IN2P3 - ntr
  • Roger/NDGF - ntr
  • Claudio/CMS
  • Lisa/FNAL - in power maintenance with few dCache nodes down. Should be up by 1pm
  • Ronald/NL-T1 - ntr
  • Rob/OSG - ntr
  • Massimo/CERN - next week: transparent castor srm upgrade on Tue (Alice and LHCb) and Wed(CMS and ATLAS)
AOB:
  • In yesterday's meeting: Michael was asking for an update on the networking issues by KIT. Pavel will follow up with experts at KIT.

Friday

Attendance: local();remote(). Attendance: local (AndreaV, Jhen-Wei, Claudio, Eva, Lola, David); remote(Michael/BNL, Giovanni/CNAF, John/RAL, Alexander/NL-T1, Mette/NDGF, Xavier/KIT, Rolf/IN2P3, Gonzalo/PIC, Rob/OSG).

Experiments round table:

  • ATLAS reports -
    • T0
      • ADCR high load because of deletion behaviors after IT LFC merging. Purpose to stop deletion temporarily.
    • T1 sites
      • FZK-LCG2 DATATAPE and MCTAPE: "TRANSFER_TIMEOUT". GGUS:76908. Many queued gridftp movers in our dCache system. The active movers have very low throughput, which leads to timeouts. All GridKa dCache pools of ATLAS were upgraded and all of them are back in production. Ticket closed.
    • T2 sites
      • ntr

  • CMS reports -
    • LHC / CMS detector
      • HI Physics data taking
    • CERN / central services
      • The size of the T1TRANSFER pool is ok. It is acceptable that during usage peaks we have queued transfers.
    • T0
      • Running HI express and prompt reconstruction.
      • Jobs stuck on one T0 WN cannot be killed because machine down (GGUS:76990). Fixed
    • T1 sites:
      • MC production and/or reprocessing running at all sites.
      • Problem of thoughput from T1_DE_KIT and T1_FR_CCIN2P3 to US T2s still open (GGUS:75985 and GGUS:75983)
    • T2 sites:
    • Other:
      • NTR

  • ALICE reports -
    • Problem with one xrootd serverthis morning, will follow up with the quattor experts. Only affected the machine for a very short time.

Sites / Services round table:

  • Michael/BNL: ntr
  • Giovanni/CNAF: ntr
  • John/RAL: ntr
  • Alexander/NL-T1: ntr
  • Mette/NDGF: ntr
  • Xavier/KIT: staging of files for CMS was blocked, now fixed and working on the backlog
  • Rolf/IN2P3: ntr
  • Gonzalo/PIC: ntr
  • Rob/OSG: ntr

  • Eva/Databases: ntr
  • David/Dashboard: ntr

AOB: none

-- JamieShiers - 31-Oct-2011

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ggus-data_MB_20111129.ppt r1 manage 2330.0 K 2011-11-28 - 11:03 MariaDimou  
Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2011-12-02 - AndreaValassi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback