Week of 120701

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(Luc, Torre, Maarten, Luca, Zbigniew, Eddy, Jacob, Dirk);remote(Michael, Vladimir, Rolf, Rob, Paolo, Tiju, Onno, Jhen-Wei, Dimitri).

Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • Fake alarms after power cut: Get error, disk shortage; Are all disk servers back?
        • Luca: One or two servers is still down - replication should cover this.
      • During power cut 4/5 LFC frontends unavailable whereas OK for Central Catalogs. Possible to have same criticality for LFC & CC?
        • Maarten: please file a GGUS ticket to LFC support.
    • T1
      • IN2P3-CC: SRM instabilities & dcache problems (dCache core servers running under SL6 due to leap-second). GGUS:83729 solved.
      • NDGF-T1: Due to leap-second several dcache services in had to be restarted.
    • T2
      • NTR

  • CMS reports -
    • LHC machine / CMS detector
      • intensity ramp-up
    • CERN / central services and T0
      • iCMS down all weekend due to disk failure caused by power cut on Friday, being fixed today GGUS:83726
    • Tier-1/2:
      • T1_DE_KIT: was failing HammerCloud due to problems with cream-2, seems to be fixed already GGUS:83736
      • FTS for T1_IT_CNAF: credential delegation problem, actively followed up in GGUS:83486 (currently OK, keeping here in case of reocurrance)

  • ALICE reports -
    • IN2P3: looking into low amount of job submissions since Sat.

  • LHCb reports -
    • Users analysis at T1s ongoing
    • MC production at all sites
    • T0:
      • Moving DIRAC accounting services to new machines.
      • CERN: (GGUS:83713) 30 TB are still missing on LHCb-Disk
    • T1:
      • FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites).
      • IN2P3 / GridKa file corruption

    • Eddy: Nagios T1 tests not running since Saturday - being investigated - Stefan is aware
    • Luca: posted list of unavailable files. LHCb could decide to wait (typically two days) to recover or to reimport? Vladimir: will notify about decision.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • KIT - ntr
  • IN2P3 - ntr
  • NLT1 - SARA also suffered from leap second problem.
  • OSG - ntr
  • RAL - ntr

AOB:

Tuesday

Attendance: local(Claudio, Torre, Eva, Eddy, Luca, Gavin. MariaDZ, Dirk);remote(Tiju/RAL, Dimitri/KIT, Michael/BNL, Dmytro/NDGF, XavierM/KIT, Jhen-Wei, Ronald/NL-T1, Burt/FNAL, Rolf/IN2P3, Vladimir/LHCb, Paolo/CNAF, Rob/OSG). Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • CERN-PROD_TZERO shows (since last night) high fraction of staging failures, ticketed this morning, under investigation GGUS:83781
      • Regarding LFC robustness issue reported yesterday. Discussed in ATLAS/IT meeting yesterday: only one LFC server currently in critical category. Will be increased to three.
      • Brief problem with GGUS yesterday morning, individual ticket pages gave 500 error, reported to be due to server problems under study
    • T1
      • NDGF-T1: LFC migration to CERN is underway today, no reported problems so far
    • T2
      • NTR
    • MariaDZ: GUUS issue Sun to Mon due to leap second (Full explanation by GGUS developer Guenter Grein: "The problem started after introducing a leap second in the night 20120630 to 20120701. It ended Monday morning 09:30 after restarting not only the Remedy server but also the underlying OS. Especially the Java based Remedy mid-tier was facing problems which increased with the rising number of accesses to the GGUS web portal this morning. Several publications in the www report problems of Linux kernels and Java applications related to the leap second. We believe this has caused the GGUS problems too. The GGUS server is running on a Linux OS and a major part of Remedy is Java-based.")
    • Luca: all files staged under "default" pools - is that expected? Torre: will check

  • CMS reports -
    • LHC machine / CMS detector
      • Physics data taking during beam intensity ramp-up
    • CERN / central services and T0
    • Tier-1/2:
      • Persistent HammerCloud failures at T2_FR_GRIF_IRFU. T2_RU_PNPI has problems since days.
      • GGUS:83486 (FTS delegation problem): currently no problems but keeping here until sw is fixed

  • ALICE reports -
    • IN2P3: low number of job submissions still being investigated by AliEn experts.

  • LHCb reports -
    • Users analysis at T1s ongoing, Reconstruction started
    • MC production at all sites
    • T0:
      • Moving DIRAC accounting services to new machines.
      • CERN: (GGUS:83713) 30 TB are still missing on LHCb-Disk; waiting for repair
      • Problems with the SAM SRM probes (GGUS:83782); Requested to remove probes [org.lhcb.SRM-VOLs, org.lhcb.SRM-VODel] from profile LHCB_CRITICAL
    • T1:
      • FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites)
      • IN2P3 / GridKa file corruption

    • Luca: missing disk server is back - please tests

Sites / Services round table:

  • Tiju/RAL - ntr
  • Michael/BNL - ntr
  • Dmytro/NDGF - due to LFC server migration no jobs right now
  • XavierM/KIT - ntr
  • Jhen-Wei/ASGC - ntr
  • Ronald/NL-T1 - ntr
  • Burt/FNAL - due to cooling load at reduced capacity but kept pledge value so far
  • Rolf/IN2P3 - ntr
  • Paolo/CNAF - ntr
  • Rob/OSG - tomorrow US holiday: nobody will attend call
  • MariaDZ : got several request for server certs from user in DOE grid. Is there somebody closer for these type of requests: Rob: yes, mail to osg-ra@opensciencegridNOSPAMPLEASE.org would work too.
  • Gavin:
  • MariaDZ: will have GGUS release on Monday followed by alarm tests. This release will introduce interface for GGUS requests (in addition to incidents)

AOB:

Wednesday

Attendance: local(Torre, Maarten, Luca, Luca, Claudio, Eddy, Dirk);remote(Michael/BNL, Dmytro/NDGF, Jhen-Wei/ASGC, John/RAL, Rolf/IN2P3, Vladimir/LHCb, Pavel/KIT, Ron/NL-T1, Paolo/CNAF). Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • CERN-PROD_TZERO staging failures have abated today, yesterday's problem still under investigation. GGUS:83781
    • T1
      • SARA-MATRIX: recurring source transfer failures, spacemanager crashing periodically, site opened ticket to dCache developers. GGUS:83774
      • IN2P3-CC: ramping up jobs after CVMFS outage yesterday evening was resolved this morning.
      • NDGF-T1: LFC migration to CERN went smoothly yesterday
    • T2
      • NTR
  • Luca: t0 pool suffers for recall in 1-2 disk servers. Please contact us if new cases appear. We could try changing the scheduling policy to random instead of best fit.

  • CMS reports -
    • LHC machine / CMS detector
      • Physics data taking with 1380 bunches
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • GGUS:83486 (FTS delegation problem): currently no problems but keeping here until sw is fixed
    • Other:
      • NTR

  • ALICE reports -
    • IN2P3: normal job submission rates since yesterday evening; the cause of the problem was a configuration issue in the central AliEn services.

  • LHCb reports -
    • Users analysis and Reconstructionat at T1s
      • MC production at T2s
    • T0:
      • Problems with the SAM SRM probes (GGUS:83782); Requested to remove probes [org.lhcb.SRM-VOLs, org.lhcb.SRM-VODel] from profile LHCB_CRITICAL
    • T1:
      • FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites).
      • IN2P3 / GridKa file corruption
      • IN2P3: CVMFS problem; fixed

Sites / Services round table:problem

  • Michael/BNL - ntr
  • Dmytro/NDGF - ntr
  • Jhen-Wei/ASGC - ntr
  • John/RAL - ntr
  • Rolf/IN2P3 - cvmfs problem already appeared some months ago and seems related to a new subdirectory being pushed from CERN. The problem is fixed already and a new CVMFS version has now been applied at IN2P3. Experiments should inform local site contact about new subdirectories to confirm that bug is indeed gone.
  • Pavel/KIT - ntr
  • Ron/NL-T1 - still working on SRM instability with dcache devs
  • Paolo/CNAF - ntr

AOB:

Thursday

Attendance: local(Torre, Eva, Eddy, Dirk);remote(Vladimir/LHCb, Michael/BNL, Dmytro/NDGF, Jhen-Wei/ASGC, Rolf/IN2P3, Ronald/NL-T1, Gonzalo/PIC, Rob/OSG, Jeremy/GridPP, Burt/FNAL, Paolo/CNAF, Gareth/RAL, Rob/OSG, Jeremy/GridPP ).

Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • CERN-PROD_TZERO staging timeout ticket remains open, possible cause/solution found, to be confirmed. No current errors. GGUS:83781
    • T1
      • SARA-MATRIX: continuing (this morning) to see transfer failures. GGUS:83774
    • T2
      • NTR

  • CMS reports -
    • LHC machine / CMS detector
      • Mainly physics data taking with 1380 bunches
    • CERN / central services and T0
      • NTR
    • Tier-1/2:
      • Slow restart of PIC after the SD. Now ok.
      • GGUS:83486 (FTS delegation problem): currently no problems but keeping here until sw is fixed
    • Other:
      • Apologies for not being able to attend today

  • LHCb reports -
    • Users analysis and Reconstructionat at T1s
      • MC production at T2s
    • T0:
    • T1:
      • FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites).
      • IN2P3 / GridKa file corruption
Sites / Services round table:
  • Michael/BNL - ntr
  • Dmytro/NDGF - ntr
  • Jhen-Wei/ASGC - ntr
  • Rolf/IN2P3 - network probem this morning: NREN had cut off center from 7:45-8:10 - no impact on running jobs
  • Ronald/NL-T1 - SRM probs at SARA: internal cleanup of postgress DB has lowered CPU load - cant tell yet if stability is improved too.
  • Gonzalo/PIC - next week planning 3 day shutdown migrating dcache namespace (planning for tue afternoon - fri afternoon)
  • Rob/OSG - ntr
  • Jeremy/GridPP -ntr
  • Burt/FNAL - ntr
  • Paolo/CNAF - scheduled downtime for ATLAS gpfs on Monday morning (closed day before) - Is it possible to schedule downtime only for one file system instead of the full service. Dirk: suggest to use warning instead specifying the affected area as this allows each experiment to decide if it worth for them to keep the site during the outage.
  • Gareth/RAL - ntr

  • Eva: have found a storage misconfiguration in PDBR cluster and will need a rolling intervention to fix this. In discussion with affected user suggesting an intervention this afternoon.
  • MariaDZ: GGUS: Tier0 service mgrs please note that with next Monday's 20120709 GGUS Release the GGUS-SNOW interface for "Change Requests" and the mandatory introduction of SNOW Service Element (SE) enters production. In case of problems, please open a GGUS ticket for the Support Unit 'GGUS' i.e. the dev.team.

AOB:

Friday

Attendance: local(Torre, Maarten, Claudio, Eddy, Luca, Dirk);remote(Dmytro/NDGF, Michael/BNL, Gonzalo/PIC, Xavier/KIT, John/RAL, Jhen-Wei/ASGC, Rolf/IN2P3, Onno/NL-T1, Paolo/CNAF, Vladimir/LHCb, Rob/OSG, Burt/FNAL).

Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • CERN-PROD_TZERO staging timeout ticket closed, no further problems in 2 days. GGUS:83781
    • T1
      • PIC: Tier 0 export failures, ticketed 9:13 this morning with 100% failures since 8:50, converted to ALARM at 11:25 with 68% failures. Excluded from T0 export at 12:00 with 100% failures. dCache pool network instabiities, PNFS overload. GGUS:83923
      • SARA-MATRIX: continuing to see transfer failures. Will be removed from Tier 0 export for the weekend if still unresolved this afternoon. GGUS:83774
    • T2
      • NTR
    • Onno: pls update GGUS with example of failed transfer - looked just but did not find local problem. Torre: will check again
    • Gonzalo: issues due to pnfs overload - dcache experts are trying to solve it

  • CMS reports -
    • LHC machine / CMS detector
      • Not much data taken. TOTEM run today
    • CERN / central services and T0
      • Corruption in a DB record caused T0 jobs to crash and DAQ to be stuck tonight. The offending record has been removed but caused the failure of jobs also at remote sites until the squid caches were refreshed.
    • Tier-1/2:
      • GGUS:83486 (FTS delegation problem): currently no problems but keeping here until sw is fixed

  • LHCb reports -
    • Users analysis and Reconstructionat at T1s
      • MC production at T2s
    • T0:
    • T1:
      • PIC: Files unavailable after downtime (GGUS:83916); Fixed
      • IN2P3: Reconstructions jobs failed due to memomy; Under investigation
      • FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites).
      • IN2P3 / GridKa file corruption

Sites / Services round table:

  • Dmytro/NDGF: swedish T2 did not come back from downtime. On Mon (1-3pm) SRM down for software upgrade - data will be unavailable
  • Michael/BNL: ntr
  • Gonzalo/PIC: acknowledge LHCB issue: problematic sensor (restating network) since last upgrade - now fixed. Yesterday noon: alarm for water leak (condensation water ) led to stop of worker nodes - 2000 Jobs stopped. Now restarting and preparing SIR for this issue.
  • Xavier: ntr
  • Jhen-Wei: yesterday night nfs server crashed due to hw problems affecting CMS and ATLAS - ATLAS back since the morning - CMS coming up.
  • Rolf: ntr
  • Onno: sara srm seems to works better now (experiments should confirm): tuning of the "vacuuming" for postgress DB to increase index performance. SRM space manager crashed during high load period yesterday evening. Now using cron job for automatic restart. Load right now is higher than yesterday and space manager runs fine. Reported instability to dcache developers as the new 2.2.1 could have a bug which could also affect other sites upgrading to this version.
  • Paolo - ntr
  • Rob: ntr
  • Burt: ntr
  • John/RAL; ntr

  • Luca: transparent CASTOR DB database upgrade planned for Mon
AOB:

-- JamieShiers - 29-Jun-2012

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2012-07-06 - DirkDuellmann
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback