LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek120701 (2012-07-06, DirkDuellmann)

EditAttachPDF

Week of 120701

Week of 120701

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability				SIRs, Open Issues & Broadcasts			Change assessments
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	WLCG Service Open Issues	Broadcast archive	CASTOR Change Assessments

General Information

General Information					GGUS Information	LHC Machine Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	WLCG Blogs		GgusInformation	Sharepoint site - Cooldown Status - News

Monday

Attendance: local(Luc, Torre, Maarten, Luca, Zbigniew, Eddy, Jacob, Dirk);remote(Michael, Vladimir, Rolf, Rob, Paolo, Tiju, Onno, Jhen-Wei, Dimitri).

Experiments round table:

ATLAS reports -
- CERN CENTRAL SERVICES, T0
  - Fake alarms after power cut: Get error, disk shortage; Are all disk servers back?
    - Luca: One or two servers is still down - replication should cover this.
  - During power cut 4/5 LFC frontends unavailable whereas OK for Central Catalogs. Possible to have same criticality for LFC & CC?
    - Maarten: please file a GGUS ticket to LFC support.
- T1
  - IN2P3-CC: SRM instabilities & dcache problems (dCache core servers running under SL6 due to leap-second). GGUS:83729 solved.
  - NDGF-T1: Due to leap-second several dcache services in had to be restarted.
- T2
  - NTR

CMS reports -
- LHC machine / CMS detector
  - intensity ramp-up
- CERN / central services and T0
  - iCMS down all weekend due to disk failure caused by power cut on Friday, being fixed today GGUS:83726
- Tier-1/2:
  - T1_DE_KIT: was failing HammerCloud due to problems with cream-2, seems to be fixed already GGUS:83736
  - FTS for T1_IT_CNAF: credential delegation problem, actively followed up in GGUS:83486 (currently OK, keeping here in case of reocurrance)

ALICE reports -
- IN2P3: looking into low amount of job submissions since Sat.

LHCb reports -
- Users analysis at T1s ongoing
- MC production at all sites
- T0:
  - Moving DIRAC accounting services to new machines.
  - CERN: (GGUS:83713) 30 TB are still missing on LHCb-Disk
- T1:
  - FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites).
  - IN2P3 / GridKa file corruption

- Eddy: Nagios T1 tests not running since Saturday - being investigated - Stefan is aware
- Luca: posted list of unavailable files. LHCb could decide to wait (typically two days) to recover or to reimport? Vladimir: will notify about decision.

Sites / Services round table:

ASGC - ntr
BNL - ntr
KIT - ntr
IN2P3 - ntr
NLT1 - SARA also suffered from leap second problem.
OSG - ntr
RAL - ntr

dashboards -
databases - issue with cms archive DB from Fri night to Sat morning - shared pool got fragmented on first instance until that was rebooted on Sat morning. Now investigating the reason for the fragmentation.
grid services - (reminder) myproxy upgrade on Tuesday cf. https://itssb.web.cern.ch/service-change/myproxy-server-upgrade/03-07-2012
storage -

AOB:

Tuesday

Attendance: local(Claudio, Torre, Eva, Eddy, Luca, Gavin. MariaDZ, Dirk);remote(Tiju/RAL, Dimitri/KIT, Michael/BNL, Dmytro/NDGF, XavierM/KIT, Jhen-Wei, Ronald/NL-T1, Burt/FNAL, Rolf/IN2P3, Vladimir/LHCb, Paolo/CNAF, Rob/OSG). Experiments round table:

ATLAS reports -
- CERN CENTRAL SERVICES, T0
  - CERN-PROD_TZERO shows (since last night) high fraction of staging failures, ticketed this morning, under investigation GGUS:83781
  - Regarding LFC robustness issue reported yesterday. Discussed in ATLAS/IT meeting yesterday: only one LFC server currently in critical category. Will be increased to three.
  - Brief problem with GGUS yesterday morning, individual ticket pages gave 500 error, reported to be due to server problems under study
- T1
  - NDGF-T1: LFC migration to CERN is underway today, no reported problems so far
- T2
  - NTR
- MariaDZ: GUUS issue Sun to Mon due to leap second (Full explanation by GGUS developer Guenter Grein: "The problem started after introducing a leap second in the night 20120630 to 20120701. It ended Monday morning 09:30 after restarting not only the Remedy server but also the underlying OS. Especially the Java based Remedy mid-tier was facing problems which increased with the rising number of accesses to the GGUS web portal this morning. Several publications in the www report problems of Linux kernels and Java applications related to the leap second. We believe this has caused the GGUS problems too. The GGUS server is running on a Linux OS and a major part of Remedy is Java-based.")
- Luca: all files staged under "default" pools - is that expected? Torre: will check

CMS reports -
- LHC machine / CMS detector
  - Physics data taking during beam intensity ramp-up
- CERN / central services and T0
  - GGUS:83726 closed. Services restarted
- Tier-1/2:
  - Persistent HammerCloud failures at T2_FR_GRIF_IRFU. T2_RU_PNPI has problems since days.
  - GGUS:83486 (FTS delegation problem): currently no problems but keeping here until sw is fixed

ALICE reports -
- IN2P3: low number of job submissions still being investigated by AliEn experts.

LHCb reports -
- Users analysis at T1s ongoing, Reconstruction started
- MC production at all sites
- T0:
  - Moving DIRAC accounting services to new machines.
  - CERN: (GGUS:83713) 30 TB are still missing on LHCb-Disk; waiting for repair
  - Problems with the SAM SRM probes (GGUS:83782); Requested to remove probes [org.lhcb.SRM-VOLs, org.lhcb.SRM-VODel] from profile LHCB_CRITICAL
- T1:
  - FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites)
  - IN2P3 / GridKa file corruption

- Luca: missing disk server is back - please tests

Sites / Services round table:

Tiju/RAL - ntr
Michael/BNL - ntr
Dmytro/NDGF - due to LFC server migration no jobs right now
XavierM/KIT - ntr
Jhen-Wei/ASGC - ntr
Ronald/NL-T1 - ntr
Burt/FNAL - due to cooling load at reduced capacity but kept pledge value so far
Rolf/IN2P3 - ntr
Paolo/CNAF - ntr
Rob/OSG - tomorrow US holiday: nobody will attend call
MariaDZ : got several request for server certs from user in DOE grid. Is there somebody closer for these type of requests: Rob: yes, mail to osg-ra@opensciencegridNOSPAMPLEASE.org would work too.
Gavin:
- Here's the current (ongoing) status of the "Slow Batch Response" issue and investigations. We'll add to it as we measure and understand more things. https://twiki.cern.ch/twiki/bin/view/PESgroup/IncidentBatchSlowResponse2012
MariaDZ: will have GGUS release on Monday followed by alarm tests. This release will introduce interface for GGUS requests (in addition to incidents)

AOB:

Wednesday

Attendance: local(Torre, Maarten, Luca, Luca, Claudio, Eddy, Dirk);remote(Michael/BNL, Dmytro/NDGF, Jhen-Wei/ASGC, John/RAL, Rolf/IN2P3, Vladimir/LHCb, Pavel/KIT, Ron/NL-T1, Paolo/CNAF). Experiments round table:

ATLAS reports -
- CERN CENTRAL SERVICES, T0
  - CERN-PROD_TZERO staging failures have abated today, yesterday's problem still under investigation. GGUS:83781
- T1
  - SARA-MATRIX: recurring source transfer failures, spacemanager crashing periodically, site opened ticket to dCache developers. GGUS:83774
  - IN2P3-CC: ramping up jobs after CVMFS outage yesterday evening was resolved this morning.
  - NDGF-T1: LFC migration to CERN went smoothly yesterday
- T2
  - NTR
Luca: t0 pool suffers for recall in 1-2 disk servers. Please contact us if new cases appear. We could try changing the scheduling policy to random instead of best fit.

CMS reports -
- LHC machine / CMS detector
  - Physics data taking with 1380 bunches
- CERN / central services and T0
  - NTR
- Tier-1/2:
  - GGUS:83486 (FTS delegation problem): currently no problems but keeping here until sw is fixed
- Other:
  - NTR

ALICE reports -
- IN2P3: normal job submission rates since yesterday evening; the cause of the problem was a configuration issue in the central AliEn services.

LHCb reports -
- Users analysis and Reconstructionat at T1s
  - MC production at T2s
- T0:
  - Problems with the SAM SRM probes (GGUS:83782); Requested to remove probes [org.lhcb.SRM-VOLs, org.lhcb.SRM-VODel] from profile LHCB_CRITICAL
- T1:
  - FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites).
  - IN2P3 / GridKa file corruption
  - IN2P3: CVMFS problem; fixed

Sites / Services round table:problem

Michael/BNL - ntr
Dmytro/NDGF - ntr
Jhen-Wei/ASGC - ntr
John/RAL - ntr
Rolf/IN2P3 - cvmfs problem already appeared some months ago and seems related to a new subdirectory being pushed from CERN. The problem is fixed already and a new CVMFS version has now been applied at IN2P3. Experiments should inform local site contact about new subdirectories to confirm that bug is indeed gone.
Pavel/KIT - ntr
Ron/NL-T1 - still working on SRM instability with dcache devs
Paolo/CNAF - ntr

AOB:

Thursday

Attendance: local(Torre, Eva, Eddy, Dirk);remote(Vladimir/LHCb, Michael/BNL, Dmytro/NDGF, Jhen-Wei/ASGC, Rolf/IN2P3, Ronald/NL-T1, Gonzalo/PIC, Rob/OSG, Jeremy/GridPP, Burt/FNAL, Paolo/CNAF, Gareth/RAL, Rob/OSG, Jeremy/GridPP ).

Experiments round table:

ATLAS reports -
- CERN CENTRAL SERVICES, T0
  - CERN-PROD_TZERO staging timeout ticket remains open, possible cause/solution found, to be confirmed. No current errors. GGUS:83781
- T1
  - SARA-MATRIX: continuing (this morning) to see transfer failures. GGUS:83774
- T2
  - NTR

CMS reports -
- LHC machine / CMS detector
  - Mainly physics data taking with 1380 bunches
- CERN / central services and T0
  - NTR
- Tier-1/2:
  - Slow restart of PIC after the SD. Now ok.
  - GGUS:83486 (FTS delegation problem): currently no problems but keeping here until sw is fixed
- Other:
  - Apologies for not being able to attend today

ALICE reports -
- NTR

LHCb reports -
- Users analysis and Reconstructionat at T1s
  - MC production at T2s
- T0:
- T1:
  - FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites).
  - IN2P3 / GridKa file corruption

Sites / Services round table:

Michael/BNL - ntr
Dmytro/NDGF - ntr
Jhen-Wei/ASGC - ntr
Rolf/IN2P3 - network probem this morning: NREN had cut off center from 7:45-8:10 - no impact on running jobs
Ronald/NL-T1 - SRM probs at SARA: internal cleanup of postgress DB has lowered CPU load - cant tell yet if stability is improved too.
Gonzalo/PIC - next week planning 3 day shutdown migrating dcache namespace (planning for tue afternoon - fri afternoon)
Rob/OSG - ntr
Jeremy/GridPP -ntr
Burt/FNAL - ntr
Paolo/CNAF - scheduled downtime for ATLAS gpfs on Monday morning (closed day before) - Is it possible to schedule downtime only for one file system instead of the full service. Dirk: suggest to use warning instead specifying the affected area as this allows each experiment to decide if it worth for them to keep the site during the outage.
Gareth/RAL - ntr

Eva: have found a storage misconfiguration in PDBR cluster and will need a rolling intervention to fix this. In discussion with affected user suggesting an intervention this afternoon.
MariaDZ: GGUS: Tier0 service mgrs please note that with next Monday's 20120709 GGUS Release the GGUS-SNOW interface for "Change Requests" and the mandatory introduction of SNOW Service Element (SE) enters production. In case of problems, please open a GGUS ticket for the Support Unit 'GGUS' i.e. the dev.team.

AOB:

Friday

Attendance: local(Torre, Maarten, Claudio, Eddy, Luca, Dirk);remote(Dmytro/NDGF, Michael/BNL, Gonzalo/PIC, Xavier/KIT, John/RAL, Jhen-Wei/ASGC, Rolf/IN2P3, Onno/NL-T1, Paolo/CNAF, Vladimir/LHCb, Rob/OSG, Burt/FNAL).

Experiments round table:

ATLAS reports -
- CERN CENTRAL SERVICES, T0
  - CERN-PROD_TZERO staging timeout ticket closed, no further problems in 2 days. GGUS:83781
- T1
  - PIC: Tier 0 export failures, ticketed 9:13 this morning with 100% failures since 8:50, converted to ALARM at 11:25 with 68% failures. Excluded from T0 export at 12:00 with 100% failures. dCache pool network instabiities, PNFS overload. GGUS:83923
  - SARA-MATRIX: continuing to see transfer failures. Will be removed from Tier 0 export for the weekend if still unresolved this afternoon. GGUS:83774
- T2
  - NTR
- Onno: pls update GGUS with example of failed transfer - looked just but did not find local problem. Torre: will check again
- Gonzalo: issues due to pnfs overload - dcache experts are trying to solve it

CMS reports -
- LHC machine / CMS detector
  - Not much data taken. TOTEM run today
- CERN / central services and T0
  - Corruption in a DB record caused T0 jobs to crash and DAQ to be stuck tonight. The offending record has been removed but caused the failure of jobs also at remote sites until the squid caches were refreshed.
- Tier-1/2:
  - GGUS:83486 (FTS delegation problem): currently no problems but keeping here until sw is fixed

ALICE reports -
- NTR

LHCb reports -
- Users analysis and Reconstructionat at T1s
  - MC production at T2s
- T0:
- T1:
  - PIC: Files unavailable after downtime (GGUS:83916); Fixed
  - IN2P3: Reconstructions jobs failed due to memomy; Under investigation
  - FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites).
  - IN2P3 / GridKa file corruption

Sites / Services round table:

Dmytro/NDGF: swedish T2 did not come back from downtime. On Mon (1-3pm) SRM down for software upgrade - data will be unavailable
Michael/BNL: ntr
Gonzalo/PIC: acknowledge LHCB issue: problematic sensor (restating network) since last upgrade - now fixed. Yesterday noon: alarm for water leak (condensation water ) led to stop of worker nodes - 2000 Jobs stopped. Now restarting and preparing SIR for this issue.
Xavier: ntr
Jhen-Wei: yesterday night nfs server crashed due to hw problems affecting CMS and ATLAS - ATLAS back since the morning - CMS coming up.
Rolf: ntr
Onno: sara srm seems to works better now (experiments should confirm): tuning of the "vacuuming" for postgress DB to increase index performance. SRM space manager crashed during high load period yesterday evening. Now using cron job for automatic restart. Load right now is higher than yesterday and space manager runs fine. Reported instability to dcache developers as the new 2.2.1 could have a bug which could also affect other sites upgrading to this version.
Paolo - ntr
Rob: ntr
Burt: ntr
John/RAL; ntr

Luca: transparent CASTOR DB database upgrade planned for Mon

AOB:

-- JamieShiers - 29-Jun-2012

Topic revision: r20 - 2012-07-06 - DirkDuellmann

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback