LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek121112 (2012-11-16, MassimoLamanna)

EditAttachPDF

Week of 121112

Week of 121112

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability				SIRs, Open Issues & Broadcasts			Change assessments
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	WLCG Service Open Issues	Broadcast archive	CASTOR Change Assessments

General Information

General Information					GGUS Information	LHC Machine Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	WLCG Blogs		GgusInformation	Sharepoint site - Cooldown Status - News

Monday

Attendance: local(Massimo, Rod, Mark, MariaD, MariaG, Maarten, Felix, Szymon, Ian);remote(Stefano CNAF, Roger and Ulf NDGF, Rolf IN2P3, Ronald and Alexander NLT1, Gonzalo PIC, Tiju RAL, Alain OSG, Dimitri KIT, Lisa FNAL).

Experiments round table:

ATLAS reports -
- Report to WLCGOperationsMeetings
  - T0/T1
    - 70k lost files at NDGF in process of being recovered from other sites
    - DATADISK space problem on PIC,INFN-T1, FKZ and SARA
      - suspended from T0 data distribution, but only DATADISK, not TAPE
      - caused mostly by reprocessing. Some intermediate datasets to be cleaned after validation

CMS reports -
- LHC / CMS
  - NTR
- CERN / central services and T0
  - Disk server and file loss reported Friday evening at 17:30. These were primarily streamer files from P5. Files that had not been successfully repacked into RAW data files were resent from P5
- Tier-1:
  - NTR
- Tier-2:
  - NTR

ALICE reports -
- KIT: still investigating issues preventing normal job behavior; the SW provisioning was switched back to the shared SW area on Sat, but very few jobs show up in the MonALISA plots

LHCb reports -
- Reprocessing at T1s and "attached" T2 sites
- User analysis at T0/1 s
- Prompt reconstruction at CERN + 4 attached T2s
- MC productions at T2s and T1s if resources available
- New GGUS (or RT) tickets
- T0:
- T1:
  - GridKa:
    - A number of timeouts when transferring. Possibly only restricted to a few pool nodes? (GGUS:88425)

Sites / Services round table:

ASGC: ntr
CNAF: glitch on CE (failing due to a cert change on Saturday without restart). Now fixed.
FNAL: ntr
IN2P3: ntr
KIT: ntr
NDGF: File loss. Still looking into it. The site should prepare an incident report, which may also benefit other sites; it seems to be a problem in the HD firmware
NLT1: ntr
PIC: ntr
RAL: Over the weekend: Saturday CRL problems on CE; Sunday SRM ATLAS DB overload. Tomorrow short network intervention (GocDB)
OSG:BDII problem recovered (last Friday, 2h downtime)

CASTOR/EOS:
- CMS incident: "timing" issue in a reinstall script (machine reinstalled itself after a routine reboot). Script being protected.
- CASTOR --> EOS (ALICE): 12 M files left (out of 36 M). On track (deadline 30 of Nov)
- FYI: CASTOR PUBLIC upgrade this Wednesday (transparent; LHC instances untouched)
Data bases: Saturday the streams propagation (ATLAS) was blocked due to an overload on ADCR (cohosted DB). Root cause being investigated.
Dashboard: ntr
GGUS: KIT intervention will affect GGUS availability (19-Nov 5:00-7:00 UTC 2h) scheduled. Announcement on https://ggus.eu/pages/news_detail.php?ID=472

AOB:

Tuesday

Attendance: local(Ian, Ignacio, Ivan, Maarten, Mark, Rod, Wei-Jen);remote(Gareth, Lisa, Roger, Rolf, Ronald, Scott, Stefano, Xavier).

Experiments round table:

ATLAS reports -
- T0/T1
  - 70k NDGF files from last week (confused with new 4 file servers - failed but recoverable)
  - T1 DATADISK situation eased a little, due to cleaning old data
- Request to add PerfSonar hosts to GOCDB
  - net.perfSONAR.Bandwidth, net.perfSONAR.Latency

CMS reports -
- LHC / CMS
  - NTR
- CERN / central services and T0
  - NTR
- Tier-1:
  - NTR
- Tier-2:
  - NTR

ALICE reports -
- KIT: job profile looking a lot better (already 2k out of 5k jobs showing up in MonALISA), but not understood why - nothing was changed on the ALICE side, as far as is known

LHCb reports -
- Reprocessing at T1s and "attached" T2 sites
- User analysis at T0/1 s
- Prompt reconstruction at CERN + 4 attached T2s
- MC productions at T2s and T1s if resources available
- New GGUS (or RT) tickets
- T0:
- T1:
  - GridKa:
    - Still having significant problems with FTS transfers to GridKa (GGUS:88425)
    - Had a large peak of FTS transfers to GridKa-RAW this morning - we wondered if the write access was suffering due to the tuning for staging?
      - Xavier: we noticed, but did not look into it further
      - Mark/Maarten: OK, let's have a closer look next time

Sites / Services round table:

ASGC - ntr
CNAF
- around 11:30 during a routine intervention the hypervisor of the LHCb VObox has been accidentally rebooted. The machine restarted soon and now is working fine. Apparently this did not cause any big issue on the LHCb operation. We are not aware of any other service that suffered from this.
FNAL - ntr
IN2P3
- reminder: there will be an outage from Dec 10 morning until Dec 12 noon to replace an external power supply cable with a new one from a separate electricity grid; all our Tier-1 services will be down, while remaining core services will depend on a local power generator
KIT - nta
NDGF
- still recovering the crashed disk servers
NLT1 - ntr
OSG
- currently in the regular maintenance window
- sites can define their PerfSONAR bandwidth and latency endpoints in OIM now
RAL
- router intervention this morning went OK

dashboards
- yesterday a new version of the CMS Site Status Board was deployed, offering various enhancements and bug fixes
grid services
- CREAM ce203.cern.ch developed a problem with job submission and has been put in downtime
- the Argus service was upgraded to EMI-2 on SL6 to fix a problem with authentication failures

AOB:

Wednesday

Attendance: local(Massimo, Wei Jen, Mark, Massimo, Julia, Luca, Ignacio, Ian);remote(AdioAplication not working, again).

Experiments round table:

ATLAS reports -
- Report to WLCGOperationsMeetings
  - ATLAS rep at GDB today
  - T0/T1
    - PIC and SARA >100TB free datadisk so back in T0 distribution
      - ASGC 20TB so removed
    - INFN-T1 tap downtime tomorrow. They asked for data to pre-stage, but no useful response from ATLAS.
      - Not much tape ressident data needed for repro - thanks for warning but we will live the 5hr downtime

CMS reports -
- LHC / CMS
  - NTR
- CERN / central services and T0
  - T0 and T1 CERN are suffering HammerCloud Failures. Ticket submitted Seems to be concentrated in ce203.cern.ch
- Tier-1:
  - NTR
- Tier-2:
  - NTR

ALICE reports -
- KIT: job profile looks mostly OK again, not yet understood why; still using the shared SW area for now

LHCb reports -
- Reprocessing and Reconstruction ramping down while waiting for new Conditions DB
- MC productions at T2s and T1s if resources available
- New GGUS (or RT) tickets
- T0:
- T1:
  - GridKa:
    - Still having significant problems with FTS transfers to GridKa. Now pointing towards overloaded storage using gridFTP2. (GGUS:88425)
  - SARA:
    - SARA_BUFFER seemed to go offline this morning for ~2 hours. No ticket as it was fixed without intervention from VO side.

Sites / Services round table:

ASGC: Migration to EMI2 CEs
CNAF: ntr
FNAL: ntr
IN2P3: ntr
KIT: ntr
NDGF: ntr
NLT1: The issue reported by LHCb corrected by itself
RAL: ntr
OSG: ntr

CASTOR/EOS: ntr
Central Services: CE203 seems the origin of the CMS problem. After debugging it is back in production (under observation)
Data bases: ntr
Dashboard: ntr

AOB:

Thursday

Attendance: local(Massimo, Wei Jen, Ignacio, Mark, Guido, Maarten); remote(Paolo, Lisa, Rolf, Marian, Roger, Ulf, Tiju, Ronald, Gonzalo, Ian).

Experiments round table:

ATLAS reports -
- Report to WLCGOperationsMeetings
  - T0
    - LSF still slow in submitting jobs (GGUS:88493 refering back to GGUS:88275)
  - T1
    - ASGC staging from tape giving errors. System overloaded, yet no improvements (GGUS:88490)
    - INFN-T1 tape downtime, no problem so far
  - T2
    - Problem with Sheffield SRM ("failed to contact remote SRM", one of the disk servers overloaded, fixed) GGUS:88508
    - Problems with INFN-MILANO-ATLASC (still to be tackled). Apparently, problem disappeared (GGUS:88506)

CMS reports -
- LHC / CMS
  - NTR
- CERN / central services and T0
  - Castor issue this morning stopped transfers from P5. We did not see it on the offline monitors because the number of active transfers seemed to stay constant. Recovered and backlog promptly cleared
- Tier-1:
  - NTR
- Tier-2:
  - NTR
ALICE reports -
- NTR

LHCb reports -
- Reprocessing and Reconstruction ramping down while waiting for new Conditions DB
- MC productions at T2s and T1s if resources available
- New GGUS (or RT) tickets
- T0:
- T1:
  - GridKa:
    - Still having significant problems with FTS transfers to GridKa. Now pointing towards overloaded storage using gridFTP2. (GGUS:88425)

Sites / Services round table:

ASGC: ntr
CNAF: ntr
FNAL: ntr
IN2P3: ntr
KIT: ntr
NDGF: ntr
NLT1: ntr
PIC: ntr
RAL: ntr

CASTOR/EOS:Upgrade of CASTOR 2.1.13-6: we propose CMS&LHCb on Wed 21 (8:00 UTC and 13:00 UTC) and ALICE&ATLAS on Thu 22 (8:00 UTC and 13:00 UTC). Transparent upgrade already in prod on Public
Central Services: The LSF problem is not quite the same of the previous one. This time it looks not all the slots are filled. The root cause might be the same.

AOB:

Friday

Attendance: local(Massimo, Wei Jen, Ignacio, Mark, Guido, Maarten, Eva); remote(Paolo, Lisa, Rolf, Elisabeth, Michael, Dimitrios, Roger, Ulf, Tiju, Alexander, Gonzalo, Ian, Christian).

Experiments round table:

ATLAS reports -
- Report to WLCGOperationsMeetings
  - T0
    - LSF quotas and shares not respected. Reconfiguration in the night helped, situation seems back to normal and problem understood (GGUS:88493)
    - Castoratlas: RAID controller broken on a server (GGUS:88528), content drained, problem fixed for ATLAS (a hardware ticket was opened for the broken server)
  - T1
    - TW staging from tape giving errors. System overloaded, yet no improvements, "[StatusOfBringOnlineRequest][SRM_FAILURE] Stager did not answer within the requested time" (GGUS:88490)
  - T2
    - nothing to report

CMS reports -
- LHC / CMS
  - NTR
- CERN / central services and T0
  - Problem with LSF over night. Only 400 jobs running in CMS. Reconfiguration fixed the issue.
- Tier-1:
  - Problem with a few unaccessible RAW data files at KIT. GGUS ticket https://ggus.eu/ws/ticket_info.php?ticket=88545
- Tier-2:
  - NTR

ALICE reports -
- ntr

LHCb reports -
- Reprocessing and Reconstruction ramping down while waiting for new Conditions DB
  - LHCb identified a glitch in their monitoring (some jobs failures not accounted correctly - being fixed)
- New processing runs should be starting later today
- MC productions at T2s and T1s if resources available
- New GGUS (or RT) tickets
- T0:
- T1:
  - GridKa:
    - Still having significant problems with FTS transfers to GridKa. Investigations still ongoing. (GGUS:88425)

Sites / Services round table:

ASGC:ntr
BNL: ntr
CNAF: ntr
FNAL: ntr
IN2P3: ntr
KIT:KIT asks ATLAS and CMS to reduce their tape useage as the disk caches are almost full.
NDGF: Tue 2 intervention (dCache headnode and disk firmware). Already in GOCDB
NLT1: ntr
PIC: ntr
RAL: ntr
OSG: ntr

CASTOR/EOS:4 intervnetion on CASTOR confirmed by the experiment (pending final mail from ATLAS)
Central Services: ntr
Data bases: ntr

AOB:

-- JamieShiers - 18-Sep-2012

Topic revision: r18 - 2012-11-16 - MassimoLamanna

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback