LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek120430 (2012-05-04, MaartenLitmaath)

EditAttachPDF

Week of 120430

Week of 120430

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability				SIRs, Open Issues & Broadcasts			Change assessments
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	WLCG Service Open Issues	Broadcast archive	CASTOR Change Assessments

General Information

General Information					GGUS Information	LHC Machine Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	WLCG Blogs		GgusInformation	Sharepoint site - Cooldown Status - News

Monday

Attendance: local(Alessandro, Alexei, Luca M, Maarten, Maria D, Pablo, Raja);remote(Giovanni, Ian, Jhen-Wei, Lisa, Michael, Rob, Roger, Tiju).

Experiments round table:

ATLAS reports -
- ATLAS/LHC: beam back as of this early morning. Stable beam expected as of Sunday afternoon
  - Sun Apr 29, stable beam NET Monday, most probably on Tuesday morning
- ASGC re-included in all the activities, including RAW data export (SantaClaus)
- HLT very urgent repro during the afternoon, no issue to report, jobs started almost all immediately (waiting time in average 8 min, max 23mins).
- about the wget issue: we have to handle internally to ATLAS, i.e. test the Squid with different configuration. to be discussed with ATLAS central services Monday.
- nothing else to report as of 20:00 of Saturday
- fast reprocessing ~done, datasets are replicated to T1s and T2s

CMS reports -
- LHC machine / CMS detector
  - Expect low pile-up runs sometime tonight
- CERN / central services and T0
  - Systematic lowering of lxbatch available over the weekend. Dropped to high eighties in percentage. Was there maintenance?
    - Alessandro: ATLAS observed no issues with jobs at CERN during the weekend
    - Raja: also normal for LHCb
    - Maarten: and for ALICE
    - Ian: as CMS had low job activity, maybe the overall fraction of issues for all experiments combined exceeded a threshold?
- Tier-1/2:
  - Job Robot failures over the weekend. Some fixed some ongoing. Looks like it could be a problem with the test infrastructure
    - Maarten: would that rather be HammerCloud instead of Job Robot?
    - Ian: yes
- Other:
  - Ian Fisk CRC until Tomorrow, and then Stephen Gowdy

ALICE reports -
- NTR

LHCb reports -
- DataReprocessing of 2012 data at T1s with new alignment
- MC simulation at Tiers2
- cpu intensive workflows successful at Tier-2s.
- New GGUS (or RT) tickets
- T0
- T1
  - GridKa
    - Problem with LFC (GGUS:81734) from 1PM on 29 April. Problem solved this morning.
      - Q: Not clear why the ticket was assigned to GridKa only at 5:20 AM this morning
        
        MariaDZ: The reason why the LHCb GGUS:81734 was only assigned on Monday morning was because the field "Affected Site" was selected at thetime of submission and not (also) the field "Notify Site". Appending here the help files (linked from [field_name] "?") https://ggus.eu/pages/help/help_affectedsite.php and https://ggus.eu/pages/help/help_sub_notifysite.php
      - Q: Not clear why the internal monitoring of GridKa did not pick this up before this - found by failing LHCb jobs.
      - Ticket open for 1 month, requesting to introduce the LFC probes into the LHCb_CRITICAL profile (GGUS:80753). Still waiting for action on it.
        
        Maarten: seems to depend on an upcoming SAM-Nagios release; Stefan is in the loop
  - PIC
    - Possible problem with queues / scaling of some nodes. Many jobs failing repeatedly due to lack of wall time before succeeding eventually.
      - In contact with the LHCb-site contact. Will open a GGUS ticket if we cannot resolve this internally.

Sites / Services round table:

ASGC - ntr
BNL - ntr
CNAF - ntr
FNAL - ntr
NDGF
- some CE failures during the weekend
OSG - ntr
RAL - ntr

dashboards - ntr
GGUS/SNOW - ntr
storage - ntr

AOB:

Tuesday - No meeting - May Day holiday

Attendance: local();remote().

Experiments round table:

ATLAS reports -

CMS reports -

ALICE reports -

LHCb reports -

Sites / Services round table:

AOB:

Wednesday

Attendance: local(Jamie, Jhen-Wei, Luca C, Maarten, Maria D);remote(Dimitri, Giovanni, Gonzalo, John, Lisa, Michael, Raja, Rob, Roger, Rolf, Ron, Stephen).

Experiments round table:

ATLAS reports -
- T0/Central Services
  - ntr
- T1s/CalibrationT2s
  - SARA-MATRIX: transfer error to disk and tape. Site excluded from T0 export. Site is dealing with this. GGUS:81786
    - Ron: the main Chimera node had a high load due to various cron jobs; they were killed and the particular error messages no longer occurred, while remaining errors were due to attempts to overwrite existing files, which is not allowed by our dCache configuration

CMS reports -
- LHC machine / CMS detector
  - Physics run ongoing
- CERN / central services and T0
  - Yesterday some online testing of HLT was blocked by unresponsive webserver, GGUS:81783 .
  - HLTMON Express stream was running at 900Hz overnight (instead of O(30Hz)), caused some backlog in express jobs to process. Recovered after FNAL operations called Point5 and we caught up by morning.
- Tier-1/2:
  - Some network problems between CERN and KNU (Korea). Recently connected to LHCone. Savannah:128323 . Cannot connect to lcgvoms or myproxy. Should we open a GGUS against CERN? [Has now been fixed]
- Other:
  - NETR

ALICE reports -
- NTR

LHCb reports -
- DataReprocessing of 2012 data at T1s with new alignment
- MC simulation at Tiers2
- Prompt reprocessing of data taken overnight
- New GGUS (or RT) tickets
- T0
- T1
  - PIC
    - Possible problem with queues / scaling of some nodes. Many jobs failing repeatedly due to lack of wall time before succeeding eventually.
      - GGUS ticket opened (GGUS:81814). Still trying to resolve the details of the problem.
  - SARA
    - Worker node with CVMFS possibly broken (GGUS:81787). Remounted just now.
    - Jobs failing to resolve input data. Investigating if it is an issue from LHCb side.
  - RAL
    - srm not visible from outside RAL (GGUS:81816)
    - Prompt reconstruction failing due to squid proxy cache not refreshing itself. In contact with Catalin at RAL about it.

Sites / Services round table:

ASGC - ntr
BNL - ntr
CNAF - ntr
FNAL - ntr
IN2P3
- FTS Globus version was patched to re-enable the use of GridFTP v2
  - Maarten: that patch should be applied on every FTS instance; also PIC already have it since a few days; this will be further discussed during tomorrow's T1 service coordination meeting; contact FTS support if you want the details already now
KIT - ntr
NDGF
- 1 tape pool had a failure earlier today, fixed now
NLT1 - nta
OSG - ntr
PIC
- LHCb jobs appear to be using more resources than expected; for now the wall time limit was increased accordingly; still under investigation
  - Raja: Philippe has provided further input now
RAL
- looking into the issues reported by LHCb

dashboards - ntr
databases - ntr
GGUS/SNOW
- for tomorrow's T1 service coordination meeting please report outstanding GGUS ticket issues to Maria D
grid services
- 87% SLS availability of LXBATCH over the weekend is understood as a bad monitoring script - corrected.

AOB:

Thursday

Attendance: local(Eva, Jhen-Wei, Maarten, Maria D, Mike, Raja, Ricardo, Simone, Xavier E);remote(Giovanni, Gonzalo, John, Lisa, Michael, Rob, Roger, Rolf, Ronald, Stephen, Xavier M).

Experiments round table:

ATLAS reports -
- T0/Central Services
  - ntr
- T1s/CalibrationT2s
  - SARA-MATRIX: transfer error to disk and tape. Problem was gone after site killed some cron jobs which made load to nameserver. Ticket closed, re-include site in T0 export. GGUS:81786
  - TRIUMF: Job failed due to input file get error. DDM expert is looking into this. GGUS:81829
- Middleware
  - ATLAS is having issues with authentication in Panda and DDM because of mod_gridsite:
    - Using the old version (1.5.19-3) some (not really understood) authentication problem when verifying the FQAN. Two users from the same CA and the same institute (Dresden) obtain different behaviors (failure in one case, success in another case)
    - ATLAS was advised to upgrade to newer version (1.7.19) but this shows a different problem: if the client host don't have DNS reverse mapping (e.g. $ host 10.153.232.180 returns "host not found"), the authentication fails. When the newer version was upgraded in production ATLAS "lost" several sites (10K cores) and therefore the change was rolled back.
    - Everything is tracked in GGUS:81757 (currently assigned to gridsite support). Not critical but "unpleasant"

CMS reports -
- LHC machine / CMS detector
  - Cryo recovery and RF studies, with a fill later
- CERN / central services and T0
  - One of the (three) frontier servers unavailable earlier today during a DNS alias move.
- Tier-1/2:
  - NTR
- Other:
  - NTR

ALICE reports -
- NTR

LHCb reports -
- DataReprocessing of 2012 data at T1s with new alignment
- MC simulation at Tiers2
- Prompt reprocessing of data
- New GGUS (or RT) tickets
- T0
  - Raja: 1 corrupted file was discovered on CASTOR; the original is still available online
  - Xavier E: it has a checksum mismatch while the size is correct; it sits on a server of an older type, which is now being drained; under investigation
- T1
  - PIC
    - Problem with queues sorted. Queue lengths temporarily increased for LHCb.
      - Hopefully the basic cause will be fixed with the next version of our reconstruction and stripping application software.
  - RAL
    - srm not visible from outside RAL (GGUS:81816)
      - Actually a problem with FTS. GGUS ticket submitted to FTS (GGUS:81844)
    - Prompt reconstruction failing due to squid proxy cache not refreshing itself.
      - Solved.
  - GridKa
    - Failing pilots (GGUS:81846)
      - Xavier M: fixed, please check
  - CNAF : Trying to understand proxy issues with directly submitted pilots.

Sites / Services round table:

ASGC - ntr
BNL
- the PNFS daemon crashed again last night like almost 1 month ago, causing a brief service interruption; with the dCache developers we are investigating how to let that daemon run on a different host, i.e. not together with Chimera
CNAF - ntr
FNAL - ntr
IN2P3 - ntr
KIT
- 2 new file servers recently installed for CMS have performance issues with disk and network I/O; under investigation
NDGF - ntr
NLT1 - ntr
OSG - ntr
PIC - ntr
RAL - ntr

dashboards - ntr
databases
- security patch being applied to CMS integration DB
GGUS/SNOW - ntr
grid services
- Two Batch SIRs for recent Alarm tickets: IncidentBatchSlow180412 and IncidentBatchDown190412.
storage
- CASTOR Oracle DB rolling security patches:
  - Wed May 9, 14:00 CEST - ALICE + ATLAS
  - Thu May 10, 14:00 CEST - CMS + LHCb

AOB:

Friday

Attendance: local(Eva, Jhen-Wei, Maarten, Raja, Ricardo, Stephen, Xavier E);remote(Alain, Catalin, Gonzalo, John, Michael, Onno, Paolo, Rolf, Ulf, Xavier M).

Experiments round table:

ATLAS reports -
- T0/Central Services
  - ntr
- T1s/CalibrationT2s
  - ntr

CMS reports -
- LHC machine / CMS detector
  - Cryo recovery and collisions this evening
  - Fill 2587 (852x852): 97.2% efficiency by luminosity recorded
- CERN / central services and T0
  - Logged 10kHz into express data for 10 minutes due to misconfiguration in online test. Didn't cause any problems.
- Tier-1/2:
  - T2_US_UCSD lost cooling water yesterday evening. They hope they'll be back soonish. We have two CRAB Servers there for glideinWMS.
- Other:
  - NTR

ALICE reports -
- NTR

LHCb reports -
- DataReprocessing of 2012 data at T1s with new alignment
- MC simulation at Tiers2
- Prompt reprocessing of data
- New GGUS (or RT) tickets
- T0
  - Corrupted file (RAW data) re-transferred and migrated
    - Raja: the checksum was OK right after the transfer
    - Xavier E: indeed, the corruption appears to have occurred shortly afterwards; a HW problem is suspected; the machine was already scheduled for retirement because of its age
- T1
  - CNAF : Still trying to understand proxy issues with directly submitted pilots.
- Others
  - FTS : Problem with FTS proxy expiration. GGUS ticket submitted to FTS by RAL FTS admins (GGUS:81844)
    - Maarten: the main ongoing issue with the FTS 2.2.8 is the intermittent failures due to expired proxies, so far only seen at some of the T1; the developers are investigating this with high priority, but the cause may be hard to pin down

Sites / Services round table:

ASGC - ntr
BNL - ntr
CNAF
- CREAM proxy issue for LHCb has been fixed
FNAL - ntr
IN2P3 - ntr
KIT - ntr
NDGF - ntr
NLT1 - ntr
OSG - ntr
PIC- ntr
RAL
- Monday May 7 is a bank holiday in the UK, RAL will not connect

databases
- Oracle released a new security patch that is particularly urgent for listeners exposed to the internet; next week it will be tested in integration, the week after it should be deployed in production; as the patch is rolling, downtimes are not expected; exact dates and times will be agreed with the experiments
grid services
- Raja: are there any improvements in the fair share algorithm foreseen?
- Ricardo: with LSF we currently cannot do much better in that area
- Maarten: other batch solutions will be looked into, but they are not for this year
- Raja: we still see some issues, will follow up offline
storage - nta

AOB:

Alain: can RAL provide details on the network issue affecting the GOCDB?
John: there was a problem with a core router that did not affect the T1; fixed now, details to follow (see below)

GOCDB downtime details

The GOCDB system experienced an unexpected downtime between 03.05.12 and 04.05.12. During this time parts of the system were inaccessible to end users and programmatic tools. The details of this downtime are below:

A network outage caused by a configuration problem on a router at STFCs RAL site caused complete inaccessibility between 15:42 and 16:58 UTC on 03.05.12
At around 21:40 UTC on 03.05.12 we experienced an issue with the storage attached to the virtual machine hypervisor that hosts the GOCDB VM instances
- This issue resulted in the read/write portal instance (gocdb4.esc.rl.ac.uk) being completely inaccessible
- The read only web portal (goc.egi.eu) was unaffected, as was the programmatic interface provided by the read only portal
The storage issue caused the filesystems of both hosts to be mounted as read only. The cause of the problem was not initially clear but once diagnosed and the fault cleared we rebooted both hosts to re-mount the filesystems in read/write mode and both hosts were back in production at 10:45 UTC.
The GOCDB team maintain a read only failover instance with colleagues in Germany to address outages. However we didn't fail over in this instance as our read only instance was unaffected during the majority of this downtime.

-- JamieShiers - 22-Mar-2012

Topic revision: r16 - 2012-05-04 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback