LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek120109 (2012-01-13, MaartenLitmaath)

EditAttachPDF

Week of 120109

Week of 120109

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability				SIRs, Open Issues & Broadcasts			Change assessments
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	WLCG Service Open Issues	Broadcast archive	CASTOR Change Assessments

General Information

General Information					GGUS Information	LHC Machine Information
CERN IT status board	M/W PPSCoordinationWorkLog	WLCG Baseline Versions	WLCG Blogs		GgusInformation	Sharepoint site - Cooldown Status - News

Monday:

Attendance: local(Massimo, Alessandro, Cedric, Ian, Kate, Eddy, MariaD, MariaG, AndreaV, Jamie, Steve, Gavin, Simone, Stefan, Maarten); remote(Tore, Stephane, Michael, Gonzalo, Onno, Jhen-Wei, Paolo, Tiju, Burt, Pavel, Rob)

Experiments round table:

ATLAS reports -
- GGUS 77923: cvmfs issue in TAIWAN. Solved
- FTS pilot 2.2.8 : No problem observed. Should prepare the migration of FTS production machines at CERN
- GGUS 77965 : Significant error rate to access LFC in IN2P3-CC. Almost stopped deletion of IN2P3-CC and degraded the success rate of production jobs. Possibly related to one of the front-end.

CMS reports -
- LHC / CMS detector
  - Shutdown
- CERN / central services
  - Nothing to report
- T0
  - Tier-0 farm performed LHE MadGraph production at 8TeV continues
- T1 sites:
  - Problem reported with the job robot at CNAF over the weekend. Since recovered
- T2 sites:
  - NTR

ALICE reports -
- NTR

LHCb reports -
- MC11: Monte Carlo productions
  - T0
    - Preparing for downtime tomorrow. Will jobs in the system be killed or suspended?
  - T1

Sites / Services round table:

ASGC: ntr
BNL: nrt
CNAF: Some transfer problem noticed last week now solved (no tickets)
FNAL:ntr
IN2P3: ntr (CMS will provide the needed contacts for the US network)
KIT: ntr
NDGF: ntr
NLT1: ATLAS/LHCb agreed to discuss the possibilityto make some tests on the setting of the CERN-SARA channel. They will keep SARA in the loop
PIC: ntr
RAL: ntr
OSG: ntr

CASTOR/EOS: Reminder of future interventions:
- Tomorrow 10:00-12:00 upgrade of CASTOR PUBLIC (important improvements on the tape part); transparent (but for some OPS tests)
- Wednesday 9:00-12:00 upgrade of the CASTOR Name Server: NOT transparent, ALL instances (ALICE, ATLAS, CMS, LHCb) affected; this is for migrating the DB to Oracle 11g and move it to new hardware
Central Services:
- A SIR for the LSF problem in December will be provided (Platform did not come back yet with a full solution)
- ATLAS is discussing with IT for deploying the FTS 2.2.8. CMS did not observe problems so far. LHCb will also in the loop. A staged roll out should be possible
- LSF master will most probably replaced/upgraded this Wednesday (at the same time of the CASTOR Name Server upgrade)
- Experiments should communicate what they want their queue to be during the CASTOR upgrade
Data bases:See the CASTOR line
Dashboard:

AOB: (MariaDZ) A file with drills of all real ALARM tickets since the last WLCG MB (6 weeks) is attached at the end of this twiki page.

Tuesday:

Attendance: local(Massimo, Cedric, Gavin, Eddy, Eva, Maarten, MariaD); remote(Gonzalo, Michael, Paolo, Tiju, Burt, Rolf, Ronald, Rob).

Experiments round table:

ATLAS reports -
- T0/Central Services
  - ntr
- T1s
  - ntr
- T2s
  - SE issue on UKI-SOUTHGRID-BHAM-HEP GGUS:78007. Solved.
  - Bad computation of Space Tokens size for 2 Storm sites : TECHNION-HEP GGUS:77963 and WEIZMANN-LCG2 GGUS:77962

CMS reports -
- Nothing to report from CMS
- Markus (CRC) will not be able to attend the daily chat today (and Thursday and Friday)

ALICE reports -
- NTR

LHCb reports -
- Experiment activities
  - MC11: Monte Carlo productions
- T0
  - For Wednesday downtime no special treatment needed for queues by PES
- T1

Sites / Services round table:

ASGC:
BNL:ntr
CNAF:ntr
FNAL:ntr
IN2P3:ntr
KIT:
NDGF:
NLT1:
PIC: ntr
RAL: Connectivity problem under investigation (probably DNS related). Investigating. Site in downtime until 5pm
OSG: Discovered bug affecting sites using GUMS for auth. Users with more than one certificate in VOMS Admin 2.6.1 end up with no certificate in GUMS. 600 ATLAS and CMS users affected. Investigating together with VOMS developers. A ticket will be posted soon.

CASTOR/EOS: CASTORPUBLIC upgrade successful.
Central Services: LSF master upgrade taking place tomorrow. Assuming no secial arrengment for LSF queus during the CASTOR intervention are needed (if not, experiment should contact Gavin before 5pm today).
Data bases: CASTOR DB intervention (Name Server) tomorrow morning for upgrade (11g included) in the morning (9-12). ATLAS on-line DB will be upgraded to 11g in the afternoon (2-6pm)
Dashboard: ntr

AOB:

Wednesday

Attendance: local(Massimo, Cedric, Eddie, Edoardo, Gavin, Alessandro, Lola, Maarten, Eva, Giuseppe, Maria Dimou);remote(Michael, John, Giovanni, Burt, Markus, Rolf, Rob, Ron).

Experiments round table:

ATLAS reports -
- T0/Central Services
  - CASTOR problems after NameServer database upgrade.
- T1s
  - RAL : DNS problem finally fixed yesterday evening.
- T2s
  - UKI-NORTHGRID-MAN-HEP : storage problem GGUS:78073
  - RO-14-ITIM : missing library GGUS:78071
  - DESY-HH : network problem

CMS reports -
- CASTOR nameserver DB will be upgraded to Oracle 11g today. System is still not back up. Can we get a time estimate?

ALICE reports -
- Yesterday at CERN almost all ALICE jobs failed. GGUS:78064 was opened. The root cause was the presence of an unexpected mount point /opt/alien on lxplus (also on lxbatch) containing an old AliEn version and thereby affecting the build of the AliEn version used on the WN. The same build worked at a few other sites, for reasons that have not been understood, which confused the investigations. Around midnight a new build with a workaround was put into production.

LHCb reports -
- Experiment activities
  - MC11: Monte Carlo productions
- T0
  - Problem with access to castor after downtime today, the problem was swiftly fixed by castor team (GGUS:78103)
- T1

Sites / Services round table:

ASGC: ntr
BNL: ntr
CNAF: ntr
FNAL: ntr
IN2P3: Preliminary announcement: Feb 7 the site will be down for a series of multiple interventions
KIT:
NDGF:
NLT1: CVMFS problem at NIKHEF. Due to config changes mounts were not possible at WNs. This was solved by throwing away all the caches and restart the service at the WNs. The problem was solved within one and a half hours. LHCb submitted GGUS:78085 for this.
PIC:
RAL: DNS problem: back at 9pm. Investigating the root cause.
OSG: Working to solve the problem we reported yesterday (GUMS/VOMS admin problem with multiple certificate). We should make sure VOMSrs is not retired prior a tested solution is available

CASTOR/EOS: CASTOR Name Server upgrade problem: reconfiguration problem after DB upgrade. Now solved
- SIR: https://twiki.cern.ch/twiki/bin/view/CASTORService/Incidentsnameserver11Jan2012
Central Services: Successful LSF upgrade
Data bases: Investigating replication problems for LHCb
Dashboard: ntr
Network; ntr

AOB:

ATLAS asked about the plans for the 11g upgrade on the Tier1. We will touch base at the next T1coord meeting.

Thursday

Attendance: local(Massimo, Maria Dimou, Cedric, Eva);remote(Michael, Gonzalo, Tore, Jhen-Wei, Ronald, Xavier, Gareth, Rob, Burt, Rolf, Paolo).

Experiments round table:

ATLAS reports -
- T0/Central Services
  - ntr
- T1s
  - ntr
- T2s
  - UKI-SCOTGRID-GLASGOW : Transfer failures GGUS:78153
  - IFIC-LCG2 : Squid problem GGUS:78142, stage-out problem GGUS:78143

CMS reports -
- NTR

ALICE reports -
- T0/Central Services
  - There was a configuration problem in the torrent built. Experts working on it.
- T1s
  - FZK: no space left on the xrootd disk-only instance makes jobs write their output to other sites. This situation strained the GridKa firewall and affected all experiments using the site. The limit of ALICE jobs running has been lowered to 1000.

LHCb reports -

Sites / Services round table:

ASGC: Next Wed (2am-10am UTC) the site will be in downtime (power)
BNL: Next Tue in the shadow of the ATLAS central DB intervention, they will configure hot backup for their dCache storage system (chimera postgres db).
CNAF: ntr
FNAL:ntr
IN2P3:ntr
KIT: CE problem (publication problem)
NDGF: ntr
NLT1:ntr
PIC: Next Tue (11-12 UTC) due to an intervetion job submission will be unavailable (running and queued jobs won't be affected)
RAL:ntr
OSG:ntr

CASTOR/EOS:
- ATLAS 2.1.11-9 upgrade done (transparent). The upgrade of CASTOR ALICE was cancelled (CDB incident) and it will be rescheduled.
Central Services:
- A new FTS 2.2.8 release candidate was installed on the pilot fts-pilot-service (Oracle 11g) this morning. Fixes the "process not found in /proc" that CMS found and discovers correctly the gsiftp endpoints on EOS. There will be at least one more release candidate to come for a deployment problem.
Data bases: PVSS ATLAS replication (online-offline) down (investigating). Condition replication had problems and it is catching up.
Dashboard:ntr

AOB:

Friday

Attendance: local(Massimo, Cedric, Eva, Maarten, Stefan, Gavin, Eddie); remote(Michael, Gonzalo, Onno, Kyle, Burt, Rolf, Paolo, Andreas, John).

Experiments round table:

ATLAS reports -
- T0/Central Services
  - ntr
- T1s
  - In order to finish high priority tasks, T1s are asked to reduce the share for analysis from 20% to 5% of the pledged computing resources. The sooner, the better.
- T2s
  - GOEGRID : Low transfer efficiency GGUS:78212
  - LRZ-LMU :Too little space left on local disk to run job GGUS:78204

CMS reports -
- Nothing to report (not able to connect)

ALICE reports -
- NTR
LHCb reports -
- Experiment activities
  - MC11: Monte Carlo productions
- T0
  - SAM jobs failing when accessing CEs (GGUS:78185)
- T1
  - IN2P3: SAM issues for testing the shared sofware area, fixed (GGUS:78054)

Sites / Services round table:

ASGC:
BNL:ntr (ATLAS shares already done)
CNAF:ntr
FNAL:ntr
IN2P3:ntr
KIT:ntr
NDGF:ntr
NLT1:Future interventions: 18-Jan outage at NIKHEF; 1-Feb outage at SARA (FTS and LFC LHCb). Details on GOCDB
PIC:ntr
RAL:Future interventions: 16-19 Jan transparent SRM interventions on ATLAS,...,LHCb (one per day). Details on GOCDB
OSG:ntr

CASTOR/EOS:ntr
Central Services:Will look in the CE (LCG-CE) problem at CERN (Monitoring problem)
Data bases:ATLAS replication problem solved
Dashboard:ntr

AOB:

-- JamieShiers - 29-Nov-2011

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
ppt	ggus-data.ppt	r1	manage	2336.0 K	2012-01-09 - 11:03	MariaDimou	Final ALARM drills for the 2012/01/10 MB covering 6 weeks

Topic revision: r22 - 2012-01-13 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback