LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsMeetings>WLCGDailyMeetingsWeek130128 (2013-02-01, MaartenLitmaath)

EditAttachPDF

Week of 130128

Daily WLCG Operations Call details
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Tuesday
Wednesday
Thursday
Friday

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

Dial +41227676000 (Main) and enter access code 0119168, or
To have the system call you, click here
The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability				SIRs	Broadcasts	Operations Web
ALICE	ATLAS	CMS	LHCb	WLCG Service Incident Reports	Broadcast archive	Operations Web

General Information

General Information			GGUS Information	LHC Machine Information
CERN IT status board	WLCG Baseline Versions	WLCG Blogs	GgusInformation	Sharepoint site - LHC Page 1

Monday

Attendance: local(Belinda, Eddie, Jerome, Luc, Luca M, Maarten, Maria D);remote(Ian, Lisa, Matteo, Michael, Onno, Pepe, Rob, Tiju, Ulf, Vladimir, Wei-Jen).

Experiments round table:

ATLAS reports -
- T0/Central services
  - Acrontab service not working. Alarm GGUS:90928
  - T0 reconstruction tasks crashing due to low memory. Elog:42411. To be rerun in 64-bits
- Tier1s
  - FZK & CNAF DATADISKS full. Taken out of T0 export
  - TRIUMF backlog of functional tests transfers from T0. Very high data activity (300MB/s). Link saturation?
    - Maarten: open a ticket?
  - FTS3 deployed in all UK (except QMUL)
    - Maarten: correction - all UK sites except QMUL are participating in functional tests through FTS3; RAL is the only UK site with FTS3 deployed; QMUL cannot participate yet due to a bug in StoRM
- Tier2s
  - NTR

CMS reports -
- LHC / CMS
  - Running
- CERN / central services and T0
  - Reasonably quiet weekend
- Tier-1:
  - SUM/SRM Failure on ASGC GGUS:90882
- Tier-2:
  - NTR
- Luca: green light for the transparent part of the EOS upgrade?
- Ian: OK, the non-transparent part is to be done on Feb 14

ALICE reports -
- NTR

LHCb reports -
- Activity is going as last week. Very intense activity: reprocessing, prompt-processing, MC and user jobs. We are hitting our limits in running jobs.
- T0:
  - INC:226288 - Possibly more tuning needed
- T1:

Sites / Services round table:

ASGC
- network maintenance on the 10 Gbit link tomorrow, 2.5 Gbit backup link will be used during that time
BNL - ntr
CNAF - ntr
FNAL - ntr
NDGF
- one tape system has become full for ALICE; a second tape system at that site is expected to come online in 2 weeks; furthermore a new site supporting ALICE should join around the same time
NLT1
- on Sat there was a power surge in one of the SARA data centers, affecting one of the two tape libraries: 9 drives broke and have meanwhile been replaced; during the incident some tape accesses may have been slow
OSG - ntr
PIC - ntr
RAL - ntr

dashboards - ntr
GGUS/SNOW
- Maria will check if the SNOW team already applied the patch to fix the problem seen with last week's alarm tests, viz. the creation of real instead of test incidents in SNOW, which upsets the alarm ticket statistics calculations
grid services - ntr
storage - ntr

AOB:

Tuesday

Attendance: local(Eddie, Ian, Ignacio, Luc, Luca M, Maarten, Maria D, Ulrich);remote(Federico, Jeremy, Lisa, Matteo, Michael, Pepe, Rob, Rolf, Ronald, Saverio, Tiju, Ulf, Wei-Jen, Xavier).

Experiments round table:

ATLAS reports -
- T0/Central services
  - NTR
- Tier1s
  - FZK & CNAF DATADISKS full. Still out of T0 export
  - TRIUMF backlog of functional tests transfers from T0. A priori no 5Gb/s link saturation
  - All UK sites (except QMUL) participating in FTS3 tests
  - TW FTS hw failure. Elog:42514
- Tier2s
  - T2 starvation in DE. Big dataset (5k files) stuck. Site service machine bug. Fix applied to all SS

CMS reports -
- LHC / CMS
  - Running. Estimates for total data expected from HI run is about our expectations. Raw + Reco should be under 500TB collected and distributed. Event size and reco time are both a little better than initially expected.
- CERN / central services and T0
  - NTR
- Tier-1:
  - A Black hole node was reported at KIT. GGUS ticket issued.
- Tier-2:
  - NTR

ALICE reports -
- NTR

LHCb reports -
- Activity is going as last week. NTR

Sites / Services round table:

ASGC - ntr
BNL - ntr
CNAF - ntr
FNAL - ntr
GridPP - ntr
IN2P3 - ntr
KIT
- at-risk downtime for tape maintenance had to be prolonged by 4h but completed OK, hopefully bringing some performance/stability improvements
NDGF - ntr
NLT1 - ntr
OSG
- as of tomorrow OSG users will be recommended to start using the new DigiCert CA for new certificates instead of the DOEGrids CA that is being phased out for new certificates in the course of this year
  - Maria: I have adjusted the LCG catch-all CA procedures and updated the docs, will let Rob proofread them before dissemination
- Ian: GGUS currently seems to be having a problem with the DOEGrids CRL? will follow up offline
PIC - ntr
RAL - ntr

dashboards - ntr

GGUS/SNOW
- Tests will start now as agreed yesterday to make sure the SNOW patch on "Ticket Category: Test" is working properly.
- There will be a meeting on 2013/02/05 at 2:30pm CET about alternative to personal certificates authentication methods to GGUS and possibly to GOCDB. WLCG position summarised in Savannah:132872#comment6 . Draft agenda https://indico.cern.ch/conferenceDisplay.py?confId=229577 . Room is 28-1-025 (~12 people) but there is audioconf. If we need to changed the room please tell MariaD a.s.a.p.
grid services
- GSI problems holding up the LFC migration to EMI-2 have been fixed: the Globus GSI libraries need to be preloaded to prevent the corresponding Oracle symbols taking precedence; one EMI-2 front-end node has already been included for ATLAS, with other nodes to be similarly replaced one by one over the coming days, also for LHCb and the central instance (e.g. used by OPS); ATLAS and LHCb will be informed of the progress
storage - ntr

AOB:

For the record, yesterday's broadcast about EMI-2 Update 8, urgent for EGI sites supporting ATLAS:
- https://operations-portal.egi.eu/broadcast/archive/id/864

Wednesday

Attendance: local(Alex, Belinda, Luc, Luca C, Luca M, Maarten);remote(Alexander, Federico, Ian, Lisa, Pavel, Rob, Rolf, Saverio, Ulf, Wei-Jen).

Experiments round table:

ATLAS reports -
- T0/Central services
  - NTR
- Tier1s
  - FZK & CNAF DATADISKS full. Cleaning (unmerged HITS ongoing). Still out of T0 export
- Tier2s
  - NTR

CMS reports -
- No report

ALICE reports -
- KIT: ALICE will increase by 500 the number of concurrent (analysis) jobs, currently throttled at 2k; KIT will check if the load on their firewall remains sustainable

LHCb reports -
- Nothing new to report

Sites / Services round table:

ASGC - ntr
CNAF - ntr
IN2P3 - ntr
KIT - ntr
NDGF - ntr
NLT1
- At SARA we had a Warning downtime this morning because we had to reboot our tape management servers. We did this in order to solve a problem with our tape backend (for some reason tape access has become slow the last few days) but unfortunately it didn't solve the problem.
OSG
- switching from DOEGrids to DigiCert CA for new certificates as of today

databases - ntr
grid services - ntr
storage - ntr

AOB:

many remote people got their Audioconf connections cut multiple times and some gave up eventually; ticket INC:230369 was opened for the experts:
- "There was a general problem on Cern phone lines. The technicians fixed it now."

Thursday

Attendance: local(Alex B, Alex L, Eva, Luc, Maarten);remote(Gareth, Kyle, Matteo, Pepe, Rolf, Ronald, Saverio, Ulf, Wei-Jen, WooJin).

Experiments round table:

ATLAS reports -
- T0/Central services
  - NTR
- Tier1s
  - FZK & CNAF DATADISKS full. Cleaning ongoing. Still out of T0 export
  - PIC & SARA DATADISK getting full.
  - RAL Castor instance problem fixed (disk server failure)
  - QMUL to RAL transfers failing GGUS:91029. Suspect RAL FTS agent.
    - Gareth: this appears to have started before the CASTOR issue, so would be unrelated
- Tier2s
  - NTR

CMS reports -
- No report

ALICE reports -
- KIT: the number of concurrent analysis jobs was further increased to 2750.

LHCb reports -
- Nothing new to report

Sites / Services round table:

ASGC
- CMS are affected by a transfer problem from IN2P3 to CASTOR at ASGC; looking into it
CNAF
- CREAM ce08 will be down from tomorrow noon to Feb 7 for an upgrade to EMI-2 on SL6
IN2P3 - ntr
KIT - ntr
NDGF
- last night there was a break of the OPN connection to Finland: a planned maintenance took 3h15m instead of the expected 20 min; only ALICE may have been affected, but no inquiries were received
NLT1 - ntr
OSG - ntr
PIC - ntr
RAL - nta

dashboards - ntr
databases
- during the next 2 weeks the security and recommended Oracle patches will be deployed on the integration databases
grid services - ntr

AOB:

Friday

Attendance: local(Alessandro, Alex L, Eddie, Eva, Ian, Ignacio, Luca M, Maarten);remote(Federico, John, Kyle, Lisa, Onno, Roger, Rolf, Saverio, Wei-Jen, Xavier).

Experiments round table:

ATLAS reports -
- Central services
  - Panda had an increase of "holding jobs" (of a factor ~5) starting around 15pm CET. From SLS plots of Panda seem that the problem was related to LFC+DQ2 registration time, which could be due LFC or DQ2 slowness. Another plot in Panda SLS report only about DQ2 registration time, which seem to be the culprit. Experts are investigating. Since just yesterday afternoon a new LFC EMI2+ node was added to the ATLAS production LFC, to avoid issues we asked LFC service manager to remove the new EMI nodes from the alias for now, even if from the SLS monitoring the LFC performances seems to be ok ATLAS LFC
    - Alessandro: the DBAs did not see any issues in the DB usage by the LFC service
    - Ignacio: the ATLAS LFC service now has 3 EMI-2 and 4 gLite 3.2 nodes; the first EMI-2 node was introduced on Tue; next week the upgrade will be resumed if no issues are observed
- Tier0/Tier1s
  - QMUL to RAL transfers failing GGUS:91029 reported yesterday, RAL answered saying they had a CASTOR problem which was then fixed.

CMS reports -
- LHC / CMS
  - NTR
- CERN / central services and T0
  - NTR
- Tier-1:
  - GGUS Ticket issued for glexec issue at a subset of WN at RAL (GGUS:91060)
- Tier-2:
  - Suspected File Corruption at T2_US_Purdue (SAV:135604)

ALICE reports -
- KIT: another 500 slots were just added to the number of concurrent analysis jobs. Experts at KIT will keep monitoring their firewall load and alert ALICE when the traffic for ALICE jobs is getting unsustainable.
  - Xavier: the WN routing configuration was wrong in 2 racks, causing their Xrootd traffic to go through the firewall unnecessarily; fixed
  - Maarten: very good news indeed, thanks! We will see if this allows us to add yet more job slots still before the weekend
    - this is happening indeed

LHCb reports -
- Nothing new to report, just few tickets for pilots aborting at Tier2s. New LFC is ok.

Sites / Services round table:

ASGC - ntr
CNAF - ntr
FNAL - ntr
IN2P3 - ntr
KIT - nta
NDGF - ntr
NLT1
- early in the night 2 dCache pool nodes at SARA failed; they were restarted at 07:30 CET, their files having been unavailable for ~7h
OSG - ntr
RAL
- will follow up on the open tickets

dashboards - ntr
databases - ntr
grid services
- all CERN LFC services (ATLAS, LHCb, shared incl. OPS) currently have a mix of EMI-2 and gLite 3.2 nodes; the upgrades will continue next week if no issues are observed in the meantime
storage - ntr

AOB:

Topic revision: r12 - 2013-02-01 - MaartenLitmaath

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback