Week of 051017

Open Actions from last week:

bundle clients for new release into LCG (Gavin/Louis)
Test new QF23 FTS clients (James) DONE - DEPLOYED
All services - upgrade kernel to be scheduled. DONE

check possiblility of using LCG Quattor WG componets for LFC/DPM/... (Jan/Vlado/Sophie) TO TEST
Escalate problem with lost packets on loopback I/F to linux.support (Olof) IN PROGRESS

On Call: Roberto/ Antonio

Monday:

Log: Nothing

New Actions:

(sophie) Shift list !
Jan/James - problem seen in lcg-mon-gridftp rpm on chkconfig which meant they were restarted on reboot. FIX next week ?
bug in VOBOX impl - will release new rpm in 2.6 and inform sites. (Simone) DONE

Discussion:

Castor problem report (Eric) - On friday, problem to restart castor. ON DB side, db server got very loaded on friday. Until it was so unresponsive, operators restarted machine - led to 1 hour downtime. Worked on resource management at DB level - will need to do a db restart to put this in place.
Urgent work in progress to look at rehosting it on more powerful hardware
Jamie would like to discuss at the LCGSCM the logging/reporting and escalation of problems for key services.
oplapro75 - seems to be a powerbar problem - will be investigated today by Andreas.

Tuesday:

Log:

new release of castor done yesterday at 4PM.
stager dead last night at 8pm last night for three hours - found by a user (andrew smith and fixed by a developer) - no monitoring caught this
another 4 hour outage at 3am last night
Derek Ross reports that he is getting TURLs back to machines not in the WAN cluster - no monitoring caught this (looks like this is CMS internal pool nodes)

New Actions:

Thursday morning - FTS, DPM, LFC, VOBOX to be rebooted. (Sophie) DONE on WEDNESDAY
move VOBOXes to Quattor
Pilot to move to 1.4.1 (Gavin)
check with GSSDATLAS/Alice that they use production FTS (Gavin)
electrics upgrade on oplapro nodes tomorrow at 9.30 (Andreas) DONE

Discussion:

kernel upgrade looks like it might solve the problem of loopback interface. Vlado running with iptables on 15/16

Wednesday

Log: Nothing

Actions:

reannounce the intervention + coordinate the "service back" announcement through hep-service-sc-level2@cernNOSPAMPLEASE.ch (Roberto) DONE
intervention at 9:30 (two hours) : CASTOR database on a more powerfull machine + Power bar rack change + FTS upgrade + FTS/LFC/DPM/VOBOX reboot DONE
ASCC/SARA problems : check with CERN network people (Romain) SOLVED

Discussion:

INDEX problem in the CASTOR database
Jan : CASTOR database monitored
LHCb : concerned about SC3-CASTOR problems

Thursday

Log:

all interventions went fine
grid8 (=FTS Pilot database) had to be restarted : too many open connection

Actions:

check lcg-sc.support@cernNOSPAMPLEASE.ch (Sophie)
check David Cameron's mapping on CASTOR2 (Olof)
CERN-PIC channel not defined ? -> to check (Gavin)
grid8 problem to follow-up (Gavin/Miguel)

Discussion:

Eric : CASTOR database intervention went fine. Low load on the database.
Simone : David Cameron is having problems -> sent a mail to shiva which we didn't get. Olof will check David's mapping on CASTOR2.
Romain : ASCC/SARA connectivity problem solved
LHCb : CERN-PIC channel seems not to be defined - dCache error

Friday

Log:

Actions:

Discussion:

-- SophieLemaitre - 19 Oct 2005

Topic revision: r6 - 2007-02-02 - FlaviaDonno

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback