Week of 051017
Open Actions from last week:
- bundle clients for new release into LCG (Gavin/Louis)
- Test new QF23 FTS clients (James)
DONE - DEPLOYED
- All services - upgrade kernel to be scheduled.
DONE
- check possiblility of using LCG Quattor WG componets for LFC/DPM/... (Jan/Vlado/Sophie)
TO TEST
- Escalate problem with lost packets on loopback I/F to linux.support (Olof)
IN PROGRESS
On Call: Roberto/ Antonio
Monday:
Log: Nothing
New Actions:
- (sophie) Shift list !
- Jan/James - problem seen in lcg-mon-gridftp rpm on chkconfig which meant they were restarted on reboot.
FIX next week ?
- bug in VOBOX impl - will release new rpm in 2.6 and inform sites. (Simone)
DONE
Discussion:
- Castor problem report (Eric) - On friday, problem to restart castor. ON DB side, db server got very loaded on friday. Until it was so unresponsive, operators restarted machine - led to 1 hour downtime. Worked on resource management at DB level - will need to do a db restart to put this in place.
- Urgent work in progress to look at rehosting it on more powerful hardware
- Jamie would like to discuss at the LCGSCM the logging/reporting and escalation of problems for key services.
- oplapro75 - seems to be a powerbar problem - will be investigated today by Andreas.
Tuesday:
Log:
- new release of castor done yesterday at 4PM.
- stager dead last night at 8pm last night for three hours - found by a user (andrew smith and fixed by a developer) - no monitoring caught this
- another 4 hour outage at 3am last night
- Derek Ross reports that he is getting TURLs back to machines not in the WAN cluster - no monitoring caught this (looks like this is CMS internal pool nodes)
New Actions:
- Thursday morning - FTS, DPM, LFC, VOBOX to be rebooted. (Sophie)
DONE on WEDNESDAY
- move VOBOXes to Quattor
- Pilot to move to 1.4.1 (Gavin)
- check with GSSDATLAS/Alice that they use production FTS (Gavin)
- electrics upgrade on oplapro nodes tomorrow at 9.30 (Andreas)
DONE
Discussion:
- kernel upgrade looks like it might solve the problem of loopback interface. Vlado running with iptables on 15/16
Wednesday
Log: Nothing
Actions:
- reannounce the intervention + coordinate the "service back" announcement through hep-service-sc-level2@cernNOSPAMPLEASE.ch (Roberto)
DONE
- intervention at 9:30 (two hours) : CASTOR database on a more powerfull machine + Power bar rack change + FTS upgrade + FTS/LFC/DPM/VOBOX reboot
DONE
- ASCC/SARA problems : check with CERN network people (Romain)
SOLVED
Discussion:
- INDEX problem in the CASTOR database
- Jan : CASTOR database monitored
- LHCb : concerned about SC3-CASTOR problems
Thursday
Log:
- all interventions went fine
- grid8 (=FTS Pilot database) had to be restarted : too many open connection
Actions:
- check lcg-sc.support@cernNOSPAMPLEASE.ch (Sophie)
- check David Cameron's mapping on CASTOR2 (Olof)
- CERN-PIC channel not defined ? -> to check (Gavin)
- grid8 problem to follow-up (Gavin/Miguel)
Discussion:
- Eric : CASTOR database intervention went fine. Low load on the database.
- Simone : David Cameron is having problems -> sent a mail to shiva which we didn't get. Olof will check David's mapping on CASTOR2.
- Romain : ASCC/SARA connectivity problem solved
- LHCb : CERN-PIC channel seems not to be defined - dCache error
Friday
Log:
Actions:
Discussion:
--
SophieLemaitre - 19 Oct 2005