Week of 051017

Open Actions from last week:
  • bundle clients for new release into LCG (Gavin/Louis)
  • Test new QF23 FTS clients (James) DONE - DEPLOYED
  • All services - upgrade kernel to be scheduled. DONE

  • check possiblility of using LCG Quattor WG componets for LFC/DPM/... (Jan/Vlado/Sophie) TO TEST
  • Escalate problem with lost packets on loopback I/F to linux.support (Olof) IN PROGRESS

On Call: Roberto/ Antonio

Monday:

Log: Nothing

New Actions:

  • (sophie) Shift list !
  • Jan/James - problem seen in lcg-mon-gridftp rpm on chkconfig which meant they were restarted on reboot. FIX next week ?
  • bug in VOBOX impl - will release new rpm in 2.6 and inform sites. (Simone) DONE

Discussion:

  • Castor problem report (Eric) - On friday, problem to restart castor. ON DB side, db server got very loaded on friday. Until it was so unresponsive, operators restarted machine - led to 1 hour downtime. Worked on resource management at DB level - will need to do a db restart to put this in place.
  • Urgent work in progress to look at rehosting it on more powerful hardware
  • Jamie would like to discuss at the LCGSCM the logging/reporting and escalation of problems for key services.
  • oplapro75 - seems to be a powerbar problem - will be investigated today by Andreas.

Tuesday:

Log:
  • new release of castor done yesterday at 4PM.
  • stager dead last night at 8pm last night for three hours - found by a user (andrew smith and fixed by a developer) - no monitoring caught this
  • another 4 hour outage at 3am last night
  • Derek Ross reports that he is getting TURLs back to machines not in the WAN cluster - no monitoring caught this (looks like this is CMS internal pool nodes)

New Actions:

  • Thursday morning - FTS, DPM, LFC, VOBOX to be rebooted. (Sophie) DONE on WEDNESDAY
  • move VOBOXes to Quattor
  • Pilot to move to 1.4.1 (Gavin)
  • check with GSSDATLAS/Alice that they use production FTS (Gavin)
  • electrics upgrade on oplapro nodes tomorrow at 9.30 (Andreas) DONE

Discussion:

  • kernel upgrade looks like it might solve the problem of loopback interface. Vlado running with iptables on 15/16

Wednesday

Log: Nothing

Actions:

  • reannounce the intervention + coordinate the "service back" announcement through hep-service-sc-level2@cernNOSPAMPLEASE.ch (Roberto) DONE
  • intervention at 9:30 (two hours) : CASTOR database on a more powerfull machine + Power bar rack change + FTS upgrade + FTS/LFC/DPM/VOBOX reboot DONE
  • ASCC/SARA problems : check with CERN network people (Romain) SOLVED

Discussion:

  • INDEX problem in the CASTOR database
  • Jan : CASTOR database monitored
  • LHCb : concerned about SC3-CASTOR problems

Thursday

Log:
  • all interventions went fine
  • grid8 (=FTS Pilot database) had to be restarted : too many open connection

Actions:

  • check lcg-sc.support@cernNOSPAMPLEASE.ch (Sophie)
  • check David Cameron's mapping on CASTOR2 (Olof)
  • CERN-PIC channel not defined ? -> to check (Gavin)
  • grid8 problem to follow-up (Gavin/Miguel)

Discussion:

  • Eric : CASTOR database intervention went fine. Low load on the database.
  • Simone : David Cameron is having problems -> sent a mail to shiva which we didn't get. Olof will check David's mapping on CASTOR2.
  • Romain : ASCC/SARA connectivity problem solved
  • LHCb : CERN-PIC channel seems not to be defined - dCache error

Friday

Log:

Actions:

Discussion:

-- SophieLemaitre - 19 Oct 2005

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2007-02-02 - FlaviaDonno
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback