Week of 051205

Open Actions from last week:
  • Move castor2 stager. Stop at 09.30 Monday for 2 hours.
  • Schedule service stop on LXFS6051 for Elonex disk intervention. Proposal is 09.30 Monday for 90 minutes.
  • L.Field will look at the discrepancy between GRIDVIEW reports of cms traffic and what CMS (and lemon) see.
  • Decide how to avoid logs in /opt/'lcg-application'/var filling the root file system.
  • Schedule deployment of new lcg-mon-gridftp.

Chair: Harry

On Call: J-P.Baud, D.Smith

Monday:

Log: lcg-mon-gridftp wrong alarm on oplapro77 (castor2 disk server). Exposed a bug in a python script - fixed by ML.

New Actions: Stop all fts services and perform the castor2 stager move and lxfs6051 disk intervention. Probably couple this with new versions of gridftp and srm.

Discussion: Sophie (on call) was unable to access oplapro77 and suggested all level2 support staff should have universal access. James pointed out the oplapro are rather specialised castor2 disk servers and, with improved software, should not need such access. From tomorrow there will be no UPS backup for a week except in the critical power area.

Tuesday:

Log: oplapro78 generated an lcg-mon-gridftp wrong alarm due to a 1 second time window while a new version was put in. lxshare026d gave a high-load alarm. P.Badino reported it as due to the CERN-BNL fts channel eating memory and has temporarily enabled a cron task to restart the agents each 30 minutes. gdrb01 and 08 generated high load alarms. The default of 16 is probably too low.

New Actions: D.Smith to follow-up redefining the gdrb high load threshold. DONE - new value is 24.

Discussion: M.dos Santos plans to install a new castor client on 5 lxplus nodes. He will confirm date/time later. DONE at 10.30.

Wednesday

Log: lfc010 gave lfc_db_error. lfcdaemon was restarted by operator as per opnote. lfc001 gave same error but is test of new version which autorestarted.

Actions: Stop unnecessary lfc_db_error alarms in new software. Upgrade all lfc to new software from 09.30 Thursday. Will be 30 minutes downtime so send warning mail to list.

Discussion: CS have requested router tests from 09.30 t0 10.30 on the Force10 router in front of the SC3 hardware. lxshare026d high load alarm still under investigation. Trying to reproduce it in pilot service.

Thursday

Log: Nothing

Actions:

Discussion:

Friday

Log: lxfsrk524 has a file system error and needs a reboot. It is the reference classic SE. It was agreed with ML to reboot at 09.30. lxb2003 logged an error probably connected as it nfs mounts lxfsrk524.

Actions: reboot lxfsrk524 at 09.30 (sysadmin) and check it (ML). Done.

Discussion: The castor VDQM upgrade failed again (OB report). They will try again on Monday morning.

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2005-12-12 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback