Week of 060116

Open Actions from last week:

Chair: James

On Call: Roberto + Simone

Monday:

Log: CASTOR problems over weekend. DLf crashed during nights. SARA+FZK taking no traffic during weekend. NDGF still has some networking problems to resolve before starting up.

New Actions:

Discussion:

Tuesday:

Log: SARA network has gone down again.

New Actions:

  • need to check if the switch for c2sc3 cluster is overloaded. JanvE to check with Marc for monitoring pages. DONE
Discussion:
  • FZK is ok again - NDGF working through the routing problems - now at GEANT.
  • need to check if the switch for c2sc3 cluster is overloaded. JanvE to check with Marc for monitoring pages.

Wednesday

Log: SARA has network problems again.

Actions:

  • Move ATLAS DC work off the switch - Jan DONE

Discussion:

  • Switch is runnign at 87% on outgoing port (1g). It looks like ATLAS DC work is putting 300MB of load on this from time to time (7 hours yesterday).

Thursday

Log: LCGMON_GRIDFTP_WRONG on lxfsrm522. VAR_FULL on lxb0729. BNL report problems with non enough jobs being scheduled in FTS

Actions:

  • update op procedure for LCGMON_GRIDFTP_WRONG - James
  • clean up WS logs on lxb0729 and write logrotate for it - gavin DONE
  • announce DB intervention on all services for 26th Jan - James/Maarten DONE
  • announce details of new update slot of Wed 9.30AM-10.30AM - James
  • look at lxshare026d at BNLs problem - Gavin/Paolo DONE
  • Find usage plan from expts for castorgridsc after SC3 rerun is over - Harry

Discussion:

  • Db intervention needed for Oracle non-rolling upgrades on RAC. Will be done after 10 day SC phase on 26th.
  • It's noted that fsra3008 has hazlf the files of the other nodes, and correspondingly half the data rate
  • BNL not seeing their transfer slots filled. needs invesitgation for cause.
  • gridview rate is now ok with an error of about 5% (30-50MB/s)

Friday

Log: FTS high load during afternoon. castor2 was in debug logging mode, slowing down scheduler.

Actions:

  • Pilot LFC/FTS DB Baqckends (grid1/8) to be rebooted on monday 9.30.
Discussion:
  • high load caused by memory leak in agent, and problems connecting to oracle backend (queries take long) mean that channels were'nt completed filled up for several hours in afternoon/early evening
  • Olof at 9pm changed debug level and restarted, which help queue throughput in castor
  • gavin restarted all channel agents at 9, which upped rate to 1gig/S again.

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2007-02-19 - FlaviaDonno
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback