Week of 060116
Open Actions from last week:
Chair: James
On Call: Roberto + Simone
Monday:
Log: CASTOR problems over weekend. DLf crashed during nights. SARA+FZK taking no traffic during weekend. NDGF still has some networking problems to resolve before starting up.
New Actions:
Discussion:
Tuesday:
Log: SARA network has gone down again.
New Actions:
- need to check if the switch for c2sc3 cluster is overloaded. JanvE to check with Marc for monitoring pages.
DONE
Discussion:
- FZK is ok again - NDGF working through the routing problems - now at GEANT.
- need to check if the switch for c2sc3 cluster is overloaded. JanvE to check with Marc for monitoring pages.
Wednesday
Log: SARA has network problems again.
Actions:
- Move ATLAS DC work off the switch - Jan
DONE
Discussion:
- Switch is runnign at 87% on outgoing port (1g). It looks like ATLAS DC work is putting 300MB of load on this from time to time (7 hours yesterday).
Thursday
Log: LCGMON_GRIDFTP_WRONG on lxfsrm522. VAR_FULL on lxb0729. BNL report problems with non enough jobs being scheduled in FTS
Actions:
- update op procedure for LCGMON_GRIDFTP_WRONG - James
- clean up WS logs on lxb0729 and write logrotate for it - gavin
DONE
- announce DB intervention on all services for 26th Jan - James/Maarten
DONE
- announce details of new update slot of Wed 9.30AM-10.30AM - James
- look at lxshare026d at BNLs problem - Gavin/Paolo
DONE
- Find usage plan from expts for castorgridsc after SC3 rerun is over - Harry
Discussion:
- Db intervention needed for Oracle non-rolling upgrades on RAC. Will be done after 10 day SC phase on 26th.
- It's noted that fsra3008 has hazlf the files of the other nodes, and correspondingly half the data rate
- BNL not seeing their transfer slots filled. needs invesitgation for cause.
- gridview rate is now ok with an error of about 5% (30-50MB/s)
Friday
Log: FTS high load during afternoon. castor2 was in debug logging mode, slowing down scheduler.
Actions:
- Pilot LFC/FTS DB Baqckends (grid1/8) to be rebooted on monday 9.30.
Discussion:
- high load caused by memory leak in agent, and problems connecting to oracle backend (queries take long) mean that channels were'nt completed filled up for several hours in afternoon/early evening
- Olof at 9pm changed debug level and restarted, which help queue throughput in castor
- gavin restarted all channel agents at 9, which upped rate to 1gig/S again.