LCG Web>LCGServiceChallenges>ServiceChallengeMeetings>SCDailyMeeting060116 (2007-02-19, FlaviaDonno)

Week of 060116

Open Actions from last week:

Chair: James

On Call: Roberto + Simone

Monday:

Log: CASTOR problems over weekend. DLf crashed during nights. SARA+FZK taking no traffic during weekend. NDGF still has some networking problems to resolve before starting up.

New Actions:

Discussion:

Tuesday:

Log: SARA network has gone down again.

New Actions:

need to check if the switch for c2sc3 cluster is overloaded. JanvE to check with Marc for monitoring pages. DONE

Discussion:

FZK is ok again - NDGF working through the routing problems - now at GEANT.
need to check if the switch for c2sc3 cluster is overloaded. JanvE to check with Marc for monitoring pages.

Wednesday

Log: SARA has network problems again.

Actions:

Move ATLAS DC work off the switch - Jan DONE

Discussion:

Switch is runnign at 87% on outgoing port (1g). It looks like ATLAS DC work is putting 300MB of load on this from time to time (7 hours yesterday).

Thursday

Log: LCGMON_GRIDFTP_WRONG on lxfsrm522. VAR_FULL on lxb0729. BNL report problems with non enough jobs being scheduled in FTS

Actions:

update op procedure for LCGMON_GRIDFTP_WRONG - James
clean up WS logs on lxb0729 and write logrotate for it - gavin DONE
announce DB intervention on all services for 26th Jan - James/Maarten DONE
announce details of new update slot of Wed 9.30AM-10.30AM - James
look at lxshare026d at BNLs problem - Gavin/Paolo DONE
Find usage plan from expts for castorgridsc after SC3 rerun is over - Harry

Discussion:

Db intervention needed for Oracle non-rolling upgrades on RAC. Will be done after 10 day SC phase on 26th.
It's noted that fsra3008 has hazlf the files of the other nodes, and correspondingly half the data rate
BNL not seeing their transfer slots filled. needs invesitgation for cause.
gridview rate is now ok with an error of about 5% (30-50MB/s)

Friday

Log: FTS high load during afternoon. castor2 was in debug logging mode, slowing down scheduler.

Actions:

Pilot LFC/FTS DB Baqckends (grid1/8) to be rebooted on monday 9.30.

Discussion:

high load caused by memory leak in agent, and problems connecting to oracle backend (queries take long) mean that channels were'nt completed filled up for several hours in afternoon/early evening
Olof at 9pm changed debug level and restarted, which help queue throughput in castor
gavin restarted all channel agents at 9, which upped rate to 1gig/S again.

Topic revision: r7 - 2007-02-19 - FlaviaDonno

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback