Week of 050718

Open Actions from last week:
  • LEMON sensors for
    • FTS - Gavin WAITING ON FIO
  • Gavin: install last FTS version and test
  • Move all users over to castor2 - James
  • Hourly tests (like SFT) - Simone FIST VERSION DONE
  • Jan/Sophie/Laurence : How to run "central" LEMON sensors LEAVE
  • All: Go through the 2nd level support
  • Ben: check oplapro79 to see if system killed SRM process NOTHING FOUND
  • Olof: castor2 - Try with one node moving DONE
  • bdflush parameters. lxshare220d.
  • James/Ben: check oplapro80 for access for ben. DONE

On shift: Jean-Philippe/Jamie

Monday:

Log: Many interventions.
  • FTS005 died NO_CONTACT - came back up
  • /tmp 92% on sc3-fts - Gavin - FTS dumping core
  • castorgridsc - 1 LSF plugin problem.
    • poor throughput in LSF - restarting LSF helps.

New Actions:

  • Gavin: Investigate core dumping on glite-url-copy
  • Check again the operator levels.

Discussion:

  • hourly tests - like to try with new version of lcg_utils
  • Olof: Wait until this afternoon to move castor1 nodes into WAN pool

Tuesday:

Log:
  • RAS

Actions:

  • Testing of L2 procedures - JPB for FTS, JDS for LFC
  • All users have now been moved to CASTOR2. Castor1 pools will be moved over today by Vlado
  • LEMON sensors? no news
  • New FTS version: being installed / tested
  • Hourly tests - still need to be deployed
  • Daily publishing of FTS plots can now be done properly
  • BD flushing: to be done
  • Core dumping on glite URL copy: Gav
  • Operator levels: to be checked

New actions:

  • Move of castor1 nodes into wan pool
  • GDB - monitoring talk? One of Gridview team

AOB:

  • castor lsf monitoring pages now available - will be added to monitoring pages
  • GDB tomorrow is open - please attend!

Wednesday

Log:
  • Rate down - not clear why. Accidental temporary removal of PhEDEx pools (affected DESY). Misconfiguration. Understood.
  • Problems with Spanish - cert problem. Refreshed CRLs - back now.
  • CRL proxy now setup (Thorsten). Recipe given to Vlado. CRLs on all nodes 'soon'.
  • CASTOR2 migration - diskservers now in wan pool. Problem now is how to populate them. Chosen on the basis of capability and as they are low compared to IA64 will rarely be selected. Could move to seperate service class temp. and then back. Olof will do this am.

Actions:

  • LEMON sensors. Still pending.
  • Testing of latest FTS - Paolo.
  • LFC smoke tests being checked. Partly a learning process and partly debugging (JDS)...
  • Core dumps on gLite URL copy. Mail from gav classifying dumps. Majority inside globus - Ben will investigate - most transfers to Triumf. Maarten can also help... Core filesize set to 1KB - this results on no dump at all... gav will setup cron job to 'archive' core dumps.
  • operator levels - no calls on fts production box at night. Need to up level with FIO.
  • monitoring talk by member of gridview team.

New actions:

  • fts alarms - james -> vlado

AOB:

  • problem with oracle backups for castor ns - requires 5 minutes down time. No backup currently being done! can live like this until end July unless we schedule some interventions.
  • DB statistics being gathered without restart - action removed.

Thursday

Log:
  • FTS: 17:00 daemon died - reasons unknown. Nothing in logs. No core files. watchdog script had a problem due to lockfile(?) Fixed (Paolo) - QF of config service: Gav -> Alberto test. bug open, moved to critical.
  • gridview problem - archiver went down 22:00 - 05:00. Running 2 archives (prod + backup). Will copy data over.
  • Traffic went down at 05:00 to zero - no alarms. Came back after one hour. Gav - check FTS logs. Around 360MB/s since. SARA down. Tried rebooting dCache several times. Complete reinstall of dCache from scratch(!) incl. upgrade to latest version.
  • Mail from Ron - organise a dCache workshop. Suggest to Michael.
  • 1000 files copied into 9 extra ia32 nodes.
  • Ben has tagged version of SRM that fixes many of current bugs. Test now - deploy eventually Monday?
  • Deploy new version of LSF plug-in today (as patch). Should avoid regular restart of LSF.
  • lfc log on lfc004 grew rapidly recently - jpb to check

Actions:

  • Lemon sensor for FTS - Gav to follow with FIO.
  • Mail from James announcing tape tests. Purge all outstanding jobs. Special tape details for some sites. Ask for others. Probably turn 3-4 on this morning.
  • Testing of new inter-VO FTS on-going
  • FTS/LFC smoke-tests ongoing.
  • SC3 service phase resource scheduling needed for PEB Tuesday

Friday

Log:
  • Files 'stuck' in castor - cleanup of these affected scheduler queues. OB - loss of throughput partly due to this but also due to DB lock (prevented any activity...) Lock retained after closing application. Killed session - ok. 16 - 18 stuck files. Not fully understood. Might be related to reconfigs at early July. Corrolates to ME's high error rate on these?
  • Some sites writing to tape - in2p3, infn (60), bnl, sara(120-20-60), ral(40-60)
  • rlsatalas: activity? Tail of rome production; some analysis. couldn't connect once out of three. (Sophie's migration is complete - cleanup on lfc side)
  • Upgrade of gridftp monitors - requires upgrade of r-gma. Schedule slot for next week.

Actions:

  • Foresee intervention slot at 09:30 of one hour daily to clean problems, rathering than attempting on the fly. Mail to service-challenge-tech (JS) to announce
  • New FTS comes not only with multi-VO support but also agent framework. FH -> PEB
  • Lemon sensors for FTS - no change
  • Why FTS died? No progress - switch core file back on.
  • Gridview problem - lost archiver last night - understood.
  • Copying files to ia32 - bug in script - need to rerun.
  • Smoketest schedule - still pending
  • dCache workshop at DESY end August. Use cern-desy challenge for testing. 'reference setup'

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2005-07-22 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback