Week of 051031

Open Actions from last week:
  • Look at queries in FTS for locking problem (Gavin/Kris) IN PROGRESS
  • Pilot to move to 1.4.1 (Gavin) FOR Intervention

  • move VOBOXes to Quattor
  • check possiblility of using LCG Quattor WG componets for LFC/DPM/... (Jan/Vlado/Sophie) TO TEST
  • Escalate problem with lost packets on loopback I/F to linux.support (Vlado) ESCALATED

On Call:

  • Simone + Patricia

Monday:

Log: Nothing

New Actions:

  • Procedure
Discussion:
  • Sebastien had some problems in upgrade - by 9pm all up and running. Oracle 10g

Tuesday:

Log: Nothing

New Actions:

  • Stress test all components for 24 hours with a view to returning to full production at 09.00 tomorrow. Checkpoint meeting at 14.00 in 28-R-015.

Discussion:

  • Monday upgrades went well. CASTOR2 finished by 8.40 and batch was restarted. Physics databases kernel patches done. LFC moved to using RAC writer/reader accounts but upgrade to version supporting VOMS and virtual ids has been postponed. FTS moved to RAC and agents are back and reconfigured on all production and pilot nodes. Gavin will now test 3rd party copies and make site tests. The plan is to use SRM copy for all T1 dcache sites. Ben Couturier discovered a compatibility problem with the new version of the CASTOR2 client and rebuilt the SRM using new CASTOR2 header files.

Tuesday 17:00 Check point meeting

  • LFC: is ok, still waiting for aliases to be setup

  • FTS:
    • network problem (sara) cs investigating
    • servers running ok; tests running. CERN ok, problems against T1s.
    • all channels except BNL & GridKA using 3rd party copy. problems with srm copy. Asking T1s to investigate.
    • neither urgent for ATLAS, but GridKA important for both ALICE and LHCb.

  • castor2 dlf: high load. proposal to move to new machine. a suivre.

  • To bring service back up:
    • Need to contact GridKA, Lyon, RAL and CNAF, sort out network with SARA, understand srmcopy (GridKA and BNL)

Wednesday

Log: Nothing

Actions: Need to understand srmcopy (to GridKA and BNL) and out of memory behaviour on lxshare026d.

Discussion: Most sites fixed during the day. CNAF thought to have one bad server causing a 30% transfer failure rate. SARA network route problem fixed at end of afternoon. RAL problem fixed about 17.00 going from 0 to 100% success in transfers. Later Patricia performed systematic tests of moving 100 files from CERN to each T1 and this provoked lxshare026d to run out of memory and kill processes. Simone observed several glite_data_conf processes each taking 300MB - to be followed up. It was agreed to inform experiments that grid services are back in production. Sophie reported that the requested lfc-cern load-balanced alias has still not been created by CS group.

Thursday

Log: Nothing

Actions: Find memory leak in FTS transfer agent code. DONE

Discussion: FTS transfers to most sites became fully operational yesterday though SARA still has a routing problem. Aggregate rates of 200MB/sec out of CERN were reached. lxshare026d, the main server for the transfer agents, again gave a high load alarm about 21.00 and was using a GB of swap. Gavin has reproduced the problem on fts006 and they are looking for a memory leak. Meanwhile he has setup a cron entry to restart the concerned daemon hourly. Harry reported we are close to swapping over the lcg-bdii server to higher reliability hardware but prefers to let it run over the weekend. We anyway need an operational procedure. He noticed TRIUMF-GC-LCG2 bdii does not respond but this turned out to be normal (production bdii is TRIUMF-LCG2).

Friday

Log: Problems yesterday at CERN - network router outage in vault and power cut. CASTOR NS down 1.45 to 5 pm.

Actions:

  • QF for FTS memory leak code - beginnning of next week (Paolo)
  • ATLAS problem with LFC (Simone)

Discussion:

-- HarryRenshall - 28 Oct 2005

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2007-02-19 - FlaviaDonno
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback