Analysis of incident with the SRM service at PIC on 31-Oct - 1-Nov:

(added by Olof on behalf of Gonzalo)

Problem and impact

On friday 31st October, at around 16:30 UTC a problem was noticed on the SRM service. It lead to significant service degradation on all VOs using the SRM service. Most SRM operation timed out. At 23:00 UTC corrective actions were taken, but until 2.00 UTC the service did not recovered normal operations. There was a second glitch of the service starting at 6:30 UTC 1st November which lasted for one hour.

Detail

The dCache srm head node (dcsrm02.pic.es) and the pnfs server (dcns03.pic.es) were under high load. dcsrm02 timed out, and network map scans on the service port (8443) frequently returned "filtered", meaning that the service was not answering to new tcp connections. The queues on the pnfsManager (dCache component on the pnfs) were relatively high (over 100 queries queued) for some of the threads.

Actions

The following actions were taken:

  • restart of the srm server
  • restart of the dcache head nodes (poolmanager, admin domain, location manager)
  • reboot the srm server completely
  • the system became responsive after much meddling but the reason was external (load decrease from the application) and not because of our actions.

Follow-up

  • Understand how to improve pnfs performance to avoid pnfsManager thread queues, that generate heavy load and timeouts.
  • Try different performance approachs for both pnfs and srm server:
    • Upgrading srm server to a 64-bit machine with a 64-bit java virtual machine
    • Upgrading pnfs server to faster version
    • Upgrading pnfs postgresql database to 8.3 - a performance boost is expected
  • Understand the effect of FTS behavior doing checks when transferring data

-- OlofBarring - 16 Apr 2009

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2009-04-16 - OlofBarring
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback