Analysis of incident with the SRM service at PIC on 31-Oct - 1-Nov:
(added by Olof on behalf of Gonzalo)
Problem and impact
On friday 31st October, at around 16:30 UTC a problem was noticed on the SRM service. It lead to significant service degradation on all VOs using the SRM service. Most SRM operation timed out.
At 23:00 UTC corrective actions were taken, but until 2.00 UTC the service did not recovered normal operations. There was a second glitch of the service starting at 6:30 UTC 1st November which lasted for one hour.
Detail
The dCache srm head node (dcsrm02.pic.es) and the pnfs server
(dcns03.pic.es) were under high load.
dcsrm02 timed out, and network map scans on the service port (8443) frequently returned "filtered", meaning that the service was not answering to new tcp connections.
The queues on the pnfsManager (dCache component on the pnfs) were relatively high (over 100 queries queued) for some of the threads.
Actions
The following actions were taken:
- restart of the srm server
- restart of the dcache head nodes (poolmanager, admin domain, location manager)
- reboot the srm server completely
- the system became responsive after much meddling but the reason was external (load decrease from the application) and not because of our actions.
Follow-up
- Understand how to improve pnfs performance to avoid pnfsManager thread queues, that generate heavy load and timeouts.
- Try different performance approachs for both pnfs and srm server:
- Upgrading srm server to a 64-bit machine with a 64-bit java virtual machine
- Upgrading pnfs server to faster version
- Upgrading pnfs postgresql database to 8.3 - a performance boost is expected
- Understand the effect of FTS behavior doing checks when transferring data
--
OlofBarring - 16 Apr 2009