PICServiceIncidentReport20090416 < LCG

LCG Web>WLCGCommonComputingReadinessChallenges>WLCGOperationsWeb>WLCGServiceIncidents>PICServiceIncidentReport20090416 (2009-04-16, OlofBarring)

EditAttachPDF

Analysis of incident with the SRM service at PIC on 31-Oct - 1-Nov:

(added by Olof on behalf of Gonzalo)

Problem and impact

On friday 31st October, at around 16:30 UTC a problem was noticed on the SRM service. It lead to significant service degradation on all VOs using the SRM service. Most SRM operation timed out. At 23:00 UTC corrective actions were taken, but until 2.00 UTC the service did not recovered normal operations. There was a second glitch of the service starting at 6:30 UTC 1st November which lasted for one hour.

Detail

The dCache srm head node (dcsrm02.pic.es) and the pnfs server (dcns03.pic.es) were under high load. dcsrm02 timed out, and network map scans on the service port (8443) frequently returned "filtered", meaning that the service was not answering to new tcp connections. The queues on the pnfsManager (dCache component on the pnfs) were relatively high (over 100 queries queued) for some of the threads.

Actions

The following actions were taken:

restart of the srm server
restart of the dcache head nodes (poolmanager, admin domain, location manager)
reboot the srm server completely
the system became responsive after much meddling but the reason was external (load decrease from the application) and not because of our actions.

Follow-up

Understand how to improve pnfs performance to avoid pnfsManager thread queues, that generate heavy load and timeouts.
Try different performance approachs for both pnfs and srm server:
- Upgrading srm server to a 64-bit machine with a 64-bit java virtual machine
- Upgrading pnfs server to faster version
- Upgrading pnfs postgresql database to 8.3 - a performance boost is expected
Understand the effect of FTS behavior doing checks when transferring data

-- OlofBarring - 16 Apr 2009

Topic revision: r1 - 2009-04-16 - OlofBarring

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback