Second meeting on "busy" storage services (26/02/2009)

Participants

J.P. Baud, G. Behrman, B. Bockelman, F. Donno, A. Frohner, E. Lanciotti, G. Lo Presti, L. Magnoni, A. Sciabà, A. Sim, D. Smith, R. Zappi

Minutes

We started going through the conclusions of the former meeting.

Gerd commented that one should not explicitly say that an SRM server in a "busy" condition should return SRM_INTERNAL_ERROR: in fact, it is implicit that it should do that when a too high load on the server or the backend generates transient errors. ACTION: rephrase in the Twiki.

It was agreed to remove the item:

  • The SRM server SHOULD clean up a request after a reasonable time, but not before the request is completed, failed or aborted.

It was pointed out that CASTOR, dCache and DPM do not abort ongoing transfers when a request is aborted.

It was suggested to change item 3. as follows:

  • The SRM server SHOULD return a remainingTotalRequestTime. If it does, remainingTotalRequestTime MUST be less or equal to the time until the request times out.

It was agreed to modify item 4. as follows:

  • [...] The status of the single files MAY NOT be returned.

In case of SRM_INTERNAL_ERROR returned by the server, the following sentence is preferred:

  • If the client application receives an SRM_INTERNAL_ERROR from the SRM server, it MAY repeat the request. If it does, it SHOULD use a randomized exponential retry time.

It was discussed if, when the client specifies a too high value for desiredTotalRequestTime, the server should cap that value or return SRM_INVALID_REQUEST. The conclusion was that the former is preferrable.

Akos mentioned that GFAL/lcg-util allows the user to set desiredTotalRequestTime, while FTS sets it as a very high value.

Gerd said that dCache does not honour desiredTotalRequestTime. ACTION: to check with Timur.

Finally, the general agreement was to drop any recommendation about desiredTotalRequestTime, both on the client and the server side. After the meeting Timur suggested to specify a desiredTotalRequestTime, because for dCache not doing so is equivalent to specify a possibly small default value, and because dCache currently invalidates TURLs when the request times out.

Then, we moved to the "questions to be answered" from the previous meeting.

The first two became irrelevant, given the agreement on desiredTotalRequestTime.

It was discussed how quickly the storage systems report that a file is LOST or UNAVAILABLE. Giuseppe said that in CASTOR this information is immediately available (for example, if a file is on a disk which is offline).

About the advantages/disadvantages of srmStatusOf... calls compared to srmLs, Gerd said that he thinks they have comparable performances, but he should check.

About the polling from SRM to the backend in CASTOR, Giuseppe said that it was intended to protect the backend; he is not committing to any change, for the moment.

At the end, the situation was summarized.

Akos said that FTS 2.2 will not have any of the agreed changes, and the priority is now on the checksum checks. The coding for the issue at discussion will begin in April, and at least three months are expected for having it in production. Remi will not implement anything either before mid March.

Gerd did not commit to anything concerning the dCache client, waiting for Timur.

The StORM developers said that the StORM client does not contain any retry logic. They foresee to be able to have the agreed changes (SRM_INTERNAL_ERROR, meaningful estimatedWaitTime) implemented in the server by September.

A technical plan will be prepared and submitted to the relevant bodies (EGEE TMB, WLCG MB).

Conclusions

Prescriptions for the SRM server

Synchronous requests

  1. The SRM server MUST return SRM_INTERNAL_ERROR when it experiencing transient problems (possibly caused by a too high load on the server or the backend).

Asynchronous requests

  1. The SRM server MUST return SRM_INTERNAL_ERROR when it experiencing transient problems (possibly caused by a too high load on the server or the backend). The status of the single files MAY NOT be returned.
  2. If the request will be processed (request status equal to SRM_REQUEST_QUEUED or SRM_REQUEST_INPROGRESS), the SRM server SHOULD return an estimatedWaitTime for each file in the request to tell the client when the next polling SHOULD happen in order to have a new update on the status of each file.
  3. The initial value of remainingTotalRequestTime, if returned, MAY differ from the desiredTotalRequestTime specified by the client.

Prescriptions for the polling algorithm of a client application

Synchronous requests

  1. If the client application receives an SRM_INTERNAL_ERROR from the SRM server, it MAY repeat the request. If it does, it SHOULD use a randomized exponential retry time.

Asynchronous requests

  1. If the client application receives an SRM_INTERNAL_ERROR from the SRM server, it MAY repeat the request. If it does, it SHOULD use a randomized exponential retry time.
  2. The client application SHOULD poll again the status of a request after a time of the order of the estimatedWaitTime of the files in the request if available, or after an exponential polling time if typical estimatedWaitTime is -1 or undefined.

Timur suggested to reintroduce this prescription:

  1. A client application SHOULD specify a desiredTotalRequestTime under the assumption that the SRM server SHOULD time out the request only after the desiredTotalRequestTime has elapsed. Note that the current versions of dCache will invalidate the TURLs produced by srmPrepareToPut and srmPrepareToGet requests when the request times out. Future releases will invalidate a TURL only when the pin time expires.

-- Flavia Donno, Akos Frohner, Elisa Lanciotti and Andrea Sciaba - 17 Feb 2009

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng test.png r1 manage 21.8 K 2009-02-20 - 10:54 AndreaSciaba Exponential backoff
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2009-03-05 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback