Guidelines for the GRID user

The GRID is a massively distributed multi-user, inhomogenous system for the storage and analysis of data. We use it in ALICE for simulation, reconstruction and data analysis. It is a very powerful tool, to which every ALICE member is given full access. Because of its properties the GRID can be very useful in the scientific work. However, since it is equally shared between all ALICE users, action of each one of the them will influence the GRID experience of all the other. Therefore we ask You to use it responsibly and having in mind that the GRID has its limitations. In order to help You, we present these simple guidelines for GRID usage. They come from several years of experience in using the complex GRID system. They are not hard rules but rather important point that You should keep in mind.

Be aware of the current GRID status. Use the monitoring MonALISA page.

Before statring to use the GRID, or any time You intend to run a significant job, please consult the ALICE Grid monitoring page: MonALISA. In particular inspect the runnign job counter and the Storage Element (SE) status page. After a few tries You will familiarize Yourself with how the "normal" state looks like and will be able to quickly recognize when the GRID is in abnormal state. Also follow the ALICE-off and alice-project-analysis-task-force mailing lists, where important announcemements about the GRID status are sent.

Always test Your jobs on a small sample before submitting large number of jobs

Use the many debugging tools provided by AliEN todebug Your jobs. Always run a small test job and validate its results before launching a big analysis or simulation job.

When submitting jobs, do not submit more than 1000 subjobs at once

The submission of a job produces a short spike in the usage of central AliEN services. Therefore submitting a lot of jobs at once can produce a congestion of central services. So submit Your jobs in batches, prefarably submitting the next batch only once the previous one has been properly splitted.

Do not be alarmed if You see slightly less than 100% success rate

The GRID is a very complex and distributed system. In order for a job to finish propely many independent pieces must work togother and all must succeed.It is not unexpected that somtimes one of the elements fails and the job does not finish. Some number of errors is to be expected. As a general rule a 95% success rate for simulation jobs and 90% success rate for analysis jobs that access data from the SEs is an acceptable level. Success rates lower than that should be investigated, but always keep in mind the first rule.

Avoid using par files, whenever possible

Theusage of par files requires compilation of code. GRID consists of computing sites, each of which can have its own software and architecture configuration. In such a scenario it is difficult to assure a consistent compiling environment, so the par files compilation may fail quite often. On the other hand the officially prepared software packages are built centrally and do not suffer from these issues. Therefore it is advisable to submit Your analysis code into SVN and have it officially distributed rather than using par files.

Monitor Your jobs execution

Use the AliEN tools to monitor Your job's execution. Take action if You notice strange behavio, large error rate or extremely long wait times.

-- AdamKisiel - 2009-09-25

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2009-09-25 - AdamKisiel
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    AliFemto All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback