Introduction

This is a small CREAM stress test suite developed in python. It provides stress testing of CREAM service and a utility script monitoring target hosts and/or services, gathering the data and then plotting it. It is a work in progress and more features will be included in the future. The test suite supports Scientific Linux 6.If an SL5 compliant version is required, please feel free to contact me. Also, it can be easily ported to any Linux platform. (anyone wanting to use the test suite in a non supported environment, please feel free to contact me)

For news and more info, you can follow us on Twitter and Posterous!

Scientific Linux 6 Installation

The test-suite has to be run in a host which has the capabilities to submit jobs to a CREAM endpoint. Namely, an EMI-UI is ideal for the job. In order to install the test-suite, execute as root:

wget http://yum.gridctb.uoa.gr/repository/robot_testing_sl6.repo -O /etc/yum.repos.d/robot_testing_sl6.repo
yum install cream_stress_test

(please note that the dependencies are automatically taken care of. Also,in previous versions, the robot_testing repository had to take precedence over the standard SL repos. This is no longer the case.)

Scientific Linux 6 Dependencies

The available packages and their dependencies are:

Package	Dependency
cream_stress_test	python-argparse
	python-matplotlib
	pexpect

All dependencies are either installed by default, or available in the standard repos (SL, EPEL etc)

Installed files

The CREAM testsuite installs the following files:

File	Description
/opt/cream_stress_test/bin/cream_stress_test.py	The stress test script
/opt/cream_stress_test/bin/cream_stress_statistics.py	The monitor script
/opt/cream_stress_test/docs/cream_stress_test.7.gz and /usr/share/man/man7/cream_stress_test.7.gz	The stress test's man page
/opt/cream_stress_test/docs//cream_stress_statistics.7.gz and /usr/share/man/man7/cream_stress_statistics.7.gz	The monitor's man page
/opt/cream_stress_test/docs/COPYING	Copyright document
/opt/cream_stress_test/docs/CHANGELOG	The test suite's changelog

Also, once installed, the man pages "cream_stress_test" and |cream_stress_statistics" are available.

Deployment

No special steps should be taken for deploying the test.

Test Execution

Stress

Arguments

First of all, lets check the scripts arguments:

Argument	Description
-h --help	Show the help message
-n --numjobs	The total number of jobs to run. Note that this value is actually ignored in case of running in constant mode (see bellow)
-c --concurrentSubmitters	The number of total submitter processes to create. Defaults to 1. WATCH OUT If this argument is very high, the system will dead lock due to memory shortage (e.g.: 160 processes for a 512 MB RAM host)
-s --slots	The number of jobs each submitter process will submit/manage concurrently. Defaults to 1.
-t --testype	The type of test to run. Either "submit" or "cancel". More test types will be added in the future.
-j --jdl	Path to the jdl file to use. If the argumment is omitted, a sample builtin jdl running uname will be used. If "sleep" is given, then a jdl which sleeps for --sleep seconds will be created. Note that the --sleep argument defaults to 0, so it should essentially be specified if the jdl is supposed to actually sleep for some time.
--sleep	The time to sleep, in case of sleep jdl being used. Defaults to 0 seconds.
-e --endpoint	The endpoint where the jobs are submitted. Example: cream.uoa.gr:8443/cream-pbs-dteam
-d --delegation	The type of delegation to use for the job submission. Either "multiple" or "single". Defaults to "multiple".
--nocolor	Specify this flag to turn off colored output (on by default)
--noprint	Specify this flag to turn off printing to the screen.
--constant	Specify this option, in order to keep a constant queue of the number of jobs provided to the -n option. The integer argument, is the time the script will submit jobs, in minutes. Note that the queue will be sustained for an undetermined amount of time, since many variables play a big role to the total number of the submitted jobs. Generally, if the number of submitters is big, the queue will increase slowly and if the submitter is only one with a big -s (slot) argument, the queue will increase quickly. Note that if the number of concurrent submitters isn’t a divisor of the number of jobs, then the number of slots (-s) will be forcefully set to numJobs/concSubmitters and then numJobs will be forcefully set to slots*concSubmitters.

Test Types

The test types supported are "submit" and "cancel", in two modes (simple and "constant"), giving a total of 4 basic types of tests.

Simple Submit	During this test, a total of -n jobs will run and then the test will exit.
Constant Submit	During this test, jobs will be submitted for --constant minutes, and then they will be collected (this step will take as much time as it needs, it is irrelevant to the --constant argument). Then the test will exit.
Simple Cancel	During this test, a total of -n jobs will be submitted and then cancelled. Then test will exit.
Constant Cancel	During this test, jobs will be submitted and then cancelled for --constant minutes, and then they will be collected (this step will take as much time as it needs, it is irrelevant to the --constant argument, though smaller than the "constant submit" counterpart). Then the test will exit.

Submitters and Slots

The number of submitters and slots can lead to very different behaviour (and load applied to the CREAM subsystems). As previously stated, the number of submitter processes must not be too big, or else the system will eventually run out of memory. Also, in case of fast exiting jobs, if the slots per submitter are too many, the jobs will be left unaccounted until the submitter’s slots are filled (priority is always given to the submitting of jobs, against other operations). Furthermore, maybe contrary to the initial impression, many submitters with many slots will not fill a CREAM endpoint quicker than a single submitter with a huge slot. This is because all the submitter processes run on the same host, eventually using the same resources. Thus, the more of them running, the more they compete for these resources, ultimately spending more and more time in function overhead and context switching. Also, a yet unexplored scenario, is the effect multiple actual users submitting jobs concurrently, would have on the test. All in all, the best setup for these two arguments, varies per site setup and must be experimented upon, in order to find the best values for the testing scenario you want to implement for your site and/or service.

Examples

1. Use 125 submitter processes, each managing 40 job slots, submitting 60 second sleep jobs, using a single delegation, until 5000 jobs are submitted.

cream_stress_test.py -d single -n 5000 -c 125 -s 40 -e ctb04.gridctb.uoa.gr:8443/cream-pbs-see -t submit --jdl sleep --sleep 60

2. Same as above, but instead of only submitting, also cancel each job.

cream_stress_test.py -d single -n 5000 -c 125 -s 40 -e ctb04.gridctb.uoa.gr:8443/cream-pbs-see -t cancel --jdl sleep --sleep 60

3. Use 127 submitters, each managing 7 jobs concurrently, submitting and then cancelling the jdl in the given path, until 1924 jobs are cancelled.

cream_stress_test.py  -d  multiple  -n  1924  -c  127  -s   7   -e   ctb04.gridctb.uoa.gr:8443/cream-pbs-see   -t   cancel   -j ~/path/to/jdl/a_jdl.jdl

4. Run in constant submit mode for 100 minutes (note that the -n 1 argument is ignored), using 50 submit processes with slot size 15, using single delegation and the given jdl.

cream_stress_test.py -d single -n 1 -c 50 -s 15 -e ctb04.gridctb.uoa.gr:8443/cream-pbs-see -t submit -j ~/path/to/jdl/a_jdl.jdl --constant 100

Monitor

Arguments

These are the script's arguments:

-h --help	Show help message
-d --delay	The delay between monitoring operations
-s --savestats	If this option flag is set, the statistics gathered, will be saven in a text file under /tmp, in a -parse- friendly format. Data are stored on a per host basis.
watchlist	This is the main argument to the script. First and foremost, it must be given inside single quotes "’" -in order to prevent the shell from intepreting the metacharacters it uses for delimiters-. It has the following format: ’user,host,port{commands}[procs]:...’ Where "commands" and "procs" should be like: {vmstat,iostat,sar,qstat}[java,BUpdaterPBS]. The only available options for the "commands" part of the argument, are "vmstat","iostat","sar","qstat". The "procs" argument can contain any process running in the target host (actually any string, if it doesn’t exist as a process on the target host, no data are gathered). The user,host,port and "commands" arguments are mandatory, the "procs" is optional. The brackets "[","]" and curly brackets "{","}" MUST be used! The watchlist must ALWAYS be passed inside single quotes

Gathered Statistics

These are the available gathered statistics:

iostat	Read and write in KB per second, for all the used hard disk devices
vmstat	Number of processes in ready and in sleeping state. Free, cached, buffered and swap memory. User, system, iowait and idle cpu utilization. System interrupts and context switches per second.
sar	The recieve and transmit rates in KBps for all registered interfaces
ps	The cpu and size of a process, as reported by the ps program
qstat	The number of jobs queued at Torque, at the given time.

Examples

1. Gather data from vmstat, iostat, sar and monitor java, BUpdaterPBS, BNotifier on cream.gridctb.uoa.gr. Gather data from vmstat, iostat, sar and monitor mysqld on db.gridctb.uoa.gr. Gather data from vmstat, iostat, sar, qstat and monitor pbs_server, maui, munged on lrms.gridctb.uoa.gr. Update monitor data each 5 seconds. Also, at the end, save the collected data.

 cream_stress_test.py ’root,cream.gridctb.uoa.gr,22{vmstat,iostat,sar}[java,BUpdaterPBS,BNotifier]:root,db.gridctb.uoa.gr,22{vmstat,iostat,sar}[mysqld]:root,lrms.gridctb.uoa.gr,22{vmstat,iostat,sar,qstat}[pbs_server,maui,munged]’ -d 5 -s

Combine

Finally, if you want to utilize both scripts simultaneously -which is the best available option!-, you should execute the monitor script on a terminal, wait for a couple of minutes to gather the system's under test idle statistics and then fire up the stress test on another terminal. Once the stress test is finished, wait again for a couple of minutes (or maybe even more, depending on the load applied to the subsystems), in order to gather statistics for the time it takes to reach the same levels of resource utilization as before the stress test, in the idle state