Ideas regarding co-scheduling

Introduction

The objective of this note is to explore and capture a possible component-level use-case and implementation of the co-scheduling feature of ETICS. The overall goal is for ETICS users to be able to execute tests on several resources (i.e. machines) at the same time, following a user defined deployment. In other words, the idea is to be able to dynamically deploy a test-bed on generic resources, prior to the execution of the test-suite, gather the test results and release the resources.

This note doesn't address the user interface, it's focused on the server-side activities, between the B&T WS and Metronome, as well as what happens as part of the jobs execution.

Upgrade to the ETICS data model

In order to execute the test, including the assignment of the nodes and the deployment of the required services, an upgrade to the ETICS data model is required.

Here's a possible extension:

For a multi-node test, the user defines a deployment. This deployment contains one or several node. Each node, in turn, contains one or several configuration. For example:

deployment
   - node1
      - configuration1
      - configuration2
   - node3
      - configuration1
      - configuration3
   - node4
      - configuration4
      - configuration5
   - node5
      - configuration3

The idea of using configuration as the container of information regarding deployment has pros and cons, compared for example to packages. One example, is that using configuration, a deployment can be platform independent (e.g. rpms cannot be deployed on Windows). On the other hand, if a configuration generates several custom package, how do fetch the right list of packages?

We also need to upgrade the TestCommand type to include possibly things like:

  • clean
  • init
  • deploy
  • test
  • publish
  • finalise

with the pre/post for all of them.

Taking this deployment structure as the input to a test, let's look at a possible use-case.

USECASE: Deploy and execute a multi-node test-suite

CHARACTERISTIC INFORMATION

Goal: Execute a test-suite following a multi-node deployment model

Scope: Server and execution engine side

Preconditions: A deployment model is provided. All the required platforms are available. All packages have being built for the required platforms.

Success End Condition: Test report is generated and available to the submit node, all used resources are released.

Failed End Condition: Same as "Success End Condition"

Primary Actor: build and test WS

Trigger: This use-case is triggered by submit method on the B&T WS


MAIN SUCCESS SCENARIO

  1. B&T WS crafts a multi-node submit file (based on the node info in the deployment model) - see section Sample nmi submit file for an example
  2. B&T WS submits to Metronome
  3. Metronome match-makes and allocates the required resources based on the submit file info
  4. All nodes register their hostname to the info system:
    • host_i = ' < ip address or hostname > ' (where i is the index of the allocated node by Metronome)
  5. The following part of the execution is focused on individual node
  6. Job waits for other nodes to have set their hostname (job needs to be able to fetch the number of nodes allocated, or the list of node index this node has a dependency on)
  7. Execute the init target
  8. Set in the info system that the current init target has completed
  9. Wait for all nodes to have completed the execution of the init target (e.g. blocking read of the value set by other nodes)
  10. Execute the test target
  11. Set in the info system that the current test target has completed
  12. Wait for all nodes to have completed the execution of the test target (e.g. blocking read of the value set by other nodes)
  13. Execute the finalise target
  14. Set in the info system that the current finalise target has completed
  15. Wait for all nodes to have completed the execution of the finalise target (e.g. blocking read of the value set by other nodes)
  16. Exit the job
  17. End of the part of the execution is focused on individual node
  18. Metronome releases the resource
  19. Metronome gathers the reports.tar.gz back to the submit node
  20. post_all script consolidates the reports from all the nodes
  21. post_all registers the artefacts with the repository

EXTENSIONS

  • (any step) If during job execution a step fails, the default behaviour is to abort the job execution. This requires that at every blocking read (i.e. synchronisation point), the read implementation must check for existence of errors and trigger abortion (i.e. exit of the job).


NOTES

  • The "info system" could be implemented using classAdds or a simple WS with setter and getter methods.


CONSTRAINTS REQUIREMENTS

  • The info system must "box" the information for each test run (in other words, a job cannot set or get a value of a job not part of the current test)

QUESTIONS

  1. Do we want to express dependencies between nodes and/or services?
  2. If yes, do we need it right from the start?
  3. Can we (re)use inter-configuration dependencies?
  4. Are we supporting cross sites testing?
  5. How to we consolidate the reports from several nodes?

-- MebSter - 15 May 2007

SAMPLE NMI SUBMIT FILE

Here's the multi-nodes submit file GuillermoDiezAndinoSancho wrote last summer, with MarianZUREK. This submit file successfully executed in parallel on 2 nodes.

We might want to add to the submit file the total number of nodes required, which might help the different jobs to find each other.

project = System scheduling
component = WS-UTILS
description = testing WS and UTILS
version = 0.1
inputs = /tmp/scripts.scp,/tmp/scripts.scp
platforms = (x86_slc_3,x86_slc_3)
                                                                                                                                           
register_artefact = false
                                                                                                                                           
#Job 0
project_0 = org.etics
component_0 = org.etics.build-system.webservice
version_0 = org.etics.build-system.webservice.HEAD
checkout_options_0 = --project-config=org.etics.HEAD
submit_options_0 =
                                                                                                                                           
service_name_0 = WS
depends_0 = UTILS
remote_declare_args_0 = 0
remote_declare_0 = remote_declare
remote_task_0 = remote_task
remote_post_0 = remote_post
post_all_0 = post_all
post_all_args_0 = x86_slc_3
                                                                                                                                           
#Job 1
project_1 = org.etics
component_1 = org.etics.build-system.java-utils
version_1 = org.etics.build-system.java-utils.HEAD
checkout_options_1 = --project-config=org.etics.HEAD
submit_options_1 =
                                                                                                                                           
                                                                                                                                           
service_name_1 = UTILS
depends_1 =
remote_declare_args_1 = 1
remote_declare_1 = remote_declare
remote_task_1 = remote_task
remote_post_1 = remote_post
post_all_1 = post_all
post_all_args_1 = x86_slc_3
                                                                                                                                           
platform_pre_0 = noop.sh
platform_post_0 = noop.sh
                                                                                                                                           
platform_pre_1 = noop.sh
platform_post_1 = noop.sh
                                                                                                                                           
run_type = build
notify = Guillermo.Sancho@cern.ch
                                                                                                                                           
# EVENTUALLY ENABLE but disabling for early testing so as not to have to
# login all the time and remove root-expiry file.
# run_as_root = true
                                                                                                                                           
# try to match based on sudo_enabled
append_requirements = sudo_enabled==TRUE
                                                                                                                                           
reimage_delay = 60m

remote_declare FILE

This is the remote_declare task script, which works in conjunction with the submit file shown above.

This task generates the task.sh file.

#!/usr/bin/env python
import os
import sys
                                                                                                                                           
tasklist = open('task.sh','w')
taskIndex = sys.argv[1]
tasklist.write('taskIndex=%s\n' %sys.argv[1])
tasklist.write('wget "http://eticssoft.web.cern.ch/eticssoft/repository/org.etics/client/0.3.1/noarch/etics-workspace-setup.py" -O etics-workspace-setup\n')
tasklist.write('python etics-workspace-setup\n')
tasklist.write('python etics/src/etics-get-project $NMI_project_%s\n' %taskIndex)
tasklist.write('python etics/src/etics-checkout $NMI_checkout_options_%s -c $NMI_version_%s $NMI_component_%s\n' %(taskIndex,taskIndex,taskIndex))
                                                                                                                                           
# Source nmi staff
tasklist.write('source util.sh\n')
# Configure Chirp.
tasklist.write('setup_chirp\n')
                                                                                                                                           
#Wait for the dependencies to be resolved
if len(os.getenv('NMI_depends_%s' %taskIndex)) > 0:
  dependencies=os.environ['NMI_depends_%s' %taskIndex]
  dependencyList = dependencies.split(';')
  for i in dependencyList:
    tasklist.write('discover_services nodeid %s 35\n' %i)
    tasklist.write('echo \"************************************************\"\n')
    tasklist.write('echo \"Printing requirement _NMI_HOST_%s\"\n' %i)
    tasklist.write('echo $_NMI_HOST_%s\n' %i)
    tasklist.write('echo \"************************ENV***********************\"\n')
    tasklist.write('env|sort\n')
    tasklist.write('echo \"************************************************\"\n')
                                                                                                                                           
if (not os.environ.has_key('NMI_run_type')) or os.getenv('NMI_run_type')=='build':
   cmd = 'python etics/src/etics-build $NMI_submit_options_%s $NMI_component_%s\n' %(taskIndex,taskIndex)
else:
   cmd = 'python etics/src/etics-test $NMI_submit_options_%s $NMI_component_%s\n' %(taskIndex,taskIndex)
tasklist.write(cmd)
                                                                                                                                           
tasklist.write('SRV_PREFIX=$NMI_service_name_%s\n' %taskIndex)
tasklist.write('RESULT=OK_result\n')
tasklist.write('publish_service ${SRV_PREFIX} ${RESULT}\n')
tasklist.close()

-- GuillermoDiezAndinoSancho - 15 May 2007

IMPLEMENTATION OF THE JOB WRAPPER

Here's a pseudo-code/Python implementation of the job wrapper for the co-scheduling. This implementation, like the above use-case, doesn't include inter-service dependencies.

This implementation requires a modification to the current execution of etics-test, where each target is executed at a time, with a synchronisation point between each target execution.


myHostname = env('HOSTNAME') # could also be the ip address
myNodeIndex = env('NODEINDEX')
nodesDeployed = env('NOOFNODESDEPLOYED')
set('hostname'+myNodeIndex, myHostname) # set a key/value providing hostname value to all other jobs part of this deployment
hostnames = []

# Wait for all nodes to be allocated
for i in nodesDeployed:
   hostname = get('hostname'+i,'block') # this is a blocking read on the existence of the variable 'hostname'+i, which is a simple way to provide synchronisation
   hostnames.append(hostname)

# Execute all targets, waiting for all others to be completed before the next target is executed
targets = getTestTargets()
for t in targets:
   execute(t)
   set('status'+myNodeIndex,t+'Done')
   for i in nodesDeployed:
      while status != t+'Done':
         status = get('status'+i,'noblock')
         if status = 'error':
            ... handle error ...
         ... need some sort of time-out mechanism if we don't have a callback mechanism ...
jobStatus = ...
return jobStatus

-- MebSter - 16 May 2007

-- MebSter - 16 May 2007

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2007-05-16 - MebSter
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    ETICS All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback