LCG Web>LCGGridDeployment>LCGgliteTestMissing (2008-09-30, GianniPucciani)

EditAttachPDF

Needed tests

This list was used during EGEE 2 and is out of date now. The new list is here

This is an evolving list of tests we need for gLite certification. Test developers are encouraged to add content.

APEL

Uses R-GMA to propagate and display job accounting information for infrastructure monitoring.

Test developer: Alessio Gianelle (INFN), Nikos Voutsinas (GRNET)

For more information on installation and configuration of new APEL RPMs (R_2_0_8) please follow this link

Layer 1 - service ping tests

Test if apel-pbs-parser cron job returns any errors on each CE, and if apel-pbs-publisher works OK on MON host.

Parsing:

env RGMA_HOME=/opt/glite APEL_HOME=/opt/glite /opt/glite/bin/apel-pbs-log-parser -f /opt/glite/etc/glite-apel-pbs/parser-config-yaim.xml

If there are no errors returned, the gLiteCE (and other CEs tested this way) publishes information correctly to the MON host.

Publishing:

env RGMA_HOME=/opt/glite APEL_HOME=/opt/glite /opt/glite/bin/apel-publisher -f /opt/glite/etc/glite-apel-publisher/publisher-config-yaim.xml

If there are no errors, accounting data is correctly published to the central Registry.

BLAH

Interfaces CE and local batch system. BLAHPD has to be installed on a Computing Element.

Developers: David Rebatto, Giuseppe Fiorentino, Massimo Mezzadri. To test the core daemon, the easier thing to do is to write some fake scripts mimicking the real LRMS ones, e.g. tst_submit.sh, tst_status.sh, etc. You can than start blahp and issue commands to submit fake jobs specyfing gridtype="tst" in the submission classad. The fake tst_submit.sh could by example just log the arguments passed by blahp and return a fake jobId. To test the LRMS scripts requires some more work and thoughts...

Test developer: Elisabetta Molinari (INFN)

Link to implemented BLAH tests

Layer 1 - service ping tests DONE

Check that the blahparser is running on the machine where the LRMS logfiles are available (usually the CE).

Layer 2 - service functionality tests DONE

Provide scripts to test the following blah commands: BLAH_JOB_CANCEL BLAH_JOB_HOLD BLAH_JOB_REFRESH_PROXY BLAH_JOB_RESUME BLAH_JOB_STATUS BLAH_JOB_SUBMIT

Layer 3 - system tests

Test for the possibility to pass parameters from the jdl to the local resource management system.

Data Management (DM)

Test all possible combinations of options of lcg_utils.

DGAS

Collects, stores and transfers accounting data.

Dgas development coordinator: Andrea Guarise and co-developer: Rosario Piro.

Test developer: Alessio Gianelle (INFN)

Link to implemented DGAS tests

Layer 1 - service ping tests DONE

Check that on the HLR node the follwing services are running: hlr-serverd, hlr-tqd.
Check that the hlr-serverd answers to 'ping' service requests from remote clients.

Layer 2 - service functionality tests DONE

Test that the following local commands work:
- glite-dgas-hlr-adduser
- glite-dgas-hlr-addgroup
- glite-dgas-hlr-addvo
- glite-dgas-hlr-addresource
- glite-dgas-hlr-deluser
- glite-dgas-hlr-delgroup
- glite-dgas-hlr-delvo
- glite-dgas-hlr-delresource
- glite-dgas-hlr-queryuser
- glite-dgas-hlr-querygroup
- glite-dgas-hlr-queryvo
- glite-dgas-hlr-queryresource
- glite-dgas-hlr-advancedquery

Test that the following (client) commands work:
- glite-dgas-ping
- glite-dgas-atmClient
- glite-dgas-query
- glite-dgas-userinfo
- glite-dgas-resourceinfo

Test basic job accounting procedures for site HLR accounting (using fake job info) DONE
Test basic job accounting procedures for site-VO HLR accounting (using fake job info)
Test that a real hello world job is being accounted

Layer 4 - stress tests

Test that a stream of job is being accounted

gLite CE

The gLite CE is no longer part of the release

The Computing Element (CE) is the service representing a computing resource. Its main functionality is job management (job submission, job control, etc.). The CE may be used by a generic client: an end-user interacting directly with the Computing Element, or the Workload Manager, which submits a given job to an appropriate CE found by a matchmaking process. Tests described here deal with direct submission to the CE.

Test developer: Alessio Gianelle (INFN)

Layer 1 - service ping tests

Check if the following basic services respond on the CE: condor and glite-gatekeeper
Check for the following optional services: blahparser, glite-lb-logd, glite-lb-interlogd

Layer 2 - service functionality tests

Install condor on a machine to be used to launch the testing scripts, from there test that the following condor commands work: condor_submit, condor_rm, condor_q using ad-hoc submit files

Write guidelines on how to check that the information published on the gliteCE is correct.

Layer 4 - stress tests

Massive job submission: provide a script that sequentially submits hundreds of jobs via the condor_submit command

Massive job cancellation: provide a script that sequentially submits hundreds of jobs via the condor_rm command.

Provide a script that cancels jobs while submitting
Provide time measurements

R-GMA

The Relational Grid Monitoring Architecture (R-GMA) provides a service for information, monitoring and logging in a distributed computing environment. For more information have a look at the R-GMA homepage

Layer 1 - service ping tests

SAM CE-sft-rgma-sc sensor

Layer 2 - service functionality tests

On the R-GMA client: rgma-client-check script (comes with R-GMA) checks that R-GMA has been configured correctly. This is accomplished by publishing data using various APIs and verifying that the data is available via R-GMA. This test is probably already executed by a SAM sensor.
On the R-GMA MON box: rgma-server-check (comes with R-GMA) checks if the R-GMA server is configured correctly and tries to connect to the servlets.

Layer 3 - system tests

Test a complete use case using the most important commands of the CLI: show, describe, select, set query, insert, secondaryproducer, set producer, set timeout/maxage
Do the above test also with the APIs
Publish a tuple with a simple predicate and check that it can be consumed (continous query)
Create secondary (latest) producer and check that it can be consumed
Create a secondary (history) producer and check that it can be consumed.
Write garbage to R-GMA, MON box connects to registry with bad string

Layer 4 - stress tests

Create a simple program that publishes tuples into the userTable. Submit a flood of jobs to get this script to run on each worker node. Check that no information is lost.
Measure response time
Do consuming and producing at the same time

VOMS

The Virtual Organisation Membership Service (VOMS) is a system for managing authorization data within multi-institutional collaborations. VOMS provides a database of user roles and groups, and a set of tools for accessing and manipulating the database and using the database contents to generate Grid credentials for users when needed. Have a look at the VOMS TWiKi page

VOMS development is divided in voms core and voms-admin. The developers are Vincenzo Ciaschini and Valerio Venturi for voms core, and Andrea Ceccanti for voms admin.

Test developer: Maria Allandes (CERN), Viktor Galaktianov (CERN), PSNC

Link to implemented VOMS tests

VOMS core tests

Layer 1 - service ping tests DONE

- Check voms core is up and running.

Layer 2 - service functionality tests

- Starting a VO in a single VO VOMS server
- Stopping a VO in a single VO VOMS server
- Restarting a VO in a single VO VOMS server
- Starting a VO in a multi VO VOMS server
- Stopping a VO in a multi VO VOMS server
- Restarting a VO in a multi VO VOMS server
- simple voms-proxy init with one VOMS server
- Test the correct return value of the voms-proxy-init command. It should return 0 whenver a valid proxy is being returned (also in fail over mode) otherwise non-zero.
- simple voms-proxy-init with two working VOMS servers with the same VO name and the same db
- simple voms-proxy-init with two VOMS servers, one is down, with the same VO name and the same db
- simple voms-proxy-init with two VOMS servers with the same VO name and different db
- simple voms-proxy-init with user belonging to one or more VOMS groups
- voms-proxy-init with user belonging to one or more VOMS groups and reordering of attributes
- simple voms-proxy-init with user having one or more roles
- voms-proxy-init with user having one or more roles and reordering of attributes
- simple voms-proxy-info
- simple voms-proxy-destroy
- simple voms-proxy-list

Layer 4 - stress tests

- Execute a high number of parallel voms-proxy-init tests
- Executel a low number of parallel voms-proxy-init tests during a large number of repetitions

Layer 5 - performance tests

- Execute voms core stress tests to verify the database and tomcat performance.

VOMS admin tests

Layer 1 - service ping tests DONE

- Check voms admin is up and running.

Layer 2 - service functionality tests DONE

- Create a VO
- Mass VO creation
- Delete a VO without removing the DB
- Delete a VO removing the DB
- Create user
- Delete user
- List users
- ldap synchronisation
- Create role
- Delete role
- List roles
- Assign a role to a user
- Remove role from a user
- List which users have certain role
- List the roles of a user
- Create group
- Delete group
- List groups
- List sub-groups
- Add member in a group
- Remove member from a group
- List members of a group
- Create a new ACL entry
- Create a new default ACL entry
- Remove an ACL entry
- Remove a default ACL entry
- List the ACL of a group or role
- List default ACL applied to new groups or roles
- Generate a grid-mapfile
- SOAP client tests

Layer 4 - stress tests

- Mass grid-mapfile generation for one VO
- Mass grid-mapfile generation for multiple VOs
- Mass grid-mapfile generation combined with user registration and user list

Layer 5 - performance tests

- Execute voms admin stress tests to verify the database and tomcat performance.

WMProxy

WMProxy API test plan provided by CSIC(IFIC)

WMS

The Workload Management System (WMS) comprises a set of Grid middleware components responsible for the distribution and management of tasks across Grid resources, in such a way that applications are conveniently, efficiently and effectively executed. For more information have a look at the WMS User's Guide

Link to implemented WMS tests

Layer 1 - service ping tests

Check if the following services respond on the resource broker: gridFTP server, Network Server, WM Proxy, CEMON asynchronous notification, Condor Collector, Logging and Bookkeeping (this service might be on another host), R-GMA export of LB data

Layer 2 - service functionality tests

test if the following NS commands work (by submitting one job): glite-job-submit, glite-job-list-match, glite-job-cancel, glite-job-output, glite-job-status, glite-job-logging-info. Check if SandBox works (sending a file with the job), send SandBoxes with increasing size to see if the (hardcoded) size limit for SandBoxes works.
test if the following WMProxy commands work: glite-wms-job-submit, glite-wms-delegate-proxy, glite-wms-job-list-match, glite-wms-job-cancel, glite-wms-job-output, glite-wms-job-perusual.
verify that the information provider on WMS publishes correct data (provide a document describing how this test should be done)
provide a test that produces very large jdls to test for jdl size limitations
ranking: test based on numbers of CPUs
matchmaking: test should only be sent to CEs where your VO and role is supported and respect the time limit
test all the options of all the WMS commands

Layer 3 - system tests

Describe a test scenario for the case that the VOMS proxy and/or the role expires during running jobs.

Layer 4 - stress tests

Massive job submission.

single stream submission
multiple stream submission with different users/VO's
submit a simple hello world job that returns immediately
test DAG jobs
submit a job that sleeps for a certain time (can be chosen by the user), i.e. the WMS does not have to take back jobs when it is still submitting.
possibility to submit to a chosen CE.
possibilty to have different SandBoxes (empty ones, one with a big file and SandBoxes containing up to 30 files)
possibility to have OutputSandboxes with up to 20 files
provide a script that returns the status of the test (how many jobs already submitted)
provide a scripts that returns the status of the submitted jobs, success rate, average submission time, failure messages

Massive job canceling

features like the job submission test
cancel while submitting
cancel when all jobs are running (i.e. simple jobs that just sleep)

Bulk submission DONE

features like the job submission test
a set of srcipts developed by Lin is available here

LB

The LB service provides a higher level view of the jobs by gathering events from various WMS components in the system.

Test developers: Othmane Bouhali, Shkelzen Rugovac (University of Bruxelles)

Link to implemented LB tests

Layer 1 - service ping tests DONE

Check the response of very basic L&B commands

Layer 2 - service functionality test DONE

Generate unique jobID's and test basic L&B API's

Check the locallogger: Job registration

Check the interlogger: event logging

Check the event delivery by the interlogger to the bokkeeping server

Other tests are also planned and will be reported.

Scalability testing. Submitting 400 jobs Di noticed some jobs failing with a message like "Resource temporarily unavailable". See also bug #15450.

Security

Tests which check whether services are able to ban users (especially on the CE).
A list of security tests is given on the SecurityAuditing page.

It is necessary to ensure that a given deployment does not introduce security vulnerabilities resulting from the interactions between the various pieces of middleware and the OS. It is necessary to test the deployment as a whole to ensure that there are no ways in wich a principal is able to bypass the security infrastructure, access data beyond their rights, or carry out any other unauthorised activity.

Information System

Layer 1 - service ping tests

Check that the LDAP server (top level EGEE.BDII) is reachable
Check that the top level BDII is publishing some info (basic BDII troubleshooting test)

Layer 3 - system tests

Setup a CE and run it under some load, no stress needed, but multiple VOs and queues, the more complex the better.
Do a ldap query to the site EGEE.BDII that is connected to the CE and verify that the information is:
- complete (are all queues and VOs present?)
- consistent (do the numbers published make sense? Compare with the configuration and the glue-schema 1.2 document.)

Layer 4 – stress tests

Performance test of the information system on the gliteCE:
- Two setups needed: Batch system head node on the CE, batch system head node NOT on the CE
- Objective: Verify that the information provider returns the needed information fast enough to ensure that the CE will stay visible in the information system even when loaded.
- Test design: The CE AND the batch system has to be loaded. Configure the batch system to accept several jobs on the WN. Submit MANY jobs to the CE, there should be > 10k jobs queued in the batch system and a few hundreds running (have to be minimal jobs that just sleep for a few minutes). The queue and VO structure should be the same as in tests before. Then logon to the CE as the user under which id the info system is run. On most systems this is edg-info. As this user go to : /opt/lcg/var/gip/plugin Run from the command line: time ./lcg-info-dynamic-scheduler-wrapper The timing tests should be run several times, best at different load levels and recorded. Make sure that the command returns something. Realtime responses longer than 30sec are problematic and those larger than 1 minute critical.
Provide a script that performs n ldapsearch queries in parallel and checks that the services keep publishing

Further tests (not yet classified):

Check that the published information is consistent.
Use gstat on the certification testbed to further develop tests for the information system.
Test service discovery API with both R-GMA and the bdii as the backend. The information behind the SD API, i.e. the content in GlueService and GlueServiceStatus, also needs checking.
Tests for FCR

Batch Systems

For every Batch System (BS) the following points have to be verified:

Link to implemented BS tests

Installation: BS with BS head node off the CE.
Installation: BS with BS head node on the CE.
Testing the queue structure and scheduler config needed for the implementation of the "Short Deadline Jobs", "Message Passing" and "Job Priorities" recommendations. Links to these recommendations are on the TCG Active Working Groups page.
Ensure that the info providers produce what is expected. This can be done by querying the GRIS on the CE and checking that the information concerning the BS is correct. Compare the information about running jobs etc. given by the information system with the one given by the command provided by the BS (e.g. qstat for Torque). For this one might be interested in a description of the GLUE Schema.
Verify that the info providers create only reasonable load on the head node and the CE.
Verify that the audit files are properly written.
Verify that the accounting files are working.
Verify that VOMS based authorization to the queues work.

How to do the test

Submit a large number of jobs to the CE.
- Start with 100 jobs and double until the CE is saturated.
- Provide measurements for each of the steps: time to process the jobs, load figures on the CE and head node, memory consumption.
- Document the failure rate in all modes.
For the saturation document at what load the CE stops working and the symptoms (crash, freeze, crawl,...)
Tests must be run with different user scenarios:
- 1 User
- 5 users from the same VO
- 4 VOs with each 5 users
Submit from multiple UIs (for the lcg-CE) and from multiple RBs and WMS nodes at the same time. This is needed to verify that the handling of parallel asynchronous events is working.
Do a BS standalone test to understand what the envelope is. Do test 1. on a BS client (e.g. with qsub for Torque) and compare the results (including error rates).
Try the proxy renewal. Here you submit jobs at the 30% level with proxies that will expire almost at the same time.

Batch systems: LSF, Condor, Sun Grid Engine, Torque

Tests verifying the User Guides (TAR UI)

Test if the TAR UI is installable as non-root and usable. All possible commands in the userguide must be tested. Verify that the UI works with bash as well as tcsh. For the services' resp. nodes' user guides we should make sure that every command mentioned therein is being tested by at least one test. This correspondence should be documented. DONE

Link to implemented UI tests

dCache

Contacts: Owen Synge, Greig Cowan

dCache functional testsuite

Proxy renewal

Make sure that proxy renewal works for all services. E.g. the proxy might be valid but the voms role no longer is valid. Proxyrenewal during the job is in the queue.

Catalogs

Performance tests

Users Tools

Grid Peek
R-GMA JobMonitoring

Interoperability Tests.

OSG - this will be handled between the PPS and OSG's ITR (their testbed). There are now fortnightly meetings to coordinate.
ARC
UNICORE
NAREGI

LFC

Link to implemented LFC tests

lcg-utils specific tests DONE
LFC specific tests
LFC Python interface tests DONE
LFC Perl interface tests

VO Setup

It must be possible to do tests with different combinations of VOs, Roles and Groups. On the certification testbed the following setup should be available on all services:

At least 100 test user certificates
4 VOs: dteam, org.glite.voms-test, one really long VO with strange characters, one VO dedicated to VOMS tests (this VO needs only be available on the VOMS server)
All test users must be registerd to all VOs.
Per VO we need 2 Roles: lcgadmin, production
Per VO we need 2 Groups: higgs, susy
It must be possible to get a voms proxy with attributes /VO/Group/Role and /VO/Role for any combination of VO, Group, Role. This needs only be available for some test users.

VOBOX

proxy renewal
gsissh -p . Default port is 1975, you have to have the sgm role in order to login with gsissh.
subset (?) of UI tests
gridftp (globus-url-copy)

Worker Node (WN)

Link to implemented WN tests

test the 64bit WN for 32bit compatibility DONE

General test topics

Sensible error messages

Does the application/service always give a sensible error message when some problem is encountered, e.g. when a bad argument is supplied on the command line, or when some expected argument is missing? For example, there are bugs open about the lack of robust argument parsing by lcg-utils, and their error messages in general.

Robustness of services against reboots or restarts

If a service is restarted, does it allow pending requests to be recovered, e.g. can a client reconnect and refer to previous requests as if the service had been available all the time? For example, as desired, when a CE, RB or WMS is rebooted, jobs in steady state will not be affected. This should be re-verified and also tested for other services for which such robustness is deemed important (and implementable).

Functionality of commands

Test if all/most combinations of options to available data management and wms commands work. For example:

edg-job-status --vo dteam --all

Selected Virtual Organisation name (from --vo option): dteam

Retrieving Information from LB server lxb2016.cern.ch:9000
Please wait: this operation could take some seconds.

**** Warning: API_NATIVE_ERROR ****
Error while calling the "Status:getStates" native api
Unable to retrieve status query information from: lxb2016.cern.ch: 
edg_wll_QueryJobsExt: No indexed condition in query: no indices configured: jobid required in query

**** Error: LB_API_OPEN_ALL ****
Unable to open connection with any LB supplied

This indicates that there is a problem with the --all option. (Note: it is defensible not to configure indices, because it might invite users to run big queries instead of keeping track of their own job IDs.)

Check behavior under error conditions

Check behavior under error conditions as well as normal operation, e.g. for R-GMA there have been many problems if a MON box is behind a firewall or responds slowly.

Stress tests

Stress tests should be submitted under several users and VOs at the same time. -- AndreasUnterkircher - 11 Oct 2007

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
pdf	testplan_wmproxy_api.pdf	r1	manage	80.5 K	2007-05-02 - 11:57	AndreasUnterkircher	WMProxy API test plan

Topic revision: r69 - 2008-09-30 - GianniPucciani

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback