Deployment SCENARIO: "Mixed Mode"
Before Starting
- HOSTNAME SL5: emitestbed34.cnaf.infn.it + 2 IP for virtual machines emitestbed35.cnaf.infn.it, emitestbed36.cnaf.infn.it
- HOSTNAME SL6: cert-06.cnaf.infn.it + 2 IP for virtual machines emitestbed19.cnaf.infn.it, emitestbed20.cnaf.infn.it
- OS: SL5 / SL6 X86_64 Installed
- No Host certificate required
- No Network Bridge configured
- Hardware must support virtualization (please run grep --color vmx /proc/cpuinfo)
Service Installation
- Repositories ( see EMI basic configuration): egi-trustanchors.repo + emi-2-rc-sl5.repo + epel.repo
- $> yum clean all
- $> yum makecache
- $> yum install ca-policy-egi-core
- $> yum install lcg-CA
- $> yum install yum-protectbase.noarch
- INSTALLING WN + TORQUE
- $> yum install emi-wn emi-torque-client
- $> yum install emi-release
- $> yum install kvm-qemu-img
- $> yum install kmod-kvm
- $> yum install libvirt
- $> yum install python-virtinst
- $> yum install pyOpenSSL
- INSTALLING WNODES
- $> yum install wnodes*
Service Configuration
CONFIGURE WN with Torque/Maui (SL5, follow same rules for SL6 -> check documentation)
- Install WN with torque following deployment logbook here WN deployment logbook, excluding GLEXEC, MPI
- $> cp /etc/munge/munge.key from the CE
- $> chown munge /etc/munge/munge.key
- $> /etc/init.d/munge start
- Wnodes specific configuration
- $> -> /etc/wnodes/nameserver/mac_list.ini
[DEFAULT_VLAN]
network_type = OPEN
bait_host =
vm_host = emitestbed35.cnaf.infn.it^00:16:3E:MACADDRESS;emitestbed36.cnaf.infn.it^00:16:3E:MACADDRESS
-
- $> grep -v "#" /etc/wnodes/nameserver/wnodes_hv_config.ini
[HV_CONF]
HV_PORT=8222
BAIT_PORT=8111
LOG_FILE_NAME=wnodes_hv.log
MAX_LOG_FILE_SIZE=100000
MAX_COUNT_LOG_FILE=5
LOCAL_REPO_DIR=/usr/local/wnodes/repo
BAIT_IMG_TAG=wnodes_sl5_bait
BAIT_VM_RAM=800
HOST_GROUP_EMITESTBED=emitestbed*
ENABLED_VLAN_GROUP_EMITESTBED=DEFAULT_VLAN
SSH_KEY_FILE=/root/.ssh/hv_id_rsa
USE_LVM=NO
VOLUME_GROUP=vg0
SERVICE_NIC_IP=10.1.1.2
SERVICE_NIC_IP_MASK=255.255.255.0
DNS_RANGE=10.1.1.3,10.1.1.30
DNS_LEASE_TIME=5m
ENABLE_MIXED_MODE=yes
-
- $> grep -v "#" /etc/wnodes/nameserver/wnodes_bait_config.ini
[BAIT_CONF]
min_vm_cpu = 1
default_vm_bandwidth = 50
min_vm_mem = 1500
reservation_length = 1200
max_vm_bandwidth = 100
log_file_name = wnodes_bait.log
status_retry_count = 3
max_vm_storage = 15
batch_system_type = PBS
max_log_file_size = 100000
enabled_vlan_group_emitestbed = DEFAULT_VLAN
lsf_profile = /etc/profile.d/lsf.sh
max_vm_mem = 2000
min_vm_bandwidth = 10
max_vm_cpu = 1
enable_mixed_mode = yes
type = BATCH;BATCH_REAL
default_vm_img = wnodes-emi-images
scheduling_interval = 60
hv_port = 8222
use_lvm = NO
default_vm_storage = 10
min_vm_storage = 10
max_count_log_file = 5
host_group_emitestbed = emitestbed*
bait_port = 8111
default_job_type = BATCH
default_vm_cpu = 1
px_failed_return_status = 3
vm_unreach_timeout = 600
default_vm_mem = 2000
-
- $> grep -v "#" /etc/wnodes/manager/wnodes.ini
[NAMESERVER]
NS_HOST = emitestbed34.cnaf.infn.it
NS_PORT = 8219
-
- $> service wnodes_nameserver start
- $> wnodes_manager -a wnodes-emi-images http torquemada.cr.cnaf.infn.it/wnodes/wnodes_sl5_wn_emi x86_64 raw /dev/mapper/VolGroup00-LogVol00
- $> [root@emitestbed34 ~]# wnodes_manager -l
tag loca path arch form dev
wnodes-emi-images http torquemada.cr.cnaf.infn.it/wnodes/wnodes_sl5_wn_emi x86_64 raw /dev/mapper/VolGroup00-LogVol00
-
- $> grep -v "#" /etc/wnodes/hypervisor/wnodes.ini
[NAMESERVER]
NS_HOST = emitestbed34.cnaf.infn.it
NS_PORT = 8219
-
- $> grep -v "#" /etc/wnodes/bait/wnodes.ini
[NAMESERVER]
NS_HOST = emitestbed34.cnaf.infn.it
NS_PORT = 8219
-
- $> mkdir -p /usr/local/wnodes/repo --> workaround
- $> service libvirtd start
- $> service wnodes_hypervisor start --> this will start the process wnodes_bait
- $> SOME CHECKS
[root@emitestbed34 ~]# wnodes_manager -t all
emitestbed34 :
[root@emitestbed34 ~]# wnodes_manager -s emitestbed34
Bait : emitestbed34;
Bait status : ['OPEN', 'Everything is OK, the BAIT process can start', 1347887272.8612399, 0, 0, {'MEM': 3965, 'BANDWIDTH': 1000, 'CPU': 3}]
No active jobs
[root@emitestbed34 ~]# wnodes_manager -S emitestbed34
[root@emitestbed34 ~]#
-
- $> chmod 500 /usr/bin/wnodes/site_specific/wnodes_preexec
- $> wget patch_wnodes_preexec.txt applying patch
- $> cp /etc/wnodes/site_specific/wnodes_preexec.conf.tpl /etc/wnodes/site_specific/wnodes_preexec.conf
- $> [root@emitestbed34 ~]# grep -v "#" /etc/wnodes/site_specific/wnodes_preexec.conf ----------> NOTE : this file must be the same on the template virtual image
[general]
TMPFILE=/tmp/my_bait
LOCAL_DOMAIN=cnaf.infn.it
NS_HOST=emitestbed34.cnaf.infn.it
NS_PORT=8219
BAIT_PORT=8111
FAIL_RETURN_STATUS = 3
[default]
TYPE=BATCH
IMG=wnodes-emi-images
NETWORK_TYPE=OPEN
CPU=1
MEM=1900
STORAGE=30
ENABLEVIRTIO=YES
BANDWIDTH=10
PX_SCRIPT=/usr/bin/wnodes/site_specific/wnodes_preexec
['dongiovanni']
TYPE=BATCH
IMG=wnodes-emi-images
NETWORK_TYPE=OPEN
CPU=1
MEM=2500
STORAGE=30
ENABLEVIRTIO=YES
BANDWIDTH=10
PX_SCRIPT=
-
- $> cat /var/torque/mom_priv/prologue
#!/bin/bash
while [ ! -f /usr/bin/wnodes/site_specific/wnodes_preexec ]; do sleep 3 ; done
sleep 10
/usr/bin/wnodes/site_specific/wnodes_preexec -f /etc/wnodes/site_specific/wnodes_preexec.conf --jobid $1 --username $2 &> /root/prologue.txt
-
- $> chmod 500 /var/torque/mom_priv/prologue
- Wnodes specific configuration: WN image
Configuration ON Torque server
[root@emi-demo13 ~]# cat /etc/cron.d/fix_wnodes_job
*/5 * * * root /usr/bin/fix_jobs_maui.sh
[root@emi-demo13 ~]# cat /usr/bin/fix_jobs_maui.sh
!/bin/bash
for i in ‘diagnose -q | grep -P -i "Hold|Def" | awk ’{print $2}’ ‘
do
releasehold $i
done
for i in ‘showq | grep BatchHold | awk ’{print $1}’ ‘
do
releasehold $i
done
[root@emi-demo13 ~]# cat siteinfo/wnodes_queue_command
create queue emiwnodes
set queue qwnodes queue_type = Execution
set queue qwnodes Priority = 1000000
set queue qwnodes max_running = 80
set queue qwnodes resources_max.cput = 100:00:00
set queue qwnodes resources_max.walltime = 100:00:00
set queue qwnodes resources_default.neednodes = cloudtf
set queue qwnodes enabled = True
set queue qwnodes started = True
[root@emi-demo13 ~]# cat siteinfo/wnodes_queue_commandsl6
create queue qwnodessl6
set queue qwnodessl6 queue_type = Execution
set queue qwnodessl6 Priority = 1000000
set queue qwnodessl6 max_running = 80
set queue qwnodessl6 resources_max.cput = 100:00:00
set queue qwnodessl6 resources_max.walltime = 100:00:00
set queue qwnodessl6 resources_default.neednodes = cloudtfsl6
set queue qwnodessl6 enabled = True
set queue qwnodessl6 started = True
[root@emi-demo13 ~]# qmgr < /root/siteinfo/wnodes_queue_command
Max open servers: 9
create queue qwnodes
set queue qwnodes queue_type = Execution
set queue qwnodes Priority = 1000000
set queue qwnodes max_running = 80
set queue qwnodes resources_max.cput = 100:00:00
set queue qwnodes resources_max.walltime = 100:00:00
set queue qwnodes resources_default.neednodes = cloudtf
set queue qwnodes enabled = True
set queue qwnodes started = True
[root@emi-demo13 ~]# qmgr < /root/siteinfo/wnodes_queue_commandsl6
.....output
....
[root@emi-demo13 ~]# set qwnodes cloudtf resources_default.neednodes = cloudtf
[root@emi-demo13 ~]# qmgr -c "set server managers += root@emitestbed35.cnaf.infn.it"
[root@emi-demo13 ~]# qmgr -c "set server managers += root@emitestbed36.cnaf.infn.it"
[root@emi-demo13 ~]# qmgr -c "set server managers += root@emitestbed34.cnaf.infn.it"
[root@emi-demo13 ~]# qmgr -c "set queue demo resources_default.neednodes = lcgpro"
FOR SL6
[root@emi-demo13 ~]#set qwnodessl6 cloudtfsl6 resources_default.neednodes = cloudtfsl6
[root@emi-demo13 ~]# qmgr -c "set server managers += root@cert-06.cnaf.infn.it"
[root@emi-demo13 ~]# qmgr -c "set server managers += root@emitestbed19.cnaf.infn.it"
[root@emi-demo13 ~]# qmgr -c "set server managers += root@emitestbed20.cnaf.infn.it"
[root@emi-demo13 ~]# cat /var/torque/server_priv/nodes
emitestbed23.cnaf.infn.it np=2 lcgpro
emi-demo09.cnaf.infn.it np=2 lcgpro
emitestbed34.cnaf.infn.it np=3 cloudtf bait
emitestbed35.cnaf.infn.it qwnodes
emitestbed36.cnaf.infn.it qwnodes
cert-06.cnaf.infn.it np=3 cloudtfsl6 bait
emitestbed19.cnaf.infn.it qwnodessl6
emitestbed20.cnaf.infn.it qwnodessl6
[root@emi-demo13 ~]# cat /var/spool/maui/maui.cfg
# MAUI configuration example
SERVERHOST emi-demo13.cnaf.infn.it
ADMIN1 root
ADMIN3 edginfo rgma edguser ldap
ADMINHOSTS emi-demo13.cnaf.infn.it
RMCFG[base] TYPE=PBS
SERVERPORT 40559
SERVERMODE NORMAL
# Set PBS server polling interval. If you have short # queues or/and jobs it is worth to set a short interval. (10 seconds)
RMPOLLINTERVAL 00:00:10
# a max. 10 MByte log file in a logical location
LOGFILE /var/log/maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 1
# Set the delay to 1 minute before Maui tries to run a job again, # in case it failed to run the first time.
# The default value is 1 hour.
DEFERTIME 00:01:00
# Necessary for MPI grid jobs
ENABLEMULTIREQJOBS TRUE
NODECFG[emitestbed34.cnaf.infn.it] PARTITION=virtual
CLASSCFG[qwnodes] PLIST=virtual PDEF=virtual
[root@emi-demo13 ~]# /etc/init.d/maui restart
Shutting down MAUI Scheduler: [ OK ]
Starting MAUI Scheduler: [ OK ]
[root@emi-demo13 ~]#
CHECKING NODES ->
[root@emi-demo13 ~]# pbsnodes -a
emitestbed23.cnaf.infn.it
state = free
np = 2
properties = lcgpro
ntype = cluster
status = rectime=1347958286,varattr=,jobs=,state=free,netload=14304694806,gres=,loadave=0.00,ncpus=1,physmem=2021380kb,availmem=3822532kb,totmem=4117852kb,idletime=5419971,nusers=2,nsessions=2,sessions=2527 31331,uname=Linux emitestbed23.cnaf.infn.it 2.6.18-308.4.1.el5 #1 SMP Tue Apr 17 14:33:50 EDT 2012 x86_64,opsys=linux
gpus = 0
emi-demo09.cnaf.infn.it
state = free
np = 2
properties = lcgpro
ntype = cluster
status = rectime=1347958285,varattr=,jobs=,state=free,netload=20164290477,gres=,loadave=0.04,ncpus=1,physmem=1504260kb,availmem=3301768kb,totmem=3600732kb,idletime=10178413,nusers=1,nsessions=1,sessions=3571,uname=Linux emi-demo09.cnaf.infn.it 2.6.18-308.4.1.el5 #1 SMP Tue Apr 17 14:33:50 EDT 2012 x86_64,opsys=linux
gpus = 0
emitestbed34.cnaf.infn.it
state = free
np = 3
properties = cloudtf,bait
ntype = cluster
status = rectime=1347958296,varattr=,jobs=,state=free,netload=23750277315,gres=,loadave=0.01,ncpus=4,physmem=6108660kb,availmem=6737480kb,totmem=8205132kb,idletime=1104,nusers=3,nsessions=4,sessions=5758 7222 7377 11649,uname=Linux emitestbed34.cnaf.infn.it 2.6.18-308.13.1.el5 #1 SMP Tue Aug 21 18:44:22 EDT 2012 x86_64,opsys=linux
gpus = 0
emitestbed35.cnaf.infn.it
state = down,offline
np = 1
properties = qwnodes
ntype = cluster
status = rectime=1347884870,varattr=,jobs=,state=free,netload=392665007,gres=,loadave=1.08,ncpus=1,physmem=509772kb,availmem=4815276kb,totmem=4966212kb,idletime=53,nusers=1,nsessions=1,sessions=2064,uname=Linux emitestbed35.cnaf.infn.it 2.6.18-274.17.1.el5 #1 SMP Tue Jan 10 16:13:44 EST 2012 x86_64,opsys=linux
gpus = 0
emitestbed36.cnaf.infn.it
state = down,offline
np = 1
properties = qwnodes
ntype = cluster
gpus = 0
cert-06.cnaf.infn.it
state = free
np = 3
properties = cloudtfsl6,bait
ntype = cluster
status = rectime=1347958307,varattr=,jobs=,state=free,netload=3443115449,gres=,loadave=0.00,ncpus=4,physmem=5990356kb,availmem=12496476kb,totmem=14378956kb,idletime=87091,nusers=3,nsessions=3,sessions=1305 11233 30070,uname=Linux cert-06.cnaf.infn.it 2.6.32-279.5.1.el6.x86_64 #1 SMP Tue Aug 14 16:11:42 CDT 2012 x86_64,opsys=linux
gpus = 0
emitestbed19.cnaf.infn.it
state = down,offline
np = 1
properties = qwnodessl6
ntype = cluster
status = rectime=1347632818,varattr=,jobs=,state=free,netload=2338705,gres=,loadave=0.00,ncpus=1,physmem=1863464kb,availmem=6186072kb,totmem=6319904kb,idletime=1034,nusers=1,nsessions=1,sessions=2267,uname=Linux emitestbed19.cnaf.infn.it 2.6.18-308.13.1.el5 #1 SMP Tue Aug 21 18:44:22 EDT 2012 x86_64,opsys=linux
gpus = 0
emitestbed20.cnaf.infn.it
state = offline
np = 1
properties = qwnodessl6
ntype = cluster
status = rectime=1347958318,varattr=,jobs=,state=free,netload=408458961,gres=,loadave=0.00,ncpus=1,physmem=1863464kb,availmem=6141812kb,totmem=6319904kb,idletime=321715,nusers=1,nsessions=1,sessions=2267,uname=Linux emitestbed20.cnaf.infn.it 2.6.18-308.13.1.el5 #1 SMP Tue Aug 21 18:44:22 EDT 2012 x86_64,opsys=linux
gpus = 0
[root@emi-demo13 ~]#
Service Testing
+++--- On WN hosting Wnodes server
- Check daemons:
+++--- On Torque server
- Enter a pool account user: $>su - tst01
- Submit a test job
- $> qsub -q qwnodes test.sh -> (where test is a bash executable with commands like /bin/hostname inside)
- $> qstat -a
check that the output has a virtual host -> hostname
- Submit a grid test job
glite-ce-job-submit -d -r emi-demo13.cnaf.infn.it:8443/cream-pbs-qwnodes -a test.jdl
2012-09-18 11:43:49,108 INFO - *************************************
2012-09-18 11:43:49,109 INFO - CREAM User Interface version 1.2.0 - Starting at Tue Sep 18 11:43:49 2012
2012-09-18 11:43:49,109 DEBUG - Using certificate proxy file [/tmp/x509up_u500]
2012-09-18 11:43:49,137 INFO - VO from certificate=[testers.eu-emi.eu]
2012-09-18 11:43:49,139 WARN - No configuration file suitable for loading. Using built-in configuration
2012-09-18 11:43:49,140 INFO - Logfile is [/tmp/glite_cream_cli_logs/glite-ce-job-submit_CREAM_dongiovanni_20120918-114349.log]
2012-09-18 11:43:49,141 DEBUG - Processing file [/home/dongiovanni/Test.sh]...
2012-09-18 11:43:49,142 DEBUG - Inserting mangled InputSandbox in JDL: [{"/home/dongiovanni/Test.sh"}]...
2012-09-18 11:43:49,149 INFO - Registering to [http://emi-demo13.cnaf.infn.it:8443/ce-cream/services/CREAM2] JDL=[ StdOutput = "test.out"; BatchSystem = "pbs"; QueueName = "qwnodes"; ShallowRetryCount = 3; RetryCount = 3; Executable = "Test.sh"; VirtualOrganisation = "testers.eu-emi.eu"; outputsandboxbasedesturi = "gsiftp://localhost"; OutputSandbox = { "test.out","test.err" }; InputSandbox = { "/home/dongiovanni/Test.sh" }; StdError = "test.err" ]
2012-09-18 11:43:49,150 INFO - certUtil::generateUniqueID() - Generated DelegationID: [0cde6d23c197f82743992bdcdcc5ebb725a5742e]
2012-09-18 11:43:50,706 INFO - JobID=[https://emi-demo13.cnaf.infn.it:8443/CREAM575922133]
2012-09-18 11:43:50,707 INFO - UploadURL=[gsiftp://emi-demo13.cnaf.infn.it/var/cream_sandbox/testers/CN_Danilo_Nicola_Dongiovanni_L_CNAF_OU_Personal_Certificate_O_INFN_C_IT_testers_eu_emi_eu_Role_NULL_Capability_NULL_tst29/57/CREAM575922133/ISB]
2012-09-18 11:43:50,710 INFO - Sending file [gsiftp://emi-demo13.cnaf.infn.it/var/cream_sandbox/testers/CN_Danilo_Nicola_Dongiovanni_L_CNAF_OU_Personal_Certificate_O_INFN_C_IT_testers_eu_emi_eu_Role_NULL_Capability_NULL_tst29/57/CREAM575922133/ISB/Test.sh]
2012-09-18 11:43:51,247 INFO - Now invoking JobStart for JobID [https://emi-demo13.cnaf.infn.it:8443/CREAM575922133]
https://emi-demo13.cnaf.infn.it:8443/CREAM575922133
[dongiovanni@emitestbed08 ~]$ glite-ce-job-status https://emi-demo13.cnaf.infn.it:8443/CREAM575922133
****** JobID=[https://emi-demo13.cnaf.infn.it:8443/CREAM575922133]
Status = [IDLE]
[dongiovanni@emitestbed08 ~]$ glite-ce-job-status https://emi-demo13.cnaf.infn.it:8443/CREAM575922133
****** JobID=[https://emi-demo13.cnaf.infn.it:8443/CREAM575922133]
Status = [RUNNING]
Notes & Service Troubleshooting
- IMAGE CONFIGURATION:
- to use Wnodes trhough a grid job you need to configure an emiWN in the image. *. as wnlist > just put the hostname
- problems with virtual machine start can occur when rpm version of wnodes on the image are not up-to-date or configuration files are different from the bait host. Please remember to check this congruency. A better approach would be to have those files shared through a shared file system.
- notice also that fetch crl operation at vm machine boot can take a lot. -> shared file system is suggested for crl handling too.
--
DaniloDongiovanni - 04-May-2012