HaTestBedSetupLog < LCG

LCG Web>LCGServiceChallenges>ProgressLogs>ServiceChallengeFourProgress>InfrastructureSetupTask>HaTestBedSetup>HaTestBedSetupLog (2005-09-27, unknown)

EditAttachPDF

OS level stuff

watch the scsi disk order!! (system disks vs. FC disks). See how the itrac* cluster is set up.
allow SPMA to install user packages (for the testbed anyway)
rpm -ivh heartbeat-2.0.1-1.cern.1.i386.rpm heartbeat-pils-2.0.1-1.cern.1.i386.rpm heartbeat-stonith-2.0.1-1.cern.1.i386.rpm
I recommend /srv/somethingsomething as mount point for shared disks (as per the Filesystem Hierarchy Standard)
set up internal (heartbeating) ip addresses! and don't forget about iptables.
do make sure that /etc/ha.d is the same accross cluster members (and use multicast and/or different port numbers if multiple clusters share a network)
if you want to stop or restart a service make sure it's stopped on both nodes - stopping linux-ha on the primary node will cause the service to fail over to the backup (as expected).

Heartbeat (aka Linux-HA) setup

Q&A

Q: How do I manually move the resources back after a failover? A: /usr/lib/heartbeat/hb_standby failback (to be issued on the node which is to return the resources to their rightful owner)
Q: How to simulate a host failure? A: [root@lxdev13 root]# sync ; echo b > /proc/sysrq-trigger

Basic test: one IP and one filesystem taken over by the other machine

(yes this works smile ) lxdev14 took over when lxdev13 had a simulated failure. Rebooting.
lxdev13 is back, and hb_standby successfully moves the mounted disk thing back.
[root@lxdev13 root]# sync ; echo b > /proc/sysrq-trigger and lxdev14 took over again.

Basic Test config


[root@lxdev14 ha.d]# chmod 600 authkeys 
[root@lxdev14 ha.d]# cat authkeys 
# authentication of nodes (shared pw)
auth 1
1 sha1 linuxgridhatestclusterauthpassword
[root@lxdev14 ha.d]# cat haresources 
# resources: one service IP and one filesystem
lxdev13.cern.ch IPaddr::192.168.0.160   Filesystem::/dev/sda1::/srv/hatest::xfs
#
[root@lxdev14 ha.d]# cat ha.cf 
# use syslog
logfacility     local0
# heartbeat parameters
keepalive 1
deadtime 10
warntime 5
initdead 60
bcast eth1
# fail back to the original owners?
auto_failback off
# who's a cluster member? (uname -n)
node lxdev13.cern.ch
node lxdev14.cern.ch
[root@lxdev14 ha.d]#

Fencing

FC hardware is not self-fencing. However, one can use
- resetboards (a fencing agent needs to be written)
- serial console based SysRq/b (ditto)
- SNMP or similar to disable a port on the FC switch; agent needs to be written; high probability of (human) error
or we can use DRBD instead of

TODO

get a "service ip address" from landb (for the time being I'll use 192.168.0.160)
fencing? STONITH? how do we do that? (see Fencing above)
get serial consoles working for the test nodes (argh...)
test unicast and multicast heartbeat
test some sort of service

-- AndrasHorvath - 26 Sep 2005

Topic revision: r3 - 2005-09-27 - unknown

LCG Wikis

LCG Service
Coordination

LCG Grid
Deployment

LCG
Apps Area

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
LCG All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback