OS level stuff
- watch the scsi disk order!! (system disks vs. FC disks). See how the itrac* cluster is set up.
- allow SPMA to install user packages (for the testbed anyway)
-
rpm -ivh heartbeat-2.0.1-1.cern.1.i386.rpm heartbeat-pils-2.0.1-1.cern.1.i386.rpm heartbeat-stonith-2.0.1-1.cern.1.i386.rpm
- I recommend /srv/somethingsomething as mount point for shared disks (as per the Filesystem Hierarchy Standard)
- set up internal (heartbeating) ip addresses! and don't forget about iptables.
- do make sure that /etc/ha.d is the same accross cluster members (and use multicast and/or different port numbers if multiple clusters share a network)
- if you want to stop or restart a service make sure it's stopped on both nodes - stopping linux-ha on the primary node will cause the service to fail over to the backup (as expected).
Heartbeat (aka Linux-HA) setup
Q&A
- Q: How do I manually move the resources back after a failover? A:
/usr/lib/heartbeat/hb_standby failback
(to be issued on the node which is to return the resources to their rightful owner)
- Q: How to simulate a host failure? A:
[root@lxdev13 root]# sync ; echo b > /proc/sysrq-trigger
Basic test: one IP and one filesystem taken over by the other machine
(yes this works
) lxdev14 took over when lxdev13 had a simulated failure.
Rebooting.
lxdev13 is back, and
hb_standby
successfully moves the mounted disk thing back.
[root@lxdev13 root]# sync ; echo b > /proc/sysrq-trigger
and lxdev14 took over again.
Basic Test config
[root@lxdev14 ha.d]# chmod 600 authkeys
[root@lxdev14 ha.d]# cat authkeys
# authentication of nodes (shared pw)
auth 1
1 sha1 linuxgridhatestclusterauthpassword
[root@lxdev14 ha.d]# cat haresources
# resources: one service IP and one filesystem
lxdev13.cern.ch IPaddr::192.168.0.160 Filesystem::/dev/sda1::/srv/hatest::xfs
#
[root@lxdev14 ha.d]# cat ha.cf
# use syslog
logfacility local0
# heartbeat parameters
keepalive 1
deadtime 10
warntime 5
initdead 60
bcast eth1
# fail back to the original owners?
auto_failback off
# who's a cluster member? (uname -n)
node lxdev13.cern.ch
node lxdev14.cern.ch
[root@lxdev14 ha.d]#
Fencing
- FC hardware is not self-fencing. However, one can use
- resetboards (a fencing agent needs to be written)
- serial console based SysRq/b (ditto)
- SNMP or similar to disable a port on the FC switch; agent needs to be written; high probability of (human) error
- or we can use DRBD instead of
TODO
- get a "service ip address" from landb (for the time being I'll use 192.168.0.160)
- fencing? STONITH? how do we do that? (see Fencing above)
- get serial consoles working for the test nodes (argh...)
- test unicast and multicast heartbeat
- test some sort of service
--
AndrasHorvath - 26 Sep 2005
Topic revision: r3 - 2005-09-27
- unknown