OS level stuff

  • watch the scsi disk order!! (system disks vs. FC disks). See how the itrac* cluster is set up.
  • allow SPMA to install user packages (for the testbed anyway)
  • rpm -ivh heartbeat-2.0.1-1.cern.1.i386.rpm heartbeat-pils-2.0.1-1.cern.1.i386.rpm heartbeat-stonith-2.0.1-1.cern.1.i386.rpm
  • I recommend /srv/somethingsomething as mount point for shared disks (as per the Filesystem Hierarchy Standard)
  • set up internal (heartbeating) ip addresses! and don't forget about iptables.
  • do make sure that /etc/ha.d is the same accross cluster members (and use multicast and/or different port numbers if multiple clusters share a network)
  • if you want to stop or restart a service make sure it's stopped on both nodes - stopping linux-ha on the primary node will cause the service to fail over to the backup (as expected).

Heartbeat (aka Linux-HA) setup

Q&A

  • Q: How do I manually move the resources back after a failover? A: /usr/lib/heartbeat/hb_standby failback (to be issued on the node which is to return the resources to their rightful owner)
  • Q: How to simulate a host failure? A: [root@lxdev13 root]# sync ; echo b > /proc/sysrq-trigger

Basic test: one IP and one filesystem taken over by the other machine

(yes this works smile ) lxdev14 took over when lxdev13 had a simulated failure. Rebooting.
lxdev13 is back, and hb_standby successfully moves the mounted disk thing back.
[root@lxdev13 root]# sync ; echo b > /proc/sysrq-trigger and lxdev14 took over again.

Basic Test config


[root@lxdev14 ha.d]# chmod 600 authkeys 
[root@lxdev14 ha.d]# cat authkeys 
# authentication of nodes (shared pw)
auth 1
1 sha1 linuxgridhatestclusterauthpassword
[root@lxdev14 ha.d]# cat haresources 
# resources: one service IP and one filesystem
lxdev13.cern.ch IPaddr::192.168.0.160   Filesystem::/dev/sda1::/srv/hatest::xfs
#
[root@lxdev14 ha.d]# cat ha.cf 
# use syslog
logfacility     local0
# heartbeat parameters
keepalive 1
deadtime 10
warntime 5
initdead 60
bcast eth1
# fail back to the original owners?
auto_failback off
# who's a cluster member? (uname -n)
node lxdev13.cern.ch
node lxdev14.cern.ch
[root@lxdev14 ha.d]# 

Fencing

  • FC hardware is not self-fencing. However, one can use
    • resetboards (a fencing agent needs to be written)
    • serial console based SysRq/b (ditto)
    • SNMP or similar to disable a port on the FC switch; agent needs to be written; high probability of (human) error
  • or we can use DRBD instead of

TODO

  • get a "service ip address" from landb (for the time being I'll use 192.168.0.160)
  • fencing? STONITH? how do we do that? (see Fencing above)
  • get serial consoles working for the test nodes (argh...)
  • test unicast and multicast heartbeat
  • test some sort of service

-- AndrasHorvath - 26 Sep 2005

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2005-09-27 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback