DistFsEval < LinuxSupport

LinuxSupport Web>LinuxSupportInternals>DistFsEval (2005-10-26, unknown)

A look at the current status of clustering and cluster filesystems

OCFS2 from Oracle

Quote: It is an extent based, POSIX compliant file system. Unlike the previous release (OCFS), OCFS2 is a general-purpose file system [...] OCFS2 uses the journaling block device (as per ext3), and other ext3 code. It was meant as a cluster filesystem to hold Oracle system files in order to ease the administration of Oracle Real Applications Cluster.

Requirements:

(recent) 2.6 kernel - as in RHEL4 for example -, patches, and userspace. Patches are in the -mm series.
shared block device (shared physical or DRBD)

(my) highlights:

POSIX compliant, open source under GPL, freely available, support only if you have Oracle support and keep Oracle on it
As for physical limits, a filesystem with a cluster size of 4K can grow to 16TB. [...] However, the current codebase cannot handle filesystems as large as the disk format can. 4k block size on a 32bit platform gives 4TB fs size maximum, and half that on 64bit platforms.
Maximum 254 nodes per cluster
claimed to have high performance (metadata caching!)
metadata journaling for reliability
no separate metadata server
can be used as shared root filesystem (!)
Not supported yet: clusterwide flock(), F_NOTIFY, shared writeable mmap, readonly mount
no quotas either

-- AndrasHorvath - 11 Aug 2005

Red Hat's Global File System GFS

Meant as a cluster filesystem. Includes Red Hat Cluster Suite, which gives you failover and load balancing of maximum 8 nodes (for NFS for example). GFS itself scales up to ~256 nodes (see below).

Requirements:

RHEL3 or 4 on any supported architecture
shared block device (SAN or GNBD or iSCSI or whatever; can run on GNBD server with limitation). Fencing agents (to prevent a failed node from rejoining the cluster) exist for certain FC switches and GNBD only -> those are suggested by redhat.

(my) highlights:

POSIX compliant, GPL licenced, pricing (???) is unclear!
up to 256 nodes (on the manual page @ redhat, more than 300 nodes claimed to work; on the marketing page, 256 are "supported")
8TB storage maximum on RHEL4 (this is new in 6.1)
quota support
LVM2 / CLVM (clustered LVM), multipath, distributed lock mgmt (=no "lock nodes") in GFS 6.1 RHEL4 (not on 6.0 @ RHEL3)
supports "freezing" of a volume (suspends I/O for blockdevice-based snapshots for example)
can do data journaling (!), metadata is always journaled
node failures are OK; storage failures call for fsck
easy to expand, cannot shrink; although clvm can probably rearrange blocks in the background (never tested, works with "normal" LVM)
Context-Dependent Path Names a la AFS

The RHCS (http://www.redhat.com/whitepapers/rha/RHA_ClusterSuiteWPPDF.pdf):

two uses: load balancer or HA
uses Linux Virtual Server for load balancing (front-end node)
maximum 8 nodes
needs GFS, uses shared disk as quorum device. Needs FC or SCSI shared storage.
multipath I/O supported
service state check is builtin (for eg. NFS) or can use script which will be called every few seconds
node heartbeat by pinging over Ethernet (also supporting bonding for HA of Ethernet links)
will restart service on another node if it fails somewhere. Usual fencing and barriers provided. State information is only preserved in the filesystem of course (eg connections will be broken).
Oracle 9i RAC supported

-- AndrasHorvath - 11 Aug 2005

OpenSSI (Single System Image) clustering

Requirements:

kernel 2.4 (RHEL3 is OK) or 2.6
shared storage if you want shared FS

(my) highlights:

GPLv2, uses LVS and CI, in Debian Sarge (stable) and FC3, i386 only (including SMP)
"An SSI cluster can be managed like a giant SMP machine." And it also looks like a giant SMP machine. But the architectural details are unclear
CI clusters up to 10..50 have been tested (out there)
CFS: sits on top of ext2/3 on a shared disk and then "all nodes see the mount point". Mounts are handled via FS labels or UUIDs.
CFS can be used as root, can fail over (HA-CFS). Can also use GFS underneath OpenSSI.

Process migration (alone or in groups) works! also clusterwide signaling etc!
after the first machine is up, additional machines' install is aided by the cluster (PXE)
failover: tcp connections die (as usual), services get restarted
LSF works up to 2 nodes, others work better.
MySQL and Oracle 9i RAC supported, OpenAFS too (1.2.13 anyway, perhaps others can run on an OpenSSI machine)
"nodedown" detection is 20 seconds (!), expected to improve (CI)
uses devfs (which is obsolete)
does NOT support EVMS or CLVM (=> no volume management!)
read http://openssi.org/cgi-bin/view?page=features.html and http://wiki.openssi.org/go/FAQ

-- AndrasHorvath - 11 Aug 2005

Linux-HA (http://www.linux-ha.org/)

This is not a filesystem but the oldest HA solution for Linux, based on heartbeat and restarting services. It's HA only, but usable with LVS for load balancing.

Requirements:

any linux version, basically (and freebsd and solaris work too)

(my) highlights:

heartbeats over network, serial line etc. to monitor machine status. Also "ping hosts" can be defined to monitor a node's network connectivity.
"services" monitored by scripts, some common cases provided
no requirements for shared disk, but supports DRBD and shared FC storage
the classic/usual "I take over the service IP address if you die" approach. TCP connections die, of course.
(now, in v2): cluster consensus/quorum support
fencing is provided in form of "Shoot The Other Node In The Head"
I've used this already ~5-6 years ago and it worked (it's as simple as HA solutions can get)
used by IT-CS in production for DNS & DHCP services (contact: Nick)

-- AndrasHorvath - 11 Aug 2005

Lustre

This is a filesystem only. Claimed to scale to thousands of nodes and there are examples out there of that.

Requirements:

kernel 2.4 or 2.6
heavy and intrusive kernel patching
separate machines for clients and OSTs ("servers") and metadata servers

(my) highlights:

it works (we've tested version 1.0.4). Currently, 1.2 is opensource and 1.4 is buyable.
the version tested was pretty hard to kill, and mostly the recovery procedures actually worked too.
promising white paper but most features are missing:
user authentication is based on UIDs; servers are not authenticated. GSSAPI has been promised for years.
access control is based on unix rights (no ACLs)
no redundancy; storage is expected to be perfectly available. When an OST crashes, only that portion of the FS is unaccessible; however, there is no control whatsoever over what data goes to which OST. However, OSTs can be used in failover configuration with shared storage.
consequently, no data migration available (but is promised for upcoming versions)
can 'stripe' data accross multiple (fixed number or all available) OSTs (but no RAID); for decent performance, some striping needs to be done.
1.4 features two metadata servers maximum in a failover configuration (with shared storage). Losing your metadata server is a Bad Thing.
backup is via clients only (can be cumbersome with large filesystems, see about losing a metadata server..)

Panasas PanFS (or some other name)

"opensource" (you get it now, to-be really opensourced) client, closed-source servers (including sw and their proprietary hw built from commodity components)
strong authentication of clients and server components
object-storage extensions to an almost-POSIX filesystem: per-file raid level settings (!), possible to change this online
also, online migration to balance storage
cache-coherent multiple metadata servers
"about 3-4 times the price of a disk server per gigabyte"
interconnect: GE inside the cluster, possible to go native Infiniband (iSER) /Quadrics/etc if a switch will convert

IBM GPFS (General Parallel Filesystem)

Documentation: http://publib.boulder.ibm.com/infocenter/clresctr/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfsbooks.html

SLC3 kernel needs patching (for mmapped files and nfs), SLC4 (RHEL4) doesn't. I didn't patch it for a start since the patch is of course for an ancient kernel version and needs manual hacking and understanding.
make sure all GPFS updates are applied
after all RPMs are installed the 'compiling from source' part still needs to be completed: see /usr/lpp/mmfs/src/README
for SLC3, set distro type to RedHat Advanced Server and edit /etc/redhat-release to include the string "Advanced Server" doh
general code quality looks horrid (#include'ing multiple C files in C files, overwhelming #define's etc) and so does the documentation. Quoting (note that we are in 2005, this is the docs for the latest release...): The default remote shell command is rsh. This requires that a properly configured .rhosts file exist in the root user's home directory on each node in the GPFS cluster. Or anyway it needs to be able to log in as root (!) to all nodes.
there's quite a lot of documentation but none of it is straightforward (single commands are explained but no "step-by-step" guide or at least a "thou shalt run commands in the following order".
on the plus side: Supports quotas. Supports all features of cluster theory from the 1970's to today.

if you're bravehearted, practical try:

scenario: we have two nodes (lxdev13 and 14) using a shared Fibre Channel disk at /dev/sdb.
have it installed on both nodes, including the source code compilation/installation part.
create a node file. It should look something like this (note the colons):


lxdev13:manager-quorum::

lxdev14

run mmcrcluster -n mmfs.nodes -p lxdev13 -r `which ssh` -R `which scp`
we're creating GPFS on a shared (FC) disk. So put into a file (say, mmfs.mmcrnsd.descfile) for example:

/dev/sdb:lxdev13::dataAndMetadata::mysdb

This file will be rewritten by gpfs.. so don't put it in /tmp.

run mmstartup on all nodes (?)
run mmcrfs /srv/gpfstest mydisk -F mmfs.mmcrnsd.descfile (which will create a filesystem).
shoot yourself in the foot (in order to start getting used to it).
lo and behold, 'mount -a' mounts the disk on both nodes (I mean issuing this on both nodes gets it mounted on both nodes). I just hope this is right.
if you try and mmfsck mydisk it says it's mounted on 2 nodes, so probably it IS right...
reading, writing and locking (! see attached flock.pl) works fine
echo b > /proc/sysrq_trigger on lxdev13 effectively stops lxdev14 from using the FS as well ("/srv/gpfstest': Stale NFS file handle"). Even after lxdev13 is up again and gpfs is restarted there the service is unavailable. mmshutdown on lxdev14 then complains (and hangs)
mmstartup -a actually starts GPFS on all nodes. Before doing this, lxdev14 was designated secondary master in the cluster (using mmchcluster). Filesystem got mounted by GPFS.
another test: the usual flock.pl on both sides, then cutting the network connection by issuing ip route add blackhole towards lxdev14 on lxdev13. Strange: after some timeout, both sides were allowed to proceed; then, when the network was re-enabled, the file in question had no corrupt records and the same sha1sum as seen from both nodes. However, writes from lxdev14 did not actually occur! The node which had the lock (lxdev13) wrote all. GPFS on lxdev14 was lying about writes! Or maybe not, need to check return value. okay, it's Perl which is lying.
so once again echo b > /proc/sysrq_trigger on lxdev13 stopped the show. For a while. Then lxdev14 continued writing!
starting up lxdev13 again, mmstartup got it back and both nodes were able to happily access the filesystem. Files are consistent.
running another flock+"ip route add blackhole" test, this time in C, resulted in another 'stale NFS file handle' even after restoring the network. Weird! mmshutdown, of course, fell belly up.

-- AndrasHorvath - 26 Oct 2005

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
c	f.c	r1	manage	0.7 K	2005-10-26 - 17:18	UnknownUser	flock() test in C, with more control over what's happening
pl	flock.pl	r1	manage	0.3 K	2005-10-26 - 15:48	UnknownUser	file locking test (to be started on multiple nodes)

Topic revision: r8 - 2005-10-26 - unknown

LinuxSupport

LinuxSupport Web
LinuxSupport Web Home
Changes
Index
Search

- Cern Search
- TWiki Search
- Google Search
LinuxSupport All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback