Lustre Server Setup
I created a ~46TB Lustre file system on 25 disk servers. Lustre Version is 1.6.4.2 using official RPMS from lustre.org.
On every disk server there is a 2TB file created in /shift/xns/lustre/ which is formatted with mkfs.lustre and then mounted via a loopback device.
On the headnode I have created an 8GB MDS file system inside the /pool directory which is also mounted as an MDS via a loopback device.
dd if=/dev/zero of=/pool/lustremds bs=1000 count=0 seek=8589934592
mkfs.lustre --fsname=tier2fs --mdt --mgs --backfstype=ldiskfs --device-size=8000000 --reformat /pool/lustremds
mkdir /mdt
mount -o loop -t lustre /pool/lustremds /mdt/
wassh -l root -e "lxb8971" -c fileserver/dpm-test "mkdir -p /shift/xns/lustre/"
wassh -l root -e "lxb8971" -c fileserver/dpm-test "dd if=/dev/zero of=/shift/xns/lustre/2TB bs=1024 count=0 seek=2000000000"
wassh -l root -e "lxb8971" -c fileserver/dpm-test "mkfs.lustre --fsname tier2fs --ost --mgsnode=lxb8971@tcp0 --backfstype=ldiskfs --device-size=2000000000 --reformat /shift/xns/lustre/2TB"
############################################################################################################
# if you want an EIOE in case an OST goes down add -param="failover.mode=failout" to the arguments above!!!!
############################################################################################################
wassh -l root -e "lxb8971" -c fileserver/dpm-test "mkdir -p /ost0"
wassh -l root -e "lxb8971" -c fileserver/dpm-test "mount -t lustre -o loop /shift/xns/lustre/2TB /ost0/"
# to mount the filesystem on a client machine do
mkdir /tier2fs
mount -t lustre -o user_xattr,acl lxb8971.cern.ch:/tier2fs /tier2fs
Lustre Client Setup
I have setup the clients lxb8901-8930. They run the default CERN kernel and I recompiled Lustre with the CERN kernel on each machine.
Unpack the sources or use source rpm:
./configure --with-linux=/usr/src/kernels/2.6.9-67.0.1.EL.cern-smp-x86_64/
make
make install
The
LustreFS /tier2fs is mounted like:
wassh -l root "mkdir /tier2fs; depmod; modprobe lustre"
wasss -l root -h lxb89[01-30] "mount -t lustre -o user_xattr,acl lxb8971.cern.ch:/tier2fs /tier2fs
Trivial tests
File creation (filesize 8 bytes) from a standard CERN desktop (100Mbit). 250 files/s. File creation on one of the disk server 760 files/s.
Copy of a 1 GB file from a neighbour disk server 18s. Copy of a 1 GB from the disk server which has a file locally (via the mounted directory): 10s.
General observation: Directory listing with 'ls' is relatively slow for directories with many entires (> O(1000) ). A 'find' command gives very fast results even for thousands of files.
'ldiskfs'
ldiskfs allows to mount the lustre device like a normal FS with the lustre interal fs structure visible. The same works on the MDS node which stores file meta data. 'getfattr' allows to dump all the meta data information and can be used
to make a backup of a lustre MDS.
mount -t ldiskfs -o loop /shift/xns/lustre/2TB /mnt/lustre
[root@lxfsrd4005 tier2fs]# ls -la /mnt/lustre/
total 52
drwxr-xr-x 5 root root 4096 Mar 3 18:56 .
drwxr-xr-x 3 root root 4096 Mar 3 22:42 ..
-rwx------ 1 root root 32 Mar 3 18:56 CATALOGS
drwxr-xr-x 2 root root 4096 Mar 3 19:38 CONFIGS
drwx------ 5 root root 4096 Mar 3 18:56 O
-rw-r--r-- 1 root root 4096 Mar 3 18:56 health_check
-rwx------ 1 root root 8704 Mar 3 18:56 last_rcvd
drwx------ 2 root root 16384 Mar 3 18:54 lost+found
Possible integration of xrootd/gridftp into Lustre FS
The lustre api allows to retrieve the file layout for each lustre entry on the MDS. This information can be used f.e. by an xrootd redirector to direct a client to the disk server which stores a certain file. On this disk server the
file could simply be opened using the mounted FS but it would be a sort of local access via local network. Alternatively one could mount on each disk server the OST with ldiskfs and access directly the file without using the lustre device. For striped files one would need to extend the xrootd client or issue ENOTSUP.
A Lustre FS could be run without any authentication and it would be mountable only by dedicated hosts while (grid-) jobs would access the files via a remote protocol doing usual GSI authentication.
Main requirement would be a clear mapping from DN (or kerberos principals) to real UID/GID pairs. An xrootd plugin for simple read/write should be doable in few days.
If SRM is still wanted,
StORM could be tried as SRM interface.
Lustre and Quota
Quota can be set per user or per group. If group quota exists on a filesystem it supersedes the user quota.
Quota is set for the complete filesyste - not for individual directories.
Enable quota on a file system:
lfs quotacheck -ug /tier2fs
Set quota for the user 'xprod':
lfs setquota -u xprod 3000000 3200000 10000 12000 /tier2fs/
This sets around 3 GB soft and 3.2 GB hardquota. 10000 inodes soft and 12000 inodes hard quota. By default quota is reserved in chunks of 100 Mb, so you have to specify atleast <X>=100 x <number of disk servers/OSTs>, otherwise you see very strange effects and quota does not work properly.
To list quota do:
lfs quota -u xprod
I did a very simple test exceeding the quota and a file copy stops when the quota is reached and '1' is returned in the shell. The uncomplete file stays in the file system.
One can also set a grace limit for quota:
lfs setquota -t 100 2000 /tier2fs
100 is the block grace, 2000 the inode grace.
Lustre Tests
Single Client Machine Writes
One must distinguish here cached and non-cached writes. Every write test starts with very high performance until the write cache is exhausted and the cache has to be commited.
Running a test with 100 clients writing 2000 files each the performance goes asymtotically towards 40 files/s = 40 Mb/s.
File Deletion
The Deletion of 1.6 Mio files (8 TB) took 27m35s executed via 'rm -r ...' from a single clients (~1000 files/s).
Disk Server Hotspots
In a test where I read with 3000 clients ( 30 mounting clients running 100 parallel copies each) 0.5 Gb files the FS created a hostspot disk server - e.g. after some stable running one disk server accumulated the requests of all clients. Once the equilibrium is broken we enter a vicious circle. If
file streaming of large files is desired on a Lustre FS the product of <N clients> * <file stripe size> has to be in a healthy ratio to the i/o output of a single disk server. If many clients will do streaming the solution is to use more than one
stripe per file. The maximum number of stripes for lustre is 160. On the contrary a single disk server failure will do a bigger damage to the file availability.
Lustre FS Hangs
If one of the disk server/OSTs crashes in a default setup the complete FS get's blocked. During a write test 1 disk server crashed for still unknown reason. As a result all 3000 copy processes stop in status 'D' and the filesystem blocks.
The situation can be solved either
a) reboot the disk server asap (I couldn't do that)
b) - identify the hanging server (look for a
LustreError with dmsg on the MDS and get the OST name) - useful is also
cat /proc/fs/lustre/osc/*/ost_server_uuid | less
- get the device ID with 'lctl dl'
- deactive the device with 'lctl --device <id> deactivate'
- one has to deactivate the device on the MDS and on ALL clients !!!!
If one does not want blocking in case of a OST failure and prefers to see EIO on clients, the OSTs should be created with the '--failover' option or with tunefs:
tunefs.lustre --writeconf --param="failover.mode=failout" <device>
. On the contrary the blocking has the advantage that one can fix the problem and all processes and pending transactions will continue once the failed OST came back.
Another useful option to set on the OSTs is:
lctl -w lnet.panic_on_lbug=1
lctl -w kernel.panic_on_oops
This enables automatic reboot of the machine whenever a bug is hit in the lustre kernel or the kernel itself and reduces the downtime caused by an OST failure.
Lustre Write Test 29 client machines ( 10 cp's a 50 Mb write)
The tests runs for ~120 Minutes. We write at a rate of max(2.6. GB/s -> we have 25 disk server and 30 client machines). We write 290000 x 50 Mb files (14 Tb) without any error.
The distribution of the time to write a file shows the effect of the caching in the client. If the cache is empty the write is much faster than the average of 5.745s. If the cache is full the write
has to wait longer until the cache is empty and produces atleast part of the tail. The missing 18 entries in the histogram got lost during the script processing of the logfiles.
Lustre Read Tests
Bonnie Performance on a disk server:
[root@lxfsrd4005 ext3]# /root/Bonnie -s 16000
File './Bonnie.30795', size: 16777216000
Writing with putc()...
[root@lxfsrd4005 ext3]# /root/Bonnie -s 32000
File './Bonnie.30804', size: 33554432000
Writing with putc()...done
Rewriting...done
Writing intelligently...done
Reading with getc()...done
Reading intelligently...done
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
32000 60691 98.9 268856 84.2 74771 21.1 59208 96.2 158751 15.3 77.9 0.3
I found that during the following tests two disk server were doing a 'verify' operation after reboot. This servers were influencing the overall performance of the tests evidently since on the clients there were slow copy processes on this 2 disk servers present all the time.
lxfsrc1201:
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-1 OK - - - 232.82 ON OFF
u1 RAID-10 VERIFYING - 98 64K 4656.51 ON OFF
u2 SPARE OK - - - 465.753 - OFF
u3 SPARE OK - - - 465.753 - OFF
Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
lxfsrc1203:
------------------------------------------------------------------------------
u0 RAID-1 OK - - - 232.82 ON OFF
u1 RAID-10 VERIFYING - 37 64K 4656.51 ON OFF
u2 SPARE VERIFYING - 34 - 465.753 - OFF
u3 SPARE OK - - - 465.753 - OFF
RAID Performance Tests with XFS
============================================================================
#write/#read 1/0 1/1 10/10 50/0 50/50
--------------------------------------------------------------------------------------
hwR10 350/0 135/78 56/38 130/0 46/34 (20 disks)
hwR1 80 50/25 13/6 22/0 0.1/24 ( 2 disks )
swR0hwR1 345/0 180/90 166/139 160/0 120/85 (20 disks)
swR0hwR5 320/0 158/91 103/78 100/0 93/68 (20 disks)
hwR10 = hardware R10 (~ 5TB)
hwR1 = hardware R1 (~ 5TB)
swR0hwR1 = software RAID0 on top of 10 hardware RAID1 (~ 5TB)
swR0hwR5 = software RAID0 on top of 4 hardware RAID5 (~ 7.8 TB)
Total Throughput Read 50M files - 2900 files in parallel
We get ~500 Mb/s over 12 hours stable. Few disk server become always hotspots. The disk i/o on all 25 machines is at 1 Gb/s while the network export is only 500 Mb/s. This is not
fully understood yet but it must be due to ineffective caching.
I/O seen on a single disk server during the test
Total Throughput read 500M files (290 clients)
Peak rate was 1.2. Gb, then decaying due to verify operation on two disk servers.
The default redahead for the 3ware controllers was '256'. In a seconde test run I have changed it to the recommended setting of '16384', the rate stays more stable.
wassh -t 10 -e lxb8971 -l root -c fileserver/dpm-test "blockdev --setra 16384 /dev/sdb"
Write Test with 8 Stripes
Writing 500 MB files with 8 stripes each. 290 clients. Peak performance shown in Lemon is 2.7 Gb/s. 29000 Files =~ 14.5 TB
Test Setup with native /dev/sdb device (no loopback) and Raid10 (20 disks)
wassh -l root -e "lxb8971" -c fileserver/dpm-test "mkfs.lustre --mkfsoption=\"-E stride=160 -i 1000000\" --param=\"failover.mode=failout\" --fsname tier2fs --ost --mgsnode=lxb8971@tcp0 --reformat /dev/sdb"
The setup without loopback devices improved drastically the performance.
This plot shows the write performance over the 24 disk server. The read performance looks very similiar ( - 10%)
This plot shows read and write in parallel. Lustre handles seems to provide a better balancing between read & write then the native device.
This plot shows xrootd running on top of the lustre cluster. We get identical performance to the native lustre.
This plot summrarizes the performance of the different setups.
- xrootd native (read + write) [ read alone was 1.9 Gb/s - write 2.5 Gb/s]
- lustr native (write | read | read+write
- xrootd over lustre
gridFTP setup on Lustre
On a client machine where you mount the lustre FS install the lustre-gridftp script in /etc/init.d
https://twiki.cern.ch/twiki/bin/viewfile/LCG/Scalability_Lustre?rev=1;filename=lustre-gsiftp and do
/etc/init.d/lustre-gridftp start (in the end this is just a stand gridftp server operating with file: dsi plugin). One get's around 90 Mb/s for a 1GB file with 16 parallel streams on the same network segment.
Performance test running CMS SuSY analysis via mounted LustreFS
The first 7 plots are realtime plots showing the load figures on client and server nodes. The last 7 plots are history plots and the test can be seen there from 11:00 ->11:20 (the last 20 minutes in the history).
We run 450 jobs on 360 cores (9 jobs per 8 cores) on 45 client hosts against 24 disk server. We get 1.7 GB/s total read rate. We have each client node ~ 80% CPU busy and 20% io wait.
These CMS jobs do about 4 MB/s i/o each reading 8 indivudal 1.1GB files sequential via the mounted
LustreFS.
Setting up a Twin-LustreFS
The goal is to setup a redundant
LustreFS which operates via xrootd like a network RAID-0.
Setup of Twin Lustre (all on single box)
- create more loopback devices
mknod -m 660 /dev/loop<dev#> b 7 <dev#>
- create seperateMDS
dd if=/dev/zero of=/scratch/twin/mds bs=1024 count=0 seek=8200
mkfs.lustre --mgs --device-size=2000000 --reformat /scratch/twin/mds
mkdir /mds
mount -o loop -t lustre /scratch/twin/mds /mds
- create two MDT
dd if=/dev/zero of=/scratch/twin/mdt1 bs=1000 count=0 seek=200000000
dd if=/dev/zero of=/scratch/twin/mdt2 bs=1000 count=0 seek=200000000
mkfs.lustre --fsname=twin1 --mdt --mgsnode=pcitsmd01@tcp0 --backfstype=ldiskfs --device-size=2000000 --reformat /scratch/twin/mdt1
mkfs.lustre --fsname=twin2 --mdt --mgsnode=pcitsmd01@tcp0 --backfstype=ldiskfs --device-size=2000000 --reformat /scratch/twin/mdt2
mount -o loop -t lustre /scratch/twin/mdt1 /mdt1
mount -o loop -t lustre /scratch/twin/mdt2 /mdt2
- create two times two OSTs (we should add later also the failnode parameter see beginning of Twiki)
dd if=/dev/zero of=/scratch/twin/ost11 bs=1000 count=0 seek=200000000
dd if=/dev/zero of=/scratch/twin/ost12 bs=1000 count=0 seek=200000000
dd if=/dev/zero of=/scratch/twin/ost21 bs=1000 count=0 seek=200000000
dd if=/dev/zero of=/scratch/twin/ost22 bs=1000 count=0 seek=200000000
mkfs.lustre --fsname twin1 --ost --mgsnode=pcitsmd01@tcp0 --backfstype=ldiskfs --device-size=2000000 --reformat /scratch/twin/ost11
mkfs.lustre --fsname twin1 --ost --mgsnode=pcitsmd01@tcp0 --backfstype=ldiskfs --device-size=2000000 --reformat /scratch/twin/ost12
mkfs.lustre --fsname twin2 --ost --mgsnode=pcitsmd01@tcp0 --backfstype=ldiskfs --device-size=2000000 --reformat /scratch/twin/ost21
mkfs.lustre --fsname twin2 --ost --mgsnode=pcitsmd01@tcp0 --backfstype=ldiskfs --device-size=2000000 --reformat /scratch/twin/ost22
mount -o loop -t lustre /scratch/twin/ost11 /ost11
mount -o loop -t lustre /scratch/twin/ost12 /ost12
mount -o loop -t lustre /scratch/twin/ost21 /ost21
mount -o loop -t lustre /scratch/twin/ost22 /ost22
- mount the two Lustre-FS
mount -t lustre -o user_xattr,acl pcitsmd01@tcp0:/twin1 /twin1
mount -t lustre -o user_xattr,acl pcitsdm01@tcp0:/twin2 /twin2
Recovery Procedures
Find all files on a OST
lfs df
=========================================================================
pcitsmd01] /home/alientest/xrootd-lustre > lfs df
UUID 1K-blocks Used Available Use% Mounted on
twin1-MDT0000_UUID 1749776 185300 1564476 10% /twin1[MDT:0]
twin1-OST0000_UUID 1968528 586628 1381900 29% /twin1[OST:0]
twin1-OST0001_UUID 1968528 655548 1312980 33% /twin1[OST:1]
filesystem summary: 3937056 1242176 2694880 31% /twin1
UUID 1K-blocks Used Available Use% Mounted on
twin2-MDT0000_UUID 1749776 183468 1566308 10% /twin2[MDT:0]
twin2-OST0000_UUID 1968528 604624 1363904 30% /twin2[OST:0]
twin2-OST0001_UUID 1968528 590440 1378088 29% /twin2[OST:1]
filesystem summary: 3937056 1195064 2741992 30% /twin2
lfs find -O twin1-OST0000_UUID /twin1
----> list of filenames
Head Node Redunance
Using drbd for device replication
Download, compile, install as usual. 'modprobe drbd'
Example configuration to replicate the devices /dev/sdb1 + /dev/sdb2 between two head nodes
global {
usage-count yes;
}
common {
protocol C;
}
resource cms-xl-1 {
on lxb8901.cern.ch {
device /dev/drbd1;
disk /dev/sdb1;
address 128.142.209.49:7789;
meta-disk /dev/sdb3[0];
}
on lxb8902.cern.ch {
device /dev/drbd1;
disk /dev/sdb1;
address 128.142.209.50:7789;
meta-disk /dev/sdb3[0];
}
syncer {
rate 40M;
}
}
resource cms-xl-2 {
on lxb8901.cern.ch {
device /dev/drbd2;
disk /dev/sdb2;
address 128.142.209.49:7790;
meta-disk /dev/sdb4[0];
}
on lxb8902.cern.ch {
device /dev/drbd2;
disk /dev/sdb2;
address 128.142.209.50:7790;
meta-disk /dev/sdb4[0];
}
syncer {
rate 40M;
}
}
Create/Attach/Connect/Synchronize the drbd devices
Execute these commands on both hosts besides the 'primary' command.
drbdadm create-md cms-xl-1
drbdadm create-md cms-xl-2
drbdadm attach cms-xl-1
drbdadm attach cms-xl-2
drbdadm connect cms-xl-1
drbdadm connect cms-xl-2
drbdadm -- --overwrite-data-of-peer primary cms-xl-1 # only on primary node
drbdadm -- --overwrite-data-of-peer primary cms-xl-2 # only on primary node
cat /proc/drbd
version: 8.2.5 (api:88/proto:86-88)
GIT-hash: 9faf052fdae5ef0c61b4d03890e2d2eab550610c build by root@lxb8901.cern.ch, 2008-04-25 12:03:43
1: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
ns:7823623 nr:0 dw:0 dr:7823623 al:0 bm:478 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:488500 misses:478 starving:0 dirty:0 changed:478
act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0
2: cs:Connected st:Secondary/Secondary ds:Inconsistent/Inconsistent C r---
ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0
When devices are synced, create the needed filesystem on the primary. Mount the device only on the primary!
Dealing with Split-Brain Situation
If a device has been mounted on two nodes, a split brain is detected and drbd will refuse to collaborate.
You have to declare one node as secondary and discard changes on the secondary node by issuing:
drbdadm secondary mdscern
rbdadm -- --discard-my-data connect mdscern
Tuning Lustre for mgs failover node
tunefs.lustre --erase-params --mgsnode=128.142.167.202@tcp --mgsnode=128.142.167.201@tcp0 --param "failover.node=128.142.167.201" --param "mdt.quota_type=ug" --writeconf /dev/drbd4
IPMI Control
For HA configuration we need IPMI configuration to be able to power-cycle machines.
ipmitool -I open chassis status # see the status of a chassis
ipmitool -I open user list # see the configured users
ipmitool -I open user set name 6 lustreha # set a new user name for id 6
ipmitool -I open user set password 6 # set the password for user 6 in interactive mode
ipmitool user enable 6
ipmitool user priv 6 4 1 # set user prviviledges for channel one to administrator!
IPMI_PASSWORD=<your password here> <cmd> # to set the password from an application!
ipmitool -U lustreha -I lan -H x.x.x.x -E chassis power status #connect from another node !!! You have to use the IPMI address which you can find out with ipmitool -I open lan print
Building Heartbeat
wget http://linux-ha.org/download/heartbeat-2.1.3-1.fc7.src.rpm
rpm -i heartbeat-2.1.3-1.fc7.src.rpm
yum install libnet
rpmbuild -bb /usr/src/redhat/SPECS/heartbeat.spec
Testing failover:
/usr/lib64/heartbeat/hb_takeover local
--
AndreasPeters - 03 Mar 2008
- lustre-gsiftp: init.d script to startup a gridftp server on a lustre client