Goal

Analyze SLC4+XFS stability on 32-bit/64-bit platforms and to provide a comparison matrix of various setups for HEPiX 2006.

Machines

Networked setups use servers and clients.

Servers

TYPE HOST KERNEL ARCH FS
t1 lxfsra2403 SLC3 i386 XFS
t1 lxfsra2404 2.6.15.5 x86_64 XFS
t1 lxfsra2405 SLC4 x86_64 XFS
t1 lxfsra2406 SLC4 x86_64 ext3
t1 lxfsra2407 SLC4 i386 XFS
t1 lxfsra2408 SLC4 i386 ext3

Clients

  • lxb5371 - lxb5393 (23 nodes, SLC3 i386)

Network between servers and clients

# traceroute lxfsra2406
traceroute to lxfsra2406.cern.ch (128.142.164.93), 30 hops max, 38 byte packets
 1  l513-c-rftec-3-ip103 (128.142.130.1)  0.345 ms  12.633 ms  0.228 ms
 2  lxfsra2406 (128.142.164.93)  0.232 ms  0.301 ms  0.237 ms

# iperf -c lxfsra2406 -i1 -t10 -fM
------------------------------------------------------------
Client connecting to lxfsra2406, TCP port 5001
TCP window size: 0.02 MByte (default)
------------------------------------------------------------
[  3] local 128.142.130.63 port 56886 connected with 128.142.164.93 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  91.2 MBytes  91.2 MBytes/sec
[  3]  1.0- 2.0 sec  91.6 MBytes  91.6 MBytes/sec
[  3]  2.0- 3.0 sec  91.6 MBytes  91.6 MBytes/sec
[  3]  3.0- 4.0 sec  91.6 MBytes  91.6 MBytes/sec
[  3]  4.0- 5.0 sec  91.6 MBytes  91.6 MBytes/sec
[  3]  5.0- 6.0 sec  91.6 MBytes  91.6 MBytes/sec
[  3]  6.0- 7.0 sec  92.5 MBytes  92.5 MBytes/sec
[  3]  7.0- 8.0 sec  91.6 MBytes  91.6 MBytes/sec
[  3]  8.0- 9.0 sec  91.6 MBytes  91.6 MBytes/sec
[  3]  9.0-10.0 sec  92.7 MBytes  92.7 MBytes/sec
[  3]  0.0-10.0 sec   918 MBytes  91.8 MBytes/sec

Methodology

Use expert vision. smile The phase space is huge, impossible to do exhaustive search. We have to cut a lot of uninteresting combinations. The hardware we have is identical, that will allow us to do parallel testing of different setups. Runs will be identified by their tuple, e.g. (SLC4, x86_64, XFS, expert, HWRAID5, RFIO+O_DIRECT, 3W) (notation may change), scheduling of such cases is arbitrary, will depend on hardware availability, time constraints and what we think is interesting to look at.

Dimensions

  • kernel (SLC4, SLC3, vanilla)
  • architecture (x86_64, i386)
  • filesystem (XFS, ext3)
  • filesystem options (expert, default)
  • hardware RAID (HWRAID5, HWRAID1, HWRAID10)
  • workload fashion (RFIO+O_DIRECT, RFIO, iozone+O_DIRECT, iozone)
  • number of streams (1R, 1W, 1R+1W, 1R+2W, 2R+1W, 3R, 3W, 3R+3W, 12R, 12W, 12R+12W, 100R, 100W, 50R+50W, 1000R, 1000W, 1000R+1000W as pathological)

VM options might be tuned (existing SLC3 tunings are irrelevant on SLC4). Another goodie that might help with bidirectional streams is external filesystem journals.

Recordings

  • Wall clock time
  • stream(s) start/finish timestamp
  • vmstat output
  • XFS stats (if applicable)
  • XFS fragmentation info (if applicable) [xfs_bmap -v, xfs_db -c frag]
  • controller state (disk configuration before/after test)

Preparations

DONE

  • SLC4 installed, updated to rolling latest (i386: UP kernel is default?)
  • 3ware firmware upgraded (2.02.00.012 -> 2.08.00.006)
  • kernel-module-xfs-2.6.9-34.EL.cernsmp-0.1-1.i686.rpm
  • kernel-module-xfs-2.6.9-34.EL.cernsmp-0.1-1.x86_64.rpm
  • both 32-bit and 64-bit survives XFS QA tests
  • batch framework operational
  • Linux 2.6.16-rc5+XFSCVS on lxfsra2404
  • dig up iozone -+n undocumented option to suppress rewrite/reread tests
  • install RFIO (64-bit version)

Results so far

  • PNG and PostScript
  • ARECA has a problem with SLC4 2.6.9-34.EL, multistream reads are capped at 100 MiB/s
  • SLC3 2.4.21-40.EL XFS seems to be working, both buffered and O_DIRECT, although O_DIRECT I/O doesn't show up in vmstat
  • Observed system behaviour under HWRAID5 rebuild: basically no impact on I/O! (lxfsra2406, tests 131-136)
  • RFIO+O_DIRECT+XFS preallocation seems stable (no errors)
  • RFIO readers have hung after couple of hours of operation -- not understood, but it was hand-compiled, now trying canonical
  • deleting 2160 pieces of 1536 MiB files from three filesystems (time find /srv -type f | xargs rm):
    • SLC4 x86_64 XFS: real 0m0.271s, user 0m0.005s, sys 0m0.152s
    • SLC4 x86_64 ext3: ~90 minutes
    • SLC4 i386 XFS: real 0m0.276s, user 0m0.007s, sys 0m0.136s
    • SLC4 i386 ext3: ~90 minutes
  • SLC4 i386 XFS: removing 2160 files in parallel leaves many rm processes hanging in D state -- SysRq-T executed, doesn't seem to be 4K stack-related, rather a livelock

-- PeterKelemen - 01 Apr 2006

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2006-04-01 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LinuxSupport All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback