Random Post: bacula-2.2.5-1 etch backport
RSS .92| RSS 2.0| ATOM 0.3
  • Home
  •  

    Coraid Odyssey: Part 5 (AoE vs iSCSI)

    The next phase of this project is choosing AoE or iSCSI. The debate on the relative merits of each protocol continues to rage on the Internet but in my particular case the criteria are pretty simple; which one performs better without causing excessive system load? Just from reading about the two protocols I am already leaning toward iSCSI for the simple fact that I can use all my TCP/IP management tools (routing, NAT, firewalling, etc.) on every iSCSI device. The only (potential) drawback is CPU load on the involved systems since it has to calculate TCP checksums for all those packets. Yes, there are many, many other advantages of one protocol over the other. No, they don’t matter to me in this scenario :-) So here we go!


    In keeping with my character, the first thing I did was start all over again from scratch by reinstalling the operating system. This time around I set up /dev/md0 as /boot (255 MB) and /dev/md1 as an LVM physical volume (the remainder of the disk), within which /, /home, /usr and friends reside as logical volumes. Its something I’ve wanted to start doing with all my systems for a long time now and shouldn’t have any bearing on the performance tests we are about to do.

    Regardless of which protocol will be used we need to enable jumbo frames on all the involved devices. For my setup that means the target (stor01), the initiator (node02), and the switch (a Cisco Catalyst 2970).

    First, we turn on jumbo frames for gigabit ethernet at the switch. Beware that this requires a reset (aka reboot) of the switch to take effect:

    c2970# system mtu jumbo 9000

    Now we enable an MTU of 9000 on both the target and the initiator:

    root@stor01:~# ifconfig bond0 mtu 9000
    root@node02:~# ifconfig eth0 mtu 9000

    For the sake of comparison, here is an iperf test done between the target and initiator with the standard MTU of 1500, and then with an MTU of 9000:

    root@stor01:~# iperf -s
    ————————————————————
    Server listening on TCP port 5001
    TCP window size: 1.00 MByte (default)
    ————————————————————
    [  4] local 65.171.150.4 port 5001 connected with 65.171.150.161 port 58731
    [  4]  0.0-10.0 sec    780 MBytes    654 Mbits/sec
    [  5] local 65.171.150.4 port 5001 connected with 65.171.150.161 port 58732
    [  5]  0.0-10.0 sec    916 MBytes    768 Mbits/sec

    As you can see, just enabling jumbo frames produces a raw throughput increase of 17.43%. Nothing to sneeze at.

    At this point I tried enabling flow control on the catalyst switch (it is already enabled for both send and receive by default in the e1000 driver) but it did not have any effect on iperf numbers. I turned it back off for now.

    So now we set up a 20GB LVM volume on the target and export it using vblade to be mounted on the initiator. We then run a simple dd test to check throughput:

    root@node02:~# dd if=/dev/zero of=/mnt/test oflag=direct bs=4M
    419+0 records in
    419+0 records out
    1757413376 bytes (1.8 GB) copied, 133.476 seconds, 13.2 MB/s

    CPU load on the target was 10-15% during the dd operation. Now we try writing direct to the (unmounted) block device to rule out any performance penalties of the filesystem itself…

    root@node02:~# dd if=/dev/zero of=/dev/etherd/e0.1 oflag=direct bs=4M
    513+0 records in
    512+0 records out
    2147483648 bytes (2.1 GB) copied, 170.991 seconds, 12.6 MB/s

    CPU usage was slightly higher in that test, running 15-20%. So some slight difference but nothing to be too concerned about.

    Now we take that same LVM device and share it via iSCSI for the same dd tests:

    ladmin@node02:~$ dd if=/dev/zero of=/mnt/test oflag=direct bs=4M
    462+0 records in
    461+0 records out
    1933574144 bytes (1.9 GB) copied, 38.2375 seconds, 50.6 MB/s

    CPU load was 6-8% during that test. We also run that same test with flow control enabled at the switch:

    ladmin@node02:~$ dd if=/dev/zero of=/mnt/test oflag=direct bs=4M
    463+0 records in
    462+0 records out
    1937768448 bytes (1.9 GB) copied, 38.1851 seconds, 50.7 MB/s

    Essentially the same…

    Now this raises the question of why AoE is so much slower than iSCSI on an essentially default install of Debian Etch. To AoE’s credit, many people report getting just as good (50MB/s or better) of performance from AoE on their systems as I’m seeing with iSCSI. I spent quite a large amount of time playing with flow control, kernel ring buffer values, filesystem options, etc. and was unable to determine why performance is so terrible for me. I did find a pretty high number (half a dozen at least) of recent posts to the AoE mailing list by other people having essentially identical problems so I’m certainly not alone. In the interest of completing my testing, I’ve decided to move forward with iSCSI.

    Now we try reformatting with the stride option to mkfs:

    mkfs.ext3 -E stride=16

    The results of several more tests are shown here…

    root@node02:~# dd if=/dev/zero of=/mnt/test oflag=direct bs=4M
    2038431744 bytes (2.0 GB) copied, 41.8938 seconds, 48.7 MB/s
    2038431744 bytes (2.0 GB) copied, 42.2633 seconds, 48.2 MB/s
    2038431744 bytes (2.0 GB) copied, 41.3756 seconds, 49.3 MB/s

    So we don’t see any appreciable difference when using a combination of the stride= option and flow control, at least with a simple dd test.

    Next we turn flow control back off, and reformat again without the stride= option. We are now back to our baseline setup for a new test with bonnie++.

    ladmin@node02:~$ /usr/sbin/bonnie++ -d /mnt -s 4096Mb -n 10 -x 5 -q

    This test produced block writes of about 82MB/s and block reads of about 37MB/s. The cause for the difference in write speed between the dd and bonnie++ tests is still unclear to me. There also appears to be a known issue where writes are much faster than reads which is apparently due to interrupt handling. This is further evidenced by running a quick dd test that does a read instead of a write:

    ladmin@node02:~$ dd if=/mnt/test of=/dev/null bs=4M
    3602907136 bytes (3.6 GB) copied, 96.2053 seconds, 37.5 MB/s
    3602907136 bytes (3.6 GB) copied, 90.5524 seconds, 39.8 MB/s
    3602907136 bytes (3.6 GB) copied, 90.0425 seconds, 40.0 MB/s
    3602907136 bytes (3.6 GB) copied, 88.5708 seconds, 40.7 MB/s

    As you can see, read operations are about 20% slower than write operations which goes against common thinking with regard to stripped disk arrays.

    So there you have it. In my particular situation, with no tuning/optimizing done, iSCSI performs much better than AoE. Even in the event that I were to go to the trouble to performance tune AoE and get it as good as, or even better than, iSCSI I would still be inclined to standardize around iSCSI. Authentication, routing, NAT, etc. can all be done very easily on iSCSI protocol with all the standard TCP/IP tools that are out there. For me that’s a pretty big advantage.

    Next up will be our final piece of the puzzle, left over from the initial system setup - getting hot swap working with the sata_mv module!

    Leave a Reply