Playing with NFS & GlusterFS on Amazon cc1.4xlarge EC2 instance types

29 Jul 2010 Playing with NFS & GlusterFS on Amazon cc1.4xlarge EC2 instance types

Early single-client tests of shared ephemeral storage via NFS and parallel GlusterFS

We here at BioTeam have been kicking tires and generally exploring around the edges of the new Amazon cc1.4xlarge “compute cluster” EC2 instance types. Much of our experimentation has been centered around simplistic benchmarking techniques as a way of slowly zeroing in on the methods, techniques and orchestration approaches most likely to have a significant usability, performance or wallclock-time-to-scientific-results outcome for the work we do professionally for ourselves and our clients.

glusterFS-002.png

We are asking very broad questions and testing assumptions along the lines of:

  • Does the hot new 10 Gigabit non-blocking networking fabric backing up the new instance types really mean that “legacy” compute farm and HPC cluster architectures which make heavy use of network filesharing possible?
  • How does filesharing between nodes look and feel on the new network and instance types?
  • Are the speedy ephemeral disks on the new instance types suitable for bundling into NFS shares or aggregating into parallel or clustered distribtued filesystems?
  • Can we use the replication features in GlusterFS to mitigate some of the risks of using ephemeral disk for storage?
  • Should the shared storage built from ephermeral disk be assigned to “/scratch” or other non-critical duties due to the risks involved? What can we do to mitigate the risks?
  • At what scale is NFS the easiest and most suitable sharing option? What are the best NFS server and client tuning parameters to use?
  • When using parallel or cluster filesystems like GlusterFS, what rough metrics can we use to figure out how many data servers to dedicate to a particular cluster size or workflow profile?

GlusterFS & NFS Initial Testing

Over the past week we have been running tests on two types of network filesharing. We’ve only tested against a single client so obviously these results say nothing about at-scale performance or operation.

Types of tests:

  1. Take the pair of ~900GB ephemeral disks on the instance type, stripe them together as a RAID0 set, slap an XFS filesystem on top and export the entire volume out via NFS
  2. Take the pair of ~900GB ephemeral disks on the instance type, slap a single large partition on each drive, format each drive with an EXT3 filesystem and then use GlusterFS to create, mount and export the volume via the GlusterFS protocol

For each of the above two test types we repeatedly ran (at least 4x times) our standard bonnie++ benchmark tests (methodology described in the earlier blog posts). The tests were run on a single remote client that was either NFS mounting or GlusterFS mounting the file share.

GlusterFS parameters

  • None really. We used the standard volume creation command and mounted the file share via the GlusterFS protocol over TCP. Eventually we want to ask some of our GlusterFS expert friends for additional tuning guidance

NFS parameters:

  • Server export file:  “/nfs    <host>(rw,async)?”
  • NFS Server config: boosted the number of nfsd daemons to 16 via edits to /etc/sysconfig/nfs file
  • Client mount options:  “mount -t nfs -o rw,async,hard,intr,retrans=2,rsize=32768,wsize=32768,nfsvers=3,tcp <host>:/nfs /nfs-scratch?”

Lessons Learned So Far  – NFS vs GlusterFS

  • GlusterFS was incredibly easy to install and creating and exporting parallel filesystem shares was straightforward. The methods involved are easily scripted/automated or built into a server orchestration strategy. The process was so simple that initially we were thinking that GlusterFS would be our default sharing option for all our work on the new compute cluster instances
  • GlusterFS has ONE HUGE DOWNSIDE. It turns out that GlusterFS recommends that the participating disk volumes be formatted with an ext3 filesystem for best results. This is … problematic … with the 900GB ephemeral disks because formatting a 900 gb disk with ext3 takes damn near forever. We estimate about 15-20 minutes of wallclock time wasted while waiting for the “mkfs.ext3” command to complete.
  • The wallclock time lost to formatting ext3 volumes for GlusterFS usage is significant enough to affect how we may or may not use GlusterFS in the future. Maybe there is a different filesystem we can use with a faster formatting profile. Using XFS and software RAID we can normally stand up and export filesystems in a matter of a few seconds or a minute or two. Sadly, XFS is not recommended at all with current versions of GlusterFS.
  • Using GlusterFS with the recommended ext3 configuration seems to mean that we have to accept a minimum delay of 15 minutes or even more when standing up and exporting new storage. This is unacceptable for small deployments or workflows where you might only be running the EC2 instances for a short time.
  • The possibility of using the GlusterFS replication features to mitigate against the risks of using ephermeral storage might be significant. We need to do more testing in this configuration.
  • Given the extensive wallclock time delays inherent in waiting for ext3 filesystem formatting to complete in a GlusterFS scenario it seems likely that we might default to using a tuned NFS server setup for (a) small clusters & compute farms or (b) systems that we plan to stand up only for a few hours.
  • The overhead of provisioning GlusterFS becomes less significant when we have very large clusters that can benefit from the inherent scaling ability of GlusterFS or when we plan to stand up the clusters for longer periods of time

Benchmark Results

In all the results shown below I’ve included data from a 2-disk RAID0 ephemeral storage  setup. This is so that the network filesharing data can be contrasted against the results seen from running bonnie++ locally.

Click on the images for a larger version.

glusterFS-004.png

glusterFS-005.png

glusterFS-006.png

14 Comments
  • James Lowey
    Posted at 18:49h, 29 July Reply

    This shows that on-demand HPC still has a ways to go, it is very unfortunate that only ext3 is supported in glusterfs, this would be a deal breaker for most of the data processing we do because not only does ext3 take forever to format the file system, but the performance and capacity are sub-par as well. Without an “insta-format” FS like XFS provisioning large ( > 100TB) scratch file systems will not be acceptable for the average HPC user. Good overview, as usual your work is appreciated !

    • blogadmin
      Posted at 19:01h, 29 July Reply

      So far my mental architecture is shaping up to be something like this:

      • Use the ephemeral disks for scratch space and live processing because they are fast compared to EBS
      • Use the ephemeral disks and the 10GbE network for building shared or parallel filesystems between nodes
      • Mount EBS or even RAID0 EBS volumes to a single server that re-exports over NFS. Use this persistent storage for result output

        Eventually the community will consolidate around some useful best practices. I’m actually kind of bullish on HPC workloads for AWS. Lots of success stories for HPC problems that are more CPU bound than storage or latency sensitive. I think the 10GbE network is going to be a huge win because it lets us build “legacy” compute farm architectures in the cloud with storage schemes that make sense to people using traditional clusters and farms. There is going to be a long transition period to “the cloud” and during this period there are still a ton of apps and use cases where the path of least resistance is to just stand up that Grid Engine, PBSPro or Platform LSF cluster in the cloud.

        -Chris

  • Anand Babu Periasamy
    Posted at 12:06h, 30 July Reply

    Great post.

    Can you also try our native scale out NFS protocol? Next 3.1 release of Gluster will include it by default. NFS is currently in beta. This release has 64k block size set by default. Final release will have larger block support and higher performance.

    Source:
    http://ftp.gluster.com/pub/gluster/glusterfs/qa-releases/nfs-beta/nfs-beta-rc10/glusterfs-nfs_beta_rc10.tar.gz

    Install + Release Notes:
    http://ftp.gluster.com/pub/gluster/glusterfs/qa-releases/nfs-beta/nfs-beta-rc10/glusterfs-nfs_beta_rc10.tar.gz

    Regarding Ext3 as backend, we recommend Ext3 because of its proven stability. I personally like Ext4 until Btrfs arrives. If you have 2.6.31 or higher, ext4 is a better bet. XFS in the past had issues regarding recovery from crash. If you got good experience with XFS, try GlusterFS + XFS. All GlusterFS requires is a POSIX compliant disk fs with extended attr support.

    For 10GigE environment, you should turn off tcp nodelay under protocol/server and protocol/client sections.
    option transport.socket.nodelay off

    Did you try distribute or stripe? For most needs, distribute gives the best performance.

    Happy Hacking!
    — Anand Babu Periasamy

  • Vikas Gorur
    Posted at 18:09h, 03 August Reply

    Let me add a few remarks about Ext3 as well.

    The reason we usually recommend Ext3 instead of XFS is because the implementation of extended attributes in Ext3 is significantly faster than in XFS. GlusterFS makes use of extended attributes quite a bit, especially the replicate translator. In an environment with lots of small files and many creation/deletion operations, using XFS with replication will be slower than Ext3.

    However, if your workload consists of mostly large files and relatively fewer create/delete operations, you might find that the performance XFS delivers is acceptable. We have many successful deployments that use XFS in just this way.

    In summary, I’d say if start-up time is an issue for you, give XFS a shot. It is by no means “never recommended”.

    Cheers,
    – Vikas


    Engineer – Gluster, Inc.

  • Kurt Gray
    Posted at 16:14h, 16 August Reply

    Instead of spending 20 minutes to format an ext3 volume every time you need one, you could just snapshot a blank ext3 volume and use that as your starting snapshot when creating a new volume.

  • Tom
    Posted at 04:24h, 10 October Reply

    We have also done a lot of testing with NFS on the AWS cloud. We found the performance decreases dramatically if you use large buffer size like “rsize=32768,wsize=32768”, Using the default sizes improves the performance by several times.

    -tom

    • blogadmin
      Posted at 07:48h, 10 October Reply

      Thanks Tom! This is a great tip. We are starting a new round of low-level AWS testing this month. — Chris

  • Tom
    Posted at 08:06h, 11 October Reply

    I have just done some simple local and NFS IO tests using ‘dd’.
    In our case we connected a NFS cluster compute server with regular m1.large instances outside the cluster group. We found significant improvement vs. a regular m1.large NFS server.

    We are using 8 way stripped EBS disks on the server which seem to give better performance that just 2 ephemeral drives, and more importantly they are durable.

    Note that all reads are after paying the EBS 1st use penalty.
    All test where repeated about 5 times and cache buffer flushed on both the server and clients. Test files where 1 GB large

    Local read from striped disks 400 MB/sec
    (This compares to 100 MB/sec in non cluster servers)

    – single nfs client read 95 MB/s

    parallel NFS clients MB/s per client Aggregate
    ——————– ————— ———
    1 95 95
    2 80 160
    3 65 195
    4 60 240
    5 4 5 225

    -tom

  • Tom
    Posted at 08:11h, 11 October Reply

    It looks like my table formatting got messed up in the prev post.
    Here is is in a clearer format:

    – 2 clients reading 2 files – 80 MB/s aggregate 160
    – 3 clients reading 3 files – 65 MB/s aggregate 195
    – 4 clients reading 4 files – 60 MB/s aggregate 240
    – 5 clients reading 5 files – 45 MB/s aggregate 225

    • blogadmin
      Posted at 08:45h, 11 October Reply

      Tom – for the 8 way striped EBS volume have you found a method for snapshotting or backing it up? I’ve heard about using xfs_freeze followed by individual EBS vol snapshots in theory but have not tested it in any significant way. –Chris

  • Sharron Clemons
    Posted at 15:20h, 21 December Reply

    It looks like my table formatting got messed up in the prev post. Here is is in a clearer format: – 2 clients reading 2 files – 80 MB/s aggregate 160 – 3 clients reading 3 files – 65 MB/s aggregate 195 – 4 clients reading 4 files – 60 MB/s aggregate 240 – 5 clients reading 5 files – 45 MB/s aggregate 225

  • Jewel Mooney
    Posted at 03:10h, 24 December Reply

    Tom – for the 8 way striped EBS volume have you found a method for snapshotting or backing it up? I’ve heard about using xfs_freeze followed by individual EBS vol snapshots in theory but have not tested it in any significant way. –Chris

  • Ben Golub
    Posted at 12:34h, 25 February Reply

    Great Post.

    Gluster has recently released an Amazon Machine Image that should make it significantly easier to deploy in AWS, with specific optimizations for performance and replication. There is a free trial available at:

    http://www.gluster.com/products/virtual-storage-appliances/ami/

    Also, Gluster recently published an extensive doc on performance, including results of tests for AWS vs. bare metal, NFS vs. Native FUSE, etc.
    http://www.gluster.com/products/performance-in-a-gluster-system-white-paper/

  • Tom
    Posted at 03:30h, 07 April Reply

    Sorry for the late reply, I just saw the updates.

    Yes we snapshot the 8 volumes after a xfs_freeze. We have done many restores and it works like a charm every time.

Post A Comment