Apr 032014

Well, lately I’ve been doing some development work with OpenNebula.

We’ve recently deployed a 3-node Ceph cluster which we intend to use as our back-end storage for numerous things: among them being VM storage.  Initially I thought the throughput would be “good enough”, 3 hosts each with gigabit links supplying VM hosts with gigabit backhaul links.

It’d be comparable to typical HDDs, or so I thought.  What I didn’t count on in particular was the random-read latency introduced by round-tripping over the network and overheads.  When I tried Ceph with just libvirt, things weren’t too bad, I was close to saturating my 1Gbps link.  Put two VMs on and again, things hummed along.  Not blistering fast mind you but reasonable.

I got OpenNebula talking to it easy enough.  We’re running the stable version: 4.4.  There are a few things I learned about the way OpenNebula uses Ceph:

  • OpenNebula uses v1-format RBDs (the Ceph default actually)
  • Since v1 RBDs don’t support COW clones, instance images are copied.
  • Copying a 160GB image in triplicate over gigabit Ethernet takes a while, and brought our little cluster to a crawl.

Naturally, we’re looking into beefing up the network links and CPUs on the storage nodes, but I’ve also been looking at ways to reduce the load on the back-end cluster.  One is through caching.  There are a couple of projects out there which allow you to combine two types of storage, using a smaller, faster block device to act as a cache for a larger, slower device.  Two which immediately come to mind: FlashCache and bcache.

bcache is on the TODO list, it has a few more knobs and dials to be able to play with, and shares a single cache device with multiple back-end devices, so might yet be worth investing time in.

Sébastian Han posted a guide on doing RBD caching using FlashCache, and so my work has largely been based on this initial work.  I’ve been hacking up a OpenNebula datastore management and transfer management driver which harnesses FlashCache and the newer v2 RBD format to produce a flexible storage subsystem for OpenNebula.

The basic concept is simple enough:

  • Logical Volume Manager, is used to allocate slices of a SSD to use as cache for back-end RBDs.
  • For non-persistent images, a new copy-on-write clone of the base image is created
  • A flashcache composite device is produced using the LVM volume as cache and the RBD as the backend
  • KVM/QEMU/Xen uses this composite device like a regular disk

The initial attempt worked well for Linux VMs, read performance initially would be between 20MB/sec and 120MB/sec depending on network/storage cluster load.  Subsequent reads would then exceed 240MB/sec.  Write performance was limited to what the cluster could do, unless you used writeback mode at which point speed picked up dramatically.

Windows proved to be a puzzle, it seems some Windows images have an odd way of accessing the disk, and this impacts performance badly.  In many cases, the images were of a sparse nature, with most of the content being in the first 8GB.  So I made sure to allocate 8GB chunks of my SSD, and performed what I call pre-caching: seeding the contents of the SSD with the initial 8GB (or however big the SSD partition is) of the image.

That picks up the initial boot performance by a big margin, at the cost of the image taking a little longer to deploy in the PROLOG stage.

For those who are interested, some early code is available via git.

bcache might be worth a look-in as it has read-ahead caching.  I haven’t done so yet.  I’d like to split the caching subsystem out and have cache drivers much like we have for datastore managers and transfer managers alike.  The same concept would work for iSCSI/CLVM storage or Gluster storage as it does for Ceph.