May 012016
 

So, after putting aside the charge controller for now, I’ve taken some time to see if I can get the software side of things into shape.

In the midst of my development, I found a small wiring fault that was responsible for blowing a couple of fuses. A small nick in the sheath of the positive wire in a power cable was letting the crimp part of a DC barrel connector contact +12V. A tweak of that crimp and things are back to normal. I’ve swapped all the 10A fuses for 5A ones, since the regulators are only rated at 7.5A.

The VLANs are assigned now, and I have bonding going between the two pairs of Ethernet devices. In spite of the switch only supporting 4 LAGs, it seems fine with me doing LACP on effectively 10 LAGs. I’ll see how it goes.

The switch has 5 ports spare after plugging in all 5 nodes and a 16-port switch for the IPMI subnet. One will be used for a management interface so I can plug a laptop in, and the others will be paired with LACP for linking to my two existing Cisco SG200-8s.

One of the goals of this project is to try and push the performance of Ceph. In the office, we tried bare Ceph, and found that, while it’s fine for sequential I/O, it suffers a bit with random read/writes, and Windows-based HyperV images like to do a lot of random reads/writes.

Putting FlashCache in the mix really helped, but I note now, it’s no longer maintained. EnhanceIO had only just forked when I tried FlashCache, now it seems that’s the official successor.

There are two alternatives to FlashCache/EnhanceIO: bcache and dm-cache.

I’ll rule out bcache now as it requires the backing image be “formatted” for use. In other words, the backing image is not a raw image, but some proprietary (to bcache) format. This isn’t unworkable, but it raises concerns with me about portability: if I migrate a VM, do I need to migrate its cache too, or is it sufficient to cleanly shut down and detach the bcache device before re-assembling it on the new host?

By contrast, dm-cache and EnhanceIO/FlashCache work with raw backing images, making them much more attractive. Flush the cache before migration or use writethru mode, and all should be fine. dm-cache does however require a separate metadata device: messy, but not unworkable. We can provision the cache-related devices we need using LVM2, and use the kernel-mode Rados block device as our backing image.

So I think my caching subsystem is a two-horse race: dm-cache or EnhanceIO. I guess we’ll give them a try and see how they go.

For those following along at home, if you’re running kernel >4.3, you might want use this fork of EnhanceIO due to changes in the kernel block I/O layer.

To manage the OpenNebula master node, I’ve installed corosync/pacemaker. Normally these are used with DR:BD, however I figure Ceph can fulfil that role. The concepts are similar: it’s a shared block device. I’m not sure if it’ll be LXC, Docker or a VM at this point that “contains” the server, but whatever it is, it should be possible for it to have its root FS and data on Ceph.

I’m leaning towards LXC for this. Time for some more experimentation.

Apr 092016
 

One elephant in the room, is how I’m going to store the system whilst in operation.

The obvious solution is some sort of metal cabinet with provision for 19″ rack mounting and DIN rail equipment. Question is, how big?

A big consideration here is thermal matters. When going flat out, there will be 100W-150W worth of thermal energy being dissipated in there. So room for convection currents is a must!

Some decent fans on the top to suck the hot air out would also be a good idea. Blowing up so that dust doesn’t get sucked down into the works.

I figured I’d sit everything sort-of in situ. I figured out that the DIN rail mounts don’t have to go on the bottom, with these cases, if you remove the front panel there’s four holes for mounting those same DIN rail mounts on the front. So that’s what I’ve done. I’ve now got a DIN rail spare for future expansion.

If I try to pack everything up as densely as possible (not wise), this is what it looks like:

There’s room there for possibly one more node to squeeze in there. I’d think that’d be pushing it however. 5 is probably a good number, meaning we can space the units out a bit to allow them to draw air in via the gaps.

On top of the units I have my two switches. The old Netcomm 24-port switch was retired from our network when a lightning strike to a neighbour’s tree an 8-port switch, my Yaesu FT-897D radio transceiver, some ports on a wireless 3G router/switch, and an ADSL router out. It also did damage some ports on the big Netcomm switch, so in short, I know it has issues.

Replacing its 3.3V PSU with one that steps down from 12V would cost me the price of a 16-port 10/100Mbps switch brand new.

When we replaced the switch (paid for by insurance) we decided to buy a 8-port and 16-port switch. The 16-port switch, retired due to an upgrade to gigabit, is sitting on top, and takes 12V 1A input. It’ll be perfect for the IPMI VLAN, where speed is not important. It also accepts the DC plugs I bought by mistake.

The 8-port one takes 7.5V 1A, so a little less convenient for this task, I’d need to make a DC-DC converter for it. Maybe later if this works.

So considering a cabinet for this, we have:

  • 5 nodes measuring 190mm in height: ~5 RU
  • A 24-port switch: 1 RU
  • A 16-port switch: 1 RU
  • Some power distribution electronics: 3RU

Yes, the battery and its charger is external to the cabinet.

Judging from this, the cabinet probably needs to be a 10RU or 12RU cabinet to give us space for mounting everything cleanly and to ensure good ventilation. Using 8-port IPMI switches and 24+2-port comms switches, that leaves us with sufficient port space for the 5 nodes and gives us one port left for a small in-chassis monitoring device and 4 ports left on the main switch for an uplink trunk.

You could conceptually then consider these as homogeneous building blocks for larger networks, using Ceph’s CRUSH maps to ensure copies get distributed amongst these “cabinets”.

Mar 252016
 

Well, figured I’d document this project here in case anyone was interested in doing this for personal amusement or for their workplace.

The list I’ve just chucked up is not a complete list, nor is it a prescribed list of exactly what’s needed, but rather is what I’ve either acquired, or will acquire.

The basic architecture is as follows:

  • The cluster is built up on discrete nodes which are based around a very similar hardware stack and are tweaked for their function.
  • Persistent data storage is handled by the storage nodes using the Ceph object storage system. This requires that a majority quorum is maintained, and so a minimum of 3 storage nodes are required.
  • Virtual machines run on the compute nodes.
  • Management nodes oversee co-ordination of the compute nodes: this ideally should be a separate pair of machines, but for my use case, I intend to use a virtual machine or container managed using active/passive failover techniques.
  • In order to reduce virtual disk latency, the compute nodes will implement a local disk cache using an SSD, backed by a Rados Block Device on Ceph.

I’ll be using KVM as the virtualisation technology with Gentoo Linux as the base OS for this experimental cluster. At my workplace, we evaluated a few different technologies including Proxmox VE, Ganeti, OpenStack and OpenNebula. For this project, I intend to build on OpenNebula as it’s the simplest to understand and the most suited to my workplace’s requirements.

Using Gentoo makes it very easy to splice in patches as I’ll be developing as I go along. If I come to implement this in the office, I’ll be porting everything across to Ubuntu. This will be building on some experimental work I’ve done in the past with OpenNebula.

For the base nodes themselves, I’ve based them around these components:

For the storage nodes, add to the list:

Other things you may want/need:

  • A managed switch, I ended up choosing the Linksys LGS-326AU which U-Mart were selling at AU$294. If you’ve ever used Cisco’s small business offerings, this unit will make you feel right at home.
  • DIN rail. Jaycar sell this in 1m lengths, and I’ll grab some tomorrow.

Most of the above bits I have, the nodes are all basically built as of this afternoon, minus the SATA adaptors for the three storage nodes. All units power on, and do what one would expect of a machine that’s trying to boot from a blank SSD.

I did put one of the compute nodes through its paces, network booting the machine via PXE/NFS root and installing Gentoo.

Power consumption was below 1.8A for a battery voltage of about 13.4V, even when building the Linux 4.4.6 kernel (using make -j8), which it did in about 10 minutes. Watching this thing tackle compile jobs is a thing of beauty, can’t wait to get distcc going and have 40 CPU cores tear into the bootstrap process. The initial boot also looks beautiful, with 8 penguins lined up representing the 8 cores — don’t turn up here in a tuxedo!

So hardware wise, things are more or less together, and it’ll mostly be software. I’ll throw up some notes on how it’s all wired, but basically the plan in the short term is a 240V mains charger (surplus from a caravan) will keep the battery floated until I get the solar panel and controller set up.

When that happens, I plan to wire a relay in series with the 240V charger controlled by a comparator to connect mains when the battery voltage drops below 12V.

The switch is a 240V device unfortunately (couldn’t find any 24-port 12V managed switches) so it’ll run from an inverter. Port space is tight, and I just got the one since they’re kinda pricey. Long term, I might look at a second for redundancy, although if a switch goes, I won’t lose existing data.

ADSL2+ will be managed by a small localised battery back-up and a small computer as router, possibly a Raspberry Pi as I have one spare (original B model), which can temporarily store incoming SMTP traffic if the cluster does go down (heaven forbid!) and act as a management endpoint. There are a few contenders here, including these industrial computers, for which I already maintain a modern Linux kernel port for my workplace.

Things are coming together, and I hope to bring more on this project as it moves ahead.

Apr 032014
 

Well, lately I’ve been doing some development work with OpenNebula.

We’ve recently deployed a 3-node Ceph cluster which we intend to use as our back-end storage for numerous things: among them being VM storage.  Initially I thought the throughput would be “good enough”, 3 hosts each with gigabit links supplying VM hosts with gigabit backhaul links.

It’d be comparable to typical HDDs, or so I thought.  What I didn’t count on in particular was the random-read latency introduced by round-tripping over the network and overheads.  When I tried Ceph with just libvirt, things weren’t too bad, I was close to saturating my 1Gbps link.  Put two VMs on and again, things hummed along.  Not blistering fast mind you but reasonable.

I got OpenNebula talking to it easy enough.  We’re running the stable version: 4.4.  There are a few things I learned about the way OpenNebula uses Ceph:

  • OpenNebula uses v1-format RBDs (the Ceph default actually)
  • Since v1 RBDs don’t support COW clones, instance images are copied.
  • Copying a 160GB image in triplicate over gigabit Ethernet takes a while, and brought our little cluster to a crawl.

Naturally, we’re looking into beefing up the network links and CPUs on the storage nodes, but I’ve also been looking at ways to reduce the load on the back-end cluster.  One is through caching.  There are a couple of projects out there which allow you to combine two types of storage, using a smaller, faster block device to act as a cache for a larger, slower device.  Two which immediately come to mind: FlashCache and bcache.

bcache is on the TODO list, it has a few more knobs and dials to be able to play with, and shares a single cache device with multiple back-end devices, so might yet be worth investing time in.

Sébastian Han posted a guide on doing RBD caching using FlashCache, and so my work has largely been based on this initial work.  I’ve been hacking up a OpenNebula datastore management and transfer management driver which harnesses FlashCache and the newer v2 RBD format to produce a flexible storage subsystem for OpenNebula.

The basic concept is simple enough:

  • Logical Volume Manager, is used to allocate slices of a SSD to use as cache for back-end RBDs.
  • For non-persistent images, a new copy-on-write clone of the base image is created
  • A flashcache composite device is produced using the LVM volume as cache and the RBD as the backend
  • KVM/QEMU/Xen uses this composite device like a regular disk

The initial attempt worked well for Linux VMs, read performance initially would be between 20MB/sec and 120MB/sec depending on network/storage cluster load.  Subsequent reads would then exceed 240MB/sec.  Write performance was limited to what the cluster could do, unless you used writeback mode at which point speed picked up dramatically.

Windows proved to be a puzzle, it seems some Windows images have an odd way of accessing the disk, and this impacts performance badly.  In many cases, the images were of a sparse nature, with most of the content being in the first 8GB.  So I made sure to allocate 8GB chunks of my SSD, and performed what I call pre-caching: seeding the contents of the SSD with the initial 8GB (or however big the SSD partition is) of the image.

That picks up the initial boot performance by a big margin, at the cost of the image taking a little longer to deploy in the PROLOG stage.

For those who are interested, some early code is available via git.

bcache might be worth a look-in as it has read-ahead caching.  I haven’t done so yet.  I’d like to split the caching subsystem out and have cache drivers much like we have for datastore managers and transfer managers alike.  The same concept would work for iSCSI/CLVM storage or Gluster storage as it does for Ceph.

Feb 252014
 

Hi all,

This is more a note to myself on how to configure stgt to talk to a Ceph rbd. Everyone seems to recommend patching tgt-admin: this is simply not necessary. The challenge is the lax way that tgt-admin parses the configuration file.

My scenario: VMWare ESXi virtual machine host, needing to use storage on Ceph.
I have 3 storage nodes running ceph-mon and ceph-osd daemons. They also have a version of tgtd that supports Ceph. (See the ceph-extras repository.)

The /etc/tgt/conf.d/${CLIENT}.conf configuration file. (I’m putting all the targets for ${CLIENT} here.)

# Target naming: iqn.yyyy-mm.backwards.domain.your:client.target
# where yyyy-mm: year and month of target creation
# backwards.domain.your: Your domain name; written backwards.
# client.target: A name for the target, since it's for one client here I name it
# as the client's host name then give the rest some descriptive title.
<target iqn.2014-02.domain.my:my-client.my-target-name>
    driver iscsi
    bs-type rbd
    backing-store pool-name/rbd-name
    initiator-address ip.of.my.client
</target>

For better or worse, I run the tgt daemon on the Ceph nodes themselves. Multipath I’m not sure about at this point, I’ve set up the targets on all of my Ceph nodes so I can connect to any, but I have not tested this yet.

To enable that target:

# tgt-admin -v -e

Then to verify:

# tgt-admin -s

You should see your LUNs listed.