OpenStack

Solar Cluster: WTF

So… with the new controller we’re able to see how much current we’re getting from the solar.  I note they omit the solar voltage, and I suspect the current is how much is coming out of the MPPT stage, but still, it’s more information than we had before.

With this, we noticed that on a good day, we were getting… 7A.

That’s about what we’d expect for one panel.  What’s going on?  Must be a wiring fault!

I’ll admit when I made the mounting for the solar controller, I didn’t account for the bend radius in the 6gauge wire I was using, and found it was difficult to feed it into the controller properly.  No worries, this morning at 4AM I powered everything off, took the solar controller off, drilled 6 new holes a bit lower down, fed the wires through and screwed them back in.

Whilst it was all off, I decided I’d individually charge the batteries.  So, right-hand battery came first, I hook the mains charger directly up and let ‘er rip.  Less than 30 minutes later, it was done.

So, disconnect that, hook up the left hand battery.  45 minutes later the charger’s still grinding away.  WTF?

Feel the battery… it is hot!  Double WTF?

It would appear that this particular battery is stuffed.  I’ve got one good one though, so for now I pull the dud out and run with just the one.

I hook everything up,  do some final checks, then power the lot back up.

Things seem to go well… I do my usual post-blackout dance of connecting my laptop up to the virtual instance management VLAN, waiting for the OpenNebula VM to fire up, then log into its interface (because we’re too kewl to have a command line tool to re-start an instance), see my router and gitea instances are “powered off”, and instruct the system to boot them.

They come up… I’m composing an email, hit send… “Could not resolve hostname”… WTF?  Wander downstairs, I note the LED on the main switch flashing furiously (as it does on power-up) and a chorus of POST beeps tells me the cluster got hard-power-cycled.  But why?  Okay, it’s up now, back up stairs, connect to the VLAN, re-start everything again.

About to send that email again… boompa!  Same error.  Sure enough, my router is down.  Wander downstairs, and as I get near, I hear the POST beeps again.  Battery voltage is good, about 13.2V.  WTF?

So, about to re-start everything, then I lose contact with my OpenNebula front-end.  Okay, something is definitely up.  Wander downstairs, and the hosts are booting again.  On a hunch I flick the off-switch to the mains charger.  Klunk, the whole lot goes off.  There’s no connection to the battery, and so when the charger drops its power to check the battery voltage, it brings the whole lot down.

WTF once more?  I jiggle some wires… no dice.  Unplug, plug back in, power blinks on then off again.  What is going on?

Finally, I pull right-hand battery out (the left-hand one is already out and cooling off, still very warm at this point), 13.2V between the negative terminal and positive on the battery, good… 13.2V between negative and the battery side of the isolator switch… unscrew the fuse holder… 13.2V between fuse holder terminal and the negative side…  but 0V between negative side on battery and the positive terminal on the SB50 connector.

No apparent loose connections, so I grab one of my spares, swap it with the existing fuse.  Screw the holder back together, plug the battery back in, and away it all goes.

This is the offending culprit.  It’s a 40A 5AG fuse.  Bought for its current carrying capacity, not for the “bling factor” (gold conductors).

If I put my multimeter in continuance test mode and hold a probe on each end cap, without moving the probes, I hear it go open-circuit, closed-circuit, open-circuit, closed-circuit.  Fuses don’t normally do that.

I have a few spares of these thankfully, but I will be buying a couple more to replace the one that’s now dead.  Ohh, and it looks like I’m up for another pair of batteries, and we will have a working spare 105Ah once I get the new ones in.

On the RAM front… the firm I bought the last lot through did get back to me, with some DDR3L ECC SO-DIMMs, again made by Kingston.  Sounded close enough, they were 20c a piece more (AU$855 for 6 vs $AU864.50).

Given that it was likely this would be an increasing problem, I thought I’d at least buy enough to ensure every node had two matched sticks in, so I told them to increase the quantity to 9 and to let me know what I owe them.

At first they sent me the updated invoice with the total amount (AU$1293.20).  No problems there.  It took a bit of back-and-forth before I finally confirmed they had the previous amount I sent them.  Great, so into the bank I trundle on Thursday morning with the updated invoice, and I pay the remainder (AU$428.70).

Friday, I get the email to say that product was no longer available.  They instead, suggested some Crucial modules which were $60 a piece cheaper.  Well, when entering a gold mine, one must prepare themselves for the shaft.

Checking the link, I found it: these were non-ECC.  1Gbit×64, not 1Gbit×72 like I had ordered.  In any case I was over it, I fired back an email telling them to cancel the order and return the money.  I was in no mood for Internet shopper Russian Roulette.

It turns out I can buy the original sticks through other suppliers, just not in the quantities I’m after.  So I might be able to buy one or two from a supplier, I can’t buy 9.  Kingston have stopped making them and so what’s left is whatever companies have in stock.

So I’ll have to move to something else.  It’d be worth buying one stick of the original type so I can pair it with one of the others, but no more than that.  I’m in no mood to do this in a few years time when parts are likely to be even harder to source… so I think I’ll bite the bullet and go 16GB modules.  Due to the limits on my debit card though, I’ll have to buy them two at a time (~$900AUD each go).  The plan is:

  1. Order in two 16GB modules and an 8GB module… take existing 8GB module out of one of the compute nodes and install the 16GB modules into that node.  Install the brand new 8GB module and the recovered 8GB module into two of the storage nodes.  One compute node now has 32GB RAM, and two storage nodes are now upgraded to 16GB each.  Remaining compute node and storage node each have 8GB.
  2. Order in two more 16GB modules… pull the existing 8GB module out of the other compute node, install the two 16GB modules.  Then install the old 8GB module into the remaining storage node.  All three storage nodes now have 16GB each, both compute nodes have 32GB each.
  3. Order two more 16GB modules, install into one compute node, it now has 64GB.
  4. Order in last two 16GB modules, install into the other compute node.

Yes, expensive, but sod it.  Once I’ve done this, the two nodes doing all the work will be at their maximum capacity.  The storage nodes are doing just fine with 8GB, so 16GB should mean there’s plenty of RAM for caching.

As for virtual machine management… I’m pretty much over OpenNebula.  Dealing with libvirt directly is no fun, but at least once configured, it works!  OpenNebula has a habit of not differentiating between a VM being powered off (as in, me logging into the guest and issuing a shutdown), and a VM being forcefully turned off by the host’s power getting yanked!

With one, there should be some event fired off by libvirt to tell OpenNebula that the VM has indeed turned itself off.  With the latter, it should observe that one moment the VM is there, and next it isn’t… the inference being that it should still be there, and that perhaps that VM should be re-started.

This could be a libvirt limitation too.  I’ll have to research that.  If it is, then the answer is clear: we ditch libvirt and DIY.  I’ll have to research how I can establish a quorum and schedule where VMs get put, but it should be doable without the hassle that OpenNebula has been so far, and without going to the utter tedium that is OpenStack.

Solar Cluster: Rack installed in-situ

So, there’s some work still to be done, for example making some extension leads for the run between the battery link harness, load power distribution and the charger… and to generally tidy things up, but it is now up and running.

On the floor, is the 240V-12V power supply and the charger, which right now is hard-wired in boost mode. In the bottom of the rack are the two 105Ah 12V AGM batteries, in boxes with fuses and isolation switches.

The nodes and switching is inside the rack, and resting on top is the load power distribution board, which I’ll have to rewire to make things a little neater. A prospect is to mount some of this on the back.

I had a few introductions to make, introducing the existing pair of SG-200 switches to the newcomer and its VLANs, but now at least, I’m able to SSH into the nodes, access the IPMI BMC and generally configure the whole box and dice.

With the exception of the later upgrade to solar, and the aforementioned wiring harness clean-ups, the hardware-side of this dual hardware/software project, is largely complete, and this project now transitions to being a software project.

The plan from here:

  • Update the OSes… as all will be a little dated. (I might even blow away and re-load.)
  • Get Ceph storage up and running. It actually should be configured already, just a matter of getting DNS hostnames sorted out so they can find eachother.
  • Investigating the block caching landscape: when I first started the project at work, it was a 3-horse race between Facebook’s FlashCache, bcache and dmcache. Well, FlashCache is no more, replaced by EnhancedIO, and I’m not sure about the rest of the market. So this needs researching.
  • Management interfaces: at my workplace I tried Ganeti, OpenNebula and OpenStack. This again, needs re-visiting. OpenNebula has moved a long way from where it was and I haven’t looked at the others in a while. OpenStack had me running away screaming, but maybe things have improved.

apt repository hell whilst installing mariadb-galera-server on Ubuntu

Hi all,

Not often I have a whinge about something, but this problem has been bugging me of late more than somewhat.  I’m in the process of setting up an OpenStack cluster at work.  Now, as the underlying OS we’ve chosen Ubuntu Linux which is fine.  Ubuntu is a quite stable, reliable and well supported platform.

One of my pet peeves though, is when some package manager decides to get lazy.  Now, those of us who have been around the Linux scene have probably discovered RPM dependency hell… and the smug Debian users who tell us that Debian doesn’t do this.

Ho ho, errm… no, when APT wants to go into dummy mode, it does so with style:

Nov 12 05:32:27 in-target: Setting up python3-update-manager (1:0.186.2) ...
Nov 12 05:32:27 in-target: Setting up python3-distupgrade (1:0.192.13) ...
Nov 12 05:32:27 in-target: Setting up ubuntu-release-upgrader-core 
(1:0.192.13) ...
Nov 12 05:32:27 in-target: Setting up update-manager-core (1:0.186.2) ...
Nov 12 05:32:27 in-target: Processing triggers for libc-bin ...
Nov 12 05:32:27 in-target: ldconfig deferred processing now taking place
Nov 12 05:32:27 in-target: Processing triggers for initramfs-tools ...
Nov 12 05:32:27 in-target: Processing triggers for ca-certificates ...
Nov 12 05:32:27 in-target: Updating certificates in /etc/ssl/certs... 
Nov 12 05:32:29 in-target: 158 added, 0 removed; done.
Nov 12 05:32:29 in-target: Running hooks in /etc/ca-certificates/update.d....
Nov 12 05:32:29 in-target: done.
Nov 12 05:32:29 in-target: Processing triggers for sgml-base ...
Nov 12 05:32:29 pkgsel: installing additional packages
Nov 12 05:32:29 in-target: Reading package lists...
Nov 12 05:32:29 in-target: 
Nov 12 05:32:29 in-target: Building dependency tree...
Nov 12 05:32:30 in-target: 
Nov 12 05:32:30 in-target: Reading state information...
Nov 12 05:32:30 in-target: 
Nov 12 05:32:30 in-target: openssh-server is already the newest version.
Nov 12 05:32:30 in-target: Some packages could not be installed. This may 
mean that you have
Nov 12 05:32:30 in-target: requested an impossible situation or if you are 
using the unstable
Nov 12 05:32:30 in-target: distribution that some required packages have not 
yet been created
Nov 12 05:32:30 in-target: or been moved out of Incoming.
Nov 12 05:32:30 in-target: The following information may help to resolve the 
situation:
Nov 12 05:32:30 in-target: 
Nov 12 05:32:30 in-target: The following packages have unmet dependencies:
Nov 12 05:32:30 in-target:  mariadb-galera-server : Depends: 
mariadb-galera-server-5.5 (= 5.5.33a+maria-1~raring) but it is not going to 
be installed

Mmmm, great, not going to be installed. May I ask why not? No, I’ll just drop to a shell and do it myself then.

Nov 12 05:32:30 in-target: E: Unable to correct problems, you have held 
broken packages.

Now this is probably one of my most hated things about computing, is when a software package accuses YOU of doing something that you haven’t. Excuse me… I have held broken packages? I simply performed a fresh install then told you to do an install!

So let’s have a closer look.

Nov 12 05:32:30 main-menu[20801]: WARNING **: Configuring 'pkgsel' failed 
with error code 100
Nov 12 05:32:30 main-menu[20801]: WARNING **: Menu item 'pkgsel' failed.
Nov 12 05:37:38 main-menu[20801]: INFO: Modifying debconf priority limit from 
'high' to 'medium'
Nov 12 05:37:38 debconf: Setting debconf/priority to medium
Nov 12 05:37:38 main-menu[20801]: DEBUG: resolver (ext2-modules): package 
doesn't exist (ignored)
Nov 12 05:37:40 main-menu[20801]: INFO: Menu item 'di-utils-shell' selected
~ # chroot /target
chroot: can't execute '/bin/network-console': No such file or directory
~ # chroot /target bin/bash

We give it a shot ourselves to see the error more clearly.

root@test-mgmt0:/# apt-get install mariadb-galera-server
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 mariadb-galera-server : Depends: mariadb-galera-server-5.5 (= 
5.5.33a+maria-1~raring) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

Fine, so we’ll try installing that instead then.

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 mariadb-galera-server-5.5 : Depends: mariadb-client-5.5 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
                             Depends: libmariadbclient18 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
                             PreDepends: mariadb-common but it is not going 
to be installed
E: Unable to correct problems, you have held broken packages.

Okay, closer, so we need to install those too. But hang on, isn’t that apt‘s responsibility to know this stuff? (which it clearly does).

Also note we don’t get told why it isn’t going to be installed. It refuses to install the packages, “just because”. No reason given.

We try adding in the deps to our list.

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 mariadb-client-5.5 : Depends: libdbd-mysql-perl (>= 1.2202) but it is not 
going to be installed
                      Depends: mariadb-common but it is not going to be 
installed
                      Depends: libmariadbclient18 (>= 5.5.33a+maria-1~raring) 
but it is not going to be installed
                      Depends: mariadb-client-core-5.5 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
 mariadb-galera-server-5.5 : Depends: libmariadbclient18 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
                             PreDepends: mariadb-common but it is not going 
to be installed
E: Unable to correct problems, you have held broken packages.

Okay, some more deps, we’ll add those…

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5 libmariadbclient18
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 libmariadbclient18 : Depends: mariadb-common but it is not going to be 
installed
                      Depends: libmysqlclient18 (= 5.5.33a+maria-1~raring) 
but it is not going to be installed
 mariadb-client-5.5 : Depends: libdbd-mysql-perl (>= 1.2202) but it is not 
going to be installed
                      Depends: mariadb-common but it is not going to be 
installed
                      Depends: mariadb-client-core-5.5 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
 mariadb-galera-server-5.5 : PreDepends: mariadb-common but it is not going 
to be installed
E: Unable to correct problems, you have held broken packages.

Wash-rinse-repeat!

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5 libmariadbclient18 mariadb-common
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 libmariadbclient18 : Depends: libmysqlclient18 (= 5.5.33a+maria-1~raring) 
but it is not going to be installed
 mariadb-client-5.5 : Depends: libdbd-mysql-perl (>= 1.2202) but it is not 
going to be installed
 mariadb-common : Depends: mysql-common but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5 libmariadbclient18 mariadb-common libdbd-mysql-perl 
mysql-common
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 libmariadbclient18 : Depends: libmysqlclient18 (= 5.5.33a+maria-1~raring) 
but 5.5.34-0ubuntu0.13.04.1 is to be installed
 mariadb-client-5.5 : Depends: mariadb-client-core-5.5 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
 mysql-common : Breaks: mysql-client-5.1
                Breaks: mysql-server-core-5.1
E: Unable to correct problems, you have held broken packages.

Aha, so there’s a newer version in the Ubuntu repository that’s overriding ours. Brilliant. Ohh, and there’s a mysql-client binary too, but it won’t tell me what version it’s trying for.

Looking in the repository myself I spot a package named mysql-common_5.5.33a+maria-1~raring_all.deb. That is likely our culprit, so I try version 5.5.33a+maria-1~raring.

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5 libmariadbclient18 mariadb-common libdbd-mysql-perl 
mysql-common=5.5.33a+maria-1~raring libmysqlclient18=5.5.33a+maria-1~raring 
mariadb-client-core-5.5
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following extra packages will be installed:
  galera libaio1 libdbi-perl libhtml-template-perl libnet-daemon-perl 
libplrpc-perl

Bingo!

So, for those wanting to pre-seed MariaDB Cluster 5.5, I used the following in my preseed file:

# MariaDB 5.5 repository list - created 2013-11-12 05:20 UTC
# http://mariadb.org/mariadb/repositories/
d-i apt-setup/local3/repository string \
        deb http://mirror.aarnet.edu.au/pub/MariaDB/repo/5.5/ubuntu raring main
d-i apt-setup/local3/comment string \
        "MariaDB repository"
d-i pkgsel/include string mariadb-galera-server-5.5 \
        mariadb-client-5.5 libmariadbclient18 mariadb-common \
        libdbd-mysql-perl mysql-common=5.5.33a+maria-1~raring \
        libmysqlclient18=5.5.33a+maria-1~raring mariadb-client-core-5.5 \
        galera

# For unattended installation, we set the password here
mysql-server mysql-server/root_password select DatabaseRootPassword
mysql-server mysql-server/root_password_again select DatabaseRootPassword

So yeah, next time someone mentions this:

Gentoo: Increasing blood pressure since 1999.

it doesn’t just apply to Gentoo!

OpenStack: My foray into cloud computing

I’ve been working with VRT Systems for a few years now. Originally brought in as a software engineer, my role shifted to include network administration duties.

This of course does not phase me, I’ve done network administration work before for charities. There are some small differences, for example, back then it was a single do-everything box running Gentoo hosting a Samba-based NT domain for about 5 Windows XP workstations, now it’s about 20 Windows 7 workstations, a Samba-based NT domain backed by LDAP, and a number of servers.

Part of this has been to move our aging infrastructure to a more modern “private cloud” infrastructure. In the following series, I plan to detail my notes on what I’ve learned through this process, so that others may benefit from my insight.  At this stage, I don’t have all the answers, and there are some things  I may have wrong below.

Planning

The first stage with any such network development (this goes for “cloud”-like and traditional structures) is to consider how we want the network to operate, how it is going to be managed, and what skills we need.

Both my manager and I are Unix-oriented people, in my case I’ll be honest — I have a definite bias towards open source, and I’ll try to assess a solution on technical merit rather than via glossy brochures.

After looking at some commercial solutions, my manager more or less came to the conclusion that a lot of these highly expensive servers are not so magical, they are fundamentally just standard desktops in a small form factor. While we could buy a whole heap of 1U high rack servers, we might be better served by using more standard hardware.

The plan is to build a cluster of standard boxes, in as small form factor as practical, which would be managed at a higher level for load balancing and redundancy.

Hardware: first attempt

One key factor we wanted to reduce was the power consumption. Our existing rack of hardware chews about 1.5kW of power. Since we want to run a lot of virtual machines, we want to make them as efficient as possible. We wanted a small building block that would handle a small handful of VMs, and storing data across multiple nodes for redundancy.

After some research, we wound up with our first attempt at a compute node:

Motherboard: Intel DQ77KB Mini ITX
CPU: Intel Core i3-3220T 2.8GHz Dual-Core
RAM: 8GB SODIMM
Storage: Intel 520S 240GB SSD
Networking: Onboard dual gigabit for cluster, PCIe Realtek RTL8168 adaptor for client-facing network

The plan, is that we’d have many of these, they would pool their storage in a redundant fashion.  The two on-board NICs would be bonded together using LACP and would form a back-end storage network for the nodes to share data.  The one PCIe card would be the “public” face of the cluster and would connect it to the outside world using VLANs.

For the OS, we threw on Ubuntu 12.04 LTS AMD64, and we ran the KVM hypervisor. We then tried throwing this on one of our power meters to see how much power the thing drew. At first my manager asked if the thing was even turned on … it was idling at 10W.

Loaded it on with a few virtual machines, eventually I had 6 VMs going on the thing, ranging from Linux, Windows 2000, Windows XP and a Windows 2008R2 P2V image for one of our customer projects.

The CPU load sat at about 6.0, and the power consumption did not budge above 30W. Our existing boxes drew 300W, so theoretically we could run 10 of these for just one of our old servers.

Management software

Running QEMU VMs from bash scripts is all very well, but in this case we need to be able to give non-technical users use of a subset of the cluster for projects.  I hardly expect them to write bash scripts to fire up KVM over SSH.

We considered a few options: Ganeti, OpenNebula and OpenStack.

Ganeti looked good but the lack of a template system and media library let it down for us, and OpenNebula proved a bit fiddly as well.  OpenStack is a big behemoth and will take quite a bit of research however.

Storage

One factor that stood out like a sore thumb: our initial infrastructure was going to just have all compute nodes, with shared storage between them.  There were a couple of options for doing this such as having the nodes in pairs with DR:BD, using Ceph or Sheepdog, etc… but by far, the most common approach was to have a storage backend on a SAN.

SANs get very expensive very quickly.  Nice hardware, but overkill and over budget.  We figured we should plan for that eventuality, should the need arise, but it’d be a later addition.  We don’t need blistering speed, if we can sustain 160Mbps throughput, that’d probably be fine for most things.

Reading the literature, Ceph looked by far and above the best choice, but it had a catch — you can’t run Ceph server daemons, and Ceph in-kernel clients, on the same host.  Doing so you run the risk of a deadlock, in much the same manner as NFS does when you mount from localhost.

OpenStack actually has 3 types of storage:

  • Ephemeral storage
  • Block storage
  • Image storage

Ephemeral storage is specific to a given virtual machine.  It often lives on the compute node with the VM, or on a back-end storage system, and stores data temporarily for the life of a virtual machine instance.  When a VM instance is created, new copies of ephemeral block devices are created from images stored in image storage.  Once the virtual machine is terminated, these ephemeral block devices are deleted.

Block storage is the persistent storage for a given VM.  Say you were running a mail server … your OS and configuration might exist on a ephemeral device, but your mail would sit on a block device.

Image storage are simply raw images of block devices.  Image storage cannot be mounted as a block device directly, but rather, the storage area is used as a repository which is read from when creating the other two types of storage.

Ephemeral storage in OpenStack is managed by the compute node itself, often using LVM on a local block device.  There is no redundancy as it’s considered to be temporary data only.

For block storage, OpenStack provides a service called cinder.  This, at its heart, seems to use LVM as well, and exports the block devices over iSCSI.

For image storage, OpenStack has a redundant storage system called swift.  The basis for this seems to be rsync, with a service called swift-proxy providing a REST-interface over http.  swift-proxy is very network intensive, and benefits from hardware such as high-speed networking (e.g. 10Gbps Ethernet).

Hardware: second attempt

Having researched how storage works in OpenStack somewhat, it became clear that one single building block would not do.  There would in fact be two other types of node: storage nodes, and management nodes.

The storage nodes would contain largish spinning disks, with software maintaining copies and load balancing between all nodes.

The management nodes would contain the high-speed networking, and would provide services such as Ceph monitors (if we use Ceph), swift-proxy and other core functions.  RabbitMQ and the core database would run here for example.

Without the need for big storage, the compute nodes could be downsized in disk, and expanded in RAM.  So we now had a network that looked like this:

Node Type Compute Management Storage
Motherboard: Intel DQ77KB Mini ITX Intel DQ77MH Micro ATX
CPU: Intel Core i3-3220T 2.8GHz Dual-Core
RAM: 2*8GB SODIMM 2*4GB DIMM
Storage: Intel 520S 60GB SSD Intel 520S 60GB SSD for OS, 2*Seagate ST3000VX000-1CU1 3TB HDDs for data
Networking: Onboard dual gigabit for cluster, PCIe Realtek RTL8168 adaptor for client-facing network Onboard dual gigabit for management, PCIe 10GbE for cluster communications Onboard dual gigabit for cluster, PCIe Realtek RTL8168 adaptor for management

The management and storage nodes are slightly tweaked versions of what we use for compute nodes. The motherboard is basically the same chipset, but capable of taking larger PCIe cards and using a standard ATX power supply.

Since we’re not storing much on the compute nodes, we’ve gone for 60GB SSDs rather than 240GB SSDs to cut the cost down a little. We might have to look at 120GB SSDs in newer nodes, or maybe look at other options, as Intel seem to have discontinued the 60GB 520S … bless them! The Intel 520S SSDs were chosen due to the 5-year warranty offered.

The management and storage nodes, rather than going into small Mini-ITX media-centre style cases, are put in larger 2U rackmount cases. These cases have room for 4 HDDs, in theory.

Deployment

For testing purposes, we got two of each node. This allows us to try out things like testing what would happen if a node went belly up by yanking its power, and to test load balancing when things are working properly.

We haven’t bought the 10GbE cards at this stage, as we’re not sure exactly which ones to get (we have a Cisco SG500X switch to plug them into) and they’re expensive.

The final cluster will have at least 3 storage nodes, 3 management nodes and maybe as many as 16 compute nodes. I say at least 3 storage nodes — in buying the test hardware, I accidentally ordered 7 cases, and so we might decide to build an extra storage node.

Each of those gives us 6TB of storage, and the production plan is to load balance with a replica on at least 3 nodes… so we can survive any two going belly up. The disks also push close to 800Mbps throughput, so with 3 nodes serving up data, that should be enough to saturate the dual-gigabit link on the compute node. 4 nodes would give us 8TB of effective storage.

With so many nodes though, one problem remains, deploying the configuration and managing it all. We’re using Ubuntu as our base platform, and so it makes sense to tap into their technologies for deployment.

We’ll be looking to use Ubuntu Cloud and Juju to manage the deployment.

Ubuntu Cloud itself is a packaged version of OpenStack.  The components of OpenStack are deployed with Juju.  Juju itself can deploy services either to “public clouds” like Amazon AWS, or to one’s own private cluster using Ubuntu MAAS (Metal As A Service).

Metal As a Service itself basically is a deployment system which installs and configures Ubuntu on network-booting clients for automatic installation and configuration.

The underlying technology is based on a few components: dnsmasq DHCP/DNS server, tftp-hpa TFTP server, and the configuration gets served up to the installer via a web service API.  There’s a web interface for managing it all.  Once installed, you then deploy services using Juju (the word juju apparently translates to “magic”).

Further research

So having researched what hardware will likely be needed, I need to research a few things.

Firstly, the storage mechanism, we can either go with the pure OpenStack approach with cinder managing LVM based storage and exporting over iSCSI, or we get cinder to manage a Ceph back-end storage cluster.  This decision has not yet been made.  My two biggest concerns with cinder are:

  • Does cinder manage multiple replicas of block storage?
  • Does cinder try to load-balance between replicas?

With image storage, if we use Ceph, we have two choices.  We can either:

  • Install Swift on the storage nodes, partition the drives and use some of the storage for Swift, and the rest for Ceph… with Swift-proxy on the management nodes.
  • Install Rados Gateway on the management nodes in place of Swift

But which is the better approach?  My understanding is that Ceph doesn’t fully integrate into the OpenStack identity service (called keystone).  I need to find out if this matters much, or whether splitting storage between Swift and Ceph might be better.

Metal As a Service seems great in concept.  I’ve been researching OpenStack and Ceph for a few months now (with numerous interruptions), and I’m starting to get a picture as to how it all fits together.  Now the next step is to understand MAAS and Juju.  I don’t mind magic in entertainment, but I do not like it in my systems.  So my first step will be to get to understand MAAS and Juju on a low level.

Crucially, I want to figure out how one customises the image provided by MAAS… in particular, making sure it deploys to the 60GB SSD on each node, and not just the first block device it sees.

The storage nodes have their two 6Gbps SATA ports connected to the 3TB HDDs for performance, making them visible as /dev/sda and /dev/sdb — MAAS needs to understand that the disk it needs to deploy to is called /dev/sdc in this case.  I’d also perfer it to use XFS rather than EXT4, and a user called something other than “ubuntu”.  These are things I’d like to work out how to configure.

As for Juju, I need to work out exactly what it does when it “bootstraps” itself.  When I tried it last, it randomly picked a compute node.  I’d be happier if it deployed itself to the management node I ran it from.  I also need to figure out how it picks out nodes and deploys the application.  My quick testing with it had me asking it to deploy all the OpenStack components, only to have it sit there doing nothing… so clearly I missed something in the docs.  How is it supposed to work?  I’ll need to find out.  It certainly isn’t this simple.