Sep 292014
 

Well, it’s been a busy year so far for security vulnerabilities in open-source projects.  Not that those have been the only two bugs, they’re just two high-profile ones that are getting a lot of media attention.

Now, a number of us do take sheer delight in pointing and laughing when one of the big boys, whether they be based in Redmond or California, makes a security balls-up on a big scale.  After all, people pay big dollars to use some of that software, and many are dependent on it for their livelihoods.

The question does get raised though, what do you trust more?  A piece of software whose code is a complete secret, or the a piece of software anyone can audit?  Some argue the former, because anyone can find the holes in the latter and exploit them.  Some argue the latter, since anyone can find the holes and fix them.  Not being able to see the code doesn’t guarantee a lack of security issues however, and these last two headline-making bugs is definitely evidence that having the code isn’t a guarantee to a bug-free utopia.

There is no guarantee either way.

I’ve seen both open-source systems and high-end commercial systems both perform well and I’ve seen both make a dismal failure.  Bad code is bad code, no matter what the license, and even having the source available doesn’t mean you can fix it as first one must be able to understand what its intent is.  Information Technology in particular seems to attract the technologically inept but socially capable types that are able to talk their way into nearly any position, and so you wind up with the monstrosities that you might see on The Daily WTF.  These same people lurk amongst open-source circles too, and there are those who just make an honest mistake.  Security is hard, and it can be easy to overlook a possible hole.

I run Gentoo here, have done so now since 2004 (damn, 10 years already, but I digress…).  I’ve been building my own stage 3 tarballs from scratch since 2010.  July 2010 I bought my current desktop, a 6-core AMD Phenom machine, and so combined with the 512Kbps ADSL I had at the time, it was faster for me to compile stage 3 tarballs for the various systems (i386, AMD64 and about 6 different MIPS builds) than to download the sources.  If I wanted an up-to-date stage 3, I just took my last build, ran it through Gentoo Catalyst, and out came a freshly built tarball.

I still obtain my operating systems that way.  Even though I’ve upgraded the ADSL, I still use the same scripts that used to produce the official Gentoo/MIPS media.

This means I could audit every piece of software that forms my core system.  I have the source code there, all of it.  Not many Linux users have this, most have it at arms reach (i.e. an apt-get source ${PACKAGE} away), or at worst, a polite email/letter to their supplier (e.g. Netcomm will supply sources for their routers for a ~AU$10 fee), however I already have it.

So did I do any audits?  Have I done any audits?  No.  Ultimately I just blindly trust what comes down the wire, and to some, that is arguably no better than just blindly trusting what Apple and Microsoft produce.

Those who say that, do have a point.  I didn’t pick up on HeartBleed, nor on ShellShock, and I probably haven’t spotted what will become the next headline-grabbing bug.  There’s a lot of source code that goes into a GNU/Linux system, and if I were to sit there and audit it, myself, it’d take me a lifetime.  It’d cost me a fortune to pay a team to analyse it.

However, I at least have the choice of auditing parts of it.  I’ll never be able to audit the copies of Microsoft Windows, or the one copy of Apple MacOS X I have.  For those, I’m reliant on the upstream vendors to audit, test and patch their code, I cannot do it myself.

For the open-source software though, it’s ultimately my choice.  I can do it myself, I can also pay someone to do it, I’ve simply chosen not to at this time.  This is an important distinction that the anti-open-source camp seem to forget.

As for the quality factor: well I’ve spent more time arguing with some piece of proprietary software and having trouble getting it to do something I need it to do, or fixing up some cock up caused by a bug in the said software.  One option, I spend hours arguing with it to make it work, and have to pay good money for the privilege.  The other, they money stays in my pocket, and in theory I can re-build it to make it work if needed.  One will place arbitrary restrictions on how I use the software as an end user, forcing me to spend money on more expensive licenses, the other will happily let me keep pushing it until I hit my system’s technical limits.

Neither offer me any kind of warranty regarding to losses I might suffer as a result of their software (I’m sorry, but US$5.00 is as good as worthless), so the money might as well stay in my pocket while I learn something about the software I use.

I remain in control of my destiny that way, and that is the way I’d like to keep it.

Apr 032014
 

Well, lately I’ve been doing some development work with OpenNebula.

We’ve recently deployed a 3-node Ceph cluster which we intend to use as our back-end storage for numerous things: among them being VM storage.  Initially I thought the throughput would be “good enough”, 3 hosts each with gigabit links supplying VM hosts with gigabit backhaul links.

It’d be comparable to typical HDDs, or so I thought.  What I didn’t count on in particular was the random-read latency introduced by round-tripping over the network and overheads.  When I tried Ceph with just libvirt, things weren’t too bad, I was close to saturating my 1Gbps link.  Put two VMs on and again, things hummed along.  Not blistering fast mind you but reasonable.

I got OpenNebula talking to it easy enough.  We’re running the stable version: 4.4.  There are a few things I learned about the way OpenNebula uses Ceph:

  • OpenNebula uses v1-format RBDs (the Ceph default actually)
  • Since v1 RBDs don’t support COW clones, instance images are copied.
  • Copying a 160GB image in triplicate over gigabit Ethernet takes a while, and brought our little cluster to a crawl.

Naturally, we’re looking into beefing up the network links and CPUs on the storage nodes, but I’ve also been looking at ways to reduce the load on the back-end cluster.  One is through caching.  There are a couple of projects out there which allow you to combine two types of storage, using a smaller, faster block device to act as a cache for a larger, slower device.  Two which immediately come to mind: FlashCache and bcache.

bcache is on the TODO list, it has a few more knobs and dials to be able to play with, and shares a single cache device with multiple back-end devices, so might yet be worth investing time in.

Sébastian Han posted a guide on doing RBD caching using FlashCache, and so my work has largely been based on this initial work.  I’ve been hacking up a OpenNebula datastore management and transfer management driver which harnesses FlashCache and the newer v2 RBD format to produce a flexible storage subsystem for OpenNebula.

The basic concept is simple enough:

  • Logical Volume Manager, is used to allocate slices of a SSD to use as cache for back-end RBDs.
  • For non-persistent images, a new copy-on-write clone of the base image is created
  • A flashcache composite device is produced using the LVM volume as cache and the RBD as the backend
  • KVM/QEMU/Xen uses this composite device like a regular disk

The initial attempt worked well for Linux VMs, read performance initially would be between 20MB/sec and 120MB/sec depending on network/storage cluster load.  Subsequent reads would then exceed 240MB/sec.  Write performance was limited to what the cluster could do, unless you used writeback mode at which point speed picked up dramatically.

Windows proved to be a puzzle, it seems some Windows images have an odd way of accessing the disk, and this impacts performance badly.  In many cases, the images were of a sparse nature, with most of the content being in the first 8GB.  So I made sure to allocate 8GB chunks of my SSD, and performed what I call pre-caching: seeding the contents of the SSD with the initial 8GB (or however big the SSD partition is) of the image.

That picks up the initial boot performance by a big margin, at the cost of the image taking a little longer to deploy in the PROLOG stage.

For those who are interested, some early code is available via git.

bcache might be worth a look-in as it has read-ahead caching.  I haven’t done so yet.  I’d like to split the caching subsystem out and have cache drivers much like we have for datastore managers and transfer managers alike.  The same concept would work for iSCSI/CLVM storage or Gluster storage as it does for Ceph.

Feb 252014
 

Hi all,

This is more a note to myself on how to configure stgt to talk to a Ceph rbd. Everyone seems to recommend patching tgt-admin: this is simply not necessary. The challenge is the lax way that tgt-admin parses the configuration file.

My scenario: VMWare ESXi virtual machine host, needing to use storage on Ceph.
I have 3 storage nodes running ceph-mon and ceph-osd daemons. They also have a version of tgtd that supports Ceph. (See the ceph-extras repository.)

The /etc/tgt/conf.d/${CLIENT}.conf configuration file. (I’m putting all the targets for ${CLIENT} here.)

# Target naming: iqn.yyyy-mm.backwards.domain.your:client.target
# where yyyy-mm: year and month of target creation
# backwards.domain.your: Your domain name; written backwards.
# client.target: A name for the target, since it's for one client here I name it
# as the client's host name then give the rest some descriptive title.
<target iqn.2014-02.domain.my:my-client.my-target-name>
    driver iscsi
    bs-type rbd
    backing-store pool-name/rbd-name
    initiator-address ip.of.my.client
</target>

For better or worse, I run the tgt daemon on the Ceph nodes themselves. Multipath I’m not sure about at this point, I’ve set up the targets on all of my Ceph nodes so I can connect to any, but I have not tested this yet.

To enable that target:

# tgt-admin -v -e

Then to verify:

# tgt-admin -s

You should see your LUNs listed.

Nov 122013
 

Hi all,

Not often I have a whinge about something, but this problem has been bugging me of late more than somewhat.  I’m in the process of setting up an OpenStack cluster at work.  Now, as the underlying OS we’ve chosen Ubuntu Linux which is fine.  Ubuntu is a quite stable, reliable and well supported platform.

One of my pet peeves though, is when some package manager decides to get lazy.  Now, those of us who have been around the Linux scene have probably discovered RPM dependency hell… and the smug Debian users who tell us that Debian doesn’t do this.

Ho ho, errm… no, when APT wants to go into dummy mode, it does so with style:

Nov 12 05:32:27 in-target: Setting up python3-update-manager (1:0.186.2) ...
Nov 12 05:32:27 in-target: Setting up python3-distupgrade (1:0.192.13) ...
Nov 12 05:32:27 in-target: Setting up ubuntu-release-upgrader-core 
(1:0.192.13) ...
Nov 12 05:32:27 in-target: Setting up update-manager-core (1:0.186.2) ...
Nov 12 05:32:27 in-target: Processing triggers for libc-bin ...
Nov 12 05:32:27 in-target: ldconfig deferred processing now taking place
Nov 12 05:32:27 in-target: Processing triggers for initramfs-tools ...
Nov 12 05:32:27 in-target: Processing triggers for ca-certificates ...
Nov 12 05:32:27 in-target: Updating certificates in /etc/ssl/certs... 
Nov 12 05:32:29 in-target: 158 added, 0 removed; done.
Nov 12 05:32:29 in-target: Running hooks in /etc/ca-certificates/update.d....
Nov 12 05:32:29 in-target: done.
Nov 12 05:32:29 in-target: Processing triggers for sgml-base ...
Nov 12 05:32:29 pkgsel: installing additional packages
Nov 12 05:32:29 in-target: Reading package lists...
Nov 12 05:32:29 in-target: 
Nov 12 05:32:29 in-target: Building dependency tree...
Nov 12 05:32:30 in-target: 
Nov 12 05:32:30 in-target: Reading state information...
Nov 12 05:32:30 in-target: 
Nov 12 05:32:30 in-target: openssh-server is already the newest version.
Nov 12 05:32:30 in-target: Some packages could not be installed. This may 
mean that you have
Nov 12 05:32:30 in-target: requested an impossible situation or if you are 
using the unstable
Nov 12 05:32:30 in-target: distribution that some required packages have not 
yet been created
Nov 12 05:32:30 in-target: or been moved out of Incoming.
Nov 12 05:32:30 in-target: The following information may help to resolve the 
situation:
Nov 12 05:32:30 in-target: 
Nov 12 05:32:30 in-target: The following packages have unmet dependencies:
Nov 12 05:32:30 in-target:  mariadb-galera-server : Depends: 
mariadb-galera-server-5.5 (= 5.5.33a+maria-1~raring) but it is not going to 
be installed

Mmmm, great, not going to be installed. May I ask why not? No, I’ll just drop to a shell and do it myself then.

Nov 12 05:32:30 in-target: E: Unable to correct problems, you have held 
broken packages.

Now this is probably one of my most hated things about computing, is when a software package accuses YOU of doing something that you haven’t. Excuse me… I have held broken packages? I simply performed a fresh install then told you to do an install!

So let’s have a closer look.

Nov 12 05:32:30 main-menu[20801]: WARNING **: Configuring 'pkgsel' failed 
with error code 100
Nov 12 05:32:30 main-menu[20801]: WARNING **: Menu item 'pkgsel' failed.
Nov 12 05:37:38 main-menu[20801]: INFO: Modifying debconf priority limit from 
'high' to 'medium'
Nov 12 05:37:38 debconf: Setting debconf/priority to medium
Nov 12 05:37:38 main-menu[20801]: DEBUG: resolver (ext2-modules): package 
doesn't exist (ignored)
Nov 12 05:37:40 main-menu[20801]: INFO: Menu item 'di-utils-shell' selected
~ # chroot /target
chroot: can't execute '/bin/network-console': No such file or directory
~ # chroot /target bin/bash

We give it a shot ourselves to see the error more clearly.

root@test-mgmt0:/# apt-get install mariadb-galera-server
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 mariadb-galera-server : Depends: mariadb-galera-server-5.5 (= 
5.5.33a+maria-1~raring) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

Fine, so we’ll try installing that instead then.

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 mariadb-galera-server-5.5 : Depends: mariadb-client-5.5 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
                             Depends: libmariadbclient18 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
                             PreDepends: mariadb-common but it is not going 
to be installed
E: Unable to correct problems, you have held broken packages.

Okay, closer, so we need to install those too. But hang on, isn’t that apt‘s responsibility to know this stuff? (which it clearly does).

Also note we don’t get told why it isn’t going to be installed. It refuses to install the packages, “just because”. No reason given.

We try adding in the deps to our list.

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 mariadb-client-5.5 : Depends: libdbd-mysql-perl (>= 1.2202) but it is not 
going to be installed
                      Depends: mariadb-common but it is not going to be 
installed
                      Depends: libmariadbclient18 (>= 5.5.33a+maria-1~raring) 
but it is not going to be installed
                      Depends: mariadb-client-core-5.5 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
 mariadb-galera-server-5.5 : Depends: libmariadbclient18 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
                             PreDepends: mariadb-common but it is not going 
to be installed
E: Unable to correct problems, you have held broken packages.

Okay, some more deps, we’ll add those…

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5 libmariadbclient18
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 libmariadbclient18 : Depends: mariadb-common but it is not going to be 
installed
                      Depends: libmysqlclient18 (= 5.5.33a+maria-1~raring) 
but it is not going to be installed
 mariadb-client-5.5 : Depends: libdbd-mysql-perl (>= 1.2202) but it is not 
going to be installed
                      Depends: mariadb-common but it is not going to be 
installed
                      Depends: mariadb-client-core-5.5 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
 mariadb-galera-server-5.5 : PreDepends: mariadb-common but it is not going 
to be installed
E: Unable to correct problems, you have held broken packages.

Wash-rinse-repeat!

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5 libmariadbclient18 mariadb-common
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 libmariadbclient18 : Depends: libmysqlclient18 (= 5.5.33a+maria-1~raring) 
but it is not going to be installed
 mariadb-client-5.5 : Depends: libdbd-mysql-perl (>= 1.2202) but it is not 
going to be installed
 mariadb-common : Depends: mysql-common but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5 libmariadbclient18 mariadb-common libdbd-mysql-perl 
mysql-common
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 libmariadbclient18 : Depends: libmysqlclient18 (= 5.5.33a+maria-1~raring) 
but 5.5.34-0ubuntu0.13.04.1 is to be installed
 mariadb-client-5.5 : Depends: mariadb-client-core-5.5 (>= 
5.5.33a+maria-1~raring) but it is not going to be installed
 mysql-common : Breaks: mysql-client-5.1
                Breaks: mysql-server-core-5.1
E: Unable to correct problems, you have held broken packages.

Aha, so there’s a newer version in the Ubuntu repository that’s overriding ours. Brilliant. Ohh, and there’s a mysql-client binary too, but it won’t tell me what version it’s trying for.

Looking in the repository myself I spot a package named mysql-common_5.5.33a+maria-1~raring_all.deb. That is likely our culprit, so I try version 5.5.33a+maria-1~raring.

root@test-mgmt0:/# apt-get install mariadb-galera-server-5.5 
mariadb-client-5.5 libmariadbclient18 mariadb-common libdbd-mysql-perl 
mysql-common=5.5.33a+maria-1~raring libmysqlclient18=5.5.33a+maria-1~raring 
mariadb-client-core-5.5
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following extra packages will be installed:
  galera libaio1 libdbi-perl libhtml-template-perl libnet-daemon-perl 
libplrpc-perl

Bingo!

So, for those wanting to pre-seed MariaDB Cluster 5.5, I used the following in my preseed file:

# MariaDB 5.5 repository list - created 2013-11-12 05:20 UTC
# http://mariadb.org/mariadb/repositories/
d-i apt-setup/local3/repository string \
        deb http://mirror.aarnet.edu.au/pub/MariaDB/repo/5.5/ubuntu raring main
d-i apt-setup/local3/comment string \
        "MariaDB repository"
d-i pkgsel/include string mariadb-galera-server-5.5 \
        mariadb-client-5.5 libmariadbclient18 mariadb-common \
        libdbd-mysql-perl mysql-common=5.5.33a+maria-1~raring \
        libmysqlclient18=5.5.33a+maria-1~raring mariadb-client-core-5.5 \
        galera

# For unattended installation, we set the password here
mysql-server mysql-server/root_password select DatabaseRootPassword
mysql-server mysql-server/root_password_again select DatabaseRootPassword

So yeah, next time someone mentions this:

Gentoo: Increasing blood pressure since 1999.

it doesn’t just apply to Gentoo!

Sep 082013
 

Well, some might remember my time with a cheap and nasty Android tablet (some might call these “landfill Android”).  The device packaging did not once even acknowledge the fact that there was GPL’ed software onboard, let alone how one obtains the source.

I discovered it was based around the Vimicro VC0882 SoC.  Turns out, that’s the same as the ViewSonic ViewPad 10e, who do release their kernel sources on their knowledge base.

Thank-you ViewSonic, you have just helped me greatly!  Maybe I should track down one of your tablets and buy one in appreciation.

May 122013
 

I’ve been working with VRT Systems for a few years now. Originally brought in as a software engineer, my role shifted to include network administration duties.

This of course does not phase me, I’ve done network administration work before for charities. There are some small differences, for example, back then it was a single do-everything box running Gentoo hosting a Samba-based NT domain for about 5 Windows XP workstations, now it’s about 20 Windows 7 workstations, a Samba-based NT domain backed by LDAP, and a number of servers.

Part of this has been to move our aging infrastructure to a more modern “private cloud” infrastructure. In the following series, I plan to detail my notes on what I’ve learned through this process, so that others may benefit from my insight.  At this stage, I don’t have all the answers, and there are some things  I may have wrong below.

Planning

The first stage with any such network development (this goes for “cloud”-like and traditional structures) is to consider how we want the network to operate, how it is going to be managed, and what skills we need.

Both my manager and I are Unix-oriented people, in my case I’ll be honest — I have a definite bias towards open source, and I’ll try to assess a solution on technical merit rather than via glossy brochures.

After looking at some commercial solutions, my manager more or less came to the conclusion that a lot of these highly expensive servers are not so magical, they are fundamentally just standard desktops in a small form factor. While we could buy a whole heap of 1U high rack servers, we might be better served by using more standard hardware.

The plan is to build a cluster of standard boxes, in as small form factor as practical, which would be managed at a higher level for load balancing and redundancy.

Hardware: first attempt

One key factor we wanted to reduce was the power consumption. Our existing rack of hardware chews about 1.5kW of power. Since we want to run a lot of virtual machines, we want to make them as efficient as possible. We wanted a small building block that would handle a small handful of VMs, and storing data across multiple nodes for redundancy.

After some research, we wound up with our first attempt at a compute node:

Motherboard: Intel DQ77KB Mini ITX
CPU: Intel Core i3-3220T 2.8GHz Dual-Core
RAM: 8GB SODIMM
Storage: Intel 520S 240GB SSD
Networking: Onboard dual gigabit for cluster, PCIe Realtek RTL8168 adaptor for client-facing network

The plan, is that we’d have many of these, they would pool their storage in a redundant fashion.  The two on-board NICs would be bonded together using LACP and would form a back-end storage network for the nodes to share data.  The one PCIe card would be the “public” face of the cluster and would connect it to the outside world using VLANs.

For the OS, we threw on Ubuntu 12.04 LTS AMD64, and we ran the KVM hypervisor. We then tried throwing this on one of our power meters to see how much power the thing drew. At first my manager asked if the thing was even turned on … it was idling at 10W.

Loaded it on with a few virtual machines, eventually I had 6 VMs going on the thing, ranging from Linux, Windows 2000, Windows XP and a Windows 2008R2 P2V image for one of our customer projects.

The CPU load sat at about 6.0, and the power consumption did not budge above 30W. Our existing boxes drew 300W, so theoretically we could run 10 of these for just one of our old servers.

Management software

Running QEMU VMs from bash scripts is all very well, but in this case we need to be able to give non-technical users use of a subset of the cluster for projects.  I hardly expect them to write bash scripts to fire up KVM over SSH.

We considered a few options: Ganeti, OpenNebula and OpenStack.

Ganeti looked good but the lack of a template system and media library let it down for us, and OpenNebula proved a bit fiddly as well.  OpenStack is a big behemoth and will take quite a bit of research however.

Storage

One factor that stood out like a sore thumb: our initial infrastructure was going to just have all compute nodes, with shared storage between them.  There were a couple of options for doing this such as having the nodes in pairs with DR:BD, using Ceph or Sheepdog, etc… but by far, the most common approach was to have a storage backend on a SAN.

SANs get very expensive very quickly.  Nice hardware, but overkill and over budget.  We figured we should plan for that eventuality, should the need arise, but it’d be a later addition.  We don’t need blistering speed, if we can sustain 160Mbps throughput, that’d probably be fine for most things.

Reading the literature, Ceph looked by far and above the best choice, but it had a catch — you can’t run Ceph server daemons, and Ceph in-kernel clients, on the same host.  Doing so you run the risk of a deadlock, in much the same manner as NFS does when you mount from localhost.

OpenStack actually has 3 types of storage:

  • Ephemeral storage
  • Block storage
  • Image storage

Ephemeral storage is specific to a given virtual machine.  It often lives on the compute node with the VM, or on a back-end storage system, and stores data temporarily for the life of a virtual machine instance.  When a VM instance is created, new copies of ephemeral block devices are created from images stored in image storage.  Once the virtual machine is terminated, these ephemeral block devices are deleted.

Block storage is the persistent storage for a given VM.  Say you were running a mail server … your OS and configuration might exist on a ephemeral device, but your mail would sit on a block device.

Image storage are simply raw images of block devices.  Image storage cannot be mounted as a block device directly, but rather, the storage area is used as a repository which is read from when creating the other two types of storage.

Ephemeral storage in OpenStack is managed by the compute node itself, often using LVM on a local block device.  There is no redundancy as it’s considered to be temporary data only.

For block storage, OpenStack provides a service called cinder.  This, at its heart, seems to use LVM as well, and exports the block devices over iSCSI.

For image storage, OpenStack has a redundant storage system called swift.  The basis for this seems to be rsync, with a service called swift-proxy providing a REST-interface over http.  swift-proxy is very network intensive, and benefits from hardware such as high-speed networking (e.g. 10Gbps Ethernet).

Hardware: second attempt

Having researched how storage works in OpenStack somewhat, it became clear that one single building block would not do.  There would in fact be two other types of node: storage nodes, and management nodes.

The storage nodes would contain largish spinning disks, with software maintaining copies and load balancing between all nodes.

The management nodes would contain the high-speed networking, and would provide services such as Ceph monitors (if we use Ceph), swift-proxy and other core functions.  RabbitMQ and the core database would run here for example.

Without the need for big storage, the compute nodes could be downsized in disk, and expanded in RAM.  So we now had a network that looked like this:

Node Type Compute Management Storage
Motherboard: Intel DQ77KB Mini ITX Intel DQ77MH Micro ATX
CPU: Intel Core i3-3220T 2.8GHz Dual-Core
RAM: 2*8GB SODIMM 2*4GB DIMM
Storage: Intel 520S 60GB SSD Intel 520S 60GB SSD for OS, 2*Seagate ST3000VX000-1CU1 3TB HDDs for data
Networking: Onboard dual gigabit for cluster, PCIe Realtek RTL8168 adaptor for client-facing network Onboard dual gigabit for management, PCIe 10GbE for cluster communications Onboard dual gigabit for cluster, PCIe Realtek RTL8168 adaptor for management

The management and storage nodes are slightly tweaked versions of what we use for compute nodes. The motherboard is basically the same chipset, but capable of taking larger PCIe cards and using a standard ATX power supply.

Since we’re not storing much on the compute nodes, we’ve gone for 60GB SSDs rather than 240GB SSDs to cut the cost down a little. We might have to look at 120GB SSDs in newer nodes, or maybe look at other options, as Intel seem to have discontinued the 60GB 520S … bless them! The Intel 520S SSDs were chosen due to the 5-year warranty offered.

The management and storage nodes, rather than going into small Mini-ITX media-centre style cases, are put in larger 2U rackmount cases. These cases have room for 4 HDDs, in theory.

Deployment

For testing purposes, we got two of each node. This allows us to try out things like testing what would happen if a node went belly up by yanking its power, and to test load balancing when things are working properly.

We haven’t bought the 10GbE cards at this stage, as we’re not sure exactly which ones to get (we have a Cisco SG500X switch to plug them into) and they’re expensive.

The final cluster will have at least 3 storage nodes, 3 management nodes and maybe as many as 16 compute nodes. I say at least 3 storage nodes — in buying the test hardware, I accidentally ordered 7 cases, and so we might decide to build an extra storage node.

Each of those gives us 6TB of storage, and the production plan is to load balance with a replica on at least 3 nodes… so we can survive any two going belly up. The disks also push close to 800Mbps throughput, so with 3 nodes serving up data, that should be enough to saturate the dual-gigabit link on the compute node. 4 nodes would give us 8TB of effective storage.

With so many nodes though, one problem remains, deploying the configuration and managing it all. We’re using Ubuntu as our base platform, and so it makes sense to tap into their technologies for deployment.

We’ll be looking to use Ubuntu Cloud and Juju to manage the deployment.

Ubuntu Cloud itself is a packaged version of OpenStack.  The components of OpenStack are deployed with Juju.  Juju itself can deploy services either to “public clouds” like Amazon AWS, or to one’s own private cluster using Ubuntu MAAS (Metal As A Service).

Metal As a Service itself basically is a deployment system which installs and configures Ubuntu on network-booting clients for automatic installation and configuration.

The underlying technology is based on a few components: dnsmasq DHCP/DNS server, tftp-hpa TFTP server, and the configuration gets served up to the installer via a web service API.  There’s a web interface for managing it all.  Once installed, you then deploy services using Juju (the word juju apparently translates to “magic”).

Further research

So having researched what hardware will likely be needed, I need to research a few things.

Firstly, the storage mechanism, we can either go with the pure OpenStack approach with cinder managing LVM based storage and exporting over iSCSI, or we get cinder to manage a Ceph back-end storage cluster.  This decision has not yet been made.  My two biggest concerns with cinder are:

  • Does cinder manage multiple replicas of block storage?
  • Does cinder try to load-balance between replicas?

With image storage, if we use Ceph, we have two choices.  We can either:

  • Install Swift on the storage nodes, partition the drives and use some of the storage for Swift, and the rest for Ceph… with Swift-proxy on the management nodes.
  • Install Rados Gateway on the management nodes in place of Swift

But which is the better approach?  My understanding is that Ceph doesn’t fully integrate into the OpenStack identity service (called keystone).  I need to find out if this matters much, or whether splitting storage between Swift and Ceph might be better.

Metal As a Service seems great in concept.  I’ve been researching OpenStack and Ceph for a few months now (with numerous interruptions), and I’m starting to get a picture as to how it all fits together.  Now the next step is to understand MAAS and Juju.  I don’t mind magic in entertainment, but I do not like it in my systems.  So my first step will be to get to understand MAAS and Juju on a low level.

Crucially, I want to figure out how one customises the image provided by MAAS… in particular, making sure it deploys to the 60GB SSD on each node, and not just the first block device it sees.

The storage nodes have their two 6Gbps SATA ports connected to the 3TB HDDs for performance, making them visible as /dev/sda and /dev/sdb — MAAS needs to understand that the disk it needs to deploy to is called /dev/sdc in this case.  I’d also perfer it to use XFS rather than EXT4, and a user called something other than “ubuntu”.  These are things I’d like to work out how to configure.

As for Juju, I need to work out exactly what it does when it “bootstraps” itself.  When I tried it last, it randomly picked a compute node.  I’d be happier if it deployed itself to the management node I ran it from.  I also need to figure out how it picks out nodes and deploys the application.  My quick testing with it had me asking it to deploy all the OpenStack components, only to have it sit there doing nothing… so clearly I missed something in the docs.  How is it supposed to work?  I’ll need to find out.  It certainly isn’t this simple.