virtualisation

Going low and slow with bochs

So, today I had a problem… I needed to solve a race condition in a test case for my workplace’s WideSky system.  The test case was meant to ensure that, if the AMQP broker crashed or was restarted, it would re-connect and resume operations as quickly as possible.

On my desktop (an 8-core AMD Rysen 7), the test case always passed.  On the CI server (a VM running on a dual-core Core i3), it failed.  I figured the desktop here was running too quickly for me to see the problem.  I needed a machine that ran more like the CI server to see the problem.

Looking around, I couldn’t see any way to reliably slow down QEMU, KVM or VirtualBox… but I do remember one old project from the mid-late 90s that could: Bochs.

Bochs in action… emulating a P4 Prescott on a Rysen 7

Turns out, far from what it could do back in 1998 when it was strictly a 386 emulator (and a slow one at that!) it now has AMD64 emulation capabilities.  Thus, I can run the software stack inside this VM, and have it throttle the CPU speed down so that hopefully, the problem arises.  The first problem I needed to solve was trying to get the network running.  We have a PXE boot server which can serve up Ubuntu, no problem.  I just needed to bridge the Bochs VM onto the network somehow.

I already have bridge interfaces configured on my two physical network interfaces, and these work great with KVM.  Sadly, Bochs is rather primitive in what it supports… tap-mode networking just did not work, it complained that tap0 was not “running” even if created beforehand by iproute2, but I did find I could bind it directly to one of the enslaved network interfaces (enp36s0.200; yes, a VLAN interface).

e1000 worked for network booting, but then Ubuntu couldn’t retrieve an IP address for whatever reason. ne2k is working fine, and presently, I have the VM installing.  To make it network bootable, you need a boot ROM image, which you can download from the iPXE rom-o-matic service.  The magic PCI IDs you need are 10ec 8029 for ne2k, or (if it gets fixed) 8086 10de for e1000.

The following is my Bochs config file:

# configuration file generated by Bochs
plugin_ctrl: unmapped=1, biosdev=1, speaker=1, extfpuirq=1, parallel=1, serial=1, gameport=1, ne2k=1
config_interface: textconfig
display_library: x
debug: action=report
memory: host=2048, guest=2048
romimage: file="/usr/share/bochs/BIOS-bochs-latest", address=0x0, options=none
vgaromimage: file="/usr/share/bochs/VGABIOS-lgpl-latest"
boot: disk, network
floppy_bootsig_check: disabled=0
# no floppya
# no floppyb
ata0: enabled=1, ioaddr1=0x1f0, ioaddr2=0x3f0, irq=14
ata0-master: type=disk, path="/tmp/wstest.raw", mode=flat, cylinders=0, heads=0, spt=0, model="Generic 1234", biosdetect=auto, translation=auto
ata0-slave: type=none
ata1: enabled=1, ioaddr1=0x170, ioaddr2=0x370, irq=15
ata1-master: type=none
ata1-slave: type=none
ata2: enabled=0
ata3: enabled=0
optromimage1: file=none
optromimage2: file=none
optromimage3: file=none
optromimage4: file=none
optramimage1: file=none
optramimage2: file=none
optramimage3: file=none
optramimage4: file=none
pci: enabled=1, chipset=i440fx, slot1=ne2k, slot2=cirrus
vga: extension=cirrus, update_freq=5, realtime=1
cpu: count=1:1:1, ips=40000000, quantum=16, model=p4_prescott_celeron_336, reset_on_triple_fault=1, cpuid_limit_winnt=0, ignore_bad_msrs=1, mwait_is_nop=0
print_timestamps: enabled=0
port_e9_hack: enabled=0
private_colormap: enabled=0
clock: sync=none, time0=local, rtc_sync=0
# no cmosimage
# no loader
log: -
logprefix: %t%e%d
debug: action=ignore
info: action=report
error: action=report
panic: action=ask
keyboard: type=mf, serial_delay=250, paste_delay=100000, user_shortcut=none
mouse: type=ps2, enabled=0, toggle=ctrl+mbutton
speaker: enabled=1, mode=system
parport1: enabled=1, file=none
parport2: enabled=0
com1: enabled=1, mode=null
com2: enabled=0
com3: enabled=0
com4: enabled=0
ne2k: enabled=1, mac=fe:fd:de:ad:be:ef, ethmod=linux, ethdev=enp36s0.200, script=/bin/true, bootrom="/tmp/10ec8029.rom"

Create your hard drive image using qemu-img, then run bochs -f yourfile.cfg and it should, hopefully, work.

Solar Cluster: WTF

So… with the new controller we’re able to see how much current we’re getting from the solar.  I note they omit the solar voltage, and I suspect the current is how much is coming out of the MPPT stage, but still, it’s more information than we had before.

With this, we noticed that on a good day, we were getting… 7A.

That’s about what we’d expect for one panel.  What’s going on?  Must be a wiring fault!

I’ll admit when I made the mounting for the solar controller, I didn’t account for the bend radius in the 6gauge wire I was using, and found it was difficult to feed it into the controller properly.  No worries, this morning at 4AM I powered everything off, took the solar controller off, drilled 6 new holes a bit lower down, fed the wires through and screwed them back in.

Whilst it was all off, I decided I’d individually charge the batteries.  So, right-hand battery came first, I hook the mains charger directly up and let ‘er rip.  Less than 30 minutes later, it was done.

So, disconnect that, hook up the left hand battery.  45 minutes later the charger’s still grinding away.  WTF?

Feel the battery… it is hot!  Double WTF?

It would appear that this particular battery is stuffed.  I’ve got one good one though, so for now I pull the dud out and run with just the one.

I hook everything up,  do some final checks, then power the lot back up.

Things seem to go well… I do my usual post-blackout dance of connecting my laptop up to the virtual instance management VLAN, waiting for the OpenNebula VM to fire up, then log into its interface (because we’re too kewl to have a command line tool to re-start an instance), see my router and gitea instances are “powered off”, and instruct the system to boot them.

They come up… I’m composing an email, hit send… “Could not resolve hostname”… WTF?  Wander downstairs, I note the LED on the main switch flashing furiously (as it does on power-up) and a chorus of POST beeps tells me the cluster got hard-power-cycled.  But why?  Okay, it’s up now, back up stairs, connect to the VLAN, re-start everything again.

About to send that email again… boompa!  Same error.  Sure enough, my router is down.  Wander downstairs, and as I get near, I hear the POST beeps again.  Battery voltage is good, about 13.2V.  WTF?

So, about to re-start everything, then I lose contact with my OpenNebula front-end.  Okay, something is definitely up.  Wander downstairs, and the hosts are booting again.  On a hunch I flick the off-switch to the mains charger.  Klunk, the whole lot goes off.  There’s no connection to the battery, and so when the charger drops its power to check the battery voltage, it brings the whole lot down.

WTF once more?  I jiggle some wires… no dice.  Unplug, plug back in, power blinks on then off again.  What is going on?

Finally, I pull right-hand battery out (the left-hand one is already out and cooling off, still very warm at this point), 13.2V between the negative terminal and positive on the battery, good… 13.2V between negative and the battery side of the isolator switch… unscrew the fuse holder… 13.2V between fuse holder terminal and the negative side…  but 0V between negative side on battery and the positive terminal on the SB50 connector.

No apparent loose connections, so I grab one of my spares, swap it with the existing fuse.  Screw the holder back together, plug the battery back in, and away it all goes.

This is the offending culprit.  It’s a 40A 5AG fuse.  Bought for its current carrying capacity, not for the “bling factor” (gold conductors).

If I put my multimeter in continuance test mode and hold a probe on each end cap, without moving the probes, I hear it go open-circuit, closed-circuit, open-circuit, closed-circuit.  Fuses don’t normally do that.

I have a few spares of these thankfully, but I will be buying a couple more to replace the one that’s now dead.  Ohh, and it looks like I’m up for another pair of batteries, and we will have a working spare 105Ah once I get the new ones in.

On the RAM front… the firm I bought the last lot through did get back to me, with some DDR3L ECC SO-DIMMs, again made by Kingston.  Sounded close enough, they were 20c a piece more (AU$855 for 6 vs $AU864.50).

Given that it was likely this would be an increasing problem, I thought I’d at least buy enough to ensure every node had two matched sticks in, so I told them to increase the quantity to 9 and to let me know what I owe them.

At first they sent me the updated invoice with the total amount (AU$1293.20).  No problems there.  It took a bit of back-and-forth before I finally confirmed they had the previous amount I sent them.  Great, so into the bank I trundle on Thursday morning with the updated invoice, and I pay the remainder (AU$428.70).

Friday, I get the email to say that product was no longer available.  They instead, suggested some Crucial modules which were $60 a piece cheaper.  Well, when entering a gold mine, one must prepare themselves for the shaft.

Checking the link, I found it: these were non-ECC.  1Gbit×64, not 1Gbit×72 like I had ordered.  In any case I was over it, I fired back an email telling them to cancel the order and return the money.  I was in no mood for Internet shopper Russian Roulette.

It turns out I can buy the original sticks through other suppliers, just not in the quantities I’m after.  So I might be able to buy one or two from a supplier, I can’t buy 9.  Kingston have stopped making them and so what’s left is whatever companies have in stock.

So I’ll have to move to something else.  It’d be worth buying one stick of the original type so I can pair it with one of the others, but no more than that.  I’m in no mood to do this in a few years time when parts are likely to be even harder to source… so I think I’ll bite the bullet and go 16GB modules.  Due to the limits on my debit card though, I’ll have to buy them two at a time (~$900AUD each go).  The plan is:

  1. Order in two 16GB modules and an 8GB module… take existing 8GB module out of one of the compute nodes and install the 16GB modules into that node.  Install the brand new 8GB module and the recovered 8GB module into two of the storage nodes.  One compute node now has 32GB RAM, and two storage nodes are now upgraded to 16GB each.  Remaining compute node and storage node each have 8GB.
  2. Order in two more 16GB modules… pull the existing 8GB module out of the other compute node, install the two 16GB modules.  Then install the old 8GB module into the remaining storage node.  All three storage nodes now have 16GB each, both compute nodes have 32GB each.
  3. Order two more 16GB modules, install into one compute node, it now has 64GB.
  4. Order in last two 16GB modules, install into the other compute node.

Yes, expensive, but sod it.  Once I’ve done this, the two nodes doing all the work will be at their maximum capacity.  The storage nodes are doing just fine with 8GB, so 16GB should mean there’s plenty of RAM for caching.

As for virtual machine management… I’m pretty much over OpenNebula.  Dealing with libvirt directly is no fun, but at least once configured, it works!  OpenNebula has a habit of not differentiating between a VM being powered off (as in, me logging into the guest and issuing a shutdown), and a VM being forcefully turned off by the host’s power getting yanked!

With one, there should be some event fired off by libvirt to tell OpenNebula that the VM has indeed turned itself off.  With the latter, it should observe that one moment the VM is there, and next it isn’t… the inference being that it should still be there, and that perhaps that VM should be re-started.

This could be a libvirt limitation too.  I’ll have to research that.  If it is, then the answer is clear: we ditch libvirt and DIY.  I’ll have to research how I can establish a quorum and schedule where VMs get put, but it should be doable without the hassle that OpenNebula has been so far, and without going to the utter tedium that is OpenStack.

Solar Cluster: Solar Testing

So I’ve now had the solar panels up for a month now… and so far, we’ve had a run of very overcast or wet days.

Figures… and we thought this was the “sunshine state”?

I still haven’t done the automatic switching, so right now the mains power supply powers the relay that switches solar to mains.  Thus the only time my cluster runs from solar is when either I switch off the mains power supply manually, or if there’s a power interruption.

The latter has not yet happened… mains electricity supply here is pretty good in this part of Brisbane, the only time I recall losing it for an extended period of time was back in 2008, and that was pretty exceptional circumstances that caused it.

That said, the political football of energy costs is being kicked around, and you can bet they’ll screw something up, even if for now we are better off this side of the Tweed river.

A few weeks back, with predictions of a sunny day, I tried switching off the mains PSU in the early morning and letting the system run off the solar.  I don’t have any battery voltage logging or current logging as yet, but the system went fine during the day.  That evening, I turned the mains back on… but the charger, a Redarc BCDC1225, seemingly didn’t get that memo.  It merrily let both batteries drain out completely.

The IPMI BMCs complained bitterly about the sinking 12V rail at about 2AM when I was sound asleep.  Luckily, I was due to get up at 4AM that day.  When I tried checking a few things on the Internet, I first noticed I didn’t have a link to the Internet.  Look up at the switch in my room and saw the link LED for the cluster was out.

At that point, some choice words were quietly muttered, and I wandered downstairs with multimeter in hand to investigate.  The batteries had been drained to 4.5V!!!

I immediately performed some load-shedding (ripped out all the nodes’ power leads) and power-cycled the mains PSU.  That woke the charger up from its slumber, and after about 30 seconds, there was enough power to bring the two Ethernet switches in the rack online.  I let the voltage rise a little more, then gradually started re-connecting power to the nodes, each one coming up as it was plugged in.

The virtual machine instances I had running outside OpenNebula came up just fine without any interaction from me, but  it seems OpenNebula didn’t see it fit to re-start the VMs it was responsible for.  Not sure if that is a misconfiguration, or if I need to look at an alternate solution.

Truth be told, I’m not a fan of libvirt either… overly complicated for starting QEMU VMs.  I might DIY a solution here as there’s lots of things that QEMU can do which libvirt ignores or makes more difficult than it should be.

Anyway… since that fateful night, I have on two occasions run the cluster from solar without incident.  On the off-chance though, I have an alternate charger which I might install at some point.  The downside is it doesn’t boost the 12V input like the other one, so I’d be back to using that Xantrex charger to charge from mains power.

Already, I’m thinking about the criteria for selecting a power source.  It would appear there are a few approaches I can take, I can either purely look at the voltages seen at the solar input and on the battery, or I can look at current flow.

Voltage wise, I tried measuring the solar panel output whilst running the cluster today.  In broad daylight, I get 19V off the panels, and at dusk it’s about 16V.

Judging from that, having the solar “turn on” at 18V and “turn off” at 15V seems logical.  Using the comparator approach, I’d need to set a reference of 16.5V and tweak the hysteresis to give me a ±3V swing.

However, this ignores how much energy is actually being produced from solar in relation to how much is being consumed.  It is possible for a day to start off sunny, then for the weather to cloud over.  Solar voltage in that case might be sitting at the 16V mentioned.

If the current is too low though, the cluster will drain more power out than is going in, and this will result in the exact conditions I had a few weeks ago: a flat battery bank.  Thus I’m thinking of incorporating current shunts both on the “input” to the battery bank, and to the “output”.  If output is greater than input, we need mains power.

There’s plenty of literature about interfacing to current shunts.  I’ll have to do some research, but immediately I’m thinking an op-amp running from the battery configured as a non-inverting DC gain block with the inputs going to either side of the current shunt.

Combining the approaches is attractive.  So turn on when solar exceeds 18V, turn off when battery output current exceeds battery input current.  A dual op-amp, a dual comparator, two current shunts, a R-S flip-flop and a P-MOSFET for switching the relay, and no hysteresis calculations needed.

Solar Cluster: OpenNebula Front-end setup

So, the front-end for OpenNebula will be a VM, that migrates between the two compute nodes in a HA arrangement.  Likewise with the core router, and border router, although I am also tossing up trying again with the little Advantech UNO-1150G I have laying around.

For now, I’ve not yet set up the HA part, I’ll come to that.  There are guides for using libvirt with corosync/heartbeat, most also call up DR:BD as the block device for the VM, but we will not be using this as our block device (Rados Block Device) is already redundant.

To host OpenNebula, I’ll use Gentoo with musl-libc since that’ll shrink the footprint down just a little bit.  We’ll run it on a MariaDB back-end.

Since we’re using musl, you’ll want to install layman and the musl overlay as not all packages build against musl out-of-the-box.  Also install gentoolkit, as you’ll need to set USE flags, and euse makes this easy:

# emerge layman
# layman -L
# layman -a musl
# emerge gentoolkit

Now that some basic packages are installed, we need to install OpenNebula’s prerequisites. They tell you in amongst these is xmlrpc-c. BUT, they don’t tell you that it needs support for abyss: and the scons build system they use will just give you a cryptic error saying it couldn’t find xmlrpc. The answer is not, as suggested, to specify the path to xmlrpc-c-config, which happens to be in ${PATH} anyway, as that will net the same result, and break things later when you fix the real issue.

# euse -p dev-util/xmlrpc-c -E abyss

Now we can build the dependencies… this isn’t a full list, but includes everything that Gentoo ships in repositories, the remaining Ruby gems will have to be installed separately.

# emerge --ask dev-lang/ruby dev-db/sqlite dev-db/mariadb \
dev-ruby/sqlite3 dev-libs/xmlrpc-c dev-util/scons \
dev-ruby/json dev-ruby/sinatra dev-ruby/uuidtools \
dev-ruby/curb dev-ruby/nokogiri

With that done, create a user account for OpenNebula:

# useradd -d /opt/opennebula -m -r opennebula

Now you’re set to build OpenNebula itself:

# tar -xzvf opennebula-5.4.0.tar.gz
# cd opennebula-5.4.0
# scons mysql=yes

That’ll run for a bit, but should succeed. At the end:

# ./install -d /opt/opennebula -u opennebula -g opennebula

There’s about where I’m at now… the link in the README for further documentation is a broken link, here is where they keep their current documentation.

Solar Cluster: First virtual instances running

So, since my last log, I’ve managed to tidy up the wiring on the cluster, making use of the plywood panel at the back to mount all my DC power electronics, and generally tidying everything up.

I had planned to use a SB50 connector to connect the cluster up to the power supply, so made provisions for this in the wiring harness. Turns out, this was not necessary, it was easier in the end to just pull apart the existing wiring and hard-wire the cluster up to the charger input.

So, I’ve now got a spare load socket hanging out the front, which will be handy if we wind up with unreliable mains power in the near future since it’s a convenient point to hook up 12V appliances.

There’s a solar power input there ready, and space to the left of that to build a little control circuit that monitors the solar voltage and switches in the mains if needed. For now though, the switching is done with a relay that’s hard-wired on.

Today though, I managed to get the Ceph clients set up on the two compute nodes, and while virt-manager is buggy where it comes to RBD pools. In particular, adding a RBD storage pool doesn’t work as there’s no way to define authentication keys, and even if you have the pool defined, you find that trying to use images from that pool causes virt-manager to complain it can’t find the image on your local machine. (Well duh! This is a known issue.)

I was able to find a XML cheat-sheet for defining a domain in libvirt, which I was then able to use with Ceph’s documentation.

A typical instance looks like this:

<domain type='kvm'>
  <!-- name of your instance -->
  <name>instancename</name>
  <!-- a UUID for your instance, use `uuidgen` to generate one -->
  <uuid>00ec9b97-c49a-45f8-befe-f74ad6bde2fe</uuid>
  <memory>524288</memory>
  <vcpu>1</vcpu>
  <os>
    <type arch="x86_64">hvm</type>
  </os>
  <clock sync="utc"/>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='network' device='disk'>
      <source protocol='rbd' name="poolname/image.vda">
        <!-- the hostnames or IPs of your Ceph monitor nodes -->
        <host name="s0.internal.network" />
        <host name="s1.internal.network" />
        <host name="s2.internal.network" />
      </source>
      <target dev='vda'/>
      <auth username='libvirt'>
        <!-- the UUID here is what libvirt allocated when you did
	    `virsh secret-define foo.xml`, use `virsh secret-list`
	    if you've forgotten what that is. -->
        <secret type='ceph' uuid='23daf9f8-1e80-4e6d-97b6-7916aeb7cc62'/>
      </auth>
    </disk>
    <disk type='network' device='cdrom'>
      <source protocol='rbd' name="poolname/image.iso">
        <!-- the hostnames or IPs of your Ceph monitor nodes -->
        <host name="s0.internal.network" />
        <host name="s1.internal.network" />
        <host name="s2.internal.network" />
      </source>
      <target dev='hdd'/>
      <auth username='libvirt'>
        <secret type='ceph' uuid='23daf9f8-1e80-4e6d-97b6-7916aeb7cc62'/>
      </auth>
    </disk>
    <interface type='network'>
      <source network='default'/>
      <mac address='11:22:33:44:55:66'/>
    </interface>
    <graphics type='vnc' port='-1' keymap='en-us'/>
  </devices>
</domain>

Having defined the domain, you can then edit it at will in virt-manager. I was able to switch the network interface over to using virtio, plop it on a bridge so it was wired up to the correct VLAN and start the instance up.

I’ve since managed to migrate 3 instances over, namely an estate database, Brisbane Area WICEN’s OwnCloud site, and my own blog.

These are sufficient to try the system out. I’m already finding these instances much more responsive, using raw Ceph even, than the original server.

My next move I think will be to see if I can get corosync/heartbeat to manage a HA VM instance. That is, if one of the compute nodes goes offline, the instance restarts on the other compute node.

Two services come to mind where HA is concerned: terminating the PPPoE link for our Internet, and a virtual management node for a higher-level system such as OpenNebula. OpenNebula really needs something semi-HA, since it really gets its knickers in a twist if the master node goes down. I also want my border router to be HA, since I won’t necessarily be around to migrate it to a different node.

Everything else, well I suspect OpenNebula can itself manage those, and long term the instances I just liberated today from my old box, will become instances within OpenNebula.

The other option is I dip my toe into OpenStack (again), since it is inherently HA by design, but it is also a royal pain to get working.

Bootstrapping Gentoo Linux

So, in amongst my pile of crusty old hardware is the old netbook I used to use in the latter part of my univerity days. It is a Lemote Yeeloong, and sports a ~700MHz Loongson 2F CPU (MIPS III little endian ISA) and 1GB RAM.

Back in the day it was a brilliant little machine. It came out of the box running a localised (for China) version of Debian, and had pretty much everything you’d need. I natually repartitioned the machine, setting up Gentoo and I had a separate partition for Debian, so I could actually dual-boot between them.

Fast forward 10 years, the machine runs, but the battery is dead, and Debian no longer supports MIPS-III machines. Debian Jessie does, but Stretch, likely due for release some time this year, will not, if you haven’t got a CPU that supports mips32r2 or mips64r2, you’re stuffed.

I don’t want to throw this machine away.  Being as esoteric as it is, it is an unlikely target for theft, as to the casual observer, it’ll just be “some crappy netbook”.  If someone were to try and steal it, there’s a very high probability I’ll recover it with my data because the day its PMON2000 boot firmware successfully boots a x86-64 OS like Ubuntu or Windows without the assistance of a VM of some kind would be the day Satan puts a requisition order in for anti-freeze and winter mittens.

My use case is for a machine I can take with me on the bicycle.  My needs aren’t huge: I won’t be playing video on this thing, it’ll be largely for web browsing and email.  The web browser needs to support JavaScript, so that rules out options like ELinks or Dillo, my preferred browser is Firefox but I’ll settle for something Webkit-based if that’s all that’s out there.

So what operating systems do I have for a machine that sports a MIPS-III CPU and 1GB RAM?  Fedora has a MIPS port, but that, like Debian, is for the newer MIPS systems.  Arch Linux too is for newer architectures.

I could bootstrap Alpine Linux… and maybe that’s worth looking into, they seem to be doing some nice work in producing a small and capable Linux distribution.  They don’t yet support MIPS though.

Linux From Scratch is an option, if a little labour intensive.  (Been there, done that.)

OpenBSD directly supports this machine, and so I gave OpenBSD 6.0 a try.  It’s a very capable OS, and while it isn’t Linux, there isn’t much that an experienced Linux user like myself needs to adapt to in order to effectively use the OS.  pkgsrc is a great asset to OpenBSD, with a large selection of pre-built packages already available.  Using that, it is possible to get a workable environment up and running very quickly.  OpenBSD/loongson uses the n64 ABI.

Due to licensing worries, they use a particularly old version of binutils as their linker and assembler.  The plan seems to be they wish to wean themselves off the GNU toolchain in favour of LLVM.  At this time though, much of the system is built using the GNU toolchain with some custom patches.  I found that, on the Yeeloong, 1GB RAM was not sufficient for compiling LLVM, even after adding additional swap files, and some packages I needed weren’t available in pkgsrc, nor would they build with the version of GNU tools available.

Maybe as they iron out the kinks in their build environment with LLVM, this will be worth re-visiting.  They’ve done a nice job so far, but it’s not quite up to where I need it to be.

Gentoo actually gives me the choice of two possible ABIs: o32 and n32o32 is the old 32-bit ABI, and suffers a number of performance problems, but generally works.  It’s what Debian Jessie and earlier supplies, and what their mips32 port will produce from Stretch onwards.

n32 is the MIPS equivalent of what some of you may know as x32 on AMD64 platforms, it is a 32-bit environment with 64-bit long pointers… the idea being that very few applications actually benefit from the use of 64-bit data types, and so the usual quantities like int and long remain the same as what they’d be on o32, saving memory.  The long long data type gets a boost because, although “32-bit”, the 64-bit operations are still available for use.

The trouble is, some applications have problems with this mode.  Either the code sees “mips64” in the CHOST and assumes a full 64-bit system (aka n64), or it assumes the pointers are the same width as a long, or the build system makes silly assumptions as to where things get put.  (virtualenv comes to mind, which is what started me on this journey.  The same problem affects x32 on AMD64.)

So I thought, I’d give n64 a try.  I’d see if I can build a cross-compiler on my AMD64 host, and bootstrap Gentoo from that.

Step 1: Cross-compiler

For the cross-compiler, Gentoo has a killer feature that I have not seen in too many other distributions: crossdev.  This is a toolchain build tool that can generate cross-compiler toolchains for most processor architectures and environments.

This is installed by running emerge sys-devel/crossdev.

A gotcha with hardened

I run “hardened” AMD64 stages on my machines, and there’s a little gotcha to be aware of: the hardened USE flag gets set by crossdev, and that can cause fun and games if, like on MIPS, the hardening features haven’t been ported.  My first attempt at this produced a n64 userland where pretty much everything generated a segmentation fault, the one exception being Python 2.7.  If I booted with init=/bin/bash (or init=/bin/bb), my virtual environment died, if I booted with init=/usr/bin/python2.7, I’d be dropped straight into a Python shell, where I could import the subprocess module and try to run things.

Cleaning up, and forcing crossdev to leave off hardened support, got things working.

Building the toolchain

With the above gotcha in mind:

# crossdev --abis n64 \
           --env 'USE="-hardened"' \
           -s4 -t mips64el-unknown-linux-gnu

The --abis n64 tells crossdev you want a n64 ABI toolchain, and the --env will hopefully keep the hardened flag unset. Failing that, try this:

# cat > /etc/portage/package.use/mips64 <<EOF
cross-mips64el-unknown-linux-gnu/binutils -hardened
cross-mips64el-unknown-linux-gnu/gcc -hardened
cross-mips64el-unknown-linux-gnu/glibc -hardened
EOF

If you want a combination of specific toolchain components to try, I’m using:

  • Binutils: 2.28
  • GCC: 5.4.0-r3
  • glibc: 2.25
  • headers: 4.10

Step 2: Checking our toolchain

This is where I went wrong the first time, I tried building the entire OS, only to discover I had wasted hours of CPU time building non-functional binaries. Save yourself some frustration. Start with a small binary to test.

A good target for this is busybox. Run mips64el-unknown-linux-gnu-emerge busybox, and wait for a bit.

When it completes, you should hopefully have a busybox binary:

RC=0 stuartl@beast ~ $ file /usr/mips64el-unknown-linux-gnu/bin/busybox 
/usr/mips64el-unknown-linux-gnu/bin/busybox: ELF 64-bit LSB executable, MIPS, MIPS-III version 1 (SYSV), statically linked, for GNU/Linux 3.2.0, stripped

Testing busybox

There is qemu-user-mips64el, but last time I tried it, I found it broken. So an easier option is to use real hardware or QEMU emulating a full system. In either case, you’ll want to ensure you have your system-of-choice running with a working 64-bit kernel already, if your real hardware isn’t already running a 64-bit Linux kernel, use QEMU.

For QEMU, the path-of-least-resistance I found was to use Debian. Aurélien Jarno has graciously provided QEMU images and corresponding kernels for a good number of ports, including little-endian MIPS.

Grab the Wheezy disk image and the corresponding kernel, then run the following command:

# qemu-system-mips64el -M malta \
    -kernel vmlinux-3.2.0-4-5kc-malta \
    -hda debian_wheezy_mipsel_standard.qcow2 \
    -append "root=/dev/sda1 console=ttyS0,115200" \
    -serial stdio -nographic -net nic -net user

Let it boot up, then log in with username root, password root.

Install openssh-client and rsync (this does not ship with the image):

# apt-get update
# apt-get install openssh-client rsync

Now, you can create a directory, and pull the relevant files from your host, then try the binary out:

# mkdir gentoo
# rsync -aP 10.0.2.2:/usr/mips64el-unknown-linux-gnu/ gentoo/
# chroot gentoo bin/busybox ash

With luck, you should be in the chroot now, using Busybox.

Step 3: Building the system

Having done a “hello world” test, we’re now ready to build everything else. Start by tweaking your /usr/mips64el-unknown-linux-gnu/etc/portage/make.conf to your liking then adjust /usr/mips64el-unknown-linux-gnu/etc/portage/make.profile to point to one of the MIPS profiles. For reference, on my system:

RC=0 stuartl@beast ~ $ ls -l /usr/mips64el-unknown-linux-gnu/etc/portage/make.profile
lrwxrwxrwx 1 root root 49 May  1 09:26 /usr/mips64el-unknown-linux-gnu/etc/portage/make.profile -> /usr/portage/profiles/default/linux/mips/13.0/n64
RC=0 stuartl@beast ~ $ cat /usr/mips64el-unknown-linux-gnu/etc/portage/make.conf 
CHOST=mips64el-unknown-linux-gnu
CBUILD=x86_64-pc-linux-gnu
ARCH=mips

HOSTCC=x86_64-pc-linux-gnu-gcc

ROOT=/usr/${CHOST}/

ACCEPT_KEYWORDS="mips ~mips"

USE="${ARCH} -pam"

CFLAGS="-O2 -pipe -fomit-frame-pointer"
CXXFLAGS="${CFLAGS}"

FEATURES="-collision-protect sandbox buildpkg noman noinfo nodoc"
# Be sure we dont overwrite pkgs from another repo..
PKGDIR=${ROOT}packages/
PORTAGE_TMPDIR=${ROOT}tmp/

ELIBC="glibc"

PKG_CONFIG_PATH="${ROOT}usr/lib/pkgconfig/"
#PORTDIR_OVERLAY="/usr/portage/local/"

Now, you should be ready to start building:

# mips64el-unknown-linux-gnu-emerge -e \
    --keep-going -j6 --load-average 12.0 @system

Now, go away, and do something else for several hours.  It’ll take that long, depending on the speed of your machine.  In my case, the machine is an AMD Phenom II x6 with 8GB RAM, which was brand new in 2010.  It took a good day or so.

Step 4: Testing our system

We should have enough that we can boot our QEMU VM with this image instead.  One way of trying it would be to copy across the userland tree the same way we did for pulling in busybox and chrooting back in again.

In my case, I took the opportunity to build a kernel specifically for the VM that I’m using, and made up a disk image using the new files.

Building a kernel

Your toolchain should be able to cross-build a kernel for the virtual machine.  To get you started, here’s a kernel config file.  Download it, decompress it, then drop it into your kernel source tree as .config.

Having done that, run make olddefconfig ARCH=mips to set the defaults, then make menuconfig ARCH=mips and customise to your hearts content. When finished, run make -j6 vmlinux modules CROSS_COMPILE=mips64el-unknown-linux-gnu- to build the kernel and modules.

Finally, run make modules_install firmware_install INSTALL_MOD_PATH=$PWD/modules CROSS_COMPILE=mips64el-unknown-linux-gnu- to install the kernel modules and firmware into a convenient place.

Making a root disk

Create a blank, raw disk image using qemu-img, then partition it as you like and mount it as a loopback device:

# qemu-img create -f raw gentoo.raw 8G
# fdisk gentoo.raw
(do your partitioning here)
# losetup -P /dev/loop0 $PWD/gentoo.raw

Now you can format the partitions /dev/loop0pX as you see fit, then mount them in some convenient place. I’ll assume that’s /mnt/vm for now. You’re ready to start copying everything in:

# rsync -aP /usr/mips64el-unknown-linux-gnu/ /mnt/vm/
# rsync -aP /path/to/kernel/tree/modules/ /mnt/vm/

You can use this opportunity to make some tweaks to configuration files, like updating etc/fstab, tweaking etc/portage/make.conf (changing ROOT, removing CBUILD), and setting up a getty on ttyS0. I also like to symlink lib to lib64 in non-multilib environments such as this: Don’t symlink lib and lib64! See below.

# cd /mnt/vm
# mv lib/* lib64
# rmdir lib
# ln -s lib64 lib
# cd usr
# mv lib/* lib64
# rmdir lib
# ln -s lib64 lib

When you’re done, unmount.

First boot

Run QEMU with the following arguments:

# qemu-system-mips64el -M malta \
    -kernel /path/to/your/kernel/vmlinux \
    -hda /path/to/your/gentoo.raw \
    -append "root=/dev/sda1 console=ttyS0,115200 init=/bin/bash" \
    -serial stdio -nographic -net nic -net user

It should boot straight to a bash prompt. Mount the root read/write, and then you can make any edits you need to do before boot, such as changing the root password. When done, re-mount the root as read-only, then exec /sbin/init.

# mount / -o rw,remount
# passwd
… etc
# mount / -o ro,remount
# exec /sbin/init

With luck, it should boot to completion.

Step 5: Making the VM a system service

Now, it’d be real nice if libvirt actually supported MIPS VMs, but it doesn’t appear that it does, or at least I couldn’t get it to work.  virt-manager certainly doesn’t support it.

No matter, we can make do with a telnet console (on loopback), and supervisord to daemonise QEMU.  I use the following supervisord configuration file to start my VMs:

[unix_http_server]
file=/tmp/supervisor.sock   ; (the path to the socket file)

[supervisord]
logfile=/tmp/supervisord.log ; (main log file;default $CWD/supervisord.log)
logfile_maxbytes=50MB        ; (max main logfile bytes b4 rotation;default 50MB)
logfile_backups=10           ; (num of main logfile rotation backups;default 10)
loglevel=info                ; (log level;default info; others: debug,warn,trace)
pidfile=/tmp/supervisord.pid ; (supervisord pidfile;default supervisord.pid)
nodaemon=false               ; (start in foreground if true;default false)
minfds=1024                  ; (min. avail startup file descriptors;default 1024)
minprocs=200                 ; (min. avail process descriptors;default 200)

; the below section must remain in the config file for RPC
; (supervisorctl/web interface) to work, additional interfaces may be
; added by defining them in separate rpcinterface: sections
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[supervisorctl]
serverurl=unix:///tmp/supervisor.sock ; use a unix:// URL  for a unix socket

[program:qemu-mips64el]
command=/usr/bin/qemu-system-mips64el -cpu MIPS64R2-generic -m 2G -spice disable-ticketing,port=5900 -M malta -kernel /home/stuartl/kernels/qemu-mips/vmlinux -hda /var/lib/libvirt/images/gentoo-mips64el.raw -append "mem=256m@0x0 mem=1792m@0x90000000 root=/dev/sda1 console=ttyS0,115200" -chardev socket,id=char0,port=65223,host=::1,server,telnet,nowait -chardev socket,id=char1,port=65224,host=::1,server,telnet,nowait -serial chardev:char0 -mon chardev=char1,mode=readline -net nic -net bridge,helper=/usr/libexec/qemu-bridge-helper,br=br0

The following creates two telnet sockets, port 65223 is the VM’s console, 65224 is the QEMU control console. The VM has the maximum 2GB RAM possible and uses bridged networking to the network bridge br0. There is a graphical console available via SPICE.

All telnet and SPICE interfaces are bound to loopback, so one must use SSH tunnelling to reach those ports from another host. You can change the above command line to use VNC if that’s what you prefer.

At this point, the VM should be able to boot on its own. I’d start with installing some basic packages, and move on from there. You’ll find the environment is very sparse (my build had no Perl binary for example) but the basics for building everything should be there.

You may also find that what is there, isn’t quite installed right… I found that sshd wasn’t functional due to missing users… a problem soon fixed by doing an emerge -K openssh (the earlier step will have produced binary packages).

In my case, that’s installing a decent text editor (vim) and GNU screen so I can start a build, then detach.  Lastly, I’ll need catalyst, which is Gentoo’s release engineering tool.

At the moment, this is where I’m at.  GNU screen has indirectly pulled in Perl as a dependency, and that is building as I type this.  It is building faster than the little netbook does, and I have the bonus that I can throw more RAM at the problem than I can on the real hardware. The plan from here:

  1. emerge -ek @system, to build everything that got missed before.
  2. ROOT=/tmp/seed emerge -eK @system, to bundle everything up into a staging area
  3. populating /tmp/seed/dev with device files
  4. tar-ing up /tmp/seed to make my initial “seed” stage for catalyst.
  5. building the first n64 stages for Gentoo using catalyst
  6. building the packages I want for the netbook in a chroot
  7. transferring the chroot to the netbook

Symlinking lib and lib64… don’t do it!

So, I was doing this years ago when n32 was experimental.  I recall it being necessary then as this was before Portage having proper multilib support.  The earlier mipsel n32 stages I built, which started out from kanaka‘s even more experimental multilib stages, required this kludge to work-around the lack of support in Portage.

Portage has changed, it now properly handles multilib, and so the symlink kludge is not only not necessary, it breaks things rather badly, as I discovered.  When packages merge files to /lib, rather than following the symlink, they’ll replace it with a directory.  At that point, all hell breaks loose, because stuff that “appeared” in /lib before is no longer there.

I was able to recover by rsync-ing /lib64 to /lib, which isn’t a pretty solution, but it’ll be enough to get an initial “seed” stage.  Running that seed stage through Catalyst will clean up the remnants of that bungle.

Solar cluster: Software stack beginning to take shape.

So, after putting aside the charge controller for now, I’ve taken some time to see if I can get the software side of things into shape.

In the midst of my development, I found a small wiring fault that was responsible for blowing a couple of fuses. A small nick in the sheath of the positive wire in a power cable was letting the crimp part of a DC barrel connector contact +12V. A tweak of that crimp and things are back to normal. I’ve swapped all the 10A fuses for 5A ones, since the regulators are only rated at 7.5A.

The VLANs are assigned now, and I have bonding going between the two pairs of Ethernet devices. In spite of the switch only supporting 4 LAGs, it seems fine with me doing LACP on effectively 10 LAGs. I’ll see how it goes.

The switch has 5 ports spare after plugging in all 5 nodes and a 16-port switch for the IPMI subnet. One will be used for a management interface so I can plug a laptop in, and the others will be paired with LACP for linking to my two existing Cisco SG200-8s.

One of the goals of this project is to try and push the performance of Ceph. In the office, we tried bare Ceph, and found that, while it’s fine for sequential I/O, it suffers a bit with random read/writes, and Windows-based HyperV images like to do a lot of random reads/writes.

Putting FlashCache in the mix really helped, but I note now, it’s no longer maintained. EnhanceIO had only just forked when I tried FlashCache, now it seems that’s the official successor.

There are two alternatives to FlashCache/EnhanceIO: bcache and dm-cache.

I’ll rule out bcache now as it requires the backing image be “formatted” for use. In other words, the backing image is not a raw image, but some proprietary (to bcache) format. This isn’t unworkable, but it raises concerns with me about portability: if I migrate a VM, do I need to migrate its cache too, or is it sufficient to cleanly shut down and detach the bcache device before re-assembling it on the new host?

By contrast, dm-cache and EnhanceIO/FlashCache work with raw backing images, making them much more attractive. Flush the cache before migration or use writethru mode, and all should be fine. dm-cache does however require a separate metadata device: messy, but not unworkable. We can provision the cache-related devices we need using LVM2, and use the kernel-mode Rados block device as our backing image.

So I think my caching subsystem is a two-horse race: dm-cache or EnhanceIO. I guess we’ll give them a try and see how they go.

For those following along at home, if you’re running kernel >4.3, you might want use this fork of EnhanceIO due to changes in the kernel block I/O layer.

To manage the OpenNebula master node, I’ve installed corosync/pacemaker. Normally these are used with DR:BD, however I figure Ceph can fulfil that role. The concepts are similar: it’s a shared block device. I’m not sure if it’ll be LXC, Docker or a VM at this point that “contains” the server, but whatever it is, it should be possible for it to have its root FS and data on Ceph.

I’m leaning towards LXC for this. Time for some more experimentation.

Solar Cluster: Accumulating parts and planning the system

Well, figured I’d document this project here in case anyone was interested in doing this for personal amusement or for their workplace.

The list I’ve just chucked up is not a complete list, nor is it a prescribed list of exactly what’s needed, but rather is what I’ve either acquired, or will acquire.

The basic architecture is as follows:

  • The cluster is built up on discrete nodes which are based around a very similar hardware stack and are tweaked for their function.
  • Persistent data storage is handled by the storage nodes using the Ceph object storage system. This requires that a majority quorum is maintained, and so a minimum of 3 storage nodes are required.
  • Virtual machines run on the compute nodes.
  • Management nodes oversee co-ordination of the compute nodes: this ideally should be a separate pair of machines, but for my use case, I intend to use a virtual machine or container managed using active/passive failover techniques.
  • In order to reduce virtual disk latency, the compute nodes will implement a local disk cache using an SSD, backed by a Rados Block Device on Ceph.

I’ll be using KVM as the virtualisation technology with Gentoo Linux as the base OS for this experimental cluster. At my workplace, we evaluated a few different technologies including Proxmox VE, Ganeti, OpenStack and OpenNebula. For this project, I intend to build on OpenNebula as it’s the simplest to understand and the most suited to my workplace’s requirements.

Using Gentoo makes it very easy to splice in patches as I’ll be developing as I go along. If I come to implement this in the office, I’ll be porting everything across to Ubuntu. This will be building on some experimental work I’ve done in the past with OpenNebula.

For the base nodes themselves, I’ve based them around these components:

For the storage nodes, add to the list:

Other things you may want/need:

  • A managed switch, I ended up choosing the Linksys LGS-326AU which U-Mart were selling at AU$294. If you’ve ever used Cisco’s small business offerings, this unit will make you feel right at home.
  • DIN rail. Jaycar sell this in 1m lengths, and I’ll grab some tomorrow.

Most of the above bits I have, the nodes are all basically built as of this afternoon, minus the SATA adaptors for the three storage nodes. All units power on, and do what one would expect of a machine that’s trying to boot from a blank SSD.

I did put one of the compute nodes through its paces, network booting the machine via PXE/NFS root and installing Gentoo.

Power consumption was below 1.8A for a battery voltage of about 13.4V, even when building the Linux 4.4.6 kernel (using make -j8), which it did in about 10 minutes. Watching this thing tackle compile jobs is a thing of beauty, can’t wait to get distcc going and have 40 CPU cores tear into the bootstrap process. The initial boot also looks beautiful, with 8 penguins lined up representing the 8 cores — don’t turn up here in a tuxedo!

So hardware wise, things are more or less together, and it’ll mostly be software. I’ll throw up some notes on how it’s all wired, but basically the plan in the short term is a 240V mains charger (surplus from a caravan) will keep the battery floated until I get the solar panel and controller set up.

When that happens, I plan to wire a relay in series with the 240V charger controlled by a comparator to connect mains when the battery voltage drops below 12V.

The switch is a 240V device unfortunately (couldn’t find any 24-port 12V managed switches) so it’ll run from an inverter. Port space is tight, and I just got the one since they’re kinda pricey. Long term, I might look at a second for redundancy, although if a switch goes, I won’t lose existing data.

ADSL2+ will be managed by a small localised battery back-up and a small computer as router, possibly a Raspberry Pi as I have one spare (original B model), which can temporarily store incoming SMTP traffic if the cluster does go down (heaven forbid!) and act as a management endpoint. There are a few contenders here, including these industrial computers, for which I already maintain a modern Linux kernel port for my workplace.

Things are coming together, and I hope to bring more on this project as it moves ahead.