Jul 292017

So, I had a go at getting OpenNebula actually running on my little VM.  Earlier I had installed it to the /opt directory by hand, and today, I tried launching it from that directory.

To get the initial user set up, you have to create the ${ONE_LOCATION}/.one/one_auth (in my case; /opt/opennebula/.one/one_auth) with the username and password for your initial user separated by a colon.  The idea here is that is used to initially create the user, you then change the password once you’re successfully logged in.

That got me a little further, but then it still fails… turns out it doesn’t like some options specified by default in the configuration file.   I commented out some options, and that got me a little further again.  oned started, but then went into lala land, accepting connections but then refusing to answer queries from other tools, leaving them to time out.

I’ve since managed to package OpenNebula into a Gentoo Ebuild, which I have now published in a dedicated overlay.  I was able to automate a lot of the install process this way, but I was still no closer.

On a hunch, I tried installing the same ebuild on my laptop.  Bingo… in mere moments, I was staring at the OpenNebula Sunstone UI in my browser, it was working.  The difference?  My laptop is running Gentoo with the standard glibc C library, not musl.  OpenNebula compiled just fine on musl, but perhaps differences in how musl does threads or some such (musl takes a hard-line POSIX approach) is causing a deadlock.

So, I’m rebuilding the VM using glibc now.  We shall see where that gets us.  At least now I have the install process automated. 🙂

Jul 232017

So, the front-end for OpenNebula will be a VM, that migrates between the two compute nodes in a HA arrangement.  Likewise with the core router, and border router, although I am also tossing up trying again with the little Advantech UNO-1150G I have laying around.

For now, I’ve not yet set up the HA part, I’ll come to that.  There are guides for using libvirt with corosync/heartbeat, most also call up DR:BD as the block device for the VM, but we will not be using this as our block device (Rados Block Device) is already redundant.

To host OpenNebula, I’ll use Gentoo with musl-libc since that’ll shrink the footprint down just a little bit.  We’ll run it on a MariaDB back-end.

Since we’re using musl, you’ll want to install layman and the musl overlay as not all packages build against musl out-of-the-box.  Also install gentoolkit, as you’ll need to set USE flags, and euse makes this easy:

# emerge layman
# layman -L
# layman -a musl
# emerge gentoolkit

Now that some basic packages are installed, we need to install OpenNebula’s prerequisites. They tell you in amongst these is xmlrpc-c. BUT, they don’t tell you that it needs support for abyss: and the scons build system they use will just give you a cryptic error saying it couldn’t find xmlrpc. The answer is not, as suggested, to specify the path to xmlrpc-c-config, which happens to be in ${PATH} anyway, as that will net the same result, and break things later when you fix the real issue.

# euse -p dev-util/xmlrpc-c -E abyss

Now we can build the dependencies… this isn’t a full list, but includes everything that Gentoo ships in repositories, the remaining Ruby gems will have to be installed separately.

# emerge --ask dev-lang/ruby dev-db/sqlite dev-db/mariadb \
dev-ruby/sqlite3 dev-libs/xmlrpc-c dev-util/scons \
dev-ruby/json dev-ruby/sinatra dev-ruby/uuidtools \
dev-ruby/curb dev-ruby/nokogiri

With that done, create a user account for OpenNebula:

# useradd -d /opt/opennebula -m -r opennebula

Now you’re set to build OpenNebula itself:

# tar -xzvf opennebula-5.4.0.tar.gz
# cd opennebula-5.4.0
# scons mysql=yes

That’ll run for a bit, but should succeed. At the end:

# ./install -d /opt/opennebula -u opennebula -g opennebula

There’s about where I’m at now… the link in the README for further documentation is a broken link, here is where they keep their current documentation.

Jul 232017

So, having got some instances going… I thought I better sort out the networking issues proper.  While it was working, I wanted to do a few things:

  1. Bring a dedicated link down from my room into the rack directly for redundancy
  2. Define some more VLANs
  3. Sort out the intermittent faults being reported by Ceph

I decided to tackle (1) first.  I have two 8-port Cisco SG-200 switches linked via a length of Cat5E that snakes its way from our study, through the ceiling cavity then comes up through a small hole in the floor of my room, near where two brush-tail possums call home.

I drilled a new hole next to where the existing cable entered, then came the fun of trying to feed the new cable along side the old one.  First attempt had the cable nearly coil itself just inside the cavity.  I tried to make a tool to grab the end of it, but it was well and truly out of reach.  I ended up getting the job done by taping the cable to a section of fibreglass tubing, feeding that in, taping another section of tubing to that, feed that in, etc… but then I ran out of tubing.

Luckily, a rummage around, and I found some rigid plastic that I was able to tape to the tubing, and that got me within a half-metre of my target.  Brilliant, except I forgot to put a leader cable through for next time didn’t I?

So more rummaging around for a length of suitable nylon rope, tape the rope to the Cat5E, haul the Cat5E out, then grab another length of rope and tape that to the end and use the nylon rope to haul everything back in.

The rope should be handy for when I come to install the solar panels.

I had one 16-way patch panel, so wound up terminating the rack-end with that, and just putting a RJ-45 on the end in my room and plugging that directly into the switch.  So on the shopping list will be some RJ-45 wall jacks.

The cable tester tells me I possibly have brown and white-brown switched, but never mind, I’ll be re-terminating it properly when I get the parts, and that pair isn’t used anyway.

The upshot: I now have a nice 1Gbps ring loop between the two SG-200s and the LGS326 in the rack.  No animals were harmed in the formation of this ring, although two possums were mildly inconvenienced.  (I call that payback for the times they’ve held the Marsupial Olympics at 2AM when I’m trying to sleep!)

Having gotten the physical layer sorted out, I was able to introduce the upstairs SG-200 to the new switch, then remove the single-port LAG I had defined on the downstairs SG-200.  A bit more tinkering going, and I had a nice redundant set-up: setting my laptop to ping one of the instances in the cluster over WiFi, I could unplug my upstairs trunk, wait a few seconds, plug it back in, wait some more, unplug the downstairs trunk, wait some more again, then plug in back in again, and not lose a single ICMP packet.

I moved my two switches and my AP over to the new management VLAN I had set up, along side the IPMI interfaces on the nodes.  The SG-200s were easy, aside from them insisting on one port being configured with a PVID equal to the management VLAN (I guess they want to ensure you don’t get locked out), it all went smoothly.

The AP though, a Cisco WAP4410N… not so easy.  In their wisdom, and unlike the SG-200s, the management VLAN settings page is separate from the IP interface page, so you can’t change both at the same time.  I wound up changing the VLAN, only to find I had locked myself out of it.  Much swearing at the cantankerous AP and wondering how could someone overlook such a fundamental requirement!  That, and the switch where the AP plugs in, helpfully didn’t add the management VLAN to the right port like I asked of it.

Once that was sorted out, I was able to configure an IP on the old subnet and move the AP across.

That just left dealing with the intermittent issues with Ceph.  My original intention with the cluster was to use 802.3AD so each node had two 2Gbps links.  Except: the LGS326-AU only supports 4 LAGs.  For me to do this, I need 10!

Thankfully, the bonding support in the Linux kernel has several other options available.  Switching from 802.3ad to balance-tlb, resolved the issue.

slaves_bond0="enp0s20f0 enp0s20f1"
slaves_bond1="enp0s20f2 enp0s20f3"
rc_net_bond0_need="net.enp0s20f0 net.enp0s20f1"
rc_net_bond1_need="net.enp0s20f2 net.enp0s20f3"

I am now currently setting up a core router instance (with OpenBSD 6.1) and a OpenNebula instance (with Gentoo AMD64/musl libc).

Jul 152017

So… last weekend I got to trying out the I/O modules. I rigged up wiring harnesses for all five push-buttons and their associated LED strings.

I used CAT5e since I’ve got loads of it around… and ran four strands up to each button, carrying: +12V, MOSFET drain, GPIO and 0V. I ran a resistor between +12V and GPIO to pull it high, the switch NO and common connected between GPIO and 0V. The button’s illumination LEDs and the LED string both connected in parallel between +12V and MOSFET drain.

So far so good. I have just a 3-pin connection on the I/O module, carrying all but the +12V. That remaining wire I hooked to +12V directly on the input feed. Not having a 0V feed going direct to the power supply though was my mistake.

If this happens, the 0V reference on the I/O module is open-circuit, and the zeners, meant to protect the MCU from +12V, don’t do anything. So I suspect a MC14066 copped a belt of +12V by mistake!

Not sure, but I thought I’d play it safe and build a new one anyway. I started on it earlier this week and finished it this afternoon. This time around I opted to put LEDs on the outputs of the 74HC574. No current-limiting resistors, since they’re meant to be ⅛ duty cycle anyway, and if they smoke, well, who cares?

This highlighted a glitch on the GPIO_EN signal during programming, LEDs would illuminate during MCU programming. A pull-up helps here.

I tested using a 9V supply from my electronics kit, which has AA cells in it (somewhat stale ones) this time around. Handily, the plugs for the LED strings have LEDs in them themselves, and they work at 9V, so we can use those to see what the LED string would do.

On the software side, I finally fixed polyphonics and a key-sticking issue. One of my I/O modules has stopped working for whatever reason, but the others are working fine, as can be seen here:

The other issue I have to chase is some leakage on the blue and white buttons: the plugs for them are not meant to be glowing unless pressed, but you note they are glowing significantly, and get brighter when those buttons are pressed. I might have to re-visit those connections. Otherwise though, very good progress today.

Jul 062017

So, since my last log, I’ve managed to tidy up the wiring on the cluster, making use of the plywood panel at the back to mount all my DC power electronics, and generally tidying everything up.

I had planned to use a SB50 connector to connect the cluster up to the power supply, so made provisions for this in the wiring harness. Turns out, this was not necessary, it was easier in the end to just pull apart the existing wiring and hard-wire the cluster up to the charger input.

So, I’ve now got a spare load socket hanging out the front, which will be handy if we wind up with unreliable mains power in the near future since it’s a convenient point to hook up 12V appliances.

There’s a solar power input there ready, and space to the left of that to build a little control circuit that monitors the solar voltage and switches in the mains if needed. For now though, the switching is done with a relay that’s hard-wired on.

Today though, I managed to get the Ceph clients set up on the two compute nodes, and while virt-manager is buggy where it comes to RBD pools. In particular, adding a RBD storage pool doesn’t work as there’s no way to define authentication keys, and even if you have the pool defined, you find that trying to use images from that pool causes virt-manager to complain it can’t find the image on your local machine. (Well duh! This is a known issue.)

I was able to find a XML cheat-sheet for defining a domain in libvirt, which I was then able to use with Ceph’s documentation.

A typical instance looks like this:

<domain type='kvm'>
  <!-- name of your instance -->
  <!-- a UUID for your instance, use `uuidgen` to generate one -->
    <type arch="x86_64">hvm</type>
  <clock sync="utc"/>
    <disk type='network' device='disk'>
      <source protocol='rbd' name="poolname/image.vda">
        <!-- the hostnames or IPs of your Ceph monitor nodes -->
        <host name="s0.internal.network" />
        <host name="s1.internal.network" />
        <host name="s2.internal.network" />
      <target dev='vda'/>
      <auth username='libvirt'>
        <!-- the UUID here is what libvirt allocated when you did
	    `virsh secret-define foo.xml`, use `virsh secret-list`
	    if you've forgotten what that is. -->
        <secret type='ceph' uuid='23daf9f8-1e80-4e6d-97b6-7916aeb7cc62'/>
    <disk type='network' device='cdrom'>
      <source protocol='rbd' name="poolname/image.iso">
        <!-- the hostnames or IPs of your Ceph monitor nodes -->
        <host name="s0.internal.network" />
        <host name="s1.internal.network" />
        <host name="s2.internal.network" />
      <target dev='hdd'/>
      <auth username='libvirt'>
        <secret type='ceph' uuid='23daf9f8-1e80-4e6d-97b6-7916aeb7cc62'/>
    <interface type='network'>
      <source network='default'/>
      <mac address='11:22:33:44:55:66'/>
    <graphics type='vnc' port='-1' keymap='en-us'/>

Having defined the domain, you can then edit it at will in virt-manager. I was able to switch the network interface over to using virtio, plop it on a bridge so it was wired up to the correct VLAN and start the instance up.

I’ve since managed to migrate 3 instances over, namely an estate database, Brisbane Area WICEN’s OwnCloud site, and my own blog.

These are sufficient to try the system out. I’m already finding these instances much more responsive, using raw Ceph even, than the original server.

My next move I think will be to see if I can get corosync/heartbeat to manage a HA VM instance. That is, if one of the compute nodes goes offline, the instance restarts on the other compute node.

Two services come to mind where HA is concerned: terminating the PPPoE link for our Internet, and a virtual management node for a higher-level system such as OpenNebula. OpenNebula really needs something semi-HA, since it really gets its knickers in a twist if the master node goes down. I also want my border router to be HA, since I won’t necessarily be around to migrate it to a different node.

Everything else, well I suspect OpenNebula can itself manage those, and long term the instances I just liberated today from my old box, will become instances within OpenNebula.

The other option is I dip my toe into OpenStack (again), since it is inherently HA by design, but it is also a royal pain to get working.

Jul 022017

So… earlier in the week I received some 74HC574s (the right chip) to replace the 74HC573s I tried to use.

My removal technique was not pretty, wound up just cutting the legs off the hapless 74HC573 (my earlier hack had busted a pin on it anyway) and removing it that way. Since the holes were full of solder, it was easier to just bend the pins on the ‘574 and surface-mount it.

This afternoon, I gave it a try, and ‘lo and behold, it worked. I even tried hooking up one of the LED strings and driving that with the MOSFET… no problems at all.

I had only built up one of the modules at this stage, so I built another 5 on the same piece of strip board.

The requirement is for 5 channels… this meets that and adds an extra one (the board was wide enough). For the full accompaniment and to have ICSP/networking via external connections, a second board like this could be made, omitting the MOSFETs on four of the channels to handle the ICSP control lines and reducing the capacitances/resistances to suit.

Somewhere I have some TVS diodes for this board, but of course, they have legs, upon which they got up and ran away. Haven’t resurfaced yet. I’m sure they will if I buy more though. The spare footprint on the top-left of the main board is where one TVS diode goes, the others go on the I/O board, two for each channel.