ceph

Solar Cluster: Storage replacements and upgrades

So recently I was musing about how I might go about expanding the storage on the cluster. This was largely driven by the fact that I was about 80% full, and thus needed to increase capacity somehow.

I also was noting that the 5400RPM HDDs (HGST HTS541010A9E680), now with a bit of load, were starting to show signs of not keeping up. The cases I have can take two 2.5″ SATA HDDs, one spot is occupied by a boot drive (120GB SSD) and the other a HDD.

A few weeks ago, I had a node fail. That really did send the cluster into a spin, since due to space constraints, things weren’t as “redundant” as I would have liked, and with one disk down, I/O throughput which was already rivalling Microsoft Azure levels of slow, really took a bad downward turn.

I hastily bought two NUCs, which I’m working towards deploying… with those I also bought two 120GB M.2 SSDs (for boot drives) and two 2TB HDDs (WD Blues).

It was at that point I noticed that some of the working drives were giving off the odd read error which was throwing Ceph off, causing “inconsistent” placement groups. At that point, I decided I’d actually deploy one of the new drives (the old drive was connected to another node so I had nothing to lose), and I’ll probably deploy the other shortly. The WD Blue 2TB drives are also 5400RPM, but unlike the 1TB Hitachis I was using before, have 128MB of cache vs just 8MB.

That should boost the read performance just a little bit. We’ll see how they go. I figured this isn’t mutually exclusive to the plans of external storage upgrades, I can still buy and mod external enclosures like I planned, but perhaps with a bit more breathing room, the immediate need has passed.

I’ve since ordered another 3 of these drives, two will replace the existing 1TB drives, and a third will go back in the NUC I stole a 2TB drive from.

Thinking about the problem more, one big issue is that I don’t have room inside the case for 3 2.5″ HDDs, and the motherboards I have do not feature mSATA or M.2 SATA. I might cram a PCIe SSD in, but those are pricey.

The 120GB SSD is only there as a boot drive. If I could move that off to some other medium, I could possibly move to a bigger SSD in place of the 120GB SSD, maybe a ~500GB unit. These are reasonably priced. The issue is then where to put the OS.

An unattractive option is to shove a USB stick in and boot off that. There’s no internal USB ports, but there are two front USB ports in the case I could rig up to an internal header so they’re not sticking out like a sore thumb(-drive) begging to be broken off by a side-wards slap. The flash memory in these is usually the cheapest variety, so maybe if I went this route, I’d buy two: one for the root FS, the other for swap/logs.

The other option is a Disk-on-Module. The motherboards provide the necessary DC power connector for running these things, and there’s a chance I could cram one in there. They’re pricey, but not as bad as going NVMe SSDs, and there’s a greater chance of success squeezing this in.

Right now I’ve just bought a replacement motherboard and some RAM for it… this time the 16-core model, and it takes full-size DIMMs. It’ll go back in as a compute node with 32GB RAM (I can take it all the way to 256GB if I want to). Coupled with that and a purchase of some HDDs, I think I’ll let the bank account cool off before I go splurging more. 🙂

Solar Cluster: RIP hydrogen

Well, it had to happen some day, but I was hoping it’d be a few more years off… I’ve had the first node failure on the cluster.

One of my storage nodes decided to keel over this morning, some time between 5 and 8AM… sending the cluster into utter chaos. I tried power cycling the host a few times before finally yanking it from the DIN rail and trying it on the bench supply. After about 10 minutes of pulling SO-DIMMs and general mucking around trying to coax it to POST, I pulled the HDD out, put that in an external dock and connected that to one of the other storage nodes. After all, it was approaching 9AM and I needed to get to work!

A quick bit of work with ceph-bluestore-tool and I had the OSD mounted and running again. The cluster is moaning that it’s lost a monitor daemon… but it’s still got the other two so provided that I can keep O’Toole away (Murphy has already visited), I should be fine for now.

This evening I took a closer look, tried the RAM I had in different slots, even with the RAM removed, there’s no signs of life out of the host itself: I should get beep codes with no RAM installed. I ran my multimeter across the various power rails I could get at: the 5V and 12V rails look fine. The IPMI BMC works, but that’s about as much as I get. I guess once the board is replaced, I might take a closer look at that BMC, see how hackable it is.

I’ve bought a couple of spare nodes which will probably find themselves pressed into monitor node duty, two Intel NUC7I5BNHs have been ordered, and I’ll pick these up later in the week. Basically one is to temporarily replace the downed node until such time as I can procure a more suitable motherboard, and the other is a spare.

I have a M.2 SATA SSD I can drop in along with some DDR4 RAM I bought by mistake, and of course the HDD for that node is sitting in the dock. The NUCs are perfectly fine running between 10.8V right up to 19V — verified on a NUC6CAYS, so no 12V regulator is needed.

The only down-side with these units is the single Ethernet port, however I think this will be fine for monitor node duty, and two additional nodes should mean the storage cluster becomes more resilient.

The likely long-term plan may be an upgrade of one of the compute nodes. For ~$1600, I can get a A2SDi-16C-HLN4F, which sports 16 cores and takes full-size DDR4 DIMMs. I can then rotate the board out of that into the downed node.

The full-size DIMMS are much more readily available in ECC format, so that should make long-term support of this cluster much easier as the supplies of the SO-DIMMs are quickly drying up.

This probably means I should pull my finger out and actually do some of the maintenance I had been planning but put off… largely due to a lack of time. It’s just typical that everything has to happen when you are least free to deal with it.

Solar Cluster: Considering storage expansion

One problem I face with the cluster as it stands now is that 2.5″ HDDs are actually quite restrictive in terms of size options.

Right now the whole shebang runs on 1TB 5400RPM Hitachi laptop drives, which so far has been fine, but now that I’ve put my old server on as a VM, that’s chewed up a big chunk of space. I can survive a single drive crash, but not two.

I can buy 2TB HDDs, WD make some and Scorptec sell them. Seagate make some bigger capacity drives, however I have a policy of not buying Seagate.

At work we built a Ceph cluster on 3TB SV35 HDDs… 6 of them to be exact. Within 9 months, the drives started failing one-by-one. At first it was just the odd drive being intermittent, then the problem got worse. They all got RMAed, all 6 of them. Since we obviously needed drives to store data on until the RMAed drives returned, we bought identically sized consumer 5400RPM Hitachi drives. Those same drives are running happily in the same cluster today, some 3 years later.

We also had one SV35 in a 3.5″ external enclosure that formed my workplace’s “disaster recovery” back-up drive. The idea being that if the place was in great peril and it was safe enough to do so, someone could just yank this drive from the rack and run. (If we didn’t, we also had truly off-site back-up NAS boxes.) That wound up failing as well before its time was due. That got replaced with one of the RMAed disks and used until the 3TB no longer sufficed.

Anyway, enough of that diversion, long story short, I don’t trust Seagate disks for 24/7 operation. I don’t see other manufacturers (other than Seagate e.g. WD, Samsung, Hitachi) making >2TB HDDs in the 2.5″ form factor. They all seem to be going SSD.

I have a Samsung 850EVO 2TB in the laptop I’m writing this on, bought a couple of years ago now, and so far, it has been reliable. The cluster also uses 120GB 850EVOs as OS drives. There’s now a 4TB version as well.

The performance would be wonderful and they’d reduce the power consumption of the cluster, however, 3 4TB SSDs would cost $2700. That’s a big investment!

The other option is to bolt on a 3.5″ HDD somehow. A DIN-rail mounted case would be ideal for this. 3.5″ high-capacity drives are much more common, and is using technology which is proven reliable and is comparatively inexpensive.

In addition, by going to bigger external drives it also means I can potentially swap out those 2.5″ HDDs for SSDs at a later date. A WD Purple (5400RPM) 4TB sells for $166. I have one of these in my desktop at work, and so far its performance there has been fine. $3 more and I can get one of the WD Red (7200RPM) 4TB drives which are intended for NAS use. $265 buys a 6TB Toshiba 7200RPM HDD. In short, I have options.

Now, mounting the drives in the rack is a problem. I could just make a shelf to sit the drive enclosures on, or I could buy a second rack and move the servers into that which would free up room for a second DIN rail for the HDDs to mount to. It’d be neat to DIN-rail mount the enclosures beside each Ceph node, but right now, there’s no room to do that.

I’d also either need to modify or scratch-make a HDD enclosure that can be DIN-rail mounted.

There’s then the thorny issue of interfacing. There are two options at my disposal: eSATA and USB3. (Thunderbolt and Firewire aren’t supported on these systems and adding a PCIe card would be tricky.)

The Supermicro motherboards I’m using have 6 SATA ports. If you’re prepared to live with reduced cable lengths, you can use a passive SATA to eSATA adaptor bracket — and this works just fine for my use case since the drives will be quite close. I will have to power down a node and cut a hole in the case to mount the bracket, but this is doable.

I haven’t tried this out yet, but I should be able to use the same type of adaptor inside the enclosure to connect the eSATA cable to the HDD. Trade-off will be further reduced cable distances, but again, they don’t need to go more than 30cm, it’ll most likely work fine.

The other interface option is USB 3.0. The motherboards have two back-panel USB 3.0 connectors and inside, two USB 3.0 ports I can potentially expose. This can be hot-plugged without changing my cluster as it stands now. The down-side is that USB incurs a greater CPU overhead than SATA.

During my migration to BlueStore, I used exactly this to provide a “temporary” OSD disk… a 1TB 7200RPM WD black in a HDD dock. The performance of that was fine, and in that case, I was willing to put up with the overhead as it was temporary.

External eSATA cases seem to be going the way of the dodo, I haven’t seen many available for sale from my usual suppliers. USB 3.0 seems to have taken over, probably because for most uses, it is “good enough”. I did ask about whether one is preferred over the other for Ceph OSD use on the Ceph mailing list, but heard nothing.

As it was, prior to undertaking the migration, I bought such a case, an el’cheapo Simplecom SE-325, along with a 4TB WD Blue for the actual drive. I was tossing up between that, and a LaCiE “Porsche” 4TB drive, but the winning factor of this was that I’d know what I was buying — the LaCiE drive could have had anything in there, manufacturers can and sometimes do substitute components in different manufacturing runs, buying the case and drive separately didn’t run that risk.

The case and drive did the job. I hooked the drive up to my laptop (I had forgotten xhci_hcd support in the storage nodes’ kernels, which I have since fixed) and pulled a snapshot of every VM disk (Rados block device) off the Ceph cluster onto this drive as a raw disk image so I would not lose data. The drive easily kept up with the GbE link I had to the downstairs switch, and a core in the Core i5-3320M in this laptop is probably on par with the ones in the Avoton C2750s running the show.

To DIN-rail mount this, I’d need to make a cradle to take the case, and I’d need to hack some forced-ventilation into the top cover, which isn’t a difficult job. (Drill some holes, then use a nibbler tool to cut slots, then mount a small fan.)

The original PSU for this case is a 12V 2A wall wart, easily substituted with a 12V 3A LDO such as the LM1085IT-12. I may even be able to squeeze it and a heatsink into the case. I presently use one of these with the border router with a small heatsink, and so far, no problems.

If I later want eSATA, I can unscrew the original PCB and should be able to hack that in.

Short term, I can place a temporary shelf atop the battery cases and sit the HDDs there until I figure out more permanent arrangements.

Right now I’ve been battling a few health problems (sharp-eyed readers may recognise the box of “gunk” in the background which is now empty and the accompanying documentation — I’ll know more next Friday morning), and so I’ll wait until I know the outcome of those tests as there’s no point in building something grand if I’m not going to be around to enjoy it.

Solar Cluster: Adventures in Ceph migration

My cloud computing cluster like all cloud computing clusters of course needs a storage back-end. There were a number of options I could have chosen, but the one I went with in the end was Ceph, and so far, it’s ran pretty well.

Lately though, I was starting to get some odd crashes out of ceph-osd. I was running release 10.2.3, which is quite dated now, this is one of the earlier Jewel releases. Adding to the fun, I’m running btrfs as my filesystem on the OS and the OSD, and I’m running it all on Gentoo. On top of this, my monitor nodes are my OSDs as well.

Not exactly a “supported” configuration, never mind the hacks done at hardware level.

There was also a nagging issue about too many placement groups in the Ceph cluster. When I first established the cluster, I christened it by dragging a few of my lxc containers off the old server and making them VMs in the cluster. This was done using libvirt and virt-manager. These got thrown into a storage pool called transitional-inst, with a VLAN set aside for the VMs to use. When I threw OpenNebula on, I created another Ceph pool, one for its images. The configuration of these lead to the “too many placement groups” warning, which until now, I just ignored.

This weekend was a long weekend, for controversial reasons… and so I thought I’ll take a snapshot of all my VMs, download those snapshots to a HDD as raw images, then see if I can fix these issues, and migrate to Ceph Luminous (v12.2.10) at the same time.

Backing up

I was going to be doing some nasty things to the cluster, so I thought the first thing to do was to back up all images. This was done by using rbd snap create pool/image@date to create a snapshot of an image, then rbd export pool/image@date /path/to/storage/pool-image.img before blowing away the snapshot with rbd snap rm pool/image@date.

This was done for all images on the Ceph cluster, stashing them on a 4TB hard drive I had bought for the purpose.

Getting things ready

My cluster is actually set up as a distcc cluster, with Apache HTTP server instances sharing out distfiles and binary package repositories, so if I build packages on one, I can have the others fetch the binary packages that it built. I started with a node, and got it to update all packages except Ceph. Made sure everything was up-to-date.

Then, I ran emerge -B =ceph-10.2.10-r2. This was the first step in my migration, I’d move to the absolute latest Jewel release available in Gentoo. Once it built, I told all three storage nodes to install it (emerge -g =ceph-10.2.10-r2). This was followed up by a re-start of the mon daemons on each node (one at a time), then the mds daemons, finally the osd daemons.

Resolving the “too many placement groups” warning

To resolve this, I first researched the problem. An Internet search lead me to this Stack Overflow post. In it, it was suggested the problem could be alleviated by making a new pool with the correct settings, then copying the images over to it and blowing away the old one.

As it happens, I had an easier solution… move the “transitional” images to OpenNebula. I created empty data blocks in OpenNebula for the three images, then used qemu-img convert -p /path/to/image.img rbd:pool/image to upload the images.

It was then a case of creating a virtual machine template to boot them. I put them in a VLAN with the other servers, and when each one booted, edited the configuration with the new TCP/IP settings.

Once all those were moved across, I blew away the old VMs and the old pool. The warning disappeared, and I was left with a HEALTH_OK message out of Ceph.

The Luminous moment

At this point I was ready to try migrating. I had a good read of the instructions beforehand. They seemed simple enough. I prepared as I did before by updating everything on the system except Ceph, then, telling Portage to build a binary package of Ceph itself.

Then I deployed the binary to the three nodes.

First step was to re-start the monitors… this went smoothly, I just did a /etc/init.d/ceph-mon.${HOST} restart on each one individually, and after a brief moment, quorum was re-established. I then deployed a manager daemon to each one — basically I just “copied” my monitor symbolic link, changing mon to mgr, added it to OpenRC’s list, then started them. No problems.

The OSDs though were still running the Jewel release.

I proceeded as before, trying a re-start of the first OSD. After a while it hadn’t come back…

2019-01-27 14:42:59.745860 7f28fac06e00 -1 filestore(/var/lib/ceph/osd/ceph-0) _detect_fs(1197): deprecated btrfs support is not ena
bled

Ohh bugger, so no btrfs support. This is where the fun began. At this point I was a bit flustered and thought I’d have to either migrate these nodes to XFS, or to BlueStore. So immediately I started looking at the BlueStore migration documentation, as I did not want to risk re-starting the other two OSDs and losing access to my data!

A hasty BlueStore migration

So, I started this by doing the ceph osd set out 0 to start my now downed OSD 0 on the path of migration. The fact it was already down didn’t click with me. I then tried running ceph osd safe-to-destroy 0, only to be told Error EINVAL: (22) Invalid argument.

Uhh ohh, this isn’t good. I waited a bit, but also part of me said: there should be a copy of everything on this node, on at least one of the other two nodes. I had configured it to maintain at least two copies of everything, so even if this node went up in smoke, the data should be recoverable.

With great trepidation, I continued and tried destroying the OSD, then creating a BlueStore one in its place… only to have the ceph-volume command blow up. It couldn’t find the keyring, then when I got that sorted out, it was failing to talk to systemd, then when I found the --no-systemd argument, it still failed because of LVM. I therefore realised I needed two things:

  1. I needed the bootstrap-osd keyring that ceph-deploy normally creates.
  2. The lvmetad daemon must be running.

For (1), this is taken care of with the following commands:

# ceph auth add client.bootstrap-osd --cap mon 'profile bootstrap-osd
# mkdir /var/lib/ceph/bootstrap-osd
# ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring

As for (2), install sys-fs/lvm and add lvmetad to your start-up services. Also add lvm, as you’ll want that at boot. (I learned this later.)

After doing that, the following command worked:

ceph-volume lvm create --bluestore --data /dev/sdb \
--osd-id 0 --no-systemd

The --no-systemd is important on Gentoo with OpenRC as there is no systemctl binary. Once I did that, I found I could start my OSD again. Data recovery began at once. The data recovery was an overnight effort — it took with my hardware until 3PM today to migrate all the placement groups over to the newly re-formatted OSD.

Migrating the other nodes

For now, they still run btrfs. In my “ohh crap” state, I didn’t see the little hint given:

2019-01-27 14:40:55.147888 7f8feb7a2e00 -1 *** experimental feature 'btrfs' is not enabled ***
This feature is marked as experimental, which means it
 - is untested
 - is unsupported
 - may corrupt your data
 - may break your cluster is an unrecoverable fashion
To enable this feature, add this to your ceph.conf:
  enable experimental unrecoverable data corrupting features = btrfs

2019-01-27 14:40:55.147901 7f8feb7a2e00 -1 filestore(/var/lib/ceph/osd/ceph-0) _detect_fs(1197): deprecated btrfs support is not enabled
2019-01-27 14:40:55.147906 7f8feb7a2e00 -1 filestore(/var/lib/ceph/osd/ceph-0) mount(1523): error in _detect_fs: (1) Operation not permitted
2019-01-27 14:40:55.147926 7f8feb7a2e00 -1 osd.0 0 OSD:init: unable to mount object store

Not feeling like a 24-hour wait, I did as it told me:

osd pool default size = 2  # Write an object n times.
osd pool default min size = 1 # Allow writing n copy in a degraded state.
osd pool default pg num = 128
osd pool default pgp num = 128
osd crush chooseleaf type = 1
osd max backfills = 10

# Allow btrfs to work:
enable experimental unrecoverable data corrupting features = btrfs

Now, my other OSDs re-started successfully, and I could finally finish off by restarting the metadata daemons and completing the migration. I’m now left with two OSDs with BTRFS and one with BlueStore.

For now, I’ll leave it that way, next week end, I might migrate a second node to BlueStore.

The reboot test

I needed to ensure the nodes would come back without my intervention. So starting with the two BTRFS nodes, I rebooted each one individually. The OSD on that node first went offline, then the monitor, finally the cluster noticed the metadata and manager services had gone. Then, upon successful boot, the services returned.

So far so good. Now the BlueStore node.

First reboot, my OSD didn’t come back. On investigation, I saw the following logs:

2019-01-28 16:25:59.312369 7fd58d4f0e00 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or
directory
2019-01-28 16:26:14.865883 7fe92f942e00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or
directory
2019-01-28 16:26:30.419863 7fd4fa026e00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory

/var/lib/ceph/osd/ceph-0 was completely empty! Bugger, do I have to endure those 24 hours again? As it happened, no. I don’t know how the files in that directory disappeared, I did observe a tmpfs pseudovolume mounted at that directory earlier when trying to create the OSD … maybe that didn’t get unmounted before OSD creation, anyway, the files were gone.

A bit of digging revealed a ceph-bluestore-tool utility, with options like repair. At first I tried to wing it using that, but no dice. Then looking at the man page I noticed the sub-command prime-osd-dir. BINGO.

At first I threw the raw device at it, but as it happens, ceph-volume had deployed LVM to the raw disk, then put BlueStore on top of that. Starting lvm got the volume group recognised, so I added that to my boot-up services (see why I mentioned it earlier). It had created a sym-link to the LVM volume in /dev/ceph-${UUID1}/osd-block-${UUID2}.

No idea where the two UUIDs came from, but I tried this:

# ceph-bluestore-tool prime-osd-dir \
    --dev /dev/ceph-d62d0d95-2e13-4c59-834d-03a87b88c85e/osd-block-62b4be3e-3935-4d51-ab5c-dde077f99ea3 \
    --path /var/lib/ceph/osd/ceph-0

That populated the directory with files, so I tried again starting the OSD.

2019-01-28 16:59:23.680039 7fd93fcbee00 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-0/block: (13) Permission denied
2019-01-28 16:59:23.680082 7fd93fcbee00 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2019-01-28 16:59:39.229888 7f4a585b4e00 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-0/block: (13) Permission denied
2019-01-28 16:59:39.229918 7f4a585b4e00 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory

Ah ha, chown -R ceph:ceph /var/lib/ceph/osd/ceph-0, and all sprang to life. The OSD came up.

Testing the fixes, a second re-boot

Since the OSD now was starting, and working, I did a second re-boot test, only to have history partially repeat itself.

The files were still there this time, but it was failing with a permissions error opening the block device. Sure enough, it was now owned by root.

Changed the permissions, and the OSD came up.

Fixing this was a job for udev:

cat /etc/udev/rules.d/99ceph.rules
SUBSYSTEM=="block", KERNEL=="sda7", OWNER="ceph", GROUP="ceph", MODE="0600"
SUBSYSTEM=="block", ENV{DM_VG_NAME}=="ceph-*", OWNER="ceph", GROUP="ceph", MODE="0600"

The first line is left-over from when /dev/sda7 was my journal. Not sure what I’ll do with this partition now, I’ll think of something (maybe Docker). The second line tells udev to change the permissions on the volume group that Ceph created.

Having done this, I rebooted again. This time, all worked. The OSD came up without my intervention.

Recap

So, the pitfalls I ran across in my Jewel-Luminous migration on Gentoo.

btrfs OSDs

I had btrfs volumes for my OSDs, which are now frowned upon and considered experimental. It isn’t necessary to migrate to BlueStore or XFS straight away, but for the OSDs to boot, you will need the following line in your /etc/ceph/ceph.conf before restarting your OSDs:

enable experimental unrecoverable data corrupting features = btrfs

ceph-volume expects the bootstrap-osd key.

To use ceph-volume, it for some reason expects to see the bootstrap-osd key in a hard-coded location. It won’t work with the default admin key.

This bootstrap key can be generated as follows:

# ceph auth add client.bootstrap-osd --cap mon 'profile bootstrap-osd
# mkdir /var/lib/ceph/bootstrap-osd
# ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring

Before creating a BlueStore OSD, make sure lvmetad and lvm are started (and set to start at boot)

You can get away with just lvmetad for the initial creation, but you’ll want lvm running at boot anyway to ensure all the logical volume groups get started at boot before Ceph goes looking for them.

So before attempting OSD creation, ensure LVM is installed, and set to start at boot.

ceph-osd runs as the ceph user

So your udev rules need to reflect that. Luckily, ceph-volume seems to prefer creating LVM volume groups named ceph-${UUID}. I don’t know what decides the UUID value, but thankfully udev supports globbing. The following udev rule (put it in /etc/udev/rules.d/99ceph.rules or wherever seems appropriate) will keep permissions in check:

SUBSYSTEM=="block", ENV{DM_VG_NAME}=="ceph-*", OWNER="ceph", GROUP="ceph", MODE="0600"

(The above should be all on one line.)

Before rebooting a BlueStore node, back up your OSD data directories

Shouldn’t be strictly necessary, but now I’ve been bitten, I’m going to be taking extra care of that data directory on my other two nodes when I migrate them. I don’t fancy playing around with ceph-bluestore-tool frantically trying to get an OSD back up again.

Solar Cluster: Next steps, better control of the charger

So, a few weeks ago I installed a new battery charger, and tweaked it so that the solar did most of the leg work during the day, and the charger kept the batteries topped up at night.

I also discussed the addition of a new industrial PC to perform routing and system monitoring functions… which was to run Gentoo Linux/musl. For now, that little PC is still running Debian Stretch, but for 45 days, it was rock solid. The addition of this box, and taking on the role of router to the management network meant I could finally achieve one of my long-term goals for the project: decommissioning the old server.

The old server is still set up with all my data and software… but now the back-up cron job calls /sbin/poweroff when it’s done, and the BIOS is set to wake the machine up in the evening ready to receive a back-up late at night.

In its place, a virtual machine clone of the box, handles my email and all the old functions of that server. This was all done just prior to my father and I leaving for a 3 week holiday in the Snowy Mountains.

I did have a couple of hiccups with Ceph OSDs crashing … but basically re-starting the daemons (done remotely whilst travelling through Cowra) got everything back up. A bit of placement group cleaning, and everything was back online again. I had another similar hiccup coming out of Maitland, but once again, re-starting the daemons fixed it. No idea why it crashed, that’s something I’ll have to investigate.

Other than that, the cluster itself has run well.

One thing that did momentarily kill the industrial PC though: I wandered down to the rack with a small bus-powered 2.5″ HDD with the intent of re-starting my Gentoo builds. This HDD had the same content as the 3.5″ HDD I had plugged in before. I figured being bus powered, I would not be dependent on mains, and it could just chug away to its heart’s content.

No such luck, the moment I plugged that drive in, the little machine took great umbrage to the spinning rust now vacuuming the electrons away from its core functions, and shut down abruptly. I’ve now brought my 3.5″ drive and dock down, plugged that into the wall, and have my builds resuming. If power goes off, hopefully the machine either handles the loss of swap gracefully. If it does crash, the watchdog will take care of it.

Thus, I have the little TS-7670 first attempting a build of gcc, to see how we go. Finger’s crossed our power should remain up. There was at least one outage in the time we were away, but hopefully we should get though this next build!

The next step I think should be to add some control of the mains charger to allow the batteries to be boosted to full charge overnight. The thinking is a simple diode-OR arrangement. Many comparators such as the LM393 have an open-collector output, which gives us this for free.

The theory is this.

The battery bank powers a simple circuit which runs of a 5V regulator. That regulator powers a dual comparator IC and provides a reference voltage. The comparator draws bugger all power, so I’m happy to use a linear PSU here. It’s mainly there as a voltage reference.

Precision isn’t really the aim here, so adjustable pots will make life easier.

The voltages from the battery bank and the solar panel are fed through voltage dividers to bring the voltages down to below 5V, then those voltages are individually fed into separate pots that control the hysteresis. I can adjust all points of the system.

The idea is that should the batteries get too low, or the sun go down, one or the other (or both) comparators will go low and pull down on R2. If the batteries are high and the sun is up, nothing pulls on R2 so the REMOTE+ pin on the HEP-600C-12 is allowed to float to +5V, turning off the mains charger.

The advantage of this is there’s no programming of a microcontroller, it’s just analogue electronics. The LM393s are pretty hardy things, the datasheet says they’ll run at 36V and can accept a maximum voltage of VCC-1.5V; so if I run at 5V, 3.5V is my recommended maximum. The adjustment pots should let me set a threshold voltage that avoids going above this.

I mainly need 5V for the HEP-600C-12, and for providing that stable known voltage reference. The LM78C05 should be fine for this.

Once I’ve done that, I should be able to wind that charger back up to its factory setting of 14.4V, which will mean that overnight the batteries will be charged back to full charge.

Solar Cluster: Solar controller replaced, upgrading RAM and VM management

So, this morning I decided to shut the whole lot down and switch to the new solar controller.  There’s some clean-up work to be done, but for now, it’ll do.  The new controller is a Powertech MP3735.  Supposedly this one can deliver 30A, and has programmable float and bulk charge voltages.  A nice feature is that it’ll disconnect the load when it drops below 11V, so finding the batteries at 6V should be a thing of the past!  We’ll see how it goes.

I also put in two current shunts, one on the feed into/out of the battery, and one to the load.  Nothing is connected to monitor these as yet, but some research suggested that while in theory it is just an op-amp needed, that op-amp has to deal with microvolt differences and noise.

There are instrumentation amplifiers designed for that, and a handy little package is TI’s INA219B.  This incorporates aforementioned amplifier, but also adds to that an ADC with an I²C interface.  Downside is that I’ll need an MCU to poll it, upside is that by placing the ADC and instrumentation amp in one package, it should cut down noise, further reduced if I mount the chip on a board bolted to the current shunt concerned.  The ADC measures bus voltage and temperature as well.  Getting this to work shouldn’t be hard.  (Yes, famous last words I know.)

A few days ago, I also placed an order for some more RAM for the two compute nodes.  I had thought 8GB would be enough, and in a way it is, except I’ve found some software really doesn’t work properly unless it has 2GB RAM available (Gitea being one, although it is otherwise a fantastic Git repository manager).  By bumping both these nodes to 32GB each (4×8GB) I can be less frugal about memory allocations.

I can in theory go to 16GB modules in these boxes, but those were hideously expensive last time I looked, and had to be imported.  My debit card maxes out at $AU999.99, and there’s GST payable on anything higher anyway, so there goes that idea.  64GB would be nice, but 32GB should be enough.

The fun bit though, Kingston no longer make DDR3 ECC SO-DIMMs.  The mob I bought the last lot though informed me that the product is no longer available, after I had sent them the B-Pay payment.  Ahh well, I’ve tossed the question back asking what do they have available that is compatible.

Searching for ECC SODIMMs is fun, because the search engines will see ECC and find ECC DIMMs (i.e. full-size).  When looking at one of these ECC SODIMM unicorns, they’ll even suggest the full-size version as similar.  I’d love to see the salespeople try to fit the suggested full-size DIMM into the SODIMM socket and make it work!

The other thing that happens is the search engine sees ECC and see that that’s a sub-string of non-ECC.  Errm, yeah, if I meant non-ECC, I’d have said so, and I wouldn’t have put ECC there.

Crucial and Micron both make it though, here’s hoping mixing and matching RAM from different suppliers in the same bank won’t cause grief, otherwise the other option is I pull the Kingston sticks out and completely replace them.

The other thing I’m looking at is an alternative to OpenNebula.  Something that isn’t a pain in the arse to deploy (like OpenStack is, been there, done that), that is decentralised, and will handle KVM with a Ceph back-end.

A nice bonus would be being able to handle cross-architecture QEMU VMs, in particular for ARM and MIPS targets.  This is something that libvirt-based solutions do not do well.

I’m starting to think about ways I can DIY that solution.  Blockchain was briefly looked at, and ruled out on the basis that while it’d be good for an audit log, there’s no easy way to index it: reading current values would mean a full-scan of the blockchain, so not a solution on its own.

CephFS is stable now, but I’m not sure how file locking works on it.  Then there’s object storage itself, librados.  I’m not sure if there’s a database engine that can interface to that, or maybe to Amazon S3 storage (radosgw can emulate that), that’ll be the next step.  Lots to think about.

Solar Cluster: OpenNebula Front-end setup

So, the front-end for OpenNebula will be a VM, that migrates between the two compute nodes in a HA arrangement.  Likewise with the core router, and border router, although I am also tossing up trying again with the little Advantech UNO-1150G I have laying around.

For now, I’ve not yet set up the HA part, I’ll come to that.  There are guides for using libvirt with corosync/heartbeat, most also call up DR:BD as the block device for the VM, but we will not be using this as our block device (Rados Block Device) is already redundant.

To host OpenNebula, I’ll use Gentoo with musl-libc since that’ll shrink the footprint down just a little bit.  We’ll run it on a MariaDB back-end.

Since we’re using musl, you’ll want to install layman and the musl overlay as not all packages build against musl out-of-the-box.  Also install gentoolkit, as you’ll need to set USE flags, and euse makes this easy:

# emerge layman
# layman -L
# layman -a musl
# emerge gentoolkit

Now that some basic packages are installed, we need to install OpenNebula’s prerequisites. They tell you in amongst these is xmlrpc-c. BUT, they don’t tell you that it needs support for abyss: and the scons build system they use will just give you a cryptic error saying it couldn’t find xmlrpc. The answer is not, as suggested, to specify the path to xmlrpc-c-config, which happens to be in ${PATH} anyway, as that will net the same result, and break things later when you fix the real issue.

# euse -p dev-util/xmlrpc-c -E abyss

Now we can build the dependencies… this isn’t a full list, but includes everything that Gentoo ships in repositories, the remaining Ruby gems will have to be installed separately.

# emerge --ask dev-lang/ruby dev-db/sqlite dev-db/mariadb \
dev-ruby/sqlite3 dev-libs/xmlrpc-c dev-util/scons \
dev-ruby/json dev-ruby/sinatra dev-ruby/uuidtools \
dev-ruby/curb dev-ruby/nokogiri

With that done, create a user account for OpenNebula:

# useradd -d /opt/opennebula -m -r opennebula

Now you’re set to build OpenNebula itself:

# tar -xzvf opennebula-5.4.0.tar.gz
# cd opennebula-5.4.0
# scons mysql=yes

That’ll run for a bit, but should succeed. At the end:

# ./install -d /opt/opennebula -u opennebula -g opennebula

There’s about where I’m at now… the link in the README for further documentation is a broken link, here is where they keep their current documentation.

Solar Cluster: Networking

So, having got some instances going… I thought I better sort out the networking issues proper.  While it was working, I wanted to do a few things:

  1. Bring a dedicated link down from my room into the rack directly for redundancy
  2. Define some more VLANs
  3. Sort out the intermittent faults being reported by Ceph

I decided to tackle (1) first.  I have two 8-port Cisco SG-200 switches linked via a length of Cat5E that snakes its way from our study, through the ceiling cavity then comes up through a small hole in the floor of my room, near where two brush-tail possums call home.

I drilled a new hole next to where the existing cable entered, then came the fun of trying to feed the new cable along side the old one.  First attempt had the cable nearly coil itself just inside the cavity.  I tried to make a tool to grab the end of it, but it was well and truly out of reach.  I ended up getting the job done by taping the cable to a section of fibreglass tubing, feeding that in, taping another section of tubing to that, feed that in, etc… but then I ran out of tubing.

Luckily, a rummage around, and I found some rigid plastic that I was able to tape to the tubing, and that got me within a half-metre of my target.  Brilliant, except I forgot to put a leader cable through for next time didn’t I?

So more rummaging around for a length of suitable nylon rope, tape the rope to the Cat5E, haul the Cat5E out, then grab another length of rope and tape that to the end and use the nylon rope to haul everything back in.

The rope should be handy for when I come to install the solar panels.

I had one 16-way patch panel, so wound up terminating the rack-end with that, and just putting a RJ-45 on the end in my room and plugging that directly into the switch.  So on the shopping list will be some RJ-45 wall jacks.

The cable tester tells me I possibly have brown and white-brown switched, but never mind, I’ll be re-terminating it properly when I get the parts, and that pair isn’t used anyway.

The upshot: I now have a nice 1Gbps ring loop between the two SG-200s and the LGS326 in the rack.  No animals were harmed in the formation of this ring, although two possums were mildly inconvenienced.  (I call that payback for the times they’ve held the Marsupial Olympics at 2AM when I’m trying to sleep!)

Having gotten the physical layer sorted out, I was able to introduce the upstairs SG-200 to the new switch, then remove the single-port LAG I had defined on the downstairs SG-200.  A bit more tinkering going, and I had a nice redundant set-up: setting my laptop to ping one of the instances in the cluster over WiFi, I could unplug my upstairs trunk, wait a few seconds, plug it back in, wait some more, unplug the downstairs trunk, wait some more again, then plug in back in again, and not lose a single ICMP packet.

I moved my two switches and my AP over to the new management VLAN I had set up, along side the IPMI interfaces on the nodes.  The SG-200s were easy, aside from them insisting on one port being configured with a PVID equal to the management VLAN (I guess they want to ensure you don’t get locked out), it all went smoothly.

The AP though, a Cisco WAP4410N… not so easy.  In their wisdom, and unlike the SG-200s, the management VLAN settings page is separate from the IP interface page, so you can’t change both at the same time.  I wound up changing the VLAN, only to find I had locked myself out of it.  Much swearing at the cantankerous AP and wondering how could someone overlook such a fundamental requirement!  That, and the switch where the AP plugs in, helpfully didn’t add the management VLAN to the right port like I asked of it.

Once that was sorted out, I was able to configure an IP on the old subnet and move the AP across.

That just left dealing with the intermittent issues with Ceph.  My original intention with the cluster was to use 802.3AD so each node had two 2Gbps links.  Except: the LGS326-AU only supports 4 LAGs.  For me to do this, I need 10!

Thankfully, the bonding support in the Linux kernel has several other options available.  Switching from 802.3ad to balance-tlb, resolved the issue.

slaves_bond0="enp0s20f0 enp0s20f1"
slaves_bond1="enp0s20f2 enp0s20f3"
config_bond0="null"
config_bond1="null"
config_enp0s20f0="null"
config_enp0s20f1="null"
config_enp0s20f2="null"
config_enp0s20f3="null"
rc_net_bond0_need="net.enp0s20f0 net.enp0s20f1"
rc_net_bond1_need="net.enp0s20f2 net.enp0s20f3"
mode_bond0="balance-tlb"
mode_bond1="balance-tlb"

I am now currently setting up a core router instance (with OpenBSD 6.1) and a OpenNebula instance (with Gentoo AMD64/musl libc).

Solar Cluster: First virtual instances running

So, since my last log, I’ve managed to tidy up the wiring on the cluster, making use of the plywood panel at the back to mount all my DC power electronics, and generally tidying everything up.

I had planned to use a SB50 connector to connect the cluster up to the power supply, so made provisions for this in the wiring harness. Turns out, this was not necessary, it was easier in the end to just pull apart the existing wiring and hard-wire the cluster up to the charger input.

So, I’ve now got a spare load socket hanging out the front, which will be handy if we wind up with unreliable mains power in the near future since it’s a convenient point to hook up 12V appliances.

There’s a solar power input there ready, and space to the left of that to build a little control circuit that monitors the solar voltage and switches in the mains if needed. For now though, the switching is done with a relay that’s hard-wired on.

Today though, I managed to get the Ceph clients set up on the two compute nodes, and while virt-manager is buggy where it comes to RBD pools. In particular, adding a RBD storage pool doesn’t work as there’s no way to define authentication keys, and even if you have the pool defined, you find that trying to use images from that pool causes virt-manager to complain it can’t find the image on your local machine. (Well duh! This is a known issue.)

I was able to find a XML cheat-sheet for defining a domain in libvirt, which I was then able to use with Ceph’s documentation.

A typical instance looks like this:

<domain type='kvm'>
  <!-- name of your instance -->
  <name>instancename</name>
  <!-- a UUID for your instance, use `uuidgen` to generate one -->
  <uuid>00ec9b97-c49a-45f8-befe-f74ad6bde2fe</uuid>
  <memory>524288</memory>
  <vcpu>1</vcpu>
  <os>
    <type arch="x86_64">hvm</type>
  </os>
  <clock sync="utc"/>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='network' device='disk'>
      <source protocol='rbd' name="poolname/image.vda">
        <!-- the hostnames or IPs of your Ceph monitor nodes -->
        <host name="s0.internal.network" />
        <host name="s1.internal.network" />
        <host name="s2.internal.network" />
      </source>
      <target dev='vda'/>
      <auth username='libvirt'>
        <!-- the UUID here is what libvirt allocated when you did
	    `virsh secret-define foo.xml`, use `virsh secret-list`
	    if you've forgotten what that is. -->
        <secret type='ceph' uuid='23daf9f8-1e80-4e6d-97b6-7916aeb7cc62'/>
      </auth>
    </disk>
    <disk type='network' device='cdrom'>
      <source protocol='rbd' name="poolname/image.iso">
        <!-- the hostnames or IPs of your Ceph monitor nodes -->
        <host name="s0.internal.network" />
        <host name="s1.internal.network" />
        <host name="s2.internal.network" />
      </source>
      <target dev='hdd'/>
      <auth username='libvirt'>
        <secret type='ceph' uuid='23daf9f8-1e80-4e6d-97b6-7916aeb7cc62'/>
      </auth>
    </disk>
    <interface type='network'>
      <source network='default'/>
      <mac address='11:22:33:44:55:66'/>
    </interface>
    <graphics type='vnc' port='-1' keymap='en-us'/>
  </devices>
</domain>

Having defined the domain, you can then edit it at will in virt-manager. I was able to switch the network interface over to using virtio, plop it on a bridge so it was wired up to the correct VLAN and start the instance up.

I’ve since managed to migrate 3 instances over, namely an estate database, Brisbane Area WICEN’s OwnCloud site, and my own blog.

These are sufficient to try the system out. I’m already finding these instances much more responsive, using raw Ceph even, than the original server.

My next move I think will be to see if I can get corosync/heartbeat to manage a HA VM instance. That is, if one of the compute nodes goes offline, the instance restarts on the other compute node.

Two services come to mind where HA is concerned: terminating the PPPoE link for our Internet, and a virtual management node for a higher-level system such as OpenNebula. OpenNebula really needs something semi-HA, since it really gets its knickers in a twist if the master node goes down. I also want my border router to be HA, since I won’t necessarily be around to migrate it to a different node.

Everything else, well I suspect OpenNebula can itself manage those, and long term the instances I just liberated today from my old box, will become instances within OpenNebula.

The other option is I dip my toe into OpenStack (again), since it is inherently HA by design, but it is also a royal pain to get working.

Solar Cluster: Rack installed in-situ

So, there’s some work still to be done, for example making some extension leads for the run between the battery link harness, load power distribution and the charger… and to generally tidy things up, but it is now up and running.

On the floor, is the 240V-12V power supply and the charger, which right now is hard-wired in boost mode. In the bottom of the rack are the two 105Ah 12V AGM batteries, in boxes with fuses and isolation switches.

The nodes and switching is inside the rack, and resting on top is the load power distribution board, which I’ll have to rewire to make things a little neater. A prospect is to mount some of this on the back.

I had a few introductions to make, introducing the existing pair of SG-200 switches to the newcomer and its VLANs, but now at least, I’m able to SSH into the nodes, access the IPMI BMC and generally configure the whole box and dice.

With the exception of the later upgrade to solar, and the aforementioned wiring harness clean-ups, the hardware-side of this dual hardware/software project, is largely complete, and this project now transitions to being a software project.

The plan from here:

  • Update the OSes… as all will be a little dated. (I might even blow away and re-load.)
  • Get Ceph storage up and running. It actually should be configured already, just a matter of getting DNS hostnames sorted out so they can find eachother.
  • Investigating the block caching landscape: when I first started the project at work, it was a 3-horse race between Facebook’s FlashCache, bcache and dmcache. Well, FlashCache is no more, replaced by EnhancedIO, and I’m not sure about the rest of the market. So this needs researching.
  • Management interfaces: at my workplace I tried Ganeti, OpenNebula and OpenStack. This again, needs re-visiting. OpenNebula has moved a long way from where it was and I haven’t looked at the others in a while. OpenStack had me running away screaming, but maybe things have improved.