May 292019
 

It’s been on my TO-DO list now for a long time to wire in some current shunts to monitor the solar input, replace the near useless Powertech solar controller with something better, and put in some more outlets.

Saturday, I finally got around to doing exactly that. I meant to also add a low-voltage disconnect to the rig … I’ve got the parts for this but haven’t yet built or tested it — I’d like to wait until I have done both,but I needed the power capacity. So I’m running a risk without the over-discharge protection, but I think I’ll take that gamble for now.

Right now:

  • The Powertech MP-3735 is permanently out, the Redarc BCDC-1225 is back in.
  • I have nearly a dozen spare 12V outlet points now.
  • There are current shunts on:
    • Raw solar input (50A)
    • Solar controller output (50A)
    • Battery (100A)
    • Load (100A)
  • The Meanwell HEP-600C-12 is mounted to the back of the server rack, freeing space from the top.
  • The janky spade lugs and undersized cable connecting the HEP-600C-12 to the battery has been replaced with a more substantial cable.

This is what it looks like now around the back:

Rear of the rack, after re-wiring

What difference has this made? I’ll let the graphs speak. This was the battery voltage this time last week:

Battery voltage for 2019-05-22

… and this was today…

Battery voltage 2019-05-29

Chalk-and-bloody-cheese! The weather has been quite consistent, and the solar output has greatly improved just replacing the controller. The panels actually got a bit overenthusiastic and overshot the 14.6V maximum… but not by much thankfully. I think once I get some more nodes on, it’ll come down a bit.

I’ve gone from about 8 hours off-grid to nearly 12! Expanding the battery capacity is an option, and could see the cluster possibly run overnight.

I need to get the two new nodes onto battery power (the two new NUCs) and the Netgear switch. Actually I’m waiting on a rack-mount kit for the Netgear as I have misplaced the one it came with, failing that I’ll hack one up out of aluminium angle — it doesn’t look hard!

A new motherboard is coming for the downed node, that will bring me back up to two compute nodes (one with 16 cores), and I have new 2TB HDDs to replace the aging 1TB drives. Once that’s done I’ll have:

  • 24 CPU cores and 64GB RAM in compute nodes
  • 28 CPU cores and 112GB RAM in storage nodes
  • 10TB of raw disk storage

I’ll have to pull my finger out on the power monitoring, there’s all the shunts in place now so I have no excuse but to make up those INA-219 boards and get everything going.

May 252019
 

So recently I was musing about how I might go about expanding the storage on the cluster. This was largely driven by the fact that I was about 80% full, and thus needed to increase capacity somehow.

I also was noting that the 5400RPM HDDs (HGST HTS541010A9E680), now with a bit of load, were starting to show signs of not keeping up. The cases I have can take two 2.5″ SATA HDDs, one spot is occupied by a boot drive (120GB SSD) and the other a HDD.

A few weeks ago, I had a node fail. That really did send the cluster into a spin, since due to space constraints, things weren’t as “redundant” as I would have liked, and with one disk down, I/O throughput which was already rivalling Microsoft Azure levels of slow, really took a bad downward turn.

I hastily bought two NUCs, which I’m working towards deploying… with those I also bought two 120GB M.2 SSDs (for boot drives) and two 2TB HDDs (WD Blues).

It was at that point I noticed that some of the working drives were giving off the odd read error which was throwing Ceph off, causing “inconsistent” placement groups. At that point, I decided I’d actually deploy one of the new drives (the old drive was connected to another node so I had nothing to lose), and I’ll probably deploy the other shortly. The WD Blue 2TB drives are also 5400RPM, but unlike the 1TB Hitachis I was using before, have 128MB of cache vs just 8MB.

That should boost the read performance just a little bit. We’ll see how they go. I figured this isn’t mutually exclusive to the plans of external storage upgrades, I can still buy and mod external enclosures like I planned, but perhaps with a bit more breathing room, the immediate need has passed.

I’ve since ordered another 3 of these drives, two will replace the existing 1TB drives, and a third will go back in the NUC I stole a 2TB drive from.

Thinking about the problem more, one big issue is that I don’t have room inside the case for 3 2.5″ HDDs, and the motherboards I have do not feature mSATA or M.2 SATA. I might cram a PCIe SSD in, but those are pricey.

The 120GB SSD is only there as a boot drive. If I could move that off to some other medium, I could possibly move to a bigger SSD in place of the 120GB SSD, maybe a ~500GB unit. These are reasonably priced. The issue is then where to put the OS.

An unattractive option is to shove a USB stick in and boot off that. There’s no internal USB ports, but there are two front USB ports in the case I could rig up to an internal header so they’re not sticking out like a sore thumb(-drive) begging to be broken off by a side-wards slap. The flash memory in these is usually the cheapest variety, so maybe if I went this route, I’d buy two: one for the root FS, the other for swap/logs.

The other option is a Disk-on-Module. The motherboards provide the necessary DC power connector for running these things, and there’s a chance I could cram one in there. They’re pricey, but not as bad as going NVMe SSDs, and there’s a greater chance of success squeezing this in.

Right now I’ve just bought a replacement motherboard and some RAM for it… this time the 16-core model, and it takes full-size DIMMs. It’ll go back in as a compute node with 32GB RAM (I can take it all the way to 256GB if I want to). Coupled with that and a purchase of some HDDs, I think I’ll let the bank account cool off before I go splurging more. 🙂

May 142019
 

Well, it had to happen some day, but I was hoping it’d be a few more years off… I’ve had the first node failure on the cluster.

One of my storage nodes decided to keel over this morning, some time between 5 and 8AM… sending the cluster into utter chaos. I tried power cycling the host a few times before finally yanking it from the DIN rail and trying it on the bench supply. After about 10 minutes of pulling SO-DIMMs and general mucking around trying to coax it to POST, I pulled the HDD out, put that in an external dock and connected that to one of the other storage nodes. After all, it was approaching 9AM and I needed to get to work!

A quick bit of work with ceph-bluestore-tool and I had the OSD mounted and running again. The cluster is moaning that it’s lost a monitor daemon… but it’s still got the other two so provided that I can keep O’Toole away (Murphy has already visited), I should be fine for now.

This evening I took a closer look, tried the RAM I had in different slots, even with the RAM removed, there’s no signs of life out of the host itself: I should get beep codes with no RAM installed. I ran my multimeter across the various power rails I could get at: the 5V and 12V rails look fine. The IPMI BMC works, but that’s about as much as I get. I guess once the board is replaced, I might take a closer look at that BMC, see how hackable it is.

I’ve bought a couple of spare nodes which will probably find themselves pressed into monitor node duty, two Intel NUC7I5BNHs have been ordered, and I’ll pick these up later in the week. Basically one is to temporarily replace the downed node until such time as I can procure a more suitable motherboard, and the other is a spare.

I have a M.2 SATA SSD I can drop in along with some DDR4 RAM I bought by mistake, and of course the HDD for that node is sitting in the dock. The NUCs are perfectly fine running between 10.8V right up to 19V — verified on a NUC6CAYS, so no 12V regulator is needed.

The only down-side with these units is the single Ethernet port, however I think this will be fine for monitor node duty, and two additional nodes should mean the storage cluster becomes more resilient.

The likely long-term plan may be an upgrade of one of the compute nodes. For ~$1600, I can get a A2SDi-16C-HLN4F, which sports 16 cores and takes full-size DDR4 DIMMs. I can then rotate the board out of that into the downed node.

The full-size DIMMS are much more readily available in ECC format, so that should make long-term support of this cluster much easier as the supplies of the SO-DIMMs are quickly drying up.

This probably means I should pull my finger out and actually do some of the maintenance I had been planning but put off… largely due to a lack of time. It’s just typical that everything has to happen when you are least free to deal with it.

Feb 152019
 

One problem I face with the cluster as it stands now is that 2.5″ HDDs are actually quite restrictive in terms of size options.

Right now the whole shebang runs on 1TB 5400RPM Hitachi laptop drives, which so far has been fine, but now that I’ve put my old server on as a VM, that’s chewed up a big chunk of space. I can survive a single drive crash, but not two.

I can buy 2TB HDDs, WD make some and Scorptec sell them. Seagate make some bigger capacity drives, however I have a policy of not buying Seagate.

At work we built a Ceph cluster on 3TB SV35 HDDs… 6 of them to be exact. Within 9 months, the drives started failing one-by-one. At first it was just the odd drive being intermittent, then the problem got worse. They all got RMAed, all 6 of them. Since we obviously needed drives to store data on until the RMAed drives returned, we bought identically sized consumer 5400RPM Hitachi drives. Those same drives are running happily in the same cluster today, some 3 years later.

We also had one SV35 in a 3.5″ external enclosure that formed my workplace’s “disaster recovery” back-up drive. The idea being that if the place was in great peril and it was safe enough to do so, someone could just yank this drive from the rack and run. (If we didn’t, we also had truly off-site back-up NAS boxes.) That wound up failing as well before its time was due. That got replaced with one of the RMAed disks and used until the 3TB no longer sufficed.

Anyway, enough of that diversion, long story short, I don’t trust Seagate disks for 24/7 operation. I don’t see other manufacturers (other than Seagate e.g. WD, Samsung, Hitachi) making >2TB HDDs in the 2.5″ form factor. They all seem to be going SSD.

I have a Samsung 850EVO 2TB in the laptop I’m writing this on, bought a couple of years ago now, and so far, it has been reliable. The cluster also uses 120GB 850EVOs as OS drives. There’s now a 4TB version as well.

The performance would be wonderful and they’d reduce the power consumption of the cluster, however, 3 4TB SSDs would cost $2700. That’s a big investment!

The other option is to bolt on a 3.5″ HDD somehow. A DIN-rail mounted case would be ideal for this. 3.5″ high-capacity drives are much more common, and is using technology which is proven reliable and is comparatively inexpensive.

In addition, by going to bigger external drives it also means I can potentially swap out those 2.5″ HDDs for SSDs at a later date. A WD Purple (5400RPM) 4TB sells for $166. I have one of these in my desktop at work, and so far its performance there has been fine. $3 more and I can get one of the WD Red (7200RPM) 4TB drives which are intended for NAS use. $265 buys a 6TB Toshiba 7200RPM HDD. In short, I have options.

Now, mounting the drives in the rack is a problem. I could just make a shelf to sit the drive enclosures on, or I could buy a second rack and move the servers into that which would free up room for a second DIN rail for the HDDs to mount to. It’d be neat to DIN-rail mount the enclosures beside each Ceph node, but right now, there’s no room to do that.

I’d also either need to modify or scratch-make a HDD enclosure that can be DIN-rail mounted.

There’s then the thorny issue of interfacing. There are two options at my disposal: eSATA and USB3. (Thunderbolt and Firewire aren’t supported on these systems and adding a PCIe card would be tricky.)

The Supermicro motherboards I’m using have 6 SATA ports. If you’re prepared to live with reduced cable lengths, you can use a passive SATA to eSATA adaptor bracket — and this works just fine for my use case since the drives will be quite close. I will have to power down a node and cut a hole in the case to mount the bracket, but this is doable.

I haven’t tried this out yet, but I should be able to use the same type of adaptor inside the enclosure to connect the eSATA cable to the HDD. Trade-off will be further reduced cable distances, but again, they don’t need to go more than 30cm, it’ll most likely work fine.

The other interface option is USB 3.0. The motherboards have two back-panel USB 3.0 connectors and inside, two USB 3.0 ports I can potentially expose. This can be hot-plugged without changing my cluster as it stands now. The down-side is that USB incurs a greater CPU overhead than SATA.

During my migration to BlueStore, I used exactly this to provide a “temporary” OSD disk… a 1TB 7200RPM WD black in a HDD dock. The performance of that was fine, and in that case, I was willing to put up with the overhead as it was temporary.

External eSATA cases seem to be going the way of the dodo, I haven’t seen many available for sale from my usual suppliers. USB 3.0 seems to have taken over, probably because for most uses, it is “good enough”. I did ask about whether one is preferred over the other for Ceph OSD use on the Ceph mailing list, but heard nothing.

As it was, prior to undertaking the migration, I bought such a case, an el’cheapo Simplecom SE-325, along with a 4TB WD Blue for the actual drive. I was tossing up between that, and a LaCiE “Porsche” 4TB drive, but the winning factor of this was that I’d know what I was buying — the LaCiE drive could have had anything in there, manufacturers can and sometimes do substitute components in different manufacturing runs, buying the case and drive separately didn’t run that risk.

The case and drive did the job. I hooked the drive up to my laptop (I had forgotten xhci_hcd support in the storage nodes’ kernels, which I have since fixed) and pulled a snapshot of every VM disk (Rados block device) off the Ceph cluster onto this drive as a raw disk image so I would not lose data. The drive easily kept up with the GbE link I had to the downstairs switch, and a core in the Core i5-3320M in this laptop is probably on par with the ones in the Avoton C2750s running the show.

To DIN-rail mount this, I’d need to make a cradle to take the case, and I’d need to hack some forced-ventilation into the top cover, which isn’t a difficult job. (Drill some holes, then use a nibbler tool to cut slots, then mount a small fan.)

The original PSU for this case is a 12V 2A wall wart, easily substituted with a 12V 3A LDO such as the LM1085IT-12. I may even be able to squeeze it and a heatsink into the case. I presently use one of these with the border router with a small heatsink, and so far, no problems.

If I later want eSATA, I can unscrew the original PCB and should be able to hack that in.

Short term, I can place a temporary shelf atop the battery cases and sit the HDDs there until I figure out more permanent arrangements.

Right now I’ve been battling a few health problems (sharp-eyed readers may recognise the box of “gunk” in the background which is now empty and the accompanying documentation — I’ll know more next Friday morning), and so I’ll wait until I know the outcome of those tests as there’s no point in building something grand if I’m not going to be around to enjoy it.

Jan 282019
 

My cloud computing cluster like all cloud computing clusters of course needs a storage back-end. There were a number of options I could have chosen, but the one I went with in the end was Ceph, and so far, it’s ran pretty well.

Lately though, I was starting to get some odd crashes out of ceph-osd. I was running release 10.2.3, which is quite dated now, this is one of the earlier Jewel releases. Adding to the fun, I’m running btrfs as my filesystem on the OS and the OSD, and I’m running it all on Gentoo. On top of this, my monitor nodes are my OSDs as well.

Not exactly a “supported” configuration, never mind the hacks done at hardware level.

There was also a nagging issue about too many placement groups in the Ceph cluster. When I first established the cluster, I christened it by dragging a few of my lxc containers off the old server and making them VMs in the cluster. This was done using libvirt and virt-manager. These got thrown into a storage pool called transitional-inst, with a VLAN set aside for the VMs to use. When I threw OpenNebula on, I created another Ceph pool, one for its images. The configuration of these lead to the “too many placement groups” warning, which until now, I just ignored.

This weekend was a long weekend, for controversial reasons… and so I thought I’ll take a snapshot of all my VMs, download those snapshots to a HDD as raw images, then see if I can fix these issues, and migrate to Ceph Luminous (v12.2.10) at the same time.

Backing up

I was going to be doing some nasty things to the cluster, so I thought the first thing to do was to back up all images. This was done by using rbd snap create pool/image@date to create a snapshot of an image, then rbd export pool/image@date /path/to/storage/pool-image.img before blowing away the snapshot with rbd snap rm pool/image@date.

This was done for all images on the Ceph cluster, stashing them on a 4TB hard drive I had bought for the purpose.

Getting things ready

My cluster is actually set up as a distcc cluster, with Apache HTTP server instances sharing out distfiles and binary package repositories, so if I build packages on one, I can have the others fetch the binary packages that it built. I started with a node, and got it to update all packages except Ceph. Made sure everything was up-to-date.

Then, I ran emerge -B =ceph-10.2.10-r2. This was the first step in my migration, I’d move to the absolute latest Jewel release available in Gentoo. Once it built, I told all three storage nodes to install it (emerge -g =ceph-10.2.10-r2). This was followed up by a re-start of the mon daemons on each node (one at a time), then the mds daemons, finally the osd daemons.

Resolving the “too many placement groups” warning

To resolve this, I first researched the problem. An Internet search lead me to this Stack Overflow post. In it, it was suggested the problem could be alleviated by making a new pool with the correct settings, then copying the images over to it and blowing away the old one.

As it happens, I had an easier solution… move the “transitional” images to OpenNebula. I created empty data blocks in OpenNebula for the three images, then used qemu-img convert -p /path/to/image.img rbd:pool/image to upload the images.

It was then a case of creating a virtual machine template to boot them. I put them in a VLAN with the other servers, and when each one booted, edited the configuration with the new TCP/IP settings.

Once all those were moved across, I blew away the old VMs and the old pool. The warning disappeared, and I was left with a HEALTH_OK message out of Ceph.

The Luminous moment

At this point I was ready to try migrating. I had a good read of the instructions beforehand. They seemed simple enough. I prepared as I did before by updating everything on the system except Ceph, then, telling Portage to build a binary package of Ceph itself.

Then I deployed the binary to the three nodes.

First step was to re-start the monitors… this went smoothly, I just did a /etc/init.d/ceph-mon.${HOST} restart on each one individually, and after a brief moment, quorum was re-established. I then deployed a manager daemon to each one — basically I just “copied” my monitor symbolic link, changing mon to mgr, added it to OpenRC’s list, then started them. No problems.

The OSDs though were still running the Jewel release.

I proceeded as before, trying a re-start of the first OSD. After a while it hadn’t come back…

2019-01-27 14:42:59.745860 7f28fac06e00 -1 filestore(/var/lib/ceph/osd/ceph-0) _detect_fs(1197): deprecated btrfs support is not ena
bled

Ohh bugger, so no btrfs support. This is where the fun began. At this point I was a bit flustered and thought I’d have to either migrate these nodes to XFS, or to BlueStore. So immediately I started looking at the BlueStore migration documentation, as I did not want to risk re-starting the other two OSDs and losing access to my data!

A hasty BlueStore migration

So, I started this by doing the ceph osd set out 0 to start my now downed OSD 0 on the path of migration. The fact it was already down didn’t click with me. I then tried running ceph osd safe-to-destroy 0, only to be told Error EINVAL: (22) Invalid argument.

Uhh ohh, this isn’t good. I waited a bit, but also part of me said: there should be a copy of everything on this node, on at least one of the other two nodes. I had configured it to maintain at least two copies of everything, so even if this node went up in smoke, the data should be recoverable.

With great trepidation, I continued and tried destroying the OSD, then creating a BlueStore one in its place… only to have the ceph-volume command blow up. It couldn’t find the keyring, then when I got that sorted out, it was failing to talk to systemd, then when I found the --no-systemd argument, it still failed because of LVM. I therefore realised I needed two things:

  1. I needed the bootstrap-osd keyring that ceph-deploy normally creates.
  2. The lvmetad daemon must be running.

For (1), this is taken care of with the following commands:

# ceph auth add client.bootstrap-osd --cap mon 'profile bootstrap-osd
# mkdir /var/lib/ceph/bootstrap-osd
# ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring

As for (2), install sys-fs/lvm and add lvmetad to your start-up services. Also add lvm, as you’ll want that at boot. (I learned this later.)

After doing that, the following command worked:

ceph-volume lvm create --bluestore --data /dev/sdb \
--osd-id 0 --no-systemd

The --no-systemd is important on Gentoo with OpenRC as there is no systemctl binary. Once I did that, I found I could start my OSD again. Data recovery began at once. The data recovery was an overnight effort — it took with my hardware until 3PM today to migrate all the placement groups over to the newly re-formatted OSD.

Migrating the other nodes

For now, they still run btrfs. In my “ohh crap” state, I didn’t see the little hint given:

2019-01-27 14:40:55.147888 7f8feb7a2e00 -1 *** experimental feature 'btrfs' is not enabled ***
This feature is marked as experimental, which means it
 - is untested
 - is unsupported
 - may corrupt your data
 - may break your cluster is an unrecoverable fashion
To enable this feature, add this to your ceph.conf:
  enable experimental unrecoverable data corrupting features = btrfs

2019-01-27 14:40:55.147901 7f8feb7a2e00 -1 filestore(/var/lib/ceph/osd/ceph-0) _detect_fs(1197): deprecated btrfs support is not enabled
2019-01-27 14:40:55.147906 7f8feb7a2e00 -1 filestore(/var/lib/ceph/osd/ceph-0) mount(1523): error in _detect_fs: (1) Operation not permitted
2019-01-27 14:40:55.147926 7f8feb7a2e00 -1 osd.0 0 OSD:init: unable to mount object store

Not feeling like a 24-hour wait, I did as it told me:

osd pool default size = 2  # Write an object n times.
osd pool default min size = 1 # Allow writing n copy in a degraded state.
osd pool default pg num = 128
osd pool default pgp num = 128
osd crush chooseleaf type = 1
osd max backfills = 10

# Allow btrfs to work:
enable experimental unrecoverable data corrupting features = btrfs

Now, my other OSDs re-started successfully, and I could finally finish off by restarting the metadata daemons and completing the migration. I’m now left with two OSDs with BTRFS and one with BlueStore.

For now, I’ll leave it that way, next week end, I might migrate a second node to BlueStore.

The reboot test

I needed to ensure the nodes would come back without my intervention. So starting with the two BTRFS nodes, I rebooted each one individually. The OSD on that node first went offline, then the monitor, finally the cluster noticed the metadata and manager services had gone. Then, upon successful boot, the services returned.

So far so good. Now the BlueStore node.

First reboot, my OSD didn’t come back. On investigation, I saw the following logs:

2019-01-28 16:25:59.312369 7fd58d4f0e00 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or
directory
2019-01-28 16:26:14.865883 7fe92f942e00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or
directory
2019-01-28 16:26:30.419863 7fd4fa026e00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory

/var/lib/ceph/osd/ceph-0 was completely empty! Bugger, do I have to endure those 24 hours again? As it happened, no. I don’t know how the files in that directory disappeared, I did observe a tmpfs pseudovolume mounted at that directory earlier when trying to create the OSD … maybe that didn’t get unmounted before OSD creation, anyway, the files were gone.

A bit of digging revealed a ceph-bluestore-tool utility, with options like repair. At first I tried to wing it using that, but no dice. Then looking at the man page I noticed the sub-command prime-osd-dir. BINGO.

At first I threw the raw device at it, but as it happens, ceph-volume had deployed LVM to the raw disk, then put BlueStore on top of that. Starting lvm got the volume group recognised, so I added that to my boot-up services (see why I mentioned it earlier). It had created a sym-link to the LVM volume in /dev/ceph-${UUID1}/osd-block-${UUID2}.

No idea where the two UUIDs came from, but I tried this:

# ceph-bluestore-tool prime-osd-dir \
    --dev /dev/ceph-d62d0d95-2e13-4c59-834d-03a87b88c85e/osd-block-62b4be3e-3935-4d51-ab5c-dde077f99ea3 \
    --path /var/lib/ceph/osd/ceph-0

That populated the directory with files, so I tried again starting the OSD.

2019-01-28 16:59:23.680039 7fd93fcbee00 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-0/block: (13) Permission denied
2019-01-28 16:59:23.680082 7fd93fcbee00 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2019-01-28 16:59:39.229888 7f4a585b4e00 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-0/block: (13) Permission denied
2019-01-28 16:59:39.229918 7f4a585b4e00 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory

Ah ha, chown -R ceph:ceph /var/lib/ceph/osd/ceph-0, and all sprang to life. The OSD came up.

Testing the fixes, a second re-boot

Since the OSD now was starting, and working, I did a second re-boot test, only to have history partially repeat itself.

The files were still there this time, but it was failing with a permissions error opening the block device. Sure enough, it was now owned by root.

Changed the permissions, and the OSD came up.

Fixing this was a job for udev:

cat /etc/udev/rules.d/99ceph.rules
SUBSYSTEM=="block", KERNEL=="sda7", OWNER="ceph", GROUP="ceph", MODE="0600"
SUBSYSTEM=="block", ENV{DM_VG_NAME}=="ceph-*", OWNER="ceph", GROUP="ceph", MODE="0600"

The first line is left-over from when /dev/sda7 was my journal. Not sure what I’ll do with this partition now, I’ll think of something (maybe Docker). The second line tells udev to change the permissions on the volume group that Ceph created.

Having done this, I rebooted again. This time, all worked. The OSD came up without my intervention.

Recap

So, the pitfalls I ran across in my Jewel-Luminous migration on Gentoo.

btrfs OSDs

I had btrfs volumes for my OSDs, which are now frowned upon and considered experimental. It isn’t necessary to migrate to BlueStore or XFS straight away, but for the OSDs to boot, you will need the following line in your /etc/ceph/ceph.conf before restarting your OSDs:

enable experimental unrecoverable data corrupting features = btrfs

ceph-volume expects the bootstrap-osd key.

To use ceph-volume, it for some reason expects to see the bootstrap-osd key in a hard-coded location. It won’t work with the default admin key.

This bootstrap key can be generated as follows:

# ceph auth add client.bootstrap-osd --cap mon 'profile bootstrap-osd
# mkdir /var/lib/ceph/bootstrap-osd
# ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring

Before creating a BlueStore OSD, make sure lvmetad and lvm are started (and set to start at boot)

You can get away with just lvmetad for the initial creation, but you’ll want lvm running at boot anyway to ensure all the logical volume groups get started at boot before Ceph goes looking for them.

So before attempting OSD creation, ensure LVM is installed, and set to start at boot.

ceph-osd runs as the ceph user

So your udev rules need to reflect that. Luckily, ceph-volume seems to prefer creating LVM volume groups named ceph-${UUID}. I don’t know what decides the UUID value, but thankfully udev supports globbing. The following udev rule (put it in /etc/udev/rules.d/99ceph.rules or wherever seems appropriate) will keep permissions in check:

SUBSYSTEM=="block", ENV{DM_VG_NAME}=="ceph-*", OWNER="ceph", GROUP="ceph", MODE="0600"

(The above should be all on one line.)

Before rebooting a BlueStore node, back up your OSD data directories

Shouldn’t be strictly necessary, but now I’ve been bitten, I’m going to be taking extra care of that data directory on my other two nodes when I migrate them. I don’t fancy playing around with ceph-bluestore-tool frantically trying to get an OSD back up again.

Dec 142018
 

So recently, I had a melt-down with some of the monitor wiring on the cluster… to counteract that, I have some parts on order (RS Components annoyingly seem to have changed their shipping policies, so I suspect I’ll get them Monday)… namely some thermocouple extension cable, some small 250mA fast-blow fuses and suitable in-line holders.

In the meantime, I’m doing without the power controller, just turning the voltage down on the mains charger so the solar controller did most of the charging.

This, isn’t terribly reliable… and for a few days now my battery voltage has just sat at a flat 12.9V, which is the “boost” voltage set on the mains charger.

Last night we had a little rain, and today I see this:

Battery voltage today… the solar charger is doing some work.

Something got up and boogied this morning, and it was nothing I did to make that happen.  I’ll re-instate that charger, or maybe a control-only version of the #High-power DC-DC power supply which I have the parts for, but haven’t yet built.

Nov 302018
 

It’s been a while since I posted about this project… I haven’t had time to do many changes, just maintaining the current system as it is keeps me busy.

One thing I noticed is that I started getting poor performance out of the solar system late last week.  This was about the time that Sydney was getting the dust storms from Broken Hill.

Last week’s battery voltages (40s moving average)

Now, being in Brisbane, I didn’t think that this was the cause, and the days were largely clear, I was a bit miffed why I was getting such poor performance.  When I checked on the solar system itself on Sunday, I was getting mixed messages looking at the LEDs on the Redarc BCDC-1225.

I thought it was actually playing up, so I tried switching over to the other solar controller to see if that was better (even if I know it’s crap), but same thing.  Neither was charging, yet I had a full 20V available at the solar terminals.  It was a clear day, I couldn’t make sense of it.  On a whim, I checked the fuses on the panels.  All fuses were intact, but one fuse holder had melted!  The fuse holders are these ones from Jaycar.  10A fuses were installed, and they were connected to the terminal blocks using a ~20mm long length of stranded wire about 6mm thick!

This should not have gotten hot.  I looked around on Mouser/RS/Element14, and came up with an order for 3 of these DIN-rail mounted fuse holders, some terminal blocks, and some 10A “midget” fuses.  I figured I’d install these one evening (when the solar was not live).

These arrived yesterday afternoon.

New fuse holders, terminal blocks, and fuses.

However, it was yesterday morning whilst I was having breakfast, I could hear a smoke alarm going off.  At first I didn’t twig to it being our smoke alarm.  I wandered downstairs and caught a whiff of something.  Not silicon, thankfully, but something had burned, and the smoke alarm above the cluster was going berserk.

I took that alarm down off the wall and shoved it it under a doonah to muffle it (seems they don’t test the functionality of the “hush” button on these things), switched the mains off and yanked the solar power.  Checking the cluster, all nodes were up, the switches were both on, there didn’t seem to be anything wrong there.  The cluster itself was fine, running happily.

My power controller was off, at first I thought this odd.  Maybe something burned out there, perhaps the 5V LDO?  A few wires sprang out of the terminal blocks.  A frequent annoyance, as the terminal blocks were not designed for CAT5e-sized wire.

By chance, I happened to run my hand along the sense cable (the unsheathed green pair of a dissected CAT5e cable) to the solar input, and noticed it got hot near the solar socket on the wall.  High current was flowing where high current was not planned for or expected, and the wire’s insulation had melted!  How that happened, I’m not quite sure.  I got some side-cutters, cut the wires at the wall-end of the patch cable and disconnected the power controller.  I’ll investigate it later.

Power controller with crispy wiring

With that rendered safe, I disconnected the mains charger from the battery and wound its float voltage back to about 12.2V, then plugged everything back in and turned everything on.  Things went fine, the solar even behaved itself (in-spite of the melty fuse holder on one panel).

Last night, I tore down the old fuse box, hacked off a length of DIN rail, and set about mounting the new holders.  I had to do away with the backing plate due to clearance issues with the holders and re-locate my isolation switch, but things went okay.

This is the installation of the fuses now:

Fuse holders installed

The re-located isolation switch has left some ugly holes, but we’ll plug those up with time (unless a friendly mud wasp does it for us).

Solar isolation switch re-located, and some holes wanting some putty.

For interest’s sake, this was the old installation, partially dismantled.

Old installation, terminal strips and fuse holders.You can see how the holders were mounted to that plate.  The holder closest to the camera has melted rather badly.  The fuse case itself also melted (but the fuse is still intact).

Melted fuse holder detail

The new holders are rated at 690V AC, 30A, and the fuses are rated to 500V, so I don’t expect to have the same problems.

As for the controller, maybe it’s time to retire that design.  The high-power DC-DC converter project ultimately is the future replacement and a first step may be to build an ATTiny24A-based controller that can poll the current shunt sensors and switch the mains charger on and off that way.

Nov 102018
 

Right now, the cluster is running happily with a Redarc BCDC-1225 solar controller, a Meanwell HEP-600C-12 acting as back-up supply, a small custom-made ATTiny24A-based power controller which manages the Meanwell charger.

The earlier purchased controller, a Powertech MP-3735 now is relegated to the function of over-discharge protection relay.  The device is many times the physical size of a VSR, and isn’t a particularly attractive device for that purpose.  I had tried it recently as a solar controller, but it’s fair to say, it’s rubbish at it.  On a good day, it struggles to keep the battery above “rock bottom” and by about 2PM, I’ll have Grafana pestering me about the battery slipping below the 12V minimum voltage threshold.

Actually, I’d dearly love t rip that Powertech controller apart and see what makes it tick (or not in this case).  It’d be an interesting study in what they did wrong to give such terrible results.

So, if I pull that out, the question is, what will prevent an over-discharge event from taking place?  First, I wish to set some criteria, namely:

  1. it must be able to sustain a continuous load of 30A
  2. it should not induce back-EMF into either the upstream supply or the downstream load when activated or activated
  3. it must disconnect before the battery reaches 10.5V (ideally it should cut off somewhere around 11-11.5V)
  4. it must not draw excessive power whilst in operation at the full load

With that in mind, I started looking at options.  One of the first places I looked was of course, Redarc.  They do have a VSR product, the VS12 which has a small relay in it, rated for 10A, so fails on (1).  I asked on their forums though, and it was suggested that for this task, a contactor, the SBI12, be used to do the actual load shedding.

Now, deep inside the heart of the SBI12 is a big electromechanical contactor.  Many moons ago, working on an electric harvester platform out at Laidley for Mulgowie Farming Company, I recall we were using these to switch the 48V supply to the traction motors in the harvester platform.  The contactors there could switch 400A and the coils were driven from a 12V 7Ah battery, which in the initial phases, were connected using spade lugs.

One day I was a little slow getting the spade lug on, so I was making-breaking-making-breaking contact.  *WHACK*… the contactor told me in no uncertain terms it was not happy with my hesitation and hit me with a nice big back-EMF spike!  I had a tingling arm for about 10 minutes.  Who knows how high that spike was… but it probably is higher than the 20V absolute maximum rating of the MIC29712s used for power regulation.  In fact, there’s a real risk they’ll happily let such a rapidly rising spike straight through to the motherboards, frying about $12000 worth of computers in the process!

Hence why I’m keen to avoid a high back-EMF.  Supposedly the SBI12 “neutralises” this … not sure how, maybe there’s a flywheel diode or MOV in there (like this), or maybe instead of just removing power in a step function, they ramp the current down over a few seconds so that the back-EMF is reduced.  So this isn’t an issue for the SBI12, but may be for other electromechanical contactors.

The other concern is the power consumption needed to keep such a beast activated.  The other factor was how much power these things need to stay actuated.  There’s an initial spike as the magnetic field ramps up and starts drawing the armature of the contactor closed, then it can drop down once contact has been made.  The figures on the SBI12 are ~600mA initially, then ~160mA when holding… give or take a bit.

I don’t expect this to be turned on frequently… my nodes currently have up-times around 172 days.  So while 600mA (7~8W at 12V nominal) is high, that’ll only be for a second at most.  Much of the current will be holding current at, let’s call it 200mA to be safe, so about 2~3W.

That 2-3W is going to be the same, whether my nodes collectively draw 10mA, 10A or 100A.

It seemed like a lot, but then I thought, what about a SSR?  You can buy a 100A DC SSR like this for a lot less money than the big contactors.  Whack a nice big heat-sink on it, and you’re set.  Well, why the heat-sink?  These things have a voltage drop and on resistance.  In the case of the Jaycar one, it’s about 350mV and the on resistance is about 7mΩ.

Suppose we were running flat chat at our predicted 30A maximum…

  • MOSFET switch voltage drop: 30A × 350mV = 10.5W
  • Ron resistance voltage drop: (30A)² × 7mΩ = 6.3W
  • Total power dissipation: 10.5W + 6.3W = 16.8W OUCH!

16.8W is basically the power of an idle compute node.  The 3W of the SBI12 isn’t looking so bad now!  But can we do better?

The function of a solid-state relay, amongst other things, is to provide electrical isolation between the control and switching components.  The two are usually galvanically isolated.  This is a feature I really don’t need, so I could reduce costs by just using a bare MOSFET.

The earlier issues I had with the body diode won’t be a problem here as there’s a definite “source” and “load”, there’ll be no current to flow out of the load back to the source to confuse some sensing circuit on the source side.  This same body diode might be an issue for dual-battery systems, as the auxiliary battery can effectively supply current to a starter motor via this body diode, but in my case, it’s strictly switching a load.

I also don’t have inductive loads on my system, so a P-channel MOSFET is an option.  One candidate for this is the Infineon AUIRFS3004-7P.  The Ron on these is supposedly in the realm of 900µΩ-1.25mΩ, and of course, being that it’s a bare MOSFET and not a SSR, there’s no voltage drop.  Thus my power dissipation at 30A is predicted to be a little over 1W.

There are others too with even smaller Ron values, but they are in teeny tiny 5mm square surface-mount packages.  The AUIRFS3004-7P looks dead-buggable, just bend up the gate pin so I can solder direct to it, and treat the others as single “pins”, then strap the sucker to a big heatsink (maybe an old PIII heatsink will do the trick).

I can either drive this MOSFET with something of my own creation, or with the aforementioned Redarc VS12.  The VS12 still does contain a (much smaller) electromechanical relay, but at 30mA (~400mW), it’s bugger all.

The question though was what else could be done?  @WIRING_SOLUTIONS suggested some units made by Victron Energy.  These do have a nice feature in that they also have over-voltage protection, and conveniently, it’s 16V, which is the maximum recommended for the MIC29712s I’m using.  They’re not badly priced, and are solid-state.

However, what’s the Ron, what’s the voltage drop?  Victron don’t know.  They tell me it’s “minimal”, but is that 100nV, 100mV, 1V?  At 30A, 100mV drop equates to 3W, on par with the SBI12.  A 500mV drop would equate to a whopping 15W!

I had a look at the suppliers for Victron Energy products, and via those, found a few other contenders such as this one by Baintech and the Projecta LVD30.  I haven’t asked about these, but again, like the Victron BatteryProtect, neither of these list a voltage drop or Ron.

There’s also this one from Jaycar, but given this is the same place that sold me the Powertech MP-3735, and sold me the original Powertech MP-3089, provided a replacement for that first one, then also replaced the replacement under RMA.  The Jaycar VSR also has practically no specs… yeah, I think I’ll pass!

Whitworths marine sell this, it might be worth looking at but the cut-out voltage is a little high, and they don’t actually give the holding current (330mA “engage” current sounds like it’s electromechanical), so no idea how much power this would dissipate either.

The power controller isn’t doing a job dissimilar to a VSR… in fact it could be repurposed as one, although I note its voltage readings seem to drift quite a lot.  I suspect this is due to the choice of 5% tolerance resistors on the voltage sensing circuit and my use of the ~1.1V internal voltage reference.  The resistors will drift a little bit, and the voltage reference can be anywhere from 1.0 to 1.2V.

Would a LM311N with good quality 1% resistors and a quality voltage reference be “better”?  Who knows?  Maybe I should try an experiment, see if I can get minimal drift out of a LM311N.  It’s either the resistors, the voltage reference, or a combination of the two that’s responsible for the power controller’s drift.

Perhaps I need to investigate which is causing the problem and see what can be done in the design to reduce it.  If I can get acceptable results, then maybe the VS12 can be dispensed with.  I may be able to do it with another ATTiny24A, or even just a simple LM311N.

Oct 272018
 

So, for the past few weeks I’ve been running a Redarc BCDC-1225 solar controller to keep the batteries charged.  I initially found I had to make my little power controller back off on the mains charger a bit, but was finally able to prove conclusively that the Redarc was able to operate in both boost and float modes.

In the interests of science, I have plugged the Powertech back in.  I have changed nothing else.  What I’m interested to see, is if the Powertech in fact behaves itself, or whether it will go back to its usual tricks.

The following is the last 6 hours.

Next week, particularly Thursday and Friday, are predicted to have similar weather patterns to today. Today’s not a good test, since the battery started at a much higher voltage, so I expect that the solar controller will be doing little more than keeping the battery voltage up to the float set-point.

For reference, the settings on the MP-3735 are: Boost voltage 14.6V, Float voltage 13.8V. These are the recommended settings according to Century’s datasheets for the batteries concerned.

Interestingly, no sooner do I wire this up, but the power controller reaches for the mains. The MP-3735 definitely likes to flip-flop. Here’s a video of its behaviour shortly after connecting up the solar (and after I turned off the mains charger at the wall).

Now looking, it’s producing about 10A, much better than the 2A it was doing whilst filming.  So it can charge properly, when it wants to, but it’s intermittent, and inside you can sometimes hear a quiet clicking noise as if it’s switching a relay.  At 2A it’s wasting time, as the cluster draws nearly 5× that.

The hesitation was so bad, the power controller kicked the mains charger in for about 30 minutes, after that, the MP-3735 seems to be behaving itself.  I guess the answer is, see what it does tomorrow, and later this week without me intervening.

If it behaves itself, I’m happy to leave it there, otherwise I’ll be ordering a VSR, pulling out the Powertech MP-3735 and re-instating the Redarc BCDC-1225 with the VSR to protect against over-discharge.


Update 2018-10-28… okay, overcast for a few hours this morning, but by 11AM it had fined up.  The solar performance however was abysmal.

Let’s see how it goes this week… but I think I might be ordering that VSR and installing the Redarc permanently now.


Today’s effort:

Each one of those vertical lines was accompanied by a warning email.

Oct 062018
 

So, since the last power bill, our energy usage has gone down even further.

No idea what the month-on-month usage is (I haven’t spotted it), but this is a scan from our last bill:

GreenPower? We need no stinkin’ GreenPower!

This won’t take into consideration my tweaks to the controller where I now just bring the mains power in to do top-ups of the battery. These other changes should see yet further reductions in the power bill.