gentoo

Solar Cluster: Adventures in Ceph migration

My cloud computing cluster like all cloud computing clusters of course needs a storage back-end. There were a number of options I could have chosen, but the one I went with in the end was Ceph, and so far, it’s ran pretty well.

Lately though, I was starting to get some odd crashes out of ceph-osd. I was running release 10.2.3, which is quite dated now, this is one of the earlier Jewel releases. Adding to the fun, I’m running btrfs as my filesystem on the OS and the OSD, and I’m running it all on Gentoo. On top of this, my monitor nodes are my OSDs as well.

Not exactly a “supported” configuration, never mind the hacks done at hardware level.

There was also a nagging issue about too many placement groups in the Ceph cluster. When I first established the cluster, I christened it by dragging a few of my lxc containers off the old server and making them VMs in the cluster. This was done using libvirt and virt-manager. These got thrown into a storage pool called transitional-inst, with a VLAN set aside for the VMs to use. When I threw OpenNebula on, I created another Ceph pool, one for its images. The configuration of these lead to the “too many placement groups” warning, which until now, I just ignored.

This weekend was a long weekend, for controversial reasons… and so I thought I’ll take a snapshot of all my VMs, download those snapshots to a HDD as raw images, then see if I can fix these issues, and migrate to Ceph Luminous (v12.2.10) at the same time.

Backing up

I was going to be doing some nasty things to the cluster, so I thought the first thing to do was to back up all images. This was done by using rbd snap create pool/image@date to create a snapshot of an image, then rbd export pool/image@date /path/to/storage/pool-image.img before blowing away the snapshot with rbd snap rm pool/image@date.

This was done for all images on the Ceph cluster, stashing them on a 4TB hard drive I had bought for the purpose.

Getting things ready

My cluster is actually set up as a distcc cluster, with Apache HTTP server instances sharing out distfiles and binary package repositories, so if I build packages on one, I can have the others fetch the binary packages that it built. I started with a node, and got it to update all packages except Ceph. Made sure everything was up-to-date.

Then, I ran emerge -B =ceph-10.2.10-r2. This was the first step in my migration, I’d move to the absolute latest Jewel release available in Gentoo. Once it built, I told all three storage nodes to install it (emerge -g =ceph-10.2.10-r2). This was followed up by a re-start of the mon daemons on each node (one at a time), then the mds daemons, finally the osd daemons.

Resolving the “too many placement groups” warning

To resolve this, I first researched the problem. An Internet search lead me to this Stack Overflow post. In it, it was suggested the problem could be alleviated by making a new pool with the correct settings, then copying the images over to it and blowing away the old one.

As it happens, I had an easier solution… move the “transitional” images to OpenNebula. I created empty data blocks in OpenNebula for the three images, then used qemu-img convert -p /path/to/image.img rbd:pool/image to upload the images.

It was then a case of creating a virtual machine template to boot them. I put them in a VLAN with the other servers, and when each one booted, edited the configuration with the new TCP/IP settings.

Once all those were moved across, I blew away the old VMs and the old pool. The warning disappeared, and I was left with a HEALTH_OK message out of Ceph.

The Luminous moment

At this point I was ready to try migrating. I had a good read of the instructions beforehand. They seemed simple enough. I prepared as I did before by updating everything on the system except Ceph, then, telling Portage to build a binary package of Ceph itself.

Then I deployed the binary to the three nodes.

First step was to re-start the monitors… this went smoothly, I just did a /etc/init.d/ceph-mon.${HOST} restart on each one individually, and after a brief moment, quorum was re-established. I then deployed a manager daemon to each one — basically I just “copied” my monitor symbolic link, changing mon to mgr, added it to OpenRC’s list, then started them. No problems.

The OSDs though were still running the Jewel release.

I proceeded as before, trying a re-start of the first OSD. After a while it hadn’t come back…

2019-01-27 14:42:59.745860 7f28fac06e00 -1 filestore(/var/lib/ceph/osd/ceph-0) _detect_fs(1197): deprecated btrfs support is not ena
bled

Ohh bugger, so no btrfs support. This is where the fun began. At this point I was a bit flustered and thought I’d have to either migrate these nodes to XFS, or to BlueStore. So immediately I started looking at the BlueStore migration documentation, as I did not want to risk re-starting the other two OSDs and losing access to my data!

A hasty BlueStore migration

So, I started this by doing the ceph osd set out 0 to start my now downed OSD 0 on the path of migration. The fact it was already down didn’t click with me. I then tried running ceph osd safe-to-destroy 0, only to be told Error EINVAL: (22) Invalid argument.

Uhh ohh, this isn’t good. I waited a bit, but also part of me said: there should be a copy of everything on this node, on at least one of the other two nodes. I had configured it to maintain at least two copies of everything, so even if this node went up in smoke, the data should be recoverable.

With great trepidation, I continued and tried destroying the OSD, then creating a BlueStore one in its place… only to have the ceph-volume command blow up. It couldn’t find the keyring, then when I got that sorted out, it was failing to talk to systemd, then when I found the --no-systemd argument, it still failed because of LVM. I therefore realised I needed two things:

  1. I needed the bootstrap-osd keyring that ceph-deploy normally creates.
  2. The lvmetad daemon must be running.

For (1), this is taken care of with the following commands:

# ceph auth add client.bootstrap-osd --cap mon 'profile bootstrap-osd
# mkdir /var/lib/ceph/bootstrap-osd
# ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring

As for (2), install sys-fs/lvm and add lvmetad to your start-up services. Also add lvm, as you’ll want that at boot. (I learned this later.)

After doing that, the following command worked:

ceph-volume lvm create --bluestore --data /dev/sdb \
--osd-id 0 --no-systemd

The --no-systemd is important on Gentoo with OpenRC as there is no systemctl binary. Once I did that, I found I could start my OSD again. Data recovery began at once. The data recovery was an overnight effort — it took with my hardware until 3PM today to migrate all the placement groups over to the newly re-formatted OSD.

Migrating the other nodes

For now, they still run btrfs. In my “ohh crap” state, I didn’t see the little hint given:

2019-01-27 14:40:55.147888 7f8feb7a2e00 -1 *** experimental feature 'btrfs' is not enabled ***
This feature is marked as experimental, which means it
 - is untested
 - is unsupported
 - may corrupt your data
 - may break your cluster is an unrecoverable fashion
To enable this feature, add this to your ceph.conf:
  enable experimental unrecoverable data corrupting features = btrfs

2019-01-27 14:40:55.147901 7f8feb7a2e00 -1 filestore(/var/lib/ceph/osd/ceph-0) _detect_fs(1197): deprecated btrfs support is not enabled
2019-01-27 14:40:55.147906 7f8feb7a2e00 -1 filestore(/var/lib/ceph/osd/ceph-0) mount(1523): error in _detect_fs: (1) Operation not permitted
2019-01-27 14:40:55.147926 7f8feb7a2e00 -1 osd.0 0 OSD:init: unable to mount object store

Not feeling like a 24-hour wait, I did as it told me:

osd pool default size = 2  # Write an object n times.
osd pool default min size = 1 # Allow writing n copy in a degraded state.
osd pool default pg num = 128
osd pool default pgp num = 128
osd crush chooseleaf type = 1
osd max backfills = 10

# Allow btrfs to work:
enable experimental unrecoverable data corrupting features = btrfs

Now, my other OSDs re-started successfully, and I could finally finish off by restarting the metadata daemons and completing the migration. I’m now left with two OSDs with BTRFS and one with BlueStore.

For now, I’ll leave it that way, next week end, I might migrate a second node to BlueStore.

The reboot test

I needed to ensure the nodes would come back without my intervention. So starting with the two BTRFS nodes, I rebooted each one individually. The OSD on that node first went offline, then the monitor, finally the cluster noticed the metadata and manager services had gone. Then, upon successful boot, the services returned.

So far so good. Now the BlueStore node.

First reboot, my OSD didn’t come back. On investigation, I saw the following logs:

2019-01-28 16:25:59.312369 7fd58d4f0e00 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or
directory
2019-01-28 16:26:14.865883 7fe92f942e00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or
directory
2019-01-28 16:26:30.419863 7fd4fa026e00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory

/var/lib/ceph/osd/ceph-0 was completely empty! Bugger, do I have to endure those 24 hours again? As it happened, no. I don’t know how the files in that directory disappeared, I did observe a tmpfs pseudovolume mounted at that directory earlier when trying to create the OSD … maybe that didn’t get unmounted before OSD creation, anyway, the files were gone.

A bit of digging revealed a ceph-bluestore-tool utility, with options like repair. At first I tried to wing it using that, but no dice. Then looking at the man page I noticed the sub-command prime-osd-dir. BINGO.

At first I threw the raw device at it, but as it happens, ceph-volume had deployed LVM to the raw disk, then put BlueStore on top of that. Starting lvm got the volume group recognised, so I added that to my boot-up services (see why I mentioned it earlier). It had created a sym-link to the LVM volume in /dev/ceph-${UUID1}/osd-block-${UUID2}.

No idea where the two UUIDs came from, but I tried this:

# ceph-bluestore-tool prime-osd-dir \
    --dev /dev/ceph-d62d0d95-2e13-4c59-834d-03a87b88c85e/osd-block-62b4be3e-3935-4d51-ab5c-dde077f99ea3 \
    --path /var/lib/ceph/osd/ceph-0

That populated the directory with files, so I tried again starting the OSD.

2019-01-28 16:59:23.680039 7fd93fcbee00 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-0/block: (13) Permission denied
2019-01-28 16:59:23.680082 7fd93fcbee00 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2019-01-28 16:59:39.229888 7f4a585b4e00 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-0/block: (13) Permission denied
2019-01-28 16:59:39.229918 7f4a585b4e00 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory

Ah ha, chown -R ceph:ceph /var/lib/ceph/osd/ceph-0, and all sprang to life. The OSD came up.

Testing the fixes, a second re-boot

Since the OSD now was starting, and working, I did a second re-boot test, only to have history partially repeat itself.

The files were still there this time, but it was failing with a permissions error opening the block device. Sure enough, it was now owned by root.

Changed the permissions, and the OSD came up.

Fixing this was a job for udev:

cat /etc/udev/rules.d/99ceph.rules
SUBSYSTEM=="block", KERNEL=="sda7", OWNER="ceph", GROUP="ceph", MODE="0600"
SUBSYSTEM=="block", ENV{DM_VG_NAME}=="ceph-*", OWNER="ceph", GROUP="ceph", MODE="0600"

The first line is left-over from when /dev/sda7 was my journal. Not sure what I’ll do with this partition now, I’ll think of something (maybe Docker). The second line tells udev to change the permissions on the volume group that Ceph created.

Having done this, I rebooted again. This time, all worked. The OSD came up without my intervention.

Recap

So, the pitfalls I ran across in my Jewel-Luminous migration on Gentoo.

btrfs OSDs

I had btrfs volumes for my OSDs, which are now frowned upon and considered experimental. It isn’t necessary to migrate to BlueStore or XFS straight away, but for the OSDs to boot, you will need the following line in your /etc/ceph/ceph.conf before restarting your OSDs:

enable experimental unrecoverable data corrupting features = btrfs

ceph-volume expects the bootstrap-osd key.

To use ceph-volume, it for some reason expects to see the bootstrap-osd key in a hard-coded location. It won’t work with the default admin key.

This bootstrap key can be generated as follows:

# ceph auth add client.bootstrap-osd --cap mon 'profile bootstrap-osd
# mkdir /var/lib/ceph/bootstrap-osd
# ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring

Before creating a BlueStore OSD, make sure lvmetad and lvm are started (and set to start at boot)

You can get away with just lvmetad for the initial creation, but you’ll want lvm running at boot anyway to ensure all the logical volume groups get started at boot before Ceph goes looking for them.

So before attempting OSD creation, ensure LVM is installed, and set to start at boot.

ceph-osd runs as the ceph user

So your udev rules need to reflect that. Luckily, ceph-volume seems to prefer creating LVM volume groups named ceph-${UUID}. I don’t know what decides the UUID value, but thankfully udev supports globbing. The following udev rule (put it in /etc/udev/rules.d/99ceph.rules or wherever seems appropriate) will keep permissions in check:

SUBSYSTEM=="block", ENV{DM_VG_NAME}=="ceph-*", OWNER="ceph", GROUP="ceph", MODE="0600"

(The above should be all on one line.)

Before rebooting a BlueStore node, back up your OSD data directories

Shouldn’t be strictly necessary, but now I’ve been bitten, I’m going to be taking extra care of that data directory on my other two nodes when I migrate them. I don’t fancy playing around with ceph-bluestore-tool frantically trying to get an OSD back up again.

6LoWHAM: Digging into LinBPQ

So today I was meant to be helping re-build a deck, but that got postponed to next weekend.  Thus, I had an extra free day I wasn’t counting on.

I wound up looking at LinBPQ in detail, to see if I can get it to run.  I downloaded the sources, and sure enough, they do compile on my x86-64 laptop, but does it work?  Not a chance.  Starts parsing the configuration file, then boompa, SEGFAULT.

I run the binary through gdb, and see this:

GNU gdb (Gentoo 8.1 p1) 8.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/stuartl/projects/6lowham/linbpq/linbpq...done.
(gdb) r
Starting program: /home/stuartl/projects/6lowham/linbpq/linbpq 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
G8BPQ AX25 Packet Switch System Version 6.0.17.1 November 2018
Copyright � 2001-2018 John Wiseman G8BPQ
Current Directory is /var/lib/linbpq

Configuration file Preprocessor.
Using Configuration file /var/lib/linbpq/bpq32.cfg
Conversion (probably) successful


Program received signal SIGSEGV, Segmentation fault.
0x00005555555f8a7b in Start () at cMain.c:1190
1190                    *(ptr3++) = *(ptr2++);
(gdb) bt full
#0  0x00005555555f8a7b in Start () at cMain.c:1190
        cfg = 0x555555b91c40
        ptr1 = 0x555555ba60c0
        PORT = 0x5555558f6aa0 
        FULLPORT = 0x558f7928
        NEXTPORT = 0x5555558f6de0 <DATAAREA+832>
        EXTPORT = 0x7ffff6eb7953 <_IO_file_overflow+291>
        APPL = 0x5555558f49e0 
        ROUTE = 0x559085e8
        DEST = 0x870b07e2ddd5f300
        CMD = 0x5555558d79e0 
        PortSlot = 2
        ptr2 = 0x555555ba6849 "K4MSL Test station \r"
        ptr3 = 0x55912549 
        ptr4 = 0x5555558d7183 <COMMANDS+1667> "         \003"
        CWPTR = 0x5555558f6b18 <DATAAREA+120>
        i = 0
        n = 119
        int3 = 1435466024
#1  0x000055555563e35c in main (argc=1, argv=0x7fffffffe518) at LinBPQ.c:598
        i = 1
        user = 0x0
        conn = 0x7ffff7ffa298
        STAT = {st_dev = 140737354131120, st_ino = 140737488347784, st_nlink = 140737488347780, st_mode = 4160741648, 
          st_uid = 32767, st_gid = 4143745959, __pad0 = 32767, st_rdev = 140737488348192, st_size = 140737488347784, 
          st_blksize = 1700966438, st_blocks = 26577600, st_atim = {tv_sec = 140737354113688, tv_nsec = 140737488348000}, 
          st_mtim = {tv_sec = 140737354113448, tv_nsec = 140737488347780}, st_ctim = {tv_sec = 140737488347984, 
            tv_nsec = 140737354131160}, __glibc_reserved = {1, 4150715120, 0}}
        PORTVEC = 0x7ffff7ffe6b0

Ookay then… so invalid pointers, what fun!  More to the point, have a close look at the underlined addresses… I’m beginning to understand why it was called BPQ32.

The culprit for this wound up being little gems like this:

			//	Round to word boundary (for ARM5 etc)

			int3 = (int)ptr3;
			int3 += 3;
			int3 &= 0xfffffffc;
			ptr3 = (UCHAR *)int3;

There were a few other instances of this, and variations on the theme too, but one way or the other, linbpq basically assumes that all pointers are 32-bits, and so are ints.

Four hours later, I finally had something that started, but there are probably lots of landmines for anyone running the binary to inadvertently stomp on.  The code is pointer-arithmetic city!  Much of the time, code is casting pointers to unsigned int, or back again.  If I submitted code like that at work, they’d have me hauled ’round the back of the building and shot!

I’m left wondering if it’s worth getting to understand, or should I shove it in a VM, write some code based on my understanding of the protocols, do some integration testing with it, then abandon LinBPQ for something I can have confidence in.

The use and re-use of certain variables makes me wonder if the code is actually a port from the DOS-based BPQCode which was likely written in 8086 assembler.  This would make a lot of sense as to why I’m seeing the sorts of software coding patterns I’m seeing in that code.  The logic seems to have been ported to C just enough to get it to compile and work like the assembly version.

Reasonable enough… but there’s a lot of technical debt there still waiting to be paid back.  On paper, there’s a lot of benefit in using LinBPQ as the back-end, and I am thankful that John Wiseman made the decision to release the code under the GPLv3 so that I can at least investigate the possibility of using that code here.

I’ve thrown what I’ve got up on Github for now, and there’s a Gentoo overlay for installing it.  Add the overlay and run emerge linbpq, and you should find yourself with an installation of LinBPQ that just needs some OpenRC scripts and some work with an editor on /var/lib/linbpq/bpq32.cfg to get going.

If I get further on the code front, I might look at some init scripts, both OpenRC and systemd ones, then I can produce a few Debian binaries so you can run apt-get install linbpq on your Raspberry Pi and have a packet station going quickly.

Rescuing a btrfs partition from a failed “move”

Today it seems, the IT gremlins have been out to get me.  At my work I have a desktop computer (personal hardware) consisting of a Rysen 7 1700, 16GB RAM, a 240GB Intel M.2 SATA SSD (540 series) and a 4TB Western Digital HDD.

The machine has been, pretty reliable, not rock-solid, in particular, compiling gcc sometimes segfaulted for reasons unknown (the RAM checks out okay according to memtest86), but for what I was doing, it mostly ran fine.  I put up with the minor niggles with the view of solving those another day.  Today though, I come in and find X has crashed.

Okay, no big deal, re-start the display manager, except that crashed too.

Hmm, okay, log in under my regular user account and try startx:  No dice, there’s no room on /.

Ahh, that might explain a few things, we clean up some log files, truncate a 500MB file, manage to free up 50GB (!).

The machine dual-boots two OSes: Debian 9 and Gentoo.  It’s been running the latter for about 12 months now, I used Debian 9 to get things rolling so I could use the machine at work (did try Ubuntu 16.04, but it didn’t like my machine), and later, used that to get Gentoo running before switching over.  So there was a 40GB partition on the SSD that had a year-old install of Debian that wasn’t being used.  I figured I’d ditch it, and re-locate my Gentoo partition to occupy that space.

So I pull out an Ubuntu 18.04 disc, boot that up, and get gparted going.  It’s happily copying, until WHAM, I was hit with an I/O error:

Failed re-location of partition (click to enlarge)

Clicking any of the three buttons resulted in the same message.  Brilliant.  I had just copied over the first 15GB of the partition, so the Debian install would be hosed (I was deleting it anyway), but my Gentoo root partition should still be there intact at its old location.  Of course the partition table was updated, so no rolling back there.  At this point, I couldn’t do anything with the SSD, it had completely stalled, and I just had to cut my losses and kill gparted.

I managed to make some room on the 4TB drive shuffling some partitions around so I could install Ubuntu 18.04 there.  My /home partition was btrfs on the 4TB drive (first partition), the rest of that drive was LVM.  I just shrank my /home down by 40GB and slipped it in there.  The boot-loader didn’t install (no EFI partition), but who cares, I know enough of grub to boot from the DVD and bootstrap the machine that way.  At first it wouldn’t boot because in their wisdom, they created the root partition with a @ subvolume.  I worked around that by making the @ subvolume the default.

Then there was momentary panic when the /home partition I had specified lacked my usual files.  Turned out, they had created a @home subvolume on my existing /home partition.  Why? Who knows?  Debian/Ubuntu seem to do strange things with btrfs which do nothing but complicate matters and I do not understand the reasoning.  Editing /etc/fstab to remove the subvolume argument for /home and re-booting fixed that.

I set up a LVM volume that would receive a DD dump of the mangled partition to see what could be saved.  GNU’s ddrescue managed to recover most of the raw partition, and so now I just had to find where the start was.  If I had the output of fdisk -l before I started, I’d be right, but I didn’t have that foresight.  (Maybe if I had just formatted a LVM volume and DD’d the root fs before involving gparted?  Never mind!)

I figured there’d be some kind of magic bytes I could “grep” for.  Something that would tell me “BTRFS was here”.  Sure enough, the information is stashed in the superblock.  At 0x00010040 from the start of the partition, I should see the magic bytes 5f 42 47 52 66 53 5f 4d.  I just needed to grep for these.  To speed things up I made an educated guess on the start-location.  The screenshot says the old partition was about 37.25GB in size, so that was a hint to maybe try skipping that bit and see what could be seen.

Sure enough, I found what looked to be the superblock:

root@vk4msl-ws:~# dd if=/dev/mapper/scratch-rootbackup skip=38100 count=200 bs=1M | hexdump -C | grep '5f 42 48 52 66 53 5f 4d'
02e10040  5f 42 48 52 66 53 5f 4d  9d 30 0d 02 00 00 00 00  |_BHRfS_M.0......|
06e00040  5f 42 48 52 66 53 5f 4d  9d 30 0d 02 00 00 00 00  |_BHRfS_M.0......|
200+0 records in
200+0 records out

Some other probes seem to confirm this, my quarry seemed to start 38146MB into the now-merged partition.  I start copying that to a new LVM volume with the hope of being able to mount it:

root@vk4msl-ws:~# dd if=/dev/mapper/scratch-rootbackup of=/dev/mapper/scratch-gentoo--root bs=1M skip=38146

Whilst waiting for this to complete, I double-checked my findings, by inspecting the other fields. From the screenshot, I know my filesystem UUID was 6513-682e-7182-4474-89e6-c0d1c71866ad. Looking at the superblock, sure enough I see that listed:

root@vk4msl-ws:~# dd if=/dev/scratch/gentoo-root bs=$(( 0x10000 )) skip=1 count=1 | hexdump -C
1+0 records in
1+0 records out
00000000  5f f9 98 90 00 00 00 00  00 00 00 00 00 00 00 00  |_...............|
65536 bytes (66 kB, 64 KiB) copied, 0.000116268 s, 564 MB/s
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  65 13 68 2e 71 82 44 74  89 e6 c0 d1 c7 19 66 ad  |e.h.q.Dt......f.|
00000030  00 00 01 00 00 00 00 00  01 00 00 00 00 00 00 00  |................|
00000040  5f 42 48 52 66 53 5f 4d  9d 30 0d 02 00 00 00 00  |_BHRfS_M.0......|
00000050  00 00 32 da 32 00 00 00  00 00 02 00 00 00 00 00  |..2.2...........|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Looks promising! After an agonising wait, the dd finishes. I can check the filesystem:

root@vk4msl-ws:~# btrfsck /dev/scratch/gentoo-root 
Checking filesystem on /dev/scratch/gentoo-root
UUID: 6513682e-7182-4474-89e6-c0d1c71966ad
checking extents
checking free space cache
block group 111690121216 has wrong amount of free space
failed to load free space cache for block group 111690121216
block group 161082245120 has wrong amount of free space
failed to load free space cache for block group 161082245120
checking fs roots
checking csums
checking root refs
found 107544387643 bytes used, no error found
total csum bytes: 99132872
total tree bytes: 6008504320
total fs tree bytes: 5592694784
total extent tree bytes: 271663104
btree space waste bytes: 1142962475
file data blocks allocated: 195274670080
 referenced 162067775488

Okay, it complained that the free space was wrong (which I’ll blame on gparted prematurely growing the partition), but the data is there!  This is confirmed by mounting the volume and doing a ls:

root@vk4msl-ws:~# mount /dev/scratch/gentoo-root /mnt/
root@vk4msl-ws:~# ls /mnt/ -l
total 4
drwxr-xr-x 1 root root 1020 Oct  7 14:13 bin
drwxr-xr-x 1 root root   18 Jul 21  2017 boot
drwxr-xr-x 1 root root   16 May 28 10:29 dbus-1
drwxr-xr-x 1 root root 1686 May 31  2017 dev
drwxr-xr-x 1 root root 3620 Oct 19 18:53 etc
drwxr-xr-x 1 root root    0 Jul 14  2017 home
lrwxrwxrwx 1 root root    5 Sep 17 09:20 lib -> lib64
drwxr-xr-x 1 root root 1156 Oct  7 13:59 lib32
drwxr-xr-x 1 root root 4926 Oct 13 05:13 lib64
drwxr-xr-x 1 root root   70 Oct 19 11:52 media
drwxr-xr-x 1 root root   28 Apr 23 13:18 mnt
drwxr-xr-x 1 root root  336 Oct  9 07:27 opt
drwxr-xr-x 1 root root    0 May 31  2017 proc
drwx------ 1 root root  390 Oct 22 06:07 root
drwxr-xr-x 1 root root   10 Jul  6  2017 run
drwxr-xr-x 1 root root 4170 Oct  9 07:57 sbin
drwxr-xr-x 1 root root   10 May 31  2017 sys
drwxrwxrwt 1 root root 6140 Oct 22 06:07 tmp
drwxr-xr-x 1 root root  304 Oct 19 18:20 usr
drwxr-xr-x 1 root root  142 May 17 12:36 var
root@vk4msl-ws:~# cat /mnt/etc/gentoo-release 
Gentoo Base System release 2.4.1

Yes, I’ll be backing this up properly RIGHT NOW. But, my data is back, and I’ll be storing this little data recovery technique for next time.

The real lesson here is:

  1. KEEP RELIABLE BACKUPS! You never know when something will fail.
  2. Catch the copy process before it starts overwriting your source data! If there’s no overlap between the old and new locations, you’re fine, but if there is and it starts overwriting the start of your original volume, it’s all over red rover! You might be lucky with a superblock back-up, but don’t bet on it!
  3. Make note of the filesystem type and its approximate location. The fact that I knew roughly where to look, and what sort of filesystem I was looking for meant I could look for magic bytes that say “I’m a BTRFS filesystem”. The magic bytes for EXT4, XFS, etc will differ, but the same concepts are there, you just have to look up the documentation on how your particular filesystem structures its data.

Solar Cluster: Battery monitor PC now running Gentoo

So, after some argument, and a bit of sitting on a concrete floor with the netbook, I managed to get Gentoo loaded onto the TS-7670.  Right now it’s running off the MicroSD card, I’ll get things right, then shift it across to eMMC.

ts7670 ~ # emerge --info
Portage 2.3.40 (python 3.5.5-final-0, default/linux/musl/arm/armv7a, gcc-6.4.0, musl-1.1.19, 4.14.15-vrt-ts7670-00031-g1a006273f907-dirty armv5tejl)
=================================================================
System uname: Linux-4.14.15-vrt-ts7670-00031-g1a006273f907-dirty-armv5tejl-ARM926EJ-S_rev_5_-v5l-with-gentoo-2.4.1
KiB Mem:      111532 total,     13136 free
KiB Swap:    4194300 total,   4191228 free
Timestamp of repository gentoo: Fri, 17 Aug 2018 16:45:01 +0000
Head commit of repository gentoo: 563622899f514c21f5b7808cb50f6e88dbd7d7de
sh bash 4.4_p12
ld GNU ld (Gentoo 2.30 p2) 2.30.0
app-shells/bash:          4.4_p12::gentoo
dev-lang/perl:            5.24.3-r1::gentoo
dev-lang/python:          2.7.14-r1::gentoo, 3.5.5::gentoo
dev-util/pkgconfig:       0.29.2::gentoo
sys-apps/baselayout:      2.4.1-r2::gentoo
sys-apps/openrc:          0.34.11::gentoo
sys-apps/sandbox:         2.13::musl
sys-devel/autoconf:       2.69-r4::gentoo
sys-devel/automake:       1.15.1-r2::gentoo
sys-devel/binutils:       2.30-r2::gentoo
sys-devel/gcc:            6.4.0-r1::musl
sys-devel/gcc-config:     1.8-r1::gentoo
sys-devel/libtool:        2.4.6-r3::gentoo
sys-devel/make:           4.2.1::gentoo
sys-kernel/linux-headers: 4.13::musl (virtual/os-headers)
sys-libs/musl:            1.1.19::gentoo
Repositories:

gentoo
    location: /usr/portage
    sync-type: rsync
    sync-uri: rsync://virtatomos.longlandclan.id.au/gentoo-portage
    priority: -1000
    sync-rsync-verify-jobs: 1
    sync-rsync-extra-opts: 
    sync-rsync-verify-metamanifest: yes
    sync-rsync-verify-max-age: 24

ACCEPT_KEYWORDS="arm"
ACCEPT_LICENSE="* -@EULA"
CBUILD="arm-unknown-linux-musleabi"
CFLAGS="-Os -pipe -march=armv5te -mtune=arm926ej-s -mfloat-abi=soft"
CHOST="arm-unknown-linux-musleabi"
CONFIG_PROTECT="/etc /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/gconf /etc/gentoo-release /etc/sandbox.d /etc/terminfo"
CXXFLAGS="-Os -pipe -march=armv5te -mtune=arm926ej-s -mfloat-abi=soft"
DISTDIR="/home/portage/distfiles"
ENV_UNSET="DBUS_SESSION_BUS_ADDRESS DISPLAY PERL5LIB PERL5OPT PERLPREFIX PERL_CORE PERL_MB_OPT PERL_MM_OPT XAUTHORITY XDG_CACHE_HOME XDG_CONFIG_HOME XDG_DATA_HOME XDG_RUNTIME_DIR"
FCFLAGS="-O2 -pipe -march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=hard"
FEATURES="assume-digests binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles merge-sync multilib-strict news parallel-fetch preserve-libs protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync xattr"
FFLAGS="-O2 -pipe -march=armv7-a -mfpu=vfpv3-d16 -mfloat-abi=hard"
GENTOO_MIRRORS=" http://virtatomos.longlandclan.id.au/portage http://mirror.internode.on.net/pub/gentoo http://ftp.swin.edu.au/gentoo http://mirror.aarnet.edu.au/pub/gentoo"
INSTALL_MASK="charset.alias"
LANG="en_AU.UTF-8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --exclude=/.git"
PORTAGE_TMPDIR="/var/tmp"
USE="arm bindist cli crypt cxx dri fortran iconv ipv6 modules ncurses nls nptl openmp pam pcre readline seccomp ssl tcpd unicode xattr zlib" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="karbon plan sheets stage words" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="musl" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock isync itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf skytraq superstar2 timing tsip tripmate tnt ublox ubx" INPUT_DEVICES="libinput keyboard mouse" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php5-6 php7-0" POSTGRES_TARGETS="postgres9_5 postgres10" PYTHON_SINGLE_TARGET="python3_6" PYTHON_TARGETS="python2_7 python3_6" RUBY_TARGETS="ruby23" USERLAND="GNU" VIDEO_CARDS="dummy fbdev v4l" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CC, CPPFLAGS, CTARGET, CXX, EMERGE_DEFAULT_OPTS, LC_ALL, LINGUAS, MAKEOPTS, PORTAGE_BINHOST, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

I still have to update the kernel.  I actually did get kernel 4.18 to boot, but I forgot to add in support for the watchdog, so U-Boot tickled it, then the watchdog got hungry and kicked the reset half way through the boot sequence.

Rolling back to my older 4.14 kernel works.  I’ll try again with 4.18.5 in a moment.  Failing that, I have also brought the 4.14 patches up to 4.14.69 which is the latest LTS release of the kernel.

I’ve started looking at the power supply sysfs device class, with a view to exposing the supply voltage via sysfs.  The thinking here is that collectd supports reading this via the “battery” module (and realistically, it is a battery that is being measured: two 105Ah AGMs).

Worst case is I do something a little proprietary and deal with it in user space.  I’ll have to dig up the Linux kernel tree I did for Jacques Electronics all those years ago, as that had some examples of interfacing sysfs to a Cypress PSOC device that was acting as an I²C slave.  Rather than using an off-the-shelf solution, they programmed up a MCU that did power management, touchscreen sensing, keypad sensing, RGB LED control and others, all in one chip.  (Fun to try and interface that to the Linux kernel.)

Technologic Systems appear to have done something similar.  The device ID 0x78 implies a 10-bit device, but I think they’re just squatting on that 7-bit address.  They hail 0x78 then read out 4 bytes, which the last two bytes are the supply voltage ADC readings.  They do their own byte swapping before scaling the value to get mV.

Solar Cluster: Gentoo Linux MUSL for ARMv5 finally built

It’s taken several months and had a few false starts, but at long last I have some stage tarballs for Gentoo Linux MUSL for ARMv5 processors.  I’m not the only one wanting such a port, even looking for my earlier thread on the matter, I stumbled on this post.  (Google translate is hopeless with Russian, but I can get the gist of what’s being said.)

This was natively built on the TS-7670 using an external hard drive connected over USB 2.0 as both swap and the chroot.  It took two passes to clean everything up and get something that’s in a “release-able” state.

I think my next steps now will be:

  • Build an updated kernel … I might see if I can expose that I²C register via a sysfs file or something that collectd can pick up whilst I’m at it.  I have the kernel sources and bootloader sources.
  • Prepare the 32GB MicroSD card I bought a few weeks back with the needed partitions and load Gentoo onto that.
  • Install the MicroSD card and boot off it.
  • Back up the eMMC
  • Re-format the eMMC and copy the MicroSD card to it.

It’s supposed to be wet this weekend, so it sounds like a good project for indoors.

Solar Cluster: arm-unknown-linux-musleabi… saga part IV

So, at long last, I finally saw this in my chroot‘s /var/log/emerge.log:

1524887925: Started emerge on: Apr 28, 2018 03:58:45
1524887926:  *** emerge --oneshot sys-devel/gcc::musl
1524888211:  >>> emerge (1 of 1) sys-devel/gcc-7.3.0 to /
1524888212:  === (1 of 1) Cleaning (sys-devel/gcc-7.3.0::/root/musl/sys-devel/gcc/gcc-7.3.0.ebuild)
1524888307:  === (1 of 1) Compiling/Packaging (sys-devel/gcc-7.3.0::/root/musl/sys-devel/gcc/gcc-7.3.0.ebuild)
1525472690:  === (1 of 1) Merging (sys-devel/gcc-7.3.0::/root/musl/sys-devel/gcc/gcc-7.3.0.ebuild)
1525472838:  >>> AUTOCLEAN: sys-devel/gcc:7.3.0
1525473358:  === (1 of 1) Post-Build Cleaning (sys-devel/gcc-7.3.0::/root/musl/sys-devel/gcc/gcc-7.3.0.ebuild)
1525473358:  ::: completed emerge (1 of 1) sys-devel/gcc-7.3.0 to /
1525473360:  *** Finished. Cleaning up...
1525473373:  *** exiting successfully.

That’s 6 days, 18 hours and 32 minutes, of solid compiling. BUT WE GOT THERE!

What’s left? This:

Calculating dependencies... done!
[ebuild     U  ] sys-libs/musl-1.1.19 [1.1.18]
[binary   R    ] sys-libs/zlib-1.2.11-r1
[binary   R    ] app-arch/xz-utils-5.2.3
[ebuild     U  ] sys-libs/ncurses-6.1-r2 [6.0-r1]
[binary   R    ] sys-libs/readline-7.0_p3
[binary   R    ] virtual/libintl-0-r2
[binary   R    ] dev-lang/python-exec-2.4.5
[binary   R    ] virtual/libiconv-0-r2
[binary   R    ] sys-apps/gentoo-functions-0.12
[binary   R    ] dev-libs/libpcre-8.41-r1
[binary   R    ] sys-apps/sed-4.2.2
[binary   R    ] app-arch/bzip2-1.0.6-r8
[binary   R    ] dev-libs/gmp-6.1.2
[binary   R    ] app-shells/bash-4.4_p12
[binary   R    ] sys-apps/file-5.32
[binary   R    ] sys-devel/gnuconfig-20170101
[binary   R    ] dev-libs/mpfr-3.1.6
[binary   R    ] app-misc/c_rehash-1.7-r1
[binary   R    ] app-misc/mime-types-9
[binary   R    ] app-arch/tar-1.29-r3
[binary   R    ] app-arch/gzip-1.8
[binary   R    ] dev-libs/mpc-1.0.3
[binary   R    ] sys-devel/gcc-config-1.8-r1
[binary   R    ] app-misc/editor-wrapper-4
[binary   R    ] sys-apps/less-529
[binary   R    ] sys-apps/debianutils-4.8.3
[binary   R    ] net-libs/libmnl-1.0.4
[binary   R    ] sys-libs/libseccomp-2.3.2
[binary   R    ] dev-libs/popt-1.16-r2
[binary   R    ] sys-libs/e2fsprogs-libs-1.43.6
[binary   R    ] sys-devel/binutils-config-5-r4
[binary   R    ] dev-libs/libffi-3.2.1
[binary   R    ] virtual/libffi-3.0.13-r1
[binary   R    ] sys-apps/sysvinit-2.88-r9
[binary   R    ] sys-apps/opentmpfiles-0.1.3
[binary   R    ] virtual/tmpfiles-0
[binary   R    ] app-text/manpager-1
[binary   R    ] sys-libs/cracklib-2.9.6-r1
[binary   R    ] sys-apps/install-xattr-0.5
[binary   R    ] app-editors/nano-2.8.7
[binary   R    ] app-portage/elt-patches-20170815
[binary   R    ] sys-devel/m4-1.4.17
[binary   R    ] app-arch/unzip-6.0_p21-r2
[binary   R    ] sys-devel/autoconf-wrapper-13
[binary   R    ] sys-devel/bison-3.0.4-r1
[binary   R    ] sys-devel/flex-2.6.4-r1
[binary   R    ] dev-libs/libltdl-2.4.6
[binary   R    ] sys-devel/automake-wrapper-10
[binary   R    ] app-text/sgml-common-0.6.3-r6
[binary   R    ] dev-libs/libgpg-error-1.27-r1
[ebuild  N     ] dev-lang/perl-5.24.3-r1  USE="-berkdb -debug -doc -gdbm -ithreads"
[ebuild  N     ] sys-kernel/linux-headers-4.13  USE="-headers-only"
[ebuild  N     ] virtual/perl-Data-Dumper-2.160.0-r1
[ebuild  N     ] virtual/perl-Test-Harness-3.360.100_rc-r3
[ebuild  N     ] perl-core/File-Temp-0.230.400-r1
[ebuild  N     ] virtual/perl-File-Temp-0.230.400-r5
[ebuild  N     ] perl-core/File-Path-2.130.0
[ebuild  N     ] virtual/perl-File-Path-2.130.0
[binary   R    ] virtual/os-headers-0
[ebuild  N     ] sys-devel/autoconf-2.69-r4  USE="-emacs"
[ebuild  N     ] sys-apps/attr-2.4.47-r2  USE="-nls -static-libs"
[ebuild   R    ] sys-apps/coreutils-8.28-r1
[ebuild     U  ] app-admin/eselect-1.4.12 [1.4.8]
[ebuild     U  ] app-eselect/eselect-python-20171204 [20160516]
[ebuild     U  ] sys-devel/patch-2.7.6-r1 [2.7.5]
[ebuild  N     ] sys-apps/shadow-4.5  USE="cracklib xattr -acl -audit -nls -pam (-selinux) -skey"
[binary   R    ] virtual/shadow-0
[ebuild  N     ] virtual/perl-ExtUtils-MakeMaker-7.100.200_rc-r4
[ebuild  N     ] sys-libs/libcap-2.24-r2  USE="-pam -static-libs"
[ebuild  N     ] dev-perl/Text-Unidecode-1.270.0
[ebuild  N     ] dev-perl/libintl-perl-1.240.0-r2
[ebuild  N     ] sys-apps/help2man-1.47.4  USE="-nls"
[ebuild  N     ] sys-devel/automake-1.15.1-r2  USE="{-test}"
[ebuild  N     ] sys-devel/libtool-2.4.6-r3  USE="-vanilla"
[ebuild  N     ] dev-libs/expat-2.2.5  USE="unicode -examples -static-libs"
[ebuild   R    ] sys-process/psmisc-22.21-r3
[ebuild  N     ] sys-libs/gdbm-1.13-r2  USE="readline -berkdb -exporter -nls -static-libs"
[ebuild  N     ] sys-apps/groff-1.22.2  USE="-X -examples" L10N="-ja"
[ebuild  N     ] dev-libs/libelf-0.8.13-r2  USE="-debug -nls"
[ebuild  N     ] virtual/libelf-2
[ebuild  N     ] dev-libs/libgcrypt-1.8.1  USE="-doc -static-libs"
[ebuild  N     ] dev-perl/XML-Parser-2.440.0
[ebuild  N     ] virtual/perl-File-Spec-3.630.100_rc-r4
[ebuild  N     ] dev-perl/Unicode-EastAsianWidth-1.330.0-r1
[ebuild  N     ] sys-apps/texinfo-6.3  USE="-nls -static"
[ebuild  N     ] dev-libs/iniparser-3.1-r1  USE="-doc -examples -static-libs"
[ebuild  N     ] app-portage/portage-utils-0.64  USE="-nls -static"
[ebuild  N     ] dev-libs/openssl-1.0.2o  USE="asm sslv3 tls-heartbeat zlib -bindist -gmp -kerberos -rfc3779 -sctp -sslv2 -static-libs {-test} -vanilla"
[binary  N     ] dev-lang/python-2.7.14-r1  USE="ipv6 ncurses readline ssl (threads) (wide-unicode) xml (-berkdb) -build -doc -examples -gdbm -hardened -libressl -sqlite -tk -wininst"
[binary  N     ] sys-apps/openrc-0.34.11  USE="ncurses netifrc unicode -audit -debug -newnet -pam (-prefix) (-selinux) -static-libs" 
[ebuild  N     ] net-misc/netifrc-0.5.1
[binary   R    ] sys-apps/grep-3.0
[binary   R    ] sys-apps/findutils-4.6.0-r1
[binary   R    ] sys-apps/kbd-2.0.4
[ebuild  N     ] sys-apps/busybox-1.28.0  USE="ipv6 static -debug -livecd -make-symlinks -math -mdev -pam -savedconfig (-selinux) -sep-usr -syslog (-systemd)"
[binary   R    ] virtual/service-manager-0
[binary   R    ] sys-devel/binutils-2.29.1-r1
[ebuild  N     ] sys-apps/net-tools-1.60_p20161110235919  USE="arp hostname ipv6 -nis -nls -plipconfig (-selinux) -slattach -static" 
[binary   R    ] sys-apps/gawk-4.1.4
[binary   R    ] virtual/editor-0
[binary   R    ] sys-devel/make-4.2.1
[binary   R    ] sys-process/procps-3.3.12-r1
[binary   R    ] virtual/dev-manager-0-r1
[binary   R    ] sys-apps/which-2.21
[ebuild  N     ] net-misc/iputils-20171016_pre  USE="arping filecaps ipv6 openssl ssl -SECURITY_HAZARD -caps -clockdiff -doc -gcrypt
 (-idn) -libressl -nettle -rarpd -rdisc -static -tftpd -tracepath -traceroute"
[binary   R    ] virtual/pager-0
[binary   R    ] sys-apps/diffutils-3.5
[binary   R    ] sys-apps/baselayout-2.4.1-r2
[binary   R    ] virtual/libc-1
[binary   R   ~] sys-devel/gcc-7.3.0
[binary   R    ] virtual/pkgconfig-0-r1
[ebuild  N     ] dev-lang/python-3.5.5  USE="ipv6 ncurses readline ssl (threads) xml -build -examples -gdbm -hardened -libressl -sqlite {-test} -tk -wininst"
[ebuild  N     ] app-misc/ca-certificates-20170717.3.36.1  USE="-cacert -insecure_certs"
[ebuild  N     ] sys-apps/util-linux-2.30.2-r1  USE="cramfs ncurses readline suid unicode -build -caps -fdformat -kill -nls -pam -python (-selinux) -slang -static-libs (-systemd) {-test} -tty-helpers -udev" PYTHON_SINGLE_TARGET="python3_5 -python2_7 -python3_4 -python3_6" PYTHON_TARGETS="python2_7 python3_5 -python3_4 -python3_6"
[ebuild     U  ] app-misc/pax-utils-1.2.3 [1.1.7]
[ebuild     U  ] sys-apps/sandbox-2.13 [2.10-r4]
[ebuild     U  ] net-misc/rsync-3.1.3 [3.1.2-r2]
[ebuild  N     ] net-firewall/iptables-1.6.1-r3  USE="ipv6 -conntrack -netlink -nftables -pcap -static-libs"
[ebuild     U  ] dev-libs/libpipeline-1.4.2 [1.4.0]
[ebuild  N     ] sys-apps/man-db-2.7.6.1-r2  USE="gdbm manpager zlib -berkdb -nls (-selinux) -static-libs"
[ebuild     U  ] sys-apps/kmod-24 [23] PYTHON_TARGETS="-python3_6%"
[ebuild  N     ] dev-python/pyblake2-1.1.0  PYTHON_TARGETS="python2_7 python3_5 (-pypy) -python3_4 -python3_6"
[ebuild  N     ] net-misc/openssh-7.5_p1-r4  USE="hpn pie ssl -X -X509 -audit -bindist -debug -kerberos -ldap -ldns -libedit -libressl -livecd -pam -sctp (-selinux) -skey -ssh1 -static {-test}"
[ebuild  N     ] dev-util/gtk-doc-am-1.25-r1
[ebuild  N     ] dev-libs/libxml2-2.9.7  USE="ipv6 readline -debug -examples -icu -lzma -python -static-libs {-test}" PYTHON_TARGETS="python2_7 python3_5 -python3_4 -python3_6"
[ebuild  N     ] sys-devel/gettext-0.19.8.1  USE="cxx ncurses openmp -acl -cvs -doc -emacs -git -java (-nls) -static-libs"
[ebuild  N     ] app-text/build-docbook-catalog-1.19.1
[ebuild  N     ] dev-libs/libxslt-1.1.30-r2  USE="crypt -debug -examples -python -static-libs" PYTHON_TARGETS="python2_7"
[ebuild  N     ] app-text/docbook-xsl-stylesheets-1.79.1-r2  USE="-ruby"
[ebuild  N     ] app-text/docbook-xml-dtd-4.1.2-r6
[ebuild  N     ] dev-util/intltool-0.51.0-r2
[ebuild  N     ] dev-libs/glib-2.52.3  USE="mime xattr -dbus -debug (-fam) (-selinux) -static-libs -systemtap {-test} -utils" PYTHON_TARGETS="python2_7"
[ebuild  N     ] x11-misc/shared-mime-info-1.9  USE="{-test}"
[ebuild  N     ] dev-python/setuptools-36.7.2  USE="{-test}" PYTHON_TARGETS="python2_7 python3_5 (-pypy) (-pypy3) -python3_4 -python3_6"
[ebuild  N     ] dev-python/certifi-2017.4.17  PYTHON_TARGETS="python2_7 python3_5 (-pypy) (-pypy3) -python3_4 -python3_6"
[ebuild  N     ] dev-python/pyxattr-0.5.5  USE="-doc {-test}" PYTHON_TARGETS="python2_7 python3_5 (-pypy) -python3_4"
[ebuild  N     ] sys-apps/portage-2.3.24-r1  USE="(ipc) native-extensions xattr -build -doc -epydoc -gentoo-dev (-rsync-verify) (-selinux)" PYTHON_TARGETS="python2_7 python3_5 (-pypy) -python3_4 -python3_6"
[ebuild  N     ] app-admin/perl-cleaner-2.25
[binary   R    ] virtual/man-0-r1
[binary   R    ] virtual/modutils-0
[ebuild  N     ] sys-fs/e2fsprogs-1.43.6  USE="-fuse (-nls) -static-libs"
[ebuild     U  ] virtual/package-manager-1 [0]
[ebuild  N     ] sys-apps/iproute2-4.14.1-r2  USE="iptables ipv6 -atm -berkdb -minimal (-selinux)"
[binary   R    ] virtual/ssh-0
[ebuild  N     ] net-misc/wget-1.19.1-r2  USE="ipv6 pcre ssl zlib -debug -gnutls -idn -libressl -nls -ntlm -static {-test} -uuid"
[ebuild   R    ] dev-util/pkgconfig-0.29.2  USE="-internal-glib*"

!!! The following binary packages have been ignored due to non matching USE:

    =dev-util/pkgconfig-0.29.2 internal-glib
    =sys-apps/attr-2.4.47-r2 nls
    =sys-apps/man-db-2.7.6.1-r2 nls
    =dev-libs/libelf-0.8.13-r2 nls
    =sys-apps/shadow-4.5 -linguas_cs -linguas_da -linguas_de -linguas_es -linguas_fi -linguas_fr -linguas_hu -linguas_id -linguas_it -linguas_ja -linguas_ko -linguas_pl -linguas_pt_BR -linguas_ru -linguas_sv -linguas_tr -linguas_zh_CN -linguas_zh_TW nls

NOTE: The --binpkg-respect-use=n option will prevent emerge
      from ignoring these binary packages if possible.
      Using --binpkg-respect-use=y will silence this warning.

I think that’s broken the back of the job.  Of course when I come to running Catalyst, I’ll have to do it all over again, but at least now the environment is clean.

Solar Cluster: Next steps, better control of the charger

So, a few weeks ago I installed a new battery charger, and tweaked it so that the solar did most of the leg work during the day, and the charger kept the batteries topped up at night.

I also discussed the addition of a new industrial PC to perform routing and system monitoring functions… which was to run Gentoo Linux/musl. For now, that little PC is still running Debian Stretch, but for 45 days, it was rock solid. The addition of this box, and taking on the role of router to the management network meant I could finally achieve one of my long-term goals for the project: decommissioning the old server.

The old server is still set up with all my data and software… but now the back-up cron job calls /sbin/poweroff when it’s done, and the BIOS is set to wake the machine up in the evening ready to receive a back-up late at night.

In its place, a virtual machine clone of the box, handles my email and all the old functions of that server. This was all done just prior to my father and I leaving for a 3 week holiday in the Snowy Mountains.

I did have a couple of hiccups with Ceph OSDs crashing … but basically re-starting the daemons (done remotely whilst travelling through Cowra) got everything back up. A bit of placement group cleaning, and everything was back online again. I had another similar hiccup coming out of Maitland, but once again, re-starting the daemons fixed it. No idea why it crashed, that’s something I’ll have to investigate.

Other than that, the cluster itself has run well.

One thing that did momentarily kill the industrial PC though: I wandered down to the rack with a small bus-powered 2.5″ HDD with the intent of re-starting my Gentoo builds. This HDD had the same content as the 3.5″ HDD I had plugged in before. I figured being bus powered, I would not be dependent on mains, and it could just chug away to its heart’s content.

No such luck, the moment I plugged that drive in, the little machine took great umbrage to the spinning rust now vacuuming the electrons away from its core functions, and shut down abruptly. I’ve now brought my 3.5″ drive and dock down, plugged that into the wall, and have my builds resuming. If power goes off, hopefully the machine either handles the loss of swap gracefully. If it does crash, the watchdog will take care of it.

Thus, I have the little TS-7670 first attempting a build of gcc, to see how we go. Finger’s crossed our power should remain up. There was at least one outage in the time we were away, but hopefully we should get though this next build!

The next step I think should be to add some control of the mains charger to allow the batteries to be boosted to full charge overnight. The thinking is a simple diode-OR arrangement. Many comparators such as the LM393 have an open-collector output, which gives us this for free.

The theory is this.

The battery bank powers a simple circuit which runs of a 5V regulator. That regulator powers a dual comparator IC and provides a reference voltage. The comparator draws bugger all power, so I’m happy to use a linear PSU here. It’s mainly there as a voltage reference.

Precision isn’t really the aim here, so adjustable pots will make life easier.

The voltages from the battery bank and the solar panel are fed through voltage dividers to bring the voltages down to below 5V, then those voltages are individually fed into separate pots that control the hysteresis. I can adjust all points of the system.

The idea is that should the batteries get too low, or the sun go down, one or the other (or both) comparators will go low and pull down on R2. If the batteries are high and the sun is up, nothing pulls on R2 so the REMOTE+ pin on the HEP-600C-12 is allowed to float to +5V, turning off the mains charger.

The advantage of this is there’s no programming of a microcontroller, it’s just analogue electronics. The LM393s are pretty hardy things, the datasheet says they’ll run at 36V and can accept a maximum voltage of VCC-1.5V; so if I run at 5V, 3.5V is my recommended maximum. The adjustment pots should let me set a threshold voltage that avoids going above this.

I mainly need 5V for the HEP-600C-12, and for providing that stable known voltage reference. The LM78C05 should be fine for this.

Once I’ve done that, I should be able to wind that charger back up to its factory setting of 14.4V, which will mean that overnight the batteries will be charged back to full charge.

Solar Cluster: Beware, I’m ARMed

Last night, I got home, having made a detour on my way into work past Jaycar Wooloongabba to replace the faulty PSU.
It was a pretty open-and-shut case, we took it out of the box, plugged it in, and sure enough, no fan.  After the saleswoman asked the advice of a co-worker, it was confirmed that the fan should be running.
It took some digging, but they found a replacement, and so it was boxed up (in the box I supplied, they didn’t have one), and I walked out the door with PSU No. 3.
I had to go straight to work, so took the PSU with me, and that evening, I loaded it into the top box to transport home on the bicycle.
I get home, and it’s first thing on my mind.  I unlock the top box, get it out, and still decked out in my cycling gear, helmet and all (needed the headlight to see down the back of the rack anyway), I get to work.
I put the ring lugs on, plug it into the wall socket and flick the switch.
Nothing.
Toggle the switch on the front, still nothing.
Tried the other socket on the outlet, unplugging the load, still nothing.  Did the 10km trip from Milton to The Gap kill it?
Frustrated, I figure I’ll switch a light on.  Funny… no lights.
I wander into the study… sure enough, the router, modem and switch are dead as doornails.  Wander out to the MDB outside, saw the main breaker was still on, and tried hitting the test button.  Nothing.
I wander back inside, switching the bike helmet for my old hard hat, since it looks as if I’ll need the headlight a bit longer, then take a sticky beak down the road to see if anyone else is facing the same issue.
Sure enough, I look down the street, everyone’s out.
So there goes my second attempt at bootstrapping Gentoo, and my old server’s uptime.
The power did return about an hour or so later.  The PSU was fine, you don’t think of the mains being out as the cause of your problems.
I’ll re-start my build, but I’m not going to lose another build to failing power.  Nope, had enough of that for a joke.
I could have rigged up a UPS to the TS-7670, but I already have one, and it’s in the very rack where it’ll get installed anyway.  Thus, no time like the present to install it.
I’ll have to configure the switch to present the right VLANs to the TS-7670, but once I do that, it’ll be able to take over the role of routing between the management VLAN and the main network.
I didn’t want to do this in a VM because that means exposing the hosts and the VMs to the management VLAN, meaning anyone who managed to compromise a host would have direct access to the BMCs on the other nodes.
This is not a network with high bandwidth demands, and so the TS-7670 with its 100Mbps Ethernet (built into the SoC; not via USB) is an ideal machine for this task.
Having done this, all that’s left to do is to create a 2GB dual-core VM which will receive the contents of the old server, then that server can be shut down, after 8 years of good service.  I’ll keep it around for storing the on-site backups, but now I can keep it asleep and just wake it up with Wake-on-LAN when I want to make a back-up.
This should make a dint in our electricity bill!
Other changes…

  • Looks like we’ll be upgrading the solar with the addition of another 120W panel.
  • I will be hooking up my other network switches, the ADSL router and ADSL modem up to the battery bank on the cluster, just got to get some suitable cable for doing so.
  • I have no faith in this third PSU, so already, I have a MeanWell HEP-600C coming.  We’ll wire up a suicide lead to it, and that can replace the Powertech MP-3089 + Redarc BCDC1225, as the MeanWell has a remote on/off feature I can use to control it.

Solar Cluster: History repeats: another PSU fan bites the dust

Perhaps literally… it has bitten the dust.  Although I wouldn’t call its installed location, dusty.  Once again, the fan in the mains power supply has carked it.

Long-term followers of this project may remember that the last PSU failed the same way.

The reason has me miffed.  All I did with the replacement, was take the PSU out of its box, loosen the two nuts for the terminals, slip the ring lugs for my power lead over the terminals, returned the nuts, plugged it in and turned it on.

While it is running 24×7, there is nothing in the documentation to say this PSU can’t run that way.  This is what the installation looks like.

If it were dusty, I’d expect to be seeing hardware failures in my nodes.

This PSU is barely 4 months old, and earlier this week, the fan started making noises, and requiring percussive maintenance to get started. Tonight, it failed. Completely, no taps on the case will convince it to go.

Now, I need to keep things running until the weekend. I need it to run without burning the house down.

Many moons ago, my father bought a 12V fan for the caravan. Cheap and nasty. It has a slider switch to select between two speeds; “fast” and “slow”, which would be better named “scream like a banshee” and “scream slightly less like a banshee”. The speed reduction is achieved by passing current through a 10W resistor, and achieves maybe a 2% reduction in motor RPM. As you can gather, it proved to be a rather unwelcome room mate, and has seen its last day in the caravan.

This fan, given it runs off 12V, has proven quite handy with the cluster. I’ve got my SB-50 “load” socket hanging out the front of the cluster. A little adaptor to bring that out to a cigarette lighter socket, and I can run it off the cluster batteries. When a build job has gotten a node hot and bothered, sitting this down the bottom of the cluster and aiming it at a node has cooled things down well.

Tonight, it has another task … to try and suck the hot air out of the PSU.

That’s the offending power supply.  A PowerTech MP-3089.  It powers the RedARC BCDC-1225 right above it.  And you can see my kludge around the cooling problem.  Not great, but it should hold for the next 24 hours.

Tomorrow, I think we’ll call past Aspley and pick up another replacement.  I’m leery of another now, but I literally have no choice … I need it now.  Sadly, >250W 12V switchmode PSUs are somewhat rare beasts here in Brisbane.  Altronics don’t sell them that big.  The grinning glasses are no more, and I’m not risking it with the Xantrex charger again.

Long term, I’m already looking at the MeanWell SP-480-12.  This is a PSU module, and will need its own case and mains wiring… but I have no faith in the MP-3089 to not fail and cremate my home of 34 years.

The nice feature of the SP-480-12 is that it does have a remote +12V power-off feature.  Presumably I can drive this with a comparator/output MOSFET, so that when the battery voltage drops below some critical threshold, it kicks in, and when it rises above a high set-point, it drops out.  Simple control, with no MCU involved.  I don’t see a reason to get more fancy than that on the control side, anything more is a liability.

On other news, my gcc build on the TS-7670 failed … so much for the wait.  We’ll try another version and see how we go.

Solar Cluster: Resuming a build

So the house got momentarily power-cycled this morning… I’m at work, minding my own business, next thing the access point emails me this:

Mar 13 09:04:23 Syslogd start up

Now, it only does that for two reasons.  Either someone told it to reboot (not I), or it got hard reset.  Sure enough, log into the old server, and it’s reporting an uptime of 15 minutes.  I get home this evening, and clocks all around are on the blink … literally.

The cluster course is going, power outage?  What power outage?

I did consider wiring up the ADSL modem, router, study switch, and the TS-7670 up to the cluster’s power rails, but haven’t gotten around to doing that.  Alas, I’m not quite there yet.

In any case, even if the TS-7670 had been powered from the solar, I’d have still have temporarily lost the build as the HDD dock I have the hard drive sitting in is mains powered.  It also doesn’t remember its state after a power cycling.  I’d have re-started the build from work, but the HDD remained off when the power came back on.

Never mind.  The downside is now I get to re-start a multi-day build.  The good news though, is that knowing the ebuild file that Portage picked out for compiling gcc; I can resume where it left off.  In this case, it’s using an ebuild from the musl overlay; /root/musl/sys-devel/gcc/gcc-6.4.0-r1.ebuild.

ebuild /root/musl/sys-devel/gcc/gcc-6.4.0-r1.ebuild package will preserve the current working tree and will resume where it was, hopefully without incident.  I’ll be left with a .tbz2; which will be picked up when I run emerge –keep-going -ekv @system.