Jul 182019
 

So, a few months back I had the failure of one of my storage nodes. Since I need 3 storage nodes to operate, but can get away with a single compute node, I did a board-shuffle. I just evacuated lithium of all its virtual machines, slapped the SSD, HDD and cover from hydrogen in/on it, and it became the new storage node.

Actually I took the opportunity to upgrade to 2TB HDDs at the same time, as well as adding two new storage nodes (Intel NUCs). I then ordered a new motherboard to get lithium back up again. Again, there was an opportunity to upgrade, so ~$1500 later I ordered a SuperMicro A2SDi-16C-HLN4F. 16 cores, and full-size DDR4 DIMMs, so much easier to get bits for. It also takes M.2 SATA.

The new board arrived a few weeks ago, but I was heavily snowed under with activities surrounding Brisbane Area WICEN Group and their efforts to assist the Stirling’s Crossing Endurance Club running the Tom Quilty 2019. So it got shoved to the side with the RAM I had purchased to be dealt with another day.

I found time on Monday to assemble the hardware, then had fun and games with the UEFI firmware on this board. Put simply, the legacy BIOS support on this board is totally and utterly broken. The UEFI shell is also riddled with bugs (e.g. ifconfig help describes how to bring up an interface via DHCP or statically, but doing so fails). And of course, PXE is not PXE when UEFI is involved.

I ended up using Ubuntu’s GRUB binary and netboot image to boot-strap the machine, after which I could copy my Gentoo install back in. I now have the machine back in the rack, and whilst I haven’t deployed any VMs to it yet, I will do so soon. I did however, give it a burn-in test updating the kernel:

  LD [M]  security/keys/encrypted-keys/encrypted-keys.ko
  MKPIGGY arch/x86/boot/compressed/piggy.S
  AS      arch/x86/boot/compressed/piggy.o
  LD      arch/x86/boot/compressed/vmlinux
ld: arch/x86/boot/compressed/head_64.o: warning: relocation in read-only section `.head.text'
ld: warning: creating a DT_TEXTREL in object.
  ZOFFSET arch/x86/boot/zoffset.h
  OBJCOPY arch/x86/boot/vmlinux.bin
  AS      arch/x86/boot/header.o
  LD      arch/x86/boot/setup.elf
  OBJCOPY arch/x86/boot/setup.bin
  BUILD   arch/x86/boot/bzImage
Setup is 16444 bytes (padded to 16896 bytes).
System is 6273 kB
CRC ca5d7cb3
Kernel: arch/x86/boot/bzImage is ready  (#1)

real    7m7.727s
user    62m6.396s
sys     5m8.970s
lithium /usr/src/linux-stable # git describe
v5.1.11

7m for make -j 17 to build a current Linux kernel is not bad at all!

May 292019
 

It’s been on my TO-DO list now for a long time to wire in some current shunts to monitor the solar input, replace the near useless Powertech solar controller with something better, and put in some more outlets.

Saturday, I finally got around to doing exactly that. I meant to also add a low-voltage disconnect to the rig … I’ve got the parts for this but haven’t yet built or tested it — I’d like to wait until I have done both,but I needed the power capacity. So I’m running a risk without the over-discharge protection, but I think I’ll take that gamble for now.

Right now:

  • The Powertech MP-3735 is permanently out, the Redarc BCDC-1225 is back in.
  • I have nearly a dozen spare 12V outlet points now.
  • There are current shunts on:
    • Raw solar input (50A)
    • Solar controller output (50A)
    • Battery (100A)
    • Load (100A)
  • The Meanwell HEP-600C-12 is mounted to the back of the server rack, freeing space from the top.
  • The janky spade lugs and undersized cable connecting the HEP-600C-12 to the battery has been replaced with a more substantial cable.

This is what it looks like now around the back:

Rear of the rack, after re-wiring

What difference has this made? I’ll let the graphs speak. This was the battery voltage this time last week:

Battery voltage for 2019-05-22

… and this was today…

Battery voltage 2019-05-29

Chalk-and-bloody-cheese! The weather has been quite consistent, and the solar output has greatly improved just replacing the controller. The panels actually got a bit overenthusiastic and overshot the 14.6V maximum… but not by much thankfully. I think once I get some more nodes on, it’ll come down a bit.

I’ve gone from about 8 hours off-grid to nearly 12! Expanding the battery capacity is an option, and could see the cluster possibly run overnight.

I need to get the two new nodes onto battery power (the two new NUCs) and the Netgear switch. Actually I’m waiting on a rack-mount kit for the Netgear as I have misplaced the one it came with, failing that I’ll hack one up out of aluminium angle — it doesn’t look hard!

A new motherboard is coming for the downed node, that will bring me back up to two compute nodes (one with 16 cores), and I have new 2TB HDDs to replace the aging 1TB drives. Once that’s done I’ll have:

  • 24 CPU cores and 64GB RAM in compute nodes
  • 28 CPU cores and 112GB RAM in storage nodes
  • 10TB of raw disk storage

I’ll have to pull my finger out on the power monitoring, there’s all the shunts in place now so I have no excuse but to make up those INA-219 boards and get everything going.

May 242019
 

Recently, I had a failure in the cluster, namely one of my nodes deciding to go the way of the dodo. I think I’ve mostly recovered everything from that episode.

I bought some new nodes which I can theoretically deploy as spare nodes, Core i5 Intel NUCs, and for now I’ve temporarily decommissioned one of my compute nodes (lithium) to re-purpose its motherboard to get the downed storage node back on-line. Whilst I was there, I went and put a new 2TB HDD in… and of course I left the 32GB RAM in, so it’s pretty much maxxed out.

I’d like to actually make use of these two new nodes, however I am out of switch capacity, with all 26 ports of the Linksys LGS-326AU occupied or otherwise reserved. I did buy a Netgear GS748T with the intention of moving across to it, but never got around to doing so.

The principle matter here being that the Netgear requires a wee bit more power. AC power ratings are 100-250V, 1.5A max. Now, presumably the 1.5A applies at the 100V scale, that’s ~150W. Some research suggested that internally, they run 12V, that corresponds to about 8.5A maximum current.

This is a bit beyond the capabilities of the MIC29712s.

I wound up buying a DC-DC power supply, an isolated one as that’s all I could get: the Meanwell SD-100A-12. This theoretically can take 9-18V in, and put out 12V at up to 8.5A. Perfect.

Due to lack of time, it sat there. Last week-end though, I realised I’d probably need to consider putting this thing to use. I started by popping open the cover and having a squiz inside. (Who needs warranties?)

The innards of the GS-748Tv5, ruler for scale

I identified the power connections. A probe around with the multimeter revealed that, like the Linksys, it too had paralleled conductors. There were no markings on the PSU module, but un-plugging it from the mainboard and hooking up the multimeter whilst powering it up confirmed it was a 12V output, and verified the polarity. The colour scheme was more sane: Red/Yellow were positive, Black/Blue were negative.

I made a note of the pin-out inside the case.

There’s further DC-DC converters on-board near the connector, what their input range is I have no idea. The connector on the mainboard intrigued me though… I had seen that sort of connector before on ATX power supplies.

The power supply connector, close up.

At the other end of the cable was a simple 4-pole “KK”-like connector with a wider pin spacing (I think ~3mm). Clearly designed with power capacity in mind. I figured I had three options:

  1. Find a mating connector for the mainboard socket.
  2. Find a mating header for the PSU connector.
  3. Ram wires into the plug and hot-glue in place.

As it happens, option (1) turned out easier than I thought it would be. When I first bought the parts for the cluster, the PicoPSU modules came with two cables: one had the standard SATA and Molex power connectors for powering disk drives, the other came out to a 4-pin connector not unlike the 6-pole version being used in the switch.

Now you’ll note of those 6 poles, only 4 are actually populated. I still had the 4-pole connectors, so I went digging, and found them this evening.

One of my 4-pole 12V connectors, with the target in the background.

As it happens, the connectors do fit un-modified, into the wrong 4 holes — if used unmodified, they would only make contact with 2 of the 4 pins. To make it fit, I had to do a slight modification, putting a small chamfer on one of the pins with a sharp knife.

After a slight modification, the connector fits where it is needed.

The wire gauge is close to that used by the original cable, and the colour coding is perfect… black corresponds to 0V, yellow to +12V. I snipped off the JST-style connector at the other end.

I thought about pulling out the original PSU, but then realised that there was a small hole meant for a Kensington-style lock which I wasn’t using. No sharp edges, perfect for feeding the DC cables through. I left the original PSU in-situ, and just unplugged its DC output.

The DC input leads snake through the hole that Netgear helpfully provided.

Bringing the DC power input to the outside.

Before putting the screws in, I decided to give this a test on the bench supply. The switch current fluctuates a bit when booting, but it seems to settle on about 1.75A or so. Not bad.

Testing the switch running on 12V

Terminating this, I decided to use XT-60 connectors. I wanted something other than the 30A “powerpoles” and their larger 50A cousins that are dotted throughout the cluster, as this needed to be regulated 12V. I did not want to get it mixed up with the raw 12V feed from the batteries.

I ran some heavier gauge cable to the DC-DC PSU, terminated with the mating XT-60 connector and hooked that up to my PSU. Providing it with 12V, I dialled the output to 12V exactly. I then gave it a no-load test: it held the output voltage pretty good.

Next, I hooked the switch up to the new PSU. It fired up and I measured the voltage now under load: it still remained at 12V. I wound the voltage down to 9V, then up to 15V… the voltage output never shifted. At 9V, the current consumption jumps up to about 3.5A, as one would expect.

Otherwise, it seemed to be content to draw under 2A so the efficiency of the DC-DC converter is pretty good.

I’ll need to wire in a new fuse box to power everything, but likely the plan will be to decommission the 16-port 100Mbps switch I use for the management network, slide the 48-port switch in its place, then gradually migrate everything across to the new switch.

Overall, the modding of this model switch was even less invasive than that of the Linksys. It’s 100% reversible. I dare say having posted this, there’ll be a GS748Tv6 that’ll move the 240V PSU to the mainboard, but for now at least, this is definitely a switch worth looking at if 12V operation is needed.

May 242019
 

So, in my workplace we’re developing a small energy/water metering device, which runs on a 6LoWPAN network and runs OpenThread-based firmware. The device itself is largely platform-agnostic, requiring a simple CoAP gateway to provide it with a configuration blob and history write end-point. The gateway service I’m running is intended to talk to WideSky.

One thorny issue we need to solve before deploying these things in the wild, is over-the-air updates. So we need a way to transfer the firmware image to the device over the mesh network. Obviously, this firmware image needs to be digitally signed and hashed using a strong cryptographic hash — I’ve taken care of this already. My problem is downloading an image that will be up to 512kB in size.

Thankfully, the IETF has thought of this, the solution to big(gish) files over CoAP is Block-wise transfers (RFC-7959). This specification gives you the ability to break up a large payload into smaller chunks that are powers of two in size between 16 and 2048 bytes.

6LoWPAN itself though has a limitation: the IEEE 802.15.4 radio specification it is built on cannot send frames bigger than 128 bytes. Thus, any message sent via this network must be that size or smaller. IPv6 has a minimum MTU of 1280 bytes, so how do they manage? They fragment the IPv6 datagram into multiple 802.15.4 frames. The end device re-assembles the fragments as it receives them.

The catch is, if a fragment is lost, you lose the entire datagram, there’s no repeats of individual fragments, the entire datagram must be re-sent. The question in my mind was this: Is it faster to rely on block-wise transfers to break the payload up and make lots of small requests, or is it faster to rely on 6LoWPAN fragmentation?

The test network here has a few parts:

  • The target device, which will be downloading a 512kB firmware image to a separate SPI flash chip.
  • The border router, which provides a secure IPv6 tunnel to a cloud host.
  • The cloud server which runs the CoAP service we’ll be talking to.

The latency between the office (in Brisbane) and the cloud server (in Sydney) isn’t bad, about 30~50ms. The CoAP service is built using node-coap and coap-polka.

My CoAP requests have some overheads:

  • The path being downloaded is about 19 bytes long.
  • There’s an authentication token given as a query string, so this adds an additional 12 bytes.

The data link is not 100% reliable, with the device itself dropping some messages. This is leading to some retransmits. The packet loss is not terrible, but probably in the region of around 5%. Over this slightly lossy link, I timed the download of my 512kB firmware image by my device with varying block size settings.

Note that node-coap seems to report a “Bad Option” error for szx=0 and szx=7, even though both are legitimately within specification. (I’d expect node-coap to pass szx=7 through and allow the application to clamp it to 6, but it seems node-coap‘s behaviour is to report “Bad option”, but then pass the payload through anyway.)

Size Exponent (szx)Block sizeStart time (UTC)End time (UTC)Effective data rate
6102403:27:0403:37:52809B/s
551203:41:2503:53:40713B/s
425603:57:1504:16:16458B/s
312804:17:4604:54:17239B/s
26404:56:0905:54:53150B/s
13223:31:3301:39:4468B/s

It remains to be seen how much multiple hops and outdoor atmospherics affect the situation. A factor here is how quickly the device can turn-around between processing a response and sending the next request, which in this case is governed by the speed of the SPI flash and its driver.

Effects on “busy” networks

So, I haven’t actually done any hard measurements here, but doing testing on a busier network with about 30 nodes, the block size equation tips more in favour of a smaller block size.

I’ll try to quantify the delays at some point, but right now 256 byte blocks are the clear winner, with 512 and 1024 byte block transfers proving highly unreliable. The speed advantage between 1k and 512 bytes in ideal conditions was a big over 10%… which really doesn’t count for much. At 256 bytes, the speed difference was about 43%, quite significant. You’re better off using 512-byte blocks if the network is quiet.

On a busy network, with all the retransmissions, smaller is better. No hard numbers yet, but right now at 256 byte blocks, the effective rate is around 118 bytes/sec. I’ll have to analyse the logs a bit to see where 512/1024 byte block sizes sat, but the fact they rarely completed says it all IMO.

Slow and steady beats fast and flakey!

May 182019
 

Seriously, if you think this is a good way to earn some yuan, think again. I just got this email this afternoon:


Dear CEO,
(It’s very urgent, please transfer this email to your CEO. If this email affects you, we are very sorry, please ignore this email. Thanks)
We are a Network Service Company which is the domain name registration center in China.
We received an application from Hua Hai Ltd on May 14
, 2019. They want to register ” stuartl.longlandclan ” as their Internet Keyword and ” stuartl.longlandclan .cn “、” stuartl.longlandclan .com.cn ” 、” stuartl.longlandclan .net.cn “、” stuartl.longlandclan .org.cn ” 、” stuartl.longlandclan .asia “、domain names, they are in China and Asia domain names. But after checking it, we find ” stuartl.longlandclan ” conflicts with your company. In order to deal with this matter better, so we send you email and confirm whether this company is your distributor or business partner in China or not?
 


Best Regards
**************************************
Mike Zhang | Service Manager
Cn YG Domain (Head Office)
Contact details censored as I do not wish to promote their business
*************************************

The wording is identical to that seen in this article on squelchdesign. Knowing this to be a scam, I did two things:

  1. As per my standard policy, I forwarded it to SpamCop. The source of the email was Baidu’s own network.
  2. I figured since it’s obviously a scam and since these people seemingly do not learn from the skirmishes with others, I’d have some fun with them:

On 18/5/19 11:46 am, Mike Zhang wrote:

Dear CEO, (It’s very urgent, please transfer this email to your CEO. If this email affects you, we are very sorry, please ignore this email. Thanks)

You want this to go to my CEO? Does every individual person in China have their own personal CEO? Is that why they have such a big population? Please keep in mind what the .id.au domain suffix is for: INDIVIDUALS.

We are a Network Service Company which is the domain name registration center in China.

Ahh, so you must know the rules around domain registrations, like the .id.au domain suffix being non-commercial.

We received an application from Hua Hai Ltd on May 14, 2019. They want to register ” stuartl.longlandclan ” as their Internet Keyword and ” stuartl.longlandclan .cn “、” stuartl.longlandclan .com.cn ” 、” stuartl.longlandclan .net.cn “、” stuartl.longlandclan .org.cn ” 、” stuartl.longlandclan .asia “、domain names, they are in China and Asia domain names.

They must be rich. They also wanted bellavitosi .cn, bellavitosi.com.cn, bellavitosi.net.cn, bellavitosi.org.cn, bellavitosi.asia, formula1-dictionary.cn, formula1-dictionary.com.cn, formula1-dictionary.net.cn, formula1-dictionary.org.cn and formula1-dictionary.asia.

What does this group do? Are they a subsiduary of BaoYuan Ltd? I hear pan xiaohong has wealth that rivals Jack Ma.

But after checking it, we find ” stuartl.longlandclan ” conflicts with your company. In order to deal with this matter better, so we send you email and confirm whether this company is your distributor or business partner in China or not?

Well, this “company” does not exist, so can’t possibly have a partner in China. I say to them, go ahead and register those domain names, I dare you, it’ll cost you a lot more than it will cost me.

Errm, yeah… the SEO spammers are slowly learning not to mess with me as I’ll just report the email as spam and will tweak mail server settings to ensure you stay blocked. Or I may choose to publicly ridicule you like I have done here.

The worst they can do is actually follow through and register all those domains, which will cost them an absolute bloody fortune (.asia domains are not cheap!) and my content is already well known with the search engines — it’s not like I rely on my online presence for an income anyway as I have a day job. Anything I do here is for self-education and training.

All this mob is doing, is destroying the image of some innocent company in Hong Kong, which are likely nothing to do with this scam. Seriously guys, get a real job!

Apr 112019
 

Lately, I had a need for a library that would talk to a KISS TNC and allow me to exchange UI frames over an AX.25 network.

This is part of a project being undertaken by Brisbane Area WICEN Group. We’ve been tasked with the job of reporting scans from RFID tag readers back to base… and naturally we’ll be using the AX.25 network we’re already familiar with. The plan is to use APRS messaging (to keep things simple) to submit the location, time and hardware address of each RFID read.

For this, I needed something I also need for this project, a tool to encode and decode the UI frames. I had initially thought of just using LinBPQ or similar to provide the interface to AX.25, but in the end, it was easier for me to write my own simple AX.25 stack from scratch.

aioax25 obviously is nowhere near a replacement for other AX.25 stacks in that it only encodes and decodes frames, but it’s a first step in that journey. This library is written for Python 3.4 and up using the asyncio module and pyserial. At the moment I have used it to somewhat crudely send and receive APRS messages, and so with a bit of work, it’ll suffice for the WICEN project.

That does mean I’m not shackled in terms of what bits I can set in my AX.25 headers. One limitation I have with my mapping of 6LoWHAM addresses to AX.25 addresses is that I cannot represent all characters or the “group” bit.

This lead to the limitation that if I defined a group called VK4BWI-0, that group may not have a participant with the call-sign of VK4BWI-0 because I would not be able to differentiate group messages from direct messages.

By writing my own AX.25 stack, I potentially can side-step that limitation: I can utilise the reserved bits in a call-sign/SSID to represent this information. I avoided their use before because the interfaces I planned on using did not expose them, but doing it myself means they’re directly accessible. The AX.25 protocol documentation states:

The bits marked “r” are reserved bits. They may be used in an agreed-upon manner in individual networks. When not implemented, they should be set to one.

https://www.tapr.org/pub_ax25.html

Now, the question is, if I set one to 0, would it reach the far end as a 0? If so, this could be a stand-in for the group bit — stored inverted so that a 1 represents a unicast destination and 0 represents a group.

The other option is to just prepend the left-over bits to the start of the message payload. This has the bonus that I can encode the full-callsign even if that call-sign does not fit in a standard AX.25 message.

So a message sent to VK4FACE-6 (let’s pretend F-calls can use packet for the sake of an example) would be sent to AX.25 SSID VK4FAC-6, and the first few bytes would encode the missing E and the group/unicast bit. If the station VK4FAC were also on frequency, the software stack at their end would need to filter based on those initial payload bytes.

We support 8-character call-signs, so we need to represent 2 left-over characters plus a group bit. Add space for two-more characters for the source call-sign (which may not be a group), we require about 3 bytes.

At this point we might as well use 4, store the extra bytes as 7-bit ASCII, with the spare MSBs of each byte encoding the group bit and one spare bit. An extra 8 bits is bugger all really even at 1200 baud.

Obviously, NET/ROM has no knowledge of this. Stations that are on the other side of a non-6LoWHAM digipeater need to explicitly source-route their hops to reach the rest of a mesh network, and the nodes the other side need to “remember” this source route.

This latter scheme also won’t work for connected mode, as there’s no scope to shoehorn those bytes in the information field and still remain AX.25 compatible — it will only work for 6LoWHAM UI frames.

Anyway, it’s food for thought.

Apr 032019
 

I got this pair of text messages today…

Before I answer, I have 3 questions of my own:

  1. Why are you calling me “BILL”?  Pretty sure that is not printed anywhere on my birth certificate.
  2. Who on earth is “Claire” that I allegedly had a “service experience” with.  (I do hope that isn’t a euphemism!)
  3. The phone number this was sent to is a Telstra service: first activated in 2001.  0439 is a Telstra prefix.  Why the F### ask me?

I do have a SIM card that is an Optus pre-paid, that’s installed in the Kite, along-side a Telstra pre-paid.  I haven’t turned that phone on in a good month or so (I should fire it up just to keep the battery fresh though).

I’d never install that Optus phone in this phone however: pretty sure the phone I received this SMS on is locked to Telstra so it wouldn’t get a service if I tried.

Given you can’t get my name right, and you clearly haven’t checked your records properly, let me finish with this: Why would I recommend you?

Feb 152019
 

One problem I face with the cluster as it stands now is that 2.5″ HDDs are actually quite restrictive in terms of size options.

Right now the whole shebang runs on 1TB 5400RPM Hitachi laptop drives, which so far has been fine, but now that I’ve put my old server on as a VM, that’s chewed up a big chunk of space. I can survive a single drive crash, but not two.

I can buy 2TB HDDs, WD make some and Scorptec sell them. Seagate make some bigger capacity drives, however I have a policy of not buying Seagate.

At work we built a Ceph cluster on 3TB SV35 HDDs… 6 of them to be exact. Within 9 months, the drives started failing one-by-one. At first it was just the odd drive being intermittent, then the problem got worse. They all got RMAed, all 6 of them. Since we obviously needed drives to store data on until the RMAed drives returned, we bought identically sized consumer 5400RPM Hitachi drives. Those same drives are running happily in the same cluster today, some 3 years later.

We also had one SV35 in a 3.5″ external enclosure that formed my workplace’s “disaster recovery” back-up drive. The idea being that if the place was in great peril and it was safe enough to do so, someone could just yank this drive from the rack and run. (If we didn’t, we also had truly off-site back-up NAS boxes.) That wound up failing as well before its time was due. That got replaced with one of the RMAed disks and used until the 3TB no longer sufficed.

Anyway, enough of that diversion, long story short, I don’t trust Seagate disks for 24/7 operation. I don’t see other manufacturers (other than Seagate e.g. WD, Samsung, Hitachi) making >2TB HDDs in the 2.5″ form factor. They all seem to be going SSD.

I have a Samsung 850EVO 2TB in the laptop I’m writing this on, bought a couple of years ago now, and so far, it has been reliable. The cluster also uses 120GB 850EVOs as OS drives. There’s now a 4TB version as well.

The performance would be wonderful and they’d reduce the power consumption of the cluster, however, 3 4TB SSDs would cost $2700. That’s a big investment!

The other option is to bolt on a 3.5″ HDD somehow. A DIN-rail mounted case would be ideal for this. 3.5″ high-capacity drives are much more common, and is using technology which is proven reliable and is comparatively inexpensive.

In addition, by going to bigger external drives it also means I can potentially swap out those 2.5″ HDDs for SSDs at a later date. A WD Purple (5400RPM) 4TB sells for $166. I have one of these in my desktop at work, and so far its performance there has been fine. $3 more and I can get one of the WD Red (7200RPM) 4TB drives which are intended for NAS use. $265 buys a 6TB Toshiba 7200RPM HDD. In short, I have options.

Now, mounting the drives in the rack is a problem. I could just make a shelf to sit the drive enclosures on, or I could buy a second rack and move the servers into that which would free up room for a second DIN rail for the HDDs to mount to. It’d be neat to DIN-rail mount the enclosures beside each Ceph node, but right now, there’s no room to do that.

I’d also either need to modify or scratch-make a HDD enclosure that can be DIN-rail mounted.

There’s then the thorny issue of interfacing. There are two options at my disposal: eSATA and USB3. (Thunderbolt and Firewire aren’t supported on these systems and adding a PCIe card would be tricky.)

The Supermicro motherboards I’m using have 6 SATA ports. If you’re prepared to live with reduced cable lengths, you can use a passive SATA to eSATA adaptor bracket — and this works just fine for my use case since the drives will be quite close. I will have to power down a node and cut a hole in the case to mount the bracket, but this is doable.

I haven’t tried this out yet, but I should be able to use the same type of adaptor inside the enclosure to connect the eSATA cable to the HDD. Trade-off will be further reduced cable distances, but again, they don’t need to go more than 30cm, it’ll most likely work fine.

The other interface option is USB 3.0. The motherboards have two back-panel USB 3.0 connectors and inside, two USB 3.0 ports I can potentially expose. This can be hot-plugged without changing my cluster as it stands now. The down-side is that USB incurs a greater CPU overhead than SATA.

During my migration to BlueStore, I used exactly this to provide a “temporary” OSD disk… a 1TB 7200RPM WD black in a HDD dock. The performance of that was fine, and in that case, I was willing to put up with the overhead as it was temporary.

External eSATA cases seem to be going the way of the dodo, I haven’t seen many available for sale from my usual suppliers. USB 3.0 seems to have taken over, probably because for most uses, it is “good enough”. I did ask about whether one is preferred over the other for Ceph OSD use on the Ceph mailing list, but heard nothing.

As it was, prior to undertaking the migration, I bought such a case, an el’cheapo Simplecom SE-325, along with a 4TB WD Blue for the actual drive. I was tossing up between that, and a LaCiE “Porsche” 4TB drive, but the winning factor of this was that I’d know what I was buying — the LaCiE drive could have had anything in there, manufacturers can and sometimes do substitute components in different manufacturing runs, buying the case and drive separately didn’t run that risk.

The case and drive did the job. I hooked the drive up to my laptop (I had forgotten xhci_hcd support in the storage nodes’ kernels, which I have since fixed) and pulled a snapshot of every VM disk (Rados block device) off the Ceph cluster onto this drive as a raw disk image so I would not lose data. The drive easily kept up with the GbE link I had to the downstairs switch, and a core in the Core i5-3320M in this laptop is probably on par with the ones in the Avoton C2750s running the show.

To DIN-rail mount this, I’d need to make a cradle to take the case, and I’d need to hack some forced-ventilation into the top cover, which isn’t a difficult job. (Drill some holes, then use a nibbler tool to cut slots, then mount a small fan.)

The original PSU for this case is a 12V 2A wall wart, easily substituted with a 12V 3A LDO such as the LM1085IT-12. I may even be able to squeeze it and a heatsink into the case. I presently use one of these with the border router with a small heatsink, and so far, no problems.

If I later want eSATA, I can unscrew the original PCB and should be able to hack that in.

Short term, I can place a temporary shelf atop the battery cases and sit the HDDs there until I figure out more permanent arrangements.

Right now I’ve been battling a few health problems (sharp-eyed readers may recognise the box of “gunk” in the background which is now empty and the accompanying documentation — I’ll know more next Friday morning), and so I’ll wait until I know the outcome of those tests as there’s no point in building something grand if I’m not going to be around to enjoy it.

Jan 282019
 

My cloud computing cluster like all cloud computing clusters of course needs a storage back-end. There were a number of options I could have chosen, but the one I went with in the end was Ceph, and so far, it’s ran pretty well.

Lately though, I was starting to get some odd crashes out of ceph-osd. I was running release 10.2.3, which is quite dated now, this is one of the earlier Jewel releases. Adding to the fun, I’m running btrfs as my filesystem on the OS and the OSD, and I’m running it all on Gentoo. On top of this, my monitor nodes are my OSDs as well.

Not exactly a “supported” configuration, never mind the hacks done at hardware level.

There was also a nagging issue about too many placement groups in the Ceph cluster. When I first established the cluster, I christened it by dragging a few of my lxc containers off the old server and making them VMs in the cluster. This was done using libvirt and virt-manager. These got thrown into a storage pool called transitional-inst, with a VLAN set aside for the VMs to use. When I threw OpenNebula on, I created another Ceph pool, one for its images. The configuration of these lead to the “too many placement groups” warning, which until now, I just ignored.

This weekend was a long weekend, for controversial reasons… and so I thought I’ll take a snapshot of all my VMs, download those snapshots to a HDD as raw images, then see if I can fix these issues, and migrate to Ceph Luminous (v12.2.10) at the same time.

Backing up

I was going to be doing some nasty things to the cluster, so I thought the first thing to do was to back up all images. This was done by using rbd snap create pool/image@date to create a snapshot of an image, then rbd export pool/image@date /path/to/storage/pool-image.img before blowing away the snapshot with rbd snap rm pool/image@date.

This was done for all images on the Ceph cluster, stashing them on a 4TB hard drive I had bought for the purpose.

Getting things ready

My cluster is actually set up as a distcc cluster, with Apache HTTP server instances sharing out distfiles and binary package repositories, so if I build packages on one, I can have the others fetch the binary packages that it built. I started with a node, and got it to update all packages except Ceph. Made sure everything was up-to-date.

Then, I ran emerge -B =ceph-10.2.10-r2. This was the first step in my migration, I’d move to the absolute latest Jewel release available in Gentoo. Once it built, I told all three storage nodes to install it (emerge -g =ceph-10.2.10-r2). This was followed up by a re-start of the mon daemons on each node (one at a time), then the mds daemons, finally the osd daemons.

Resolving the “too many placement groups” warning

To resolve this, I first researched the problem. An Internet search lead me to this Stack Overflow post. In it, it was suggested the problem could be alleviated by making a new pool with the correct settings, then copying the images over to it and blowing away the old one.

As it happens, I had an easier solution… move the “transitional” images to OpenNebula. I created empty data blocks in OpenNebula for the three images, then used qemu-img convert -p /path/to/image.img rbd:pool/image to upload the images.

It was then a case of creating a virtual machine template to boot them. I put them in a VLAN with the other servers, and when each one booted, edited the configuration with the new TCP/IP settings.

Once all those were moved across, I blew away the old VMs and the old pool. The warning disappeared, and I was left with a HEALTH_OK message out of Ceph.

The Luminous moment

At this point I was ready to try migrating. I had a good read of the instructions beforehand. They seemed simple enough. I prepared as I did before by updating everything on the system except Ceph, then, telling Portage to build a binary package of Ceph itself.

Then I deployed the binary to the three nodes.

First step was to re-start the monitors… this went smoothly, I just did a /etc/init.d/ceph-mon.${HOST} restart on each one individually, and after a brief moment, quorum was re-established. I then deployed a manager daemon to each one — basically I just “copied” my monitor symbolic link, changing mon to mgr, added it to OpenRC’s list, then started them. No problems.

The OSDs though were still running the Jewel release.

I proceeded as before, trying a re-start of the first OSD. After a while it hadn’t come back…

2019-01-27 14:42:59.745860 7f28fac06e00 -1 filestore(/var/lib/ceph/osd/ceph-0) _detect_fs(1197): deprecated btrfs support is not ena
bled

Ohh bugger, so no btrfs support. This is where the fun began. At this point I was a bit flustered and thought I’d have to either migrate these nodes to XFS, or to BlueStore. So immediately I started looking at the BlueStore migration documentation, as I did not want to risk re-starting the other two OSDs and losing access to my data!

A hasty BlueStore migration

So, I started this by doing the ceph osd set out 0 to start my now downed OSD 0 on the path of migration. The fact it was already down didn’t click with me. I then tried running ceph osd safe-to-destroy 0, only to be told Error EINVAL: (22) Invalid argument.

Uhh ohh, this isn’t good. I waited a bit, but also part of me said: there should be a copy of everything on this node, on at least one of the other two nodes. I had configured it to maintain at least two copies of everything, so even if this node went up in smoke, the data should be recoverable.

With great trepidation, I continued and tried destroying the OSD, then creating a BlueStore one in its place… only to have the ceph-volume command blow up. It couldn’t find the keyring, then when I got that sorted out, it was failing to talk to systemd, then when I found the --no-systemd argument, it still failed because of LVM. I therefore realised I needed two things:

  1. I needed the bootstrap-osd keyring that ceph-deploy normally creates.
  2. The lvmetad daemon must be running.

For (1), this is taken care of with the following commands:

# ceph auth add client.bootstrap-osd --cap mon 'profile bootstrap-osd
# mkdir /var/lib/ceph/bootstrap-osd
# ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring

As for (2), install sys-fs/lvm and add lvmetad to your start-up services. Also add lvm, as you’ll want that at boot. (I learned this later.)

After doing that, the following command worked:

ceph-volume lvm create --bluestore --data /dev/sdb \
--osd-id 0 --no-systemd

The --no-systemd is important on Gentoo with OpenRC as there is no systemctl binary. Once I did that, I found I could start my OSD again. Data recovery began at once. The data recovery was an overnight effort — it took with my hardware until 3PM today to migrate all the placement groups over to the newly re-formatted OSD.

Migrating the other nodes

For now, they still run btrfs. In my “ohh crap” state, I didn’t see the little hint given:

2019-01-27 14:40:55.147888 7f8feb7a2e00 -1 *** experimental feature 'btrfs' is not enabled ***
This feature is marked as experimental, which means it
 - is untested
 - is unsupported
 - may corrupt your data
 - may break your cluster is an unrecoverable fashion
To enable this feature, add this to your ceph.conf:
  enable experimental unrecoverable data corrupting features = btrfs

2019-01-27 14:40:55.147901 7f8feb7a2e00 -1 filestore(/var/lib/ceph/osd/ceph-0) _detect_fs(1197): deprecated btrfs support is not enabled
2019-01-27 14:40:55.147906 7f8feb7a2e00 -1 filestore(/var/lib/ceph/osd/ceph-0) mount(1523): error in _detect_fs: (1) Operation not permitted
2019-01-27 14:40:55.147926 7f8feb7a2e00 -1 osd.0 0 OSD:init: unable to mount object store

Not feeling like a 24-hour wait, I did as it told me:

osd pool default size = 2  # Write an object n times.
osd pool default min size = 1 # Allow writing n copy in a degraded state.
osd pool default pg num = 128
osd pool default pgp num = 128
osd crush chooseleaf type = 1
osd max backfills = 10

# Allow btrfs to work:
enable experimental unrecoverable data corrupting features = btrfs

Now, my other OSDs re-started successfully, and I could finally finish off by restarting the metadata daemons and completing the migration. I’m now left with two OSDs with BTRFS and one with BlueStore.

For now, I’ll leave it that way, next week end, I might migrate a second node to BlueStore.

The reboot test

I needed to ensure the nodes would come back without my intervention. So starting with the two BTRFS nodes, I rebooted each one individually. The OSD on that node first went offline, then the monitor, finally the cluster noticed the metadata and manager services had gone. Then, upon successful boot, the services returned.

So far so good. Now the BlueStore node.

First reboot, my OSD didn’t come back. On investigation, I saw the following logs:

2019-01-28 16:25:59.312369 7fd58d4f0e00 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or
directory
2019-01-28 16:26:14.865883 7fe92f942e00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or
directory
2019-01-28 16:26:30.419863 7fd4fa026e00 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory

/var/lib/ceph/osd/ceph-0 was completely empty! Bugger, do I have to endure those 24 hours again? As it happened, no. I don’t know how the files in that directory disappeared, I did observe a tmpfs pseudovolume mounted at that directory earlier when trying to create the OSD … maybe that didn’t get unmounted before OSD creation, anyway, the files were gone.

A bit of digging revealed a ceph-bluestore-tool utility, with options like repair. At first I tried to wing it using that, but no dice. Then looking at the man page I noticed the sub-command prime-osd-dir. BINGO.

At first I threw the raw device at it, but as it happens, ceph-volume had deployed LVM to the raw disk, then put BlueStore on top of that. Starting lvm got the volume group recognised, so I added that to my boot-up services (see why I mentioned it earlier). It had created a sym-link to the LVM volume in /dev/ceph-${UUID1}/osd-block-${UUID2}.

No idea where the two UUIDs came from, but I tried this:

# ceph-bluestore-tool prime-osd-dir \
    --dev /dev/ceph-d62d0d95-2e13-4c59-834d-03a87b88c85e/osd-block-62b4be3e-3935-4d51-ab5c-dde077f99ea3 \
    --path /var/lib/ceph/osd/ceph-0

That populated the directory with files, so I tried again starting the OSD.

2019-01-28 16:59:23.680039 7fd93fcbee00 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-0/block: (13) Permission denied
2019-01-28 16:59:23.680082 7fd93fcbee00 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory
2019-01-28 16:59:39.229888 7f4a585b4e00 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-0/block: (13) Permission denied
2019-01-28 16:59:39.229918 7f4a585b4e00 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-0: (2) No such file or directory

Ah ha, chown -R ceph:ceph /var/lib/ceph/osd/ceph-0, and all sprang to life. The OSD came up.

Testing the fixes, a second re-boot

Since the OSD now was starting, and working, I did a second re-boot test, only to have history partially repeat itself.

The files were still there this time, but it was failing with a permissions error opening the block device. Sure enough, it was now owned by root.

Changed the permissions, and the OSD came up.

Fixing this was a job for udev:

cat /etc/udev/rules.d/99ceph.rules
SUBSYSTEM=="block", KERNEL=="sda7", OWNER="ceph", GROUP="ceph", MODE="0600"
SUBSYSTEM=="block", ENV{DM_VG_NAME}=="ceph-*", OWNER="ceph", GROUP="ceph", MODE="0600"

The first line is left-over from when /dev/sda7 was my journal. Not sure what I’ll do with this partition now, I’ll think of something (maybe Docker). The second line tells udev to change the permissions on the volume group that Ceph created.

Having done this, I rebooted again. This time, all worked. The OSD came up without my intervention.

Recap

So, the pitfalls I ran across in my Jewel-Luminous migration on Gentoo.

btrfs OSDs

I had btrfs volumes for my OSDs, which are now frowned upon and considered experimental. It isn’t necessary to migrate to BlueStore or XFS straight away, but for the OSDs to boot, you will need the following line in your /etc/ceph/ceph.conf before restarting your OSDs:

enable experimental unrecoverable data corrupting features = btrfs

ceph-volume expects the bootstrap-osd key.

To use ceph-volume, it for some reason expects to see the bootstrap-osd key in a hard-coded location. It won’t work with the default admin key.

This bootstrap key can be generated as follows:

# ceph auth add client.bootstrap-osd --cap mon 'profile bootstrap-osd
# mkdir /var/lib/ceph/bootstrap-osd
# ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring

Before creating a BlueStore OSD, make sure lvmetad and lvm are started (and set to start at boot)

You can get away with just lvmetad for the initial creation, but you’ll want lvm running at boot anyway to ensure all the logical volume groups get started at boot before Ceph goes looking for them.

So before attempting OSD creation, ensure LVM is installed, and set to start at boot.

ceph-osd runs as the ceph user

So your udev rules need to reflect that. Luckily, ceph-volume seems to prefer creating LVM volume groups named ceph-${UUID}. I don’t know what decides the UUID value, but thankfully udev supports globbing. The following udev rule (put it in /etc/udev/rules.d/99ceph.rules or wherever seems appropriate) will keep permissions in check:

SUBSYSTEM=="block", ENV{DM_VG_NAME}=="ceph-*", OWNER="ceph", GROUP="ceph", MODE="0600"

(The above should be all on one line.)

Before rebooting a BlueStore node, back up your OSD data directories

Shouldn’t be strictly necessary, but now I’ve been bitten, I’m going to be taking extra care of that data directory on my other two nodes when I migrate them. I don’t fancy playing around with ceph-bluestore-tool frantically trying to get an OSD back up again.

Jan 192019
 

Recently, I’ve been looking at the problem of how to retrieve IPv6 traffic from the network stack of my workstation and manipulate it for transmission over AX.25.

My last experiments focussed on the TUN/TAP interface in Linux. Using this interface, I could create a virtual network interface that piped its traffic to a file descriptor in a program written in C.

One advantage of using the C language for this is that, as binding to the TAP interface requires root privileges, the binary could be installed setuid root. Thus, any time it started, it would be running as root. From there, it could do what it needed, then drop privileges back to a regular user.

The program would just run as a child process… when there was traffic received from the kernel, it would just spit that out to stdout. If my parent application had something to send, it would feed that into stdin.

6lhagent is an implementation of that idea. It’s pretty rough, but it seems to work. It uses a simple protocol to frame the Ethernet packets so that it can maintain synchronisation with the parent process. All frames are ACKed or NAKed, depending on whether they were understood or not. The protocol is analogous to KISS or SLIP in concept. The framing is very different to these protocols, but the concept is that of frames delimited by a byte sequence, with occurrences of the special byte sequences replaced with place-holders to prevent the parser getting confused.

I then wrote this Python script which uses the asyncio IO loop to run 6lhagent and dump the packets it receives:

$ python3 demo/dumper.py 
Interface data: b'V\xc7\x05\\yA\x05\x00\x00\x00\x00\xca\x04tap0'
Interface: MAC=[86, 199, 5, 92, 121, 65] MTU=1280 IDX=202 NAME=tap0
Ethernet traffic: b'33330000001656c7055c794186dd600000000024000100000000000000000000000000000000ff0200000000000000000000000000163a000502000001008f00f5ec0000000104000000ff0200000000000000000001ff5c7941'
From: 33:33:00:00:00:16
To:   56:c7:05:5c:79:41
Protocol: 86dd
IPv6: Priority 0, Flow 000000
From: ::
To:   ff02::16
Length: 36, Next header: 0, Hop Limit: 1
Payload: b':\x00\x05\x02\x00\x00\x01\x00\x8f\x00\xf5\xec\x00\x00\x00\x01\x04\x00\x00\x00\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xff\\yA'
Ethernet traffic: b'33330000001656c7055c794186dd600000000024000100000000000000000000000000000000ff0200000000000000000000000000163a000502000001008f00f5ec0000000104000000ff0200000000000000000001ff5c7941'
From: 33:33:00:00:00:16
To:   56:c7:05:5c:79:41
Protocol: 86dd
IPv6: Priority 0, Flow 000000
From: ::
To:   ff02::16
Length: 36, Next header: 0, Hop Limit: 1
Payload: b':\x00\x05\x02\x00\x00\x01\x00\x8f\x00\xf5\xec\x00\x00\x00\x01\x04\x00\x00\x00\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xff\\yA'
Ethernet traffic: b'3333ff5c794156c7055c794186dd6000000000203aff00000000000000000000000000000000ff0200000000000000000001ff5c79418700bebb00000000fe8000000000000054c705fffe5c79410e01a02d5c9a6698'
From: 33:33:ff:5c:79:41
To:   56:c7:05:5c:79:41
Protocol: 86dd
IPv6: Priority 0, Flow 000000
From: ::
To:   ff02::1:ff5c:7941
Length: 32, Next header: 58, Hop Limit: 255
ICMP Type 135, Code 0, Checksum bebb
Data: b'\x00\x00\x00\x00\xfe\x80\x00\x00'
Payload: b'\x00\x00\x00\x00T\xc7\x05\xff\xfe\\yA\x0e\x01\xa0-\\\x9af\x98'
Ethernet traffic: b'33330000001656c7055c794186dd6000000000240001fe8000000000000054c705fffe5c7941ff0200000000000000000000000000163a000502000001008f0025070000000104000000ff0200000000000000000001ff5c7941'
From: 33:33:00:00:00:16
To:   56:c7:05:5c:79:41
Protocol: 86dd
IPv6: Priority 0, Flow 000000
From: fe80::54c7:5ff:fe5c:7941
To:   ff02::16
Length: 36, Next header: 0, Hop Limit: 1
Payload: b':\x00\x05\x02\x00\x00\x01\x00\x8f\x00%\x07\x00\x00\x00\x01\x04\x00\x00\x00\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xff\\yA'
Ethernet traffic: b'33330000001656c7055c794186dd6000000000240001fe8000000000000054c705fffe5c7941ff0200000000000000000000000000163a000502000001008f009cab0000000104000000ff0200000000000000000000000000fb'
From: 33:33:00:00:00:16
To:   56:c7:05:5c:79:41
Protocol: 86dd
IPv6: Priority 0, Flow 000000
From: fe80::54c7:5ff:fe5c:7941
To:   ff02::16
Length: 36, Next header: 0, Hop Limit: 1
Payload: b':\x00\x05\x02\x00\x00\x01\x00\x8f\x00\x9c\xab\x00\x00\x00\x01\x04\x00\x00\x00\xff\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xfb'

The thinking is that the bulk of the proof-of-concept will be done in Python. My reasoning for this is that it’s usually easier to prototype in a higher-level language than in C, and in this application, speed is not important. At best our network interface will be running at 9600 baud — Python will keep up just fine. Most of it will be at 1200 baud.

The Python code will do some packet filtering (e.g. filtering out the multicast NS messages, which are a no-no in RFC-6775) and to add options where required. It’ll also be responsible for rate-limiting the firehose-like output of the tap interface from the host so the AX.25 network doesn’t get flooded.

The proof of concept is coming together. Next steps are to implement an IPv6 stack of sorts in Python to dissect the datagrams.