Nov 252017
 

Seems I’ve vindicated my decision to chase ECC memory modules for my servers, inspite of ECC DDR3 SODIMMs being harder to find. (Pro tip, a 8GB ECC module will have an organisation of 1Gbit×72.)

Specifically mentioned, is that ECC memory is more resistant to these problems.  (Thanks to Sebastian Pipping for forwarding this.)

Nov 192017
 

So, this weekend I did plan to run from solar full time to see how it’d go.

Mother nature did not co-operate.  I think there was about 2 hours of sunlight!  This is what the 24 hour rain map looks like from the local weather radar (image credit: Bureau of Meteorology):

In the end, I opted to crimp SB50 connectors onto the old Redarc BCDC1225 and hook it up between the battery harness and the 40A power supply. It’s happily keeping the batteries sitting at about 13.2V, which is fine. The cluster ran for months off this very same power supply without issue: it’s when I introduced the solar panels that the problems started. With a separate controller doing the solar that has over-discharge protection to boot, we should be fine.

I also have mostly built-up some monitoring boards based on the TI INA219Bs hooked up to NXP LPC810s. I have not powered these up yet, plan is to try them out with a 1ohm resistor as the stand-in for the shunt and a 3V rail… develop the firmware for reporting voltage/current… then try 9V and check nothing smokes.

If all is well, then I’ll package them up and move them to the cluster. Not sure of protocols just yet. Modbus/RTU is tempting and is a protocol I’m familiar with at work and would work well for this application, given I just need to represent voltage and current… both of which can be scaled to fit 16-bit registers easy (voltage in mV, current in mA would be fine).

I just need some connectors to interface the boards to the outside world and testing will begin. I’ve ordered these and they’ll probably turn up some time this week.

Nov 132017
 

So, at present I’ve been using a two-charger solution to keep the batteries at full voltage.  On the solar side is the Powertech MP3735, which also does over-discharge protection.  On the mains side, I’m using a Xantrex TC2012.

One thing I’ve observed is that the TC2012, despite being configured for AGM batteries, despite the handbook saying it charges AGM batteries to a maximum 14.3V, has a happy knack of applying quite high charging voltages to the batteries.

I’ve verified this… every meter I’ve put across it has reported it at one time or another, more than 15V across the terminals of the charger.  I’m using SB50 connectors rated at 50A and short runs of 6G cable to the batteries.  So a nice low-resistance path.

The literature I’ve read says 14.8V is the maximum.  I think something has gone out of calibration!

This, and the fact that the previous set-up over-discharged the batteries twice, are the factors that lead to the early failure of both batteries.

The two new batteries (Century C12-105DA) are now sitting in the battery cases replacing the two Giant Energy batteries, which will probably find themselves on a trip to the Upper Kedron recycling facility in the near future.

The Century batteries were chosen as I needed the replacements right now and couldn’t wait for shipping.  This just happened to be what SuperCheap Auto at Keperra sell.

The Giant Energy batteries took a number of weeks to arrive: likely because the seller (who’s about 2 hours drive from me) had run out of stock and needed to order them in (from China).  If things weren’t so critical, I might’ve given those batteries another shot, but I really didn’t have the time to order in new ones.

I have disconnected the Xantrex TC2012.  I really am leery about using it, having had one bad experience with it now.  The replacement batteries cost me $1000.  I don’t want to be repeating the exercise.

I have a couple of options:

  1. Ditch the idea of using mains power and just go solar.
  2. Dig out the Redarc BCDC1225 charger I was using before and hook that up to a regulated power supply.
  3. Source a new 20A mains charger to hook in parallel with the batteries.
  4. Hook a dumb fixed-voltage power supply in parallel with the batteries.
  5. Hook a dumb fixed-voltage power supply in parallel with the solar panel.

Option (1) sounds good, but what if there’s a run of cloudy days?  This really is only an option once I get some supervisory monitoring going.  I have the current shunts fitted and the TI INA219Bs for measuring those shunts arrived a week or so back, just haven’t had the time to put that into service.  This will need engineering time.

Option (2) could be done right now… and let’s face it, its problem was switching from solar to mains.  In this application, it’d be permanently wired up in boost mode.  Moreover, it’s theoretically impossible to over-discharge the batteries now as the MP3735 should be looking after that.

Option (3) would need some research as to what would do the job.  More money to spend, and no guarantee that the result will be any better than what I have now.

Option (4) I’m leery about, as there’s every possibility that the power supply could be overloaded by inrush current to the battery.  I could rig up a PWM circuit in concert with the monitoring I’m planning on putting in, but this requires engineering time to figure out.

Option (5) I’m also leery about, not sure how the panels will react to having a DC supply in parallel to them.  The MP3735 allegedly can take an input DC supply as low as 9V and boost that up, so might see a 13.8V switchmode PSU as a solar panel on a really cloudy day.  I’m not sure though.  I can experiment, plug it in and see how it reacts.  Research gives mixed advice, with this Stack Exchange post saying yes and this Reddit thread suggesting no.

I know now that the cluster averages about 7A.  In theory, I should have 30 hours capacity in the batteries I have now, if I get enough sun to keep them charged.

This I think, will be a week-end experiment, and maybe something I’ll try this weekend.  Right now, the cluster itself is running from my 40A switchmode PSU, and for now, it can stay there.

I’ll let the solar charger top the batteries up from the sun this week.  With no load, the batteries should be nice and full, ready come Friday evening, when I can, gracefully, bring the cluster down and hook it up to the solar charger load output.  If, at the end of the weekend, it’s looking healthy, I might be inclined to leave it that way.

Nov 122017
 

So, yesterday I had this idea of building an IC storage unit to solve a problem I was facing with the storage of IC tubes, and to solve an identical problem faced at HSBNE.

It turned out to be a longish project, and by 11:30PM, I had gotten far, but still had a bit of work to do.  Rather than slog it out overnight, I thought I’d head home and resume it the next day.  Instead of carting the lot home, and back again, I decided to leave my bicycle trailer with all the project gear and my laptop, stashed at HSBNE’s wood shop.

By the time I had closed up the shop and gotten going, it was after midnight.  That said, the hour of day was a blessing: there was practically no traffic, so I road on the road most of the way, including the notorious Kingsford-Smith Drive.  I made it home in record time: 1 hour, 20 minutes.  A record that stood until later this morning coming the other way, doing the run in 1:10.

I was exhausted, and was thinking about bed, but wheeling the bicycle up the drive way and opening the garage door, I caught a whiff.  What’s that smell?  Sulphur??

Remember last post I had battery trouble, so isolated the crook battery and left the “good” one connected?

The charger was going flat chat, and the battery case was hot!  I took no chances, I switched the charger off at the wall and yanked the connection to the battery wiring harness.  I grabbed some chemical handling gloves and heaved the battery case out.  Yep, that battery was steaming!  Literally!

This was the last thing I wanted to deal with at nearly 2AM on a Sunday morning.  I did have two new batteries, but hadn’t yet installed them.  I swapped the one I had pulled out last fortnight, and put in one of the new ones.  I wanted to give them a maintenance charge before letting them loose on the cluster.

The other dud battery posed a risk though, with the battery so hot and under high pressure, there was a good chance that it could rupture if it hadn’t already.  A shower of sulphuric acid was not something I wanted.

I decided there was nothing running on the cluster that I needed until I got up later today, so left the whole kit off, figuring I’d wait for that battery to cool down.

5AM, I woke up, checked the battery, still warm.  Playing it safe, I dusted off the 40A switchmode PSU I had originally used to power the Redarc controller, and plugged it directly into the cluster, bypassing the batteries and controller.  That would at least get the system up.

This evening, I get home (getting a lift), and sure enough, the battery has cooled down, so I swap it out with another of the new batteries.  One of the new batteries is charging from the mains now, and I’ll do the second tomorrow.

See if you can pick out which one is which…

Nov 062017
 

So, I’m doing some development on a Cortex M3-based device with access to only one serial port, and that serial port is doing double-duty as serial console and polling a Modbus energy meter.  How do I get log messages out?

My code actually implements the stubs to direct stdout and stderr transparently to the serial port, however this has to go to /dev/null when the Modbus port is in use.  That said, _write_r still gets called, in my code, it is possible to set a breakpoint inside the _write_r function when traffic is identified for the console.

As it happens, gdb can be told to not only break there, but to perform a series of actions.  In my case serial.c:659 is the file and line number inside an if branch that handles the console code.  Setting up gdb to print this data out requires the following statements:

(gdb) break serial.c:659
(gdb) commands
Type commands for breakpoint(s) 3, one per line.
End with a line saying just "end".
>set ((char*)buf)[cnt] = 0
>print (char*)buf
>continue
>end
(gdb) c

The result:

Breakpoint 3, _write_r (ptr=, fd=0, buf=0x200068c0, cnt=78) at /home/stuartl/vrt/projects/widesky/hub/hal/src/serial.c:659
659			if (serial_console_target.port) {
$51 = 0x200068c0 "/home/stuartl/vrt/projects/widesky/hub/hal/demo/main.c:226 Registration sent\r\n"

Breakpoint 3, _write_r (ptr=, fd=0, buf=0x200068c0, cnt=46) at /home/stuartl/vrt/projects/widesky/hub/hal/src/serial.c:659
659			if (serial_console_target.port) {
$52 = 0x200068c0 "Received NTP time is Mon Nov  6 04:13:41 2017\n"

Breakpoint 3, _write_r (ptr=, fd=0, buf=0x200068c0, cnt=2) at /home/stuartl/vrt/projects/widesky/hub/hal/src/serial.c:659
659			if (serial_console_target.port) {
$53 = 0x200068c0 "\r\n"

Breakpoint 3, _write_r (ptr=, fd=0, buf=0x200068c0, cnt=89) at /home/stuartl/vrt/projects/widesky/hub/hal/src/serial.c:659
659			if (serial_console_target.port) {
$54 = 0x200068c0 "/home/stuartl/vrt/projects/widesky/hub/hal/demo/main.c:115 Registration timeout: 30 sec\r\n"

Breakpoint 3, _write_r (ptr=, fd=0, buf=0x200068c0, cnt=83) at /home/stuartl/vrt/projects/widesky/hub/hal/src/serial.c:659
659			if (serial_console_target.port) {
$55 = 0x200068c0 "/home/stuartl/vrt/projects/widesky/hub/hal/demo/main.c:130 Select source address:\r\n"

Breakpoint 3, _write_r (ptr=, fd=0, buf=0x200068c0, cnt=53) at /home/stuartl/vrt/projects/widesky/hub/hal/src/serial.c:659
659			if (serial_console_target.port) {
$56 = 0x200068c0 " ? fdde:ad00:beef:0:0:ff:fe00:a400 Pref=Y Valid=Y\r\n"

Breakpoint 3, _write_r (ptr=, fd=0, buf=0x200068c0, cnt=15) at /home/stuartl/vrt/projects/widesky/hub/hal/src/serial.c:659
659			if (serial_console_target.port) {
$57 = 0x200068c0 " ? Selected\r\n"

---Type  to continue, or q  to quit---
Breakpoint 3, _write_r (ptr=, fd=0, buf=0x200068c0, cnt=78) at /home/stuartl/vrt/projects/widesky/hub/hal/src/serial.c:659
659			if (serial_console_target.port) {
$58 = 0x200068c0 "/home/stuartl/vrt/projects/widesky/hub/hal/demo/main.c:226 Registration sent\r\n"

Not as nice as having a dedicated port, but better than nothing.

Nov 052017
 

So… with the new controller we’re able to see how much current we’re getting from the solar.  I note they omit the solar voltage, and I suspect the current is how much is coming out of the MPPT stage, but still, it’s more information than we had before.

With this, we noticed that on a good day, we were getting… 7A.

That’s about what we’d expect for one panel.  What’s going on?  Must be a wiring fault!

I’ll admit when I made the mounting for the solar controller, I didn’t account for the bend radius in the 6gauge wire I was using, and found it was difficult to feed it into the controller properly.  No worries, this morning at 4AM I powered everything off, took the solar controller off, drilled 6 new holes a bit lower down, fed the wires through and screwed them back in.

Whilst it was all off, I decided I’d individually charge the batteries.  So, right-hand battery came first, I hook the mains charger directly up and let ‘er rip.  Less than 30 minutes later, it was done.

So, disconnect that, hook up the left hand battery.  45 minutes later the charger’s still grinding away.  WTF?

Feel the battery… it is hot!  Double WTF?

It would appear that this particular battery is stuffed.  I’ve got one good one though, so for now I pull the dud out and run with just the one.

I hook everything up,  do some final checks, then power the lot back up.

Things seem to go well… I do my usual post-blackout dance of connecting my laptop up to the virtual instance management VLAN, waiting for the OpenNebula VM to fire up, then log into its interface (because we’re too kewl to have a command line tool to re-start an instance), see my router and gitea instances are “powered off”, and instruct the system to boot them.

They come up… I’m composing an email, hit send… “Could not resolve hostname”… WTF?  Wander downstairs, I note the LED on the main switch flashing furiously (as it does on power-up) and a chorus of POST beeps tells me the cluster got hard-power-cycled.  But why?  Okay, it’s up now, back up stairs, connect to the VLAN, re-start everything again.

About to send that email again… boompa!  Same error.  Sure enough, my router is down.  Wander downstairs, and as I get near, I hear the POST beeps again.  Battery voltage is good, about 13.2V.  WTF?

So, about to re-start everything, then I lose contact with my OpenNebula front-end.  Okay, something is definitely up.  Wander downstairs, and the hosts are booting again.  On a hunch I flick the off-switch to the mains charger.  Klunk, the whole lot goes off.  There’s no connection to the battery, and so when the charger drops its power to check the battery voltage, it brings the whole lot down.

WTF once more?  I jiggle some wires… no dice.  Unplug, plug back in, power blinks on then off again.  What is going on?

Finally, I pull right-hand battery out (the left-hand one is already out and cooling off, still very warm at this point), 13.2V between the negative terminal and positive on the battery, good… 13.2V between negative and the battery side of the isolator switch… unscrew the fuse holder… 13.2V between fuse holder terminal and the negative side…  but 0V between negative side on battery and the positive terminal on the SB50 connector.

No apparent loose connections, so I grab one of my spares, swap it with the existing fuse.  Screw the holder back together, plug the battery back in, and away it all goes.

This is the offending culprit.  It’s a 40A 5AG fuse.  Bought for its current carrying capacity, not for the “bling factor” (gold conductors).

If I put my multimeter in continuance test mode and hold a probe on each end cap, without moving the probes, I hear it go open-circuit, closed-circuit, open-circuit, closed-circuit.  Fuses don’t normally do that.

I have a few spares of these thankfully, but I will be buying a couple more to replace the one that’s now dead.  Ohh, and it looks like I’m up for another pair of batteries, and we will have a working spare 105Ah once I get the new ones in.

On the RAM front… the firm I bought the last lot through did get back to me, with some DDR3L ECC SO-DIMMs, again made by Kingston.  Sounded close enough, they were 20c a piece more (AU$855 for 6 vs $AU864.50).

Given that it was likely this would be an increasing problem, I thought I’d at least buy enough to ensure every node had two matched sticks in, so I told them to increase the quantity to 9 and to let me know what I owe them.

At first they sent me the updated invoice with the total amount (AU$1293.20).  No problems there.  It took a bit of back-and-forth before I finally confirmed they had the previous amount I sent them.  Great, so into the bank I trundle on Thursday morning with the updated invoice, and I pay the remainder (AU$428.70).

Friday, I get the email to say that product was no longer available.  They instead, suggested some Crucial modules which were $60 a piece cheaper.  Well, when entering a gold mine, one must prepare themselves for the shaft.

Checking the link, I found it: these were non-ECC.  1Gbit×64, not 1Gbit×72 like I had ordered.  In any case I was over it, I fired back an email telling them to cancel the order and return the money.  I was in no mood for Internet shopper Russian Roulette.

It turns out I can buy the original sticks through other suppliers, just not in the quantities I’m after.  So I might be able to buy one or two from a supplier, I can’t buy 9.  Kingston have stopped making them and so what’s left is whatever companies have in stock.

So I’ll have to move to something else.  It’d be worth buying one stick of the original type so I can pair it with one of the others, but no more than that.  I’m in no mood to do this in a few years time when parts are likely to be even harder to source… so I think I’ll bite the bullet and go 16GB modules.  Due to the limits on my debit card though, I’ll have to buy them two at a time (~$900AUD each go).  The plan is:

  1. Order in two 16GB modules and an 8GB module… take existing 8GB module out of one of the compute nodes and install the 16GB modules into that node.  Install the brand new 8GB module and the recovered 8GB module into two of the storage nodes.  One compute node now has 32GB RAM, and two storage nodes are now upgraded to 16GB each.  Remaining compute node and storage node each have 8GB.
  2. Order in two more 16GB modules… pull the existing 8GB module out of the other compute node, install the two 16GB modules.  Then install the old 8GB module into the remaining storage node.  All three storage nodes now have 16GB each, both compute nodes have 32GB each.
  3. Order two more 16GB modules, install into one compute node, it now has 64GB.
  4. Order in last two 16GB modules, install into the other compute node.

Yes, expensive, but sod it.  Once I’ve done this, the two nodes doing all the work will be at their maximum capacity.  The storage nodes are doing just fine with 8GB, so 16GB should mean there’s plenty of RAM for caching.

As for virtual machine management… I’m pretty much over OpenNebula.  Dealing with libvirt directly is no fun, but at least once configured, it works!  OpenNebula has a habit of not differentiating between a VM being powered off (as in, me logging into the guest and issuing a shutdown), and a VM being forcefully turned off by the host’s power getting yanked!

With one, there should be some event fired off by libvirt to tell OpenNebula that the VM has indeed turned itself off.  With the latter, it should observe that one moment the VM is there, and next it isn’t… the inference being that it should still be there, and that perhaps that VM should be re-started.

This could be a libvirt limitation too.  I’ll have to research that.  If it is, then the answer is clear: we ditch libvirt and DIY.  I’ll have to research how I can establish a quorum and schedule where VMs get put, but it should be doable without the hassle that OpenNebula has been so far, and without going to the utter tedium that is OpenStack.