influxdb

Making InfluxDB’s init script more patient

Recently, I noticed my network monitoring was down… I hadn’t worried about it because I had other things to keep me busy, and thankfully, my network monitoring, whilst important, isn’t mission critical.

I took a look at it today. The symptom was an odd one, influxd was running, it was listening on the back-up/RPC port 8088, but not 8086 for queries.

It otherwise was generating logs as if it were online. What gives?

Tried some different settings, nothing… nada… zilch. Nothing would make it listen to port 8086.

Tried updating to 1.8 (was 1.1), still nothing.

Tried manually running it as root… sure enough, if I waited long enough, it started on its own, and did begin listening on port 8086. Hmmm, I wonder. I had a look at the init scripts:

#!/bin/bash -e

/usr/bin/influxd -config /etc/influxdb/influxdb.conf $INFLUXD_OPTS &
PID=$!
echo $PID > /var/lib/influxdb/influxd.pid

PROTOCOL="http"
BIND_ADDRESS=$(influxd config | grep -A5 "\[http\]" | grep '^  bind-address' | cut -d ' ' -f5 | tr -d '"')
HTTPS_ENABLED_FOUND=$(influxd config | grep "https-enabled = true" | cut -d ' ' -f5)
HTTPS_ENABLED=${HTTPS_ENABLED_FOUND:-"false"}
if [ $HTTPS_ENABLED = "true" ]; then
  HTTPS_CERT=$(influxd config | grep "https-certificate" | cut -d ' ' -f5 | tr -d '"')
  if [ ! -f "${HTTPS_CERT}" ]; then
    echo "${HTTPS_CERT} not found! Exiting..."
    exit 1
  fi
  echo "$HTTPS_CERT found"
  PROTOCOL="https"
fi
HOST=${BIND_ADDRESS%%:*}
HOST=${HOST:-"localhost"}
PORT=${BIND_ADDRESS##*:}

set +e
max_attempts=10
url="$PROTOCOL://$HOST:$PORT/health"
result=$(curl -k -s -o /dev/null $url -w %{http_code})
while [ "$result" != "200" ]; do
  sleep 1
  result=$(curl -k -s -o /dev/null $url -w %{http_code})
  max_attempts=$(($max_attempts-1))
  if [ $max_attempts -le 0 ]; then
    echo "Failed to reach influxdb $PROTOCOL endpoint at $url"
    exit 1
  fi
done
set -e

Ahh right, so start the server, check every second to see if it’s up, and if not, just abort and let systemd restart the whole shebang. Because turning the power on-off-on-off-on-off is going to make it go faster, right?

I changed max_attempts to 360 and the sleep to 10.

Having fixed this, I am now getting data back into my system.

Solar Cluster: Thank goodness for good monitoring

A few months back now, I had the misfortune of overshooting my Internet quota, and winding up with a AU$380 bill for the month (and that was capped… in truth it was more like AU$3000).  In fact, it happened a couple of times until I finally nailed down the cause.

Part of it was NTP traffic (seems lots of cowboys write SNTP clients now and point them at pool.ntp.org), some was the #Hackaday.io Spambot Hunter Project and related activity.  In short, I invested some money into upping the quota, and some time into better monitoring.

I wanted to do the monitoring anyway to keep an eye on operations, as well as things like the solar panel voltages, etc.  Since I got it in place, I’ve been able to get much faster notifications of when things go awry.  Much sooner than the 120% quota usage alarm that Internode sends you.

I’m glad I did that now, last night I left a few tabs open on the Hackaday.io site.  I noticed this evening they were still trying to load something and got suspicious… then I saw this:

Double checking, sure enough, something on one of those pages made Chromium get its knickers into a twist, and chew through all that data.

It took me a bit of tinkering to get the right query to extract the above chart.  Essentially there was a sustained 1.5MB/sec download for over 21 hours which would account for the 113.1GB that Internode recorded.

It’s a bit co-incidental that the usage dropped the moment I re-started Chromium.  Not sure why it was continually re-loading pages, but never mind.

The above data is collected using a combination of collectd and InfluxDB, with Grafana doing the dashboarding and alarms, and a small Perl script pulling the usage data off Internode’s API.

Solar Cluster: Getting the battery voltage into Grafana

I’ve succeeded in getting a working battery monitor kernel module. This is basically taking the application note by Technologic Systems and spinning that into a power supply class driver that reports the voltage via sysfs.

As it happens, the battery module in collectd does not see this as a “battery”, something I’ll look at later. For now the exec plug-in works well enough. This feeds through eventually to an InfluxDB database with Grafana sitting on top.

https://netmon.longlandclan.id.au/d/IyZP-V2mk/battery-voltage?orgId=1