Oct 072021
 

Recently, I noticed my network monitoring was down… I hadn’t worried about it because I had other things to keep me busy, and thankfully, my network monitoring, whilst important, isn’t mission critical.

I took a look at it today. The symptom was an odd one, influxd was running, it was listening on the back-up/RPC port 8088, but not 8086 for queries.

It otherwise was generating logs as if it were online. What gives?

Tried some different settings, nothing… nada… zilch. Nothing would make it listen to port 8086.

Tried updating to 1.8 (was 1.1), still nothing.

Tried manually running it as root… sure enough, if I waited long enough, it started on its own, and did begin listening on port 8086. Hmmm, I wonder. I had a look at the init scripts:

#!/bin/bash -e

/usr/bin/influxd -config /etc/influxdb/influxdb.conf $INFLUXD_OPTS &
PID=$!
echo $PID > /var/lib/influxdb/influxd.pid

PROTOCOL="http"
BIND_ADDRESS=$(influxd config | grep -A5 "\[http\]" | grep '^  bind-address' | cut -d ' ' -f5 | tr -d '"')
HTTPS_ENABLED_FOUND=$(influxd config | grep "https-enabled = true" | cut -d ' ' -f5)
HTTPS_ENABLED=${HTTPS_ENABLED_FOUND:-"false"}
if [ $HTTPS_ENABLED = "true" ]; then
  HTTPS_CERT=$(influxd config | grep "https-certificate" | cut -d ' ' -f5 | tr -d '"')
  if [ ! -f "${HTTPS_CERT}" ]; then
    echo "${HTTPS_CERT} not found! Exiting..."
    exit 1
  fi
  echo "$HTTPS_CERT found"
  PROTOCOL="https"
fi
HOST=${BIND_ADDRESS%%:*}
HOST=${HOST:-"localhost"}
PORT=${BIND_ADDRESS##*:}

set +e
max_attempts=10
url="$PROTOCOL://$HOST:$PORT/health"
result=$(curl -k -s -o /dev/null $url -w %{http_code})
while [ "$result" != "200" ]; do
  sleep 1
  result=$(curl -k -s -o /dev/null $url -w %{http_code})
  max_attempts=$(($max_attempts-1))
  if [ $max_attempts -le 0 ]; then
    echo "Failed to reach influxdb $PROTOCOL endpoint at $url"
    exit 1
  fi
done
set -e

Ahh right, so start the server, check every second to see if it’s up, and if not, just abort and let systemd restart the whole shebang. Because turning the power on-off-on-off-on-off is going to make it go faster, right?

I changed max_attempts to 360 and the sleep to 10.

Having fixed this, I am now getting data back into my system.