Oct 222018
 

Today it seems, the IT gremlins have been out to get me.  At my work I have a desktop computer (personal hardware) consisting of a Rysen 7 1700, 16GB RAM, a 240GB Intel M.2 SATA SSD (540 series) and a 4TB Western Digital HDD.

The machine has been, pretty reliable, not rock-solid, in particular, compiling gcc sometimes segfaulted for reasons unknown (the RAM checks out okay according to memtest86), but for what I was doing, it mostly ran fine.  I put up with the minor niggles with the view of solving those another day.  Today though, I come in and find X has crashed.

Okay, no big deal, re-start the display manager, except that crashed too.

Hmm, okay, log in under my regular user account and try startx:  No dice, there’s no room on /.

Ahh, that might explain a few things, we clean up some log files, truncate a 500MB file, manage to free up 50GB (!).

The machine dual-boots two OSes: Debian 9 and Gentoo.  It’s been running the latter for about 12 months now, I used Debian 9 to get things rolling so I could use the machine at work (did try Ubuntu 16.04, but it didn’t like my machine), and later, used that to get Gentoo running before switching over.  So there was a 40GB partition on the SSD that had a year-old install of Debian that wasn’t being used.  I figured I’d ditch it, and re-locate my Gentoo partition to occupy that space.

So I pull out an Ubuntu 18.04 disc, boot that up, and get gparted going.  It’s happily copying, until WHAM, I was hit with an I/O error:

Failed re-location of partition (click to enlarge)

Clicking any of the three buttons resulted in the same message.  Brilliant.  I had just copied over the first 15GB of the partition, so the Debian install would be hosed (I was deleting it anyway), but my Gentoo root partition should still be there intact at its old location.  Of course the partition table was updated, so no rolling back there.  At this point, I couldn’t do anything with the SSD, it had completely stalled, and I just had to cut my losses and kill gparted.

I managed to make some room on the 4TB drive shuffling some partitions around so I could install Ubuntu 18.04 there.  My /home partition was btrfs on the 4TB drive (first partition), the rest of that drive was LVM.  I just shrank my /home down by 40GB and slipped it in there.  The boot-loader didn’t install (no EFI partition), but who cares, I know enough of grub to boot from the DVD and bootstrap the machine that way.  At first it wouldn’t boot because in their wisdom, they created the root partition with a @ subvolume.  I worked around that by making the @ subvolume the default.

Then there was momentary panic when the /home partition I had specified lacked my usual files.  Turned out, they had created a @home subvolume on my existing /home partition.  Why? Who knows?  Debian/Ubuntu seem to do strange things with btrfs which do nothing but complicate matters and I do not understand the reasoning.  Editing /etc/fstab to remove the subvolume argument for /home and re-booting fixed that.

I set up a LVM volume that would receive a DD dump of the mangled partition to see what could be saved.  GNU’s ddrescue managed to recover most of the raw partition, and so now I just had to find where the start was.  If I had the output of fdisk -l before I started, I’d be right, but I didn’t have that foresight.  (Maybe if I had just formatted a LVM volume and DD’d the root fs before involving gparted?  Never mind!)

I figured there’d be some kind of magic bytes I could “grep” for.  Something that would tell me “BTRFS was here”.  Sure enough, the information is stashed in the superblock.  At 0x00010040 from the start of the partition, I should see the magic bytes 5f 42 47 52 66 53 5f 4d.  I just needed to grep for these.  To speed things up I made an educated guess on the start-location.  The screenshot says the old partition was about 37.25GB in size, so that was a hint to maybe try skipping that bit and see what could be seen.

Sure enough, I found what looked to be the superblock:

root@vk4msl-ws:~# dd if=/dev/mapper/scratch-rootbackup skip=38100 count=200 bs=1M | hexdump -C | grep '5f 42 48 52 66 53 5f 4d'
02e10040  5f 42 48 52 66 53 5f 4d  9d 30 0d 02 00 00 00 00  |_BHRfS_M.0......|
06e00040  5f 42 48 52 66 53 5f 4d  9d 30 0d 02 00 00 00 00  |_BHRfS_M.0......|
200+0 records in
200+0 records out

Some other probes seem to confirm this, my quarry seemed to start 38146MB into the now-merged partition.  I start copying that to a new LVM volume with the hope of being able to mount it:

root@vk4msl-ws:~# dd if=/dev/mapper/scratch-rootbackup of=/dev/mapper/scratch-gentoo--root bs=1M skip=38146

Whilst waiting for this to complete, I double-checked my findings, by inspecting the other fields. From the screenshot, I know my filesystem UUID was 6513-682e-7182-4474-89e6-c0d1c71866ad. Looking at the superblock, sure enough I see that listed:

root@vk4msl-ws:~# dd if=/dev/scratch/gentoo-root bs=$(( 0x10000 )) skip=1 count=1 | hexdump -C
1+0 records in
1+0 records out
00000000  5f f9 98 90 00 00 00 00  00 00 00 00 00 00 00 00  |_...............|
65536 bytes (66 kB, 64 KiB) copied, 0.000116268 s, 564 MB/s
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  65 13 68 2e 71 82 44 74  89 e6 c0 d1 c7 19 66 ad  |e.h.q.Dt......f.|
00000030  00 00 01 00 00 00 00 00  01 00 00 00 00 00 00 00  |................|
00000040  5f 42 48 52 66 53 5f 4d  9d 30 0d 02 00 00 00 00  |_BHRfS_M.0......|
00000050  00 00 32 da 32 00 00 00  00 00 02 00 00 00 00 00  |..2.2...........|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Looks promising! After an agonising wait, the dd finishes. I can check the filesystem:

root@vk4msl-ws:~# btrfsck /dev/scratch/gentoo-root 
Checking filesystem on /dev/scratch/gentoo-root
UUID: 6513682e-7182-4474-89e6-c0d1c71966ad
checking extents
checking free space cache
block group 111690121216 has wrong amount of free space
failed to load free space cache for block group 111690121216
block group 161082245120 has wrong amount of free space
failed to load free space cache for block group 161082245120
checking fs roots
checking csums
checking root refs
found 107544387643 bytes used, no error found
total csum bytes: 99132872
total tree bytes: 6008504320
total fs tree bytes: 5592694784
total extent tree bytes: 271663104
btree space waste bytes: 1142962475
file data blocks allocated: 195274670080
 referenced 162067775488

Okay, it complained that the free space was wrong (which I’ll blame on gparted prematurely growing the partition), but the data is there!  This is confirmed by mounting the volume and doing a ls:

root@vk4msl-ws:~# mount /dev/scratch/gentoo-root /mnt/
root@vk4msl-ws:~# ls /mnt/ -l
total 4
drwxr-xr-x 1 root root 1020 Oct  7 14:13 bin
drwxr-xr-x 1 root root   18 Jul 21  2017 boot
drwxr-xr-x 1 root root   16 May 28 10:29 dbus-1
drwxr-xr-x 1 root root 1686 May 31  2017 dev
drwxr-xr-x 1 root root 3620 Oct 19 18:53 etc
drwxr-xr-x 1 root root    0 Jul 14  2017 home
lrwxrwxrwx 1 root root    5 Sep 17 09:20 lib -> lib64
drwxr-xr-x 1 root root 1156 Oct  7 13:59 lib32
drwxr-xr-x 1 root root 4926 Oct 13 05:13 lib64
drwxr-xr-x 1 root root   70 Oct 19 11:52 media
drwxr-xr-x 1 root root   28 Apr 23 13:18 mnt
drwxr-xr-x 1 root root  336 Oct  9 07:27 opt
drwxr-xr-x 1 root root    0 May 31  2017 proc
drwx------ 1 root root  390 Oct 22 06:07 root
drwxr-xr-x 1 root root   10 Jul  6  2017 run
drwxr-xr-x 1 root root 4170 Oct  9 07:57 sbin
drwxr-xr-x 1 root root   10 May 31  2017 sys
drwxrwxrwt 1 root root 6140 Oct 22 06:07 tmp
drwxr-xr-x 1 root root  304 Oct 19 18:20 usr
drwxr-xr-x 1 root root  142 May 17 12:36 var
root@vk4msl-ws:~# cat /mnt/etc/gentoo-release 
Gentoo Base System release 2.4.1

Yes, I’ll be backing this up properly RIGHT NOW. But, my data is back, and I’ll be storing this little data recovery technique for next time.

The real lesson here is:

  1. KEEP RELIABLE BACKUPS! You never know when something will fail.
  2. Catch the copy process before it starts overwriting your source data! If there’s no overlap between the old and new locations, you’re fine, but if there is and it starts overwriting the start of your original volume, it’s all over red rover! You might be lucky with a superblock back-up, but don’t bet on it!
  3. Make note of the filesystem type and its approximate location. The fact that I knew roughly where to look, and what sort of filesystem I was looking for meant I could look for magic bytes that say “I’m a BTRFS filesystem”. The magic bytes for EXT4, XFS, etc will differ, but the same concepts are there, you just have to look up the documentation on how your particular filesystem structures its data.
Jan 302018
 

So, today I had a problem… I needed to solve a race condition in a test case for my workplace’s WideSky system.  The test case was meant to ensure that, if the AMQP broker crashed or was restarted, it would re-connect and resume operations as quickly as possible.

On my desktop (an 8-core AMD Rysen 7), the test case always passed.  On the CI server (a VM running on a dual-core Core i3), it failed.  I figured the desktop here was running too quickly for me to see the problem.  I needed a machine that ran more like the CI server to see the problem.

Looking around, I couldn’t see any way to reliably slow down QEMU, KVM or VirtualBox… but I do remember one old project from the mid-late 90s that could: Bochs.

Bochs in action… emulating a P4 Prescott on a Rysen 7

Turns out, far from what it could do back in 1998 when it was strictly a 386 emulator (and a slow one at that!) it now has AMD64 emulation capabilities.  Thus, I can run the software stack inside this VM, and have it throttle the CPU speed down so that hopefully, the problem arises.  The first problem I needed to solve was trying to get the network running.  We have a PXE boot server which can serve up Ubuntu, no problem.  I just needed to bridge the Bochs VM onto the network somehow.

I already have bridge interfaces configured on my two physical network interfaces, and these work great with KVM.  Sadly, Bochs is rather primitive in what it supports… tap-mode networking just did not work, it complained that tap0 was not “running” even if created beforehand by iproute2, but I did find I could bind it directly to one of the enslaved network interfaces (enp36s0.200; yes, a VLAN interface).

e1000 worked for network booting, but then Ubuntu couldn’t retrieve an IP address for whatever reason. ne2k is working fine, and presently, I have the VM installing.  To make it network bootable, you need a boot ROM image, which you can download from the iPXE rom-o-matic service.  The magic PCI IDs you need are 10ec 8029 for ne2k, or (if it gets fixed) 8086 10de for e1000.

The following is my Bochs config file:

# configuration file generated by Bochs
plugin_ctrl: unmapped=1, biosdev=1, speaker=1, extfpuirq=1, parallel=1, serial=1, gameport=1, ne2k=1
config_interface: textconfig
display_library: x
debug: action=report
memory: host=2048, guest=2048
romimage: file="/usr/share/bochs/BIOS-bochs-latest", address=0x0, options=none
vgaromimage: file="/usr/share/bochs/VGABIOS-lgpl-latest"
boot: disk, network
floppy_bootsig_check: disabled=0
# no floppya
# no floppyb
ata0: enabled=1, ioaddr1=0x1f0, ioaddr2=0x3f0, irq=14
ata0-master: type=disk, path="/tmp/wstest.raw", mode=flat, cylinders=0, heads=0, spt=0, model="Generic 1234", biosdetect=auto, translation=auto
ata0-slave: type=none
ata1: enabled=1, ioaddr1=0x170, ioaddr2=0x370, irq=15
ata1-master: type=none
ata1-slave: type=none
ata2: enabled=0
ata3: enabled=0
optromimage1: file=none
optromimage2: file=none
optromimage3: file=none
optromimage4: file=none
optramimage1: file=none
optramimage2: file=none
optramimage3: file=none
optramimage4: file=none
pci: enabled=1, chipset=i440fx, slot1=ne2k, slot2=cirrus
vga: extension=cirrus, update_freq=5, realtime=1
cpu: count=1:1:1, ips=40000000, quantum=16, model=p4_prescott_celeron_336, reset_on_triple_fault=1, cpuid_limit_winnt=0, ignore_bad_msrs=1, mwait_is_nop=0
print_timestamps: enabled=0
port_e9_hack: enabled=0
private_colormap: enabled=0
clock: sync=none, time0=local, rtc_sync=0
# no cmosimage
# no loader
log: -
logprefix: %t%e%d
debug: action=ignore
info: action=report
error: action=report
panic: action=ask
keyboard: type=mf, serial_delay=250, paste_delay=100000, user_shortcut=none
mouse: type=ps2, enabled=0, toggle=ctrl+mbutton
speaker: enabled=1, mode=system
parport1: enabled=1, file=none
parport2: enabled=0
com1: enabled=1, mode=null
com2: enabled=0
com3: enabled=0
com4: enabled=0
ne2k: enabled=1, mac=fe:fd:de:ad:be:ef, ethmod=linux, ethdev=enp36s0.200, script=/bin/true, bootrom="/tmp/10ec8029.rom"

Create your hard drive image using qemu-img, then run bochs -f yourfile.cfg and it should, hopefully, work.

Aug 202017
 

OpenNebula is running now… I ended up re-loading my VM with Ubuntu Linux and throwing OpenNebula on that.  That works… and I can debug the issue with Gentoo later.

I still have to figure out corosync/heartbeat for two VMs, the one running OpenNebula, and the core router.  For now, the VMs are only set up to run on one node, but I can configure them on the other too… it’s then a matter of configuring libvirt to not start the instances at boot, and setting up the Linux-HA tools to figure out which node gets to fire up which VM.

The VM hosts are still running Gentoo however, and so far I’ve managed to get them to behave with OpenNebula.  A big part was disabling the authentication in libvirt, otherwise polkit generally made a mess of things from OpenNebula’s point of view.

That, and firewalld had to be told to open up ports for VNC/spice… I allocated 5900-6900… I doubt I’ll have that many VMs.

Last weekend I replaced the border router… previously this was a function of my aging web server, but now I have an ex-RAAF-base Advantech UNO-1150G industrial PC which is performing the routing function.  I tried to set it up with Gentoo, and while it worked, I found it wasn’t particularly stable due to limited memory (it only has 256MB RAM).  In the end, I managed to get OpenBSD 6.1/i386 running sweetly, so for now, it’s staying that way.

While the AMD Geode LX800 is no speed demon, a nice feature of this machine is it’s happy with any voltage between 9 and 32V.

The border router was also given the responsibility of managing the domain: I did this by installing ISC BIND9 from ports and copying across the config from Linux.  This seemed to be working, and so I left it.  Big mistake, turns out bind9 didn’t think it was authoritative, and so refused to handle AXFRs with my slaves.

I was using two different slave DNS providers, puck.nether.net and Roller Network, both at the time of subscription being freebies.  Turns out, when your DNS goes offline, puck.nether.net responds by disabling your domain then emailing you about it.  I received that email Friday morning… and so I wound up in a mad rush trying to figure out why BIND9 didn’t consider itself authoritative.

Since I was in a rush, I decided to tell the border router to just port-forward to the old server, which got things going until I could look into it properly.  It took a bit of tinkering with pf.conf, but eventually got that going, and the crisis was averted.  Re-enabling the domains on puck.nether.net worked, and they stayed enabled.

It was at that time I discovered that Roller Network had decided to make their slave DNS a paid offering.  Fair enough, these things do cost money… At first I thought, well, I’ll just pay for an account with them, until I realised their personal plans were US$5/month.  My workplace uses Vultr for hosting instances of their WideSky platform for customers… and aside from the odd hiccup, they’ve been fine.  US$5/month VPS which can run almost anything trumps US$5/month that only does secondary DNS, so out came the debit card for a new instance in their Sydney data centre.

Later I might use it to act as a caching front-end and as a secondary mail exchanger… but for now, it’s a DIY secondary DNS.  I used their ISO library to install an OpenBSD 6.1 server, and managed to nut out nsd to act as a secondary name server.

Getting that going this morning, I was able to figure out my DNS woes on the border router and got that running, so after removing the port forward entries, I was able to trigger my secondary DNS at Vultr to re-transfer the domain and debug it until I got it working.

With most of the physical stuff worked out, it was time to turn my attention to getting virtual instances working.  Up until now, everything running on the VM was through hand-crafted VMs using libvirt directly.  This is painful and tedious… but for whatever reason, OpenNebula was not successfully deploying VMs.  It’d get part way, then barf trying to set up 802.1Q network interfaces.

In the end, I knew OpenNebula worked fine with bridges that were already defined… but I didn’t want to have to hand-configure each VLAN… so I turned to another automation tool in my toolkit… Ansible:

- hosts: compute
  tasks:
  - name: Configure networking
    template: src=compute-net.j2 dest=/etc/conf.d/net
# …
- hosts: compute
  tasks:
# …
  - name: Add symbolic links (instance VLAN interfaces)
    file: src=net.lo dest=/etc/init.d/net.bond0.{{item}} state=link
    with_sequence: start=128 end=193
  - name: Add symbolic links (instance VLAN bridges)
    file: src=net.lo dest=/etc/init.d/net.vlan{{item}} state=link
    with_sequence: start=128 end=193
# …
  - name: Make services start at boot (instance VLAN bridges)
    command: rc-update add net.vlan{{item}} default
    with_sequence: start=128 end=193 

That’s a snippet of the playbook… and it basically creates symbolic links from Gentoo’s net.lo for all the VLAN ports and bridges, then sets them up to start at boot.

In the compute-net.j2 file referenced above, I put in the following to enumerate all the configuration bits.

# Instance VLANs
{% for vlan in range(128,193) %}
config_vlan{{vlan}}="null"
config_bond0_{{vlan}}="null"
rc_net_vlan{{vlan}}_need="net.bond0.{{vlan}}"
{% endfor %}
# …
vlans_bond0="5 8 10{% for vlan in range(128,193) %} {{vlan}} {% endfor %}248 249 250 251 252"
vlans_bond1="253"
# …
# Instance VLANs
{% for vlan in range(128,193) %}
bridge_vlan{{vlan}}="bond0.{{vlan}}"
{% endfor %} 

The start and end ranges are a little off, but it saved a lot of work.

This naturally took a while for OpenRC to bring up… but it worked. Going back to OpenNebula, I told it what bridges to use, and before long I had my first instance… an OpenBSD router to link my personal VLAN to the DMZ.

I spent a bit of time re-working my routing tables after that… in fact, my network is getting big enough now I have to write some details down.  I spent a few hours documenting the effort:

That’s page 1 of about 15… yes my hand is sore… but at least now should I get run over by a bus, others have a fighting chance doing anything with the network without my technical input.

Mar 252016
 

Well, figured I’d document this project here in case anyone was interested in doing this for personal amusement or for their workplace.

The list I’ve just chucked up is not a complete list, nor is it a prescribed list of exactly what’s needed, but rather is what I’ve either acquired, or will acquire.

The basic architecture is as follows:

  • The cluster is built up on discrete nodes which are based around a very similar hardware stack and are tweaked for their function.
  • Persistent data storage is handled by the storage nodes using the Ceph object storage system. This requires that a majority quorum is maintained, and so a minimum of 3 storage nodes are required.
  • Virtual machines run on the compute nodes.
  • Management nodes oversee co-ordination of the compute nodes: this ideally should be a separate pair of machines, but for my use case, I intend to use a virtual machine or container managed using active/passive failover techniques.
  • In order to reduce virtual disk latency, the compute nodes will implement a local disk cache using an SSD, backed by a Rados Block Device on Ceph.

I’ll be using KVM as the virtualisation technology with Gentoo Linux as the base OS for this experimental cluster. At my workplace, we evaluated a few different technologies including Proxmox VE, Ganeti, OpenStack and OpenNebula. For this project, I intend to build on OpenNebula as it’s the simplest to understand and the most suited to my workplace’s requirements.

Using Gentoo makes it very easy to splice in patches as I’ll be developing as I go along. If I come to implement this in the office, I’ll be porting everything across to Ubuntu. This will be building on some experimental work I’ve done in the past with OpenNebula.

For the base nodes themselves, I’ve based them around these components:

For the storage nodes, add to the list:

Other things you may want/need:

  • A managed switch, I ended up choosing the Linksys LGS-326AU which U-Mart were selling at AU$294. If you’ve ever used Cisco’s small business offerings, this unit will make you feel right at home.
  • DIN rail. Jaycar sell this in 1m lengths, and I’ll grab some tomorrow.

Most of the above bits I have, the nodes are all basically built as of this afternoon, minus the SATA adaptors for the three storage nodes. All units power on, and do what one would expect of a machine that’s trying to boot from a blank SSD.

I did put one of the compute nodes through its paces, network booting the machine via PXE/NFS root and installing Gentoo.

Power consumption was below 1.8A for a battery voltage of about 13.4V, even when building the Linux 4.4.6 kernel (using make -j8), which it did in about 10 minutes. Watching this thing tackle compile jobs is a thing of beauty, can’t wait to get distcc going and have 40 CPU cores tear into the bootstrap process. The initial boot also looks beautiful, with 8 penguins lined up representing the 8 cores — don’t turn up here in a tuxedo!

So hardware wise, things are more or less together, and it’ll mostly be software. I’ll throw up some notes on how it’s all wired, but basically the plan in the short term is a 240V mains charger (surplus from a caravan) will keep the battery floated until I get the solar panel and controller set up.

When that happens, I plan to wire a relay in series with the 240V charger controlled by a comparator to connect mains when the battery voltage drops below 12V.

The switch is a 240V device unfortunately (couldn’t find any 24-port 12V managed switches) so it’ll run from an inverter. Port space is tight, and I just got the one since they’re kinda pricey. Long term, I might look at a second for redundancy, although if a switch goes, I won’t lose existing data.

ADSL2+ will be managed by a small localised battery back-up and a small computer as router, possibly a Raspberry Pi as I have one spare (original B model), which can temporarily store incoming SMTP traffic if the cluster does go down (heaven forbid!) and act as a management endpoint. There are a few contenders here, including these industrial computers, for which I already maintain a modern Linux kernel port for my workplace.

Things are coming together, and I hope to bring more on this project as it moves ahead.

Oct 312015
 

Well, it seems the updates to Microsoft’s latest aren’t going as its maker planned. A few people have asked me about my personal opinion of this OS, and I’ll admit, I have no direct experience with it.  I also haven’t had much contact with Windows 8 either.

That said, I do keep up with the news, and a few things do concern me.

The good news

It’s not all bad of course.  Windows 8 saw a big shrink in the footprint of a typical Windows install, and Windows 10 continues to be fairly lightweight.  The UI disaster from Windows 8 has been somewhat pared back to provide a more traditional desktop with a start menu that combines features from the start screen.

There are some limitations with the new start menu, but from what I understand, it behaves mostly like the one from Windows 7.  The tiled section still has some rough edges though, something that is likely to be addressed in future updates of Windows 10.

If this is all that had changed though, I’d be happily accepting it.  Sadly, this is not the case.

Rolling-release updates

Windows has, since day one, been on a long-term support release model.  That is, they bring out a release, then they support it for X years.  Windows XP was released in 2002 and was supported until last year for example.  Windows Vista is still on extended support, and Windows 7 will enter extended support soon.

Now, in the Linux world, we’ve had both long-term support releases and rolling release distributions for years.  Most of the current Linux users know about it, and the distribution makers have had many years to get it right.  Ubuntu have been doing this since 2004, Debian since 1998 and Red Hat since 1994.  Rolling releases can be a bumpy ride if not managed correctly, which is why the long-term support releases exist.  The community has recognised the need, and meets it accordingly.

Ubuntu are even predictable with their releases.  They release on a schedule.  Anything not ready for release is pushed back to the next release.  They do a release every 6 months, in April and October and every 2 years, the April release is a long-term support release.  That is; 8.04, 10.04, 12.04, 14.04 are all LTS releases.  The LTS releases get supported for about 3 years, the regular releases about 18 months.

Debian releases are basically LTS, unless you run Debian Testing or Debian Unstable.  Then you’re running rolling-release.

Some distributions like Gentoo are always rolling-release.  I’ve been running Gentoo for more than 10 years now, and I find the rolling releases rarely give me problems.  We’ve had our hiccups, but these days, things are smooth.  Updating an older Gentoo box to the latest release used to be a fight, but these days, is comparatively painless.

It took most of that 10 years to get to that point, and this is where I worry about Microsoft forcing the vast majority of Windows users onto a rolling-release model, as they will be doing this for the first time.  As I understand it, there will be four branches:

  1. Windows Insiders programme is like Debian Unstable.  The very latest features are pushed out to them first.  They are effectively running a beta version of Windows, and can expect many updates, many breakages, lots of things changing.  For some users, this will be fine, others it’ll be a headache.  There’s no option to skip updates, but you probably will have the option of resigning from the Windows Insiders programme.
  2. Home users basically get something like Debian Testing.  After updates have been thrashed out by the insiders, it gets force-fed to the general public.  The Home version of Windows 10 will not have an option to defer an update.
  3. Professional users get something more like the standard releases of Debian.  They’ll have the option of deferring an update for up to 30 days, so things can change less frequently.  It’s still rolling-release, but they can at least plan their updates to take place once a month, hopefully without disrupting too much.
  4. Enterprise users get something like the old-stable release of Debian.  Security updates, and they have the option to defer updates for a year.

Enterprise isn’t available unless you’re a large company buying lots of licenses.  If people must buy a Windows 10 machine, my recommendation would be to go for the professional version, then you have some right of veto, as not all the updates a purely security-related, some will be changing the UI and adding/removing features.

I can see this being a major headache though for anyone who has to support hardware or software on Windows 10 however, since it’s essentially the build number that becomes important: different release builds will behave differently.  Possibly different enough that things need much more testing and maintenance than what vendors are used to.

Some are very poor at supporting Linux right now due to the rolling-release model of things like the Linux kernel, so I can see Windows 10 being a nightmare for some.

Privacy concerns

One of the big issues to be raised with Windows 10 is the inclusion of telemetry to “improve the user experience” and other features that are seen as an invasion of privacy.  Many things can be turned off, but it will take someone who’s familiar with the OS or good at researching the problem to turn them off.

Probably the biggest concern from my prospective as a network administrator is the WiFi Sense feature.  This is a feature in Windows 10 (and Windows 8 Phone), turned on by default, that allows you to share WiFi passwords with other contacts.

If one of that person’s contacts then comes into range of your AP, their device contacts Microsoft’s servers which have the password on file, and can provide it to that person’s device (hopefully in a secured manner).  The password is never shown to the user themselves, but I believe it’s only a matter of time before someone figures out how to retrieve that password from WiFi Sense.  (A rogue AP would probably do the trick.)

We have discussed this at work where we have two WiFi networks: one WPA2 enterprise one for staff, and a WPA2 Personal one for guests.  Since we cannot control whether the users have this feature turned on or not, or whether they might accidentally “share” the password with world + dog, we’re considering two options:

  1. Banning the use of Windows 10 devices (and Windows 8 Phone) from being used on our guest WiFi network.
  2. Implementing a cron job to regularly change the guest WiFi password.  (The Cisco AP we have can be hit with SSH; automating this shouldn’t be difficult.)

There are some nasty points in the end user license agreement too that seem to give Microsoft free reign to make copies of any of the data on the system.  They say personal information will be removed, but even with the best of intentions, it is likely that some personal information will get caught in the net cast by telemetry software.

Forced “upgrades” to Windows 10

This is the bit about Windows 10 that really bugs me.  Okay, Microsoft is pushing a deal where they’ll provide it to you for free for a year.  Free upgrades, yaay!  But wait: how do you know if your hardware and software is compatible?  Maybe you’re not ready to jump on the bandwagon just yet, or maybe you’ve heard news about the privacy issues or rolling release updates and decided to hold back.

Many users of Windows 7, 8 and 8.1 are now being force-fed the new release, whether we asked for it or not.

Now the problem with this is it completely ignores the fact that some do not run with an always-on Internet connection with a large quota.  I know people who only have a 3G connection, with a very small (1GB) quota.  Windows 10 weighs in at nearly 3GB, so for them, they’ll be paying for 2GB worth of overuse charges just for the OS, never mind what web browsing, emailing and other things they might have actually bought their Internet connection for.

Microsoft employees have been outed for showing such contempt before.  It seems so many there are used to the idea of an Internet connection that is always there and has a big enough quota to be considered “unlimited” that they have forgotten that some parts of the world do not have such luxuries.  The computer and the Internet are just tools: we do not buy an Internet connection just for the sake of having one.

Stopping updates

There are a couple of tools that exist for managing this.  I have not tested any of them, and cannot vouch for their safety or reliability.

  • BlockWindows (github link) is a set of scripts that, when executed, uninstall and disable most of the Windows 10-related updates on Windows 7 and 8/8.1.
  • GWX Control Panel is a (proprietary?) tool for controlling the GWX process.  The download is here.

My recommendation is to keep good backups.  Find a tool that will do a raw partition back-up of your Windows partition, and keep your personal files on a separate partition.  Then, if Microsoft does come a-knocking, you can easily roll back.  Hopefully after the “free upgrade” offer has expired (about this time next year), they will cease and desist from this practise.