Oct 222018

Today it seems, the IT gremlins have been out to get me.  At my work I have a desktop computer (personal hardware) consisting of a Rysen 7 1700, 16GB RAM, a 240GB Intel M.2 SATA SSD (540 series) and a 4TB Western Digital HDD.

The machine has been, pretty reliable, not rock-solid, in particular, compiling gcc sometimes segfaulted for reasons unknown (the RAM checks out okay according to memtest86), but for what I was doing, it mostly ran fine.  I put up with the minor niggles with the view of solving those another day.  Today though, I come in and find X has crashed.

Okay, no big deal, re-start the display manager, except that crashed too.

Hmm, okay, log in under my regular user account and try startx:  No dice, there’s no room on /.

Ahh, that might explain a few things, we clean up some log files, truncate a 500MB file, manage to free up 50GB (!).

The machine dual-boots two OSes: Debian 9 and Gentoo.  It’s been running the latter for about 12 months now, I used Debian 9 to get things rolling so I could use the machine at work (did try Ubuntu 16.04, but it didn’t like my machine), and later, used that to get Gentoo running before switching over.  So there was a 40GB partition on the SSD that had a year-old install of Debian that wasn’t being used.  I figured I’d ditch it, and re-locate my Gentoo partition to occupy that space.

So I pull out an Ubuntu 18.04 disc, boot that up, and get gparted going.  It’s happily copying, until WHAM, I was hit with an I/O error:

Failed re-location of partition (click to enlarge)

Clicking any of the three buttons resulted in the same message.  Brilliant.  I had just copied over the first 15GB of the partition, so the Debian install would be hosed (I was deleting it anyway), but my Gentoo root partition should still be there intact at its old location.  Of course the partition table was updated, so no rolling back there.  At this point, I couldn’t do anything with the SSD, it had completely stalled, and I just had to cut my losses and kill gparted.

I managed to make some room on the 4TB drive shuffling some partitions around so I could install Ubuntu 18.04 there.  My /home partition was btrfs on the 4TB drive (first partition), the rest of that drive was LVM.  I just shrank my /home down by 40GB and slipped it in there.  The boot-loader didn’t install (no EFI partition), but who cares, I know enough of grub to boot from the DVD and bootstrap the machine that way.  At first it wouldn’t boot because in their wisdom, they created the root partition with a @ subvolume.  I worked around that by making the @ subvolume the default.

Then there was momentary panic when the /home partition I had specified lacked my usual files.  Turned out, they had created a @home subvolume on my existing /home partition.  Why? Who knows?  Debian/Ubuntu seem to do strange things with btrfs which do nothing but complicate matters and I do not understand the reasoning.  Editing /etc/fstab to remove the subvolume argument for /home and re-booting fixed that.

I set up a LVM volume that would receive a DD dump of the mangled partition to see what could be saved.  GNU’s ddrescue managed to recover most of the raw partition, and so now I just had to find where the start was.  If I had the output of fdisk -l before I started, I’d be right, but I didn’t have that foresight.  (Maybe if I had just formatted a LVM volume and DD’d the root fs before involving gparted?  Never mind!)

I figured there’d be some kind of magic bytes I could “grep” for.  Something that would tell me “BTRFS was here”.  Sure enough, the information is stashed in the superblock.  At 0x00010040 from the start of the partition, I should see the magic bytes 5f 42 47 52 66 53 5f 4d.  I just needed to grep for these.  To speed things up I made an educated guess on the start-location.  The screenshot says the old partition was about 37.25GB in size, so that was a hint to maybe try skipping that bit and see what could be seen.

Sure enough, I found what looked to be the superblock:

root@vk4msl-ws:~# dd if=/dev/mapper/scratch-rootbackup skip=38100 count=200 bs=1M | hexdump -C | grep '5f 42 48 52 66 53 5f 4d'
02e10040  5f 42 48 52 66 53 5f 4d  9d 30 0d 02 00 00 00 00  |_BHRfS_M.0......|
06e00040  5f 42 48 52 66 53 5f 4d  9d 30 0d 02 00 00 00 00  |_BHRfS_M.0......|
200+0 records in
200+0 records out

Some other probes seem to confirm this, my quarry seemed to start 38146MB into the now-merged partition.  I start copying that to a new LVM volume with the hope of being able to mount it:

root@vk4msl-ws:~# dd if=/dev/mapper/scratch-rootbackup of=/dev/mapper/scratch-gentoo--root bs=1M skip=38146

Whilst waiting for this to complete, I double-checked my findings, by inspecting the other fields. From the screenshot, I know my filesystem UUID was 6513-682e-7182-4474-89e6-c0d1c71866ad. Looking at the superblock, sure enough I see that listed:

root@vk4msl-ws:~# dd if=/dev/scratch/gentoo-root bs=$(( 0x10000 )) skip=1 count=1 | hexdump -C
1+0 records in
1+0 records out
00000000  5f f9 98 90 00 00 00 00  00 00 00 00 00 00 00 00  |_...............|
65536 bytes (66 kB, 64 KiB) copied, 0.000116268 s, 564 MB/s
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000020  65 13 68 2e 71 82 44 74  89 e6 c0 d1 c7 19 66 ad  |e.h.q.Dt......f.|
00000030  00 00 01 00 00 00 00 00  01 00 00 00 00 00 00 00  |................|
00000040  5f 42 48 52 66 53 5f 4d  9d 30 0d 02 00 00 00 00  |_BHRfS_M.0......|
00000050  00 00 32 da 32 00 00 00  00 00 02 00 00 00 00 00  |..2.2...........|
00000060  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Looks promising! After an agonising wait, the dd finishes. I can check the filesystem:

root@vk4msl-ws:~# btrfsck /dev/scratch/gentoo-root 
Checking filesystem on /dev/scratch/gentoo-root
UUID: 6513682e-7182-4474-89e6-c0d1c71966ad
checking extents
checking free space cache
block group 111690121216 has wrong amount of free space
failed to load free space cache for block group 111690121216
block group 161082245120 has wrong amount of free space
failed to load free space cache for block group 161082245120
checking fs roots
checking csums
checking root refs
found 107544387643 bytes used, no error found
total csum bytes: 99132872
total tree bytes: 6008504320
total fs tree bytes: 5592694784
total extent tree bytes: 271663104
btree space waste bytes: 1142962475
file data blocks allocated: 195274670080
 referenced 162067775488

Okay, it complained that the free space was wrong (which I’ll blame on gparted prematurely growing the partition), but the data is there!  This is confirmed by mounting the volume and doing a ls:

root@vk4msl-ws:~# mount /dev/scratch/gentoo-root /mnt/
root@vk4msl-ws:~# ls /mnt/ -l
total 4
drwxr-xr-x 1 root root 1020 Oct  7 14:13 bin
drwxr-xr-x 1 root root   18 Jul 21  2017 boot
drwxr-xr-x 1 root root   16 May 28 10:29 dbus-1
drwxr-xr-x 1 root root 1686 May 31  2017 dev
drwxr-xr-x 1 root root 3620 Oct 19 18:53 etc
drwxr-xr-x 1 root root    0 Jul 14  2017 home
lrwxrwxrwx 1 root root    5 Sep 17 09:20 lib -> lib64
drwxr-xr-x 1 root root 1156 Oct  7 13:59 lib32
drwxr-xr-x 1 root root 4926 Oct 13 05:13 lib64
drwxr-xr-x 1 root root   70 Oct 19 11:52 media
drwxr-xr-x 1 root root   28 Apr 23 13:18 mnt
drwxr-xr-x 1 root root  336 Oct  9 07:27 opt
drwxr-xr-x 1 root root    0 May 31  2017 proc
drwx------ 1 root root  390 Oct 22 06:07 root
drwxr-xr-x 1 root root   10 Jul  6  2017 run
drwxr-xr-x 1 root root 4170 Oct  9 07:57 sbin
drwxr-xr-x 1 root root   10 May 31  2017 sys
drwxrwxrwt 1 root root 6140 Oct 22 06:07 tmp
drwxr-xr-x 1 root root  304 Oct 19 18:20 usr
drwxr-xr-x 1 root root  142 May 17 12:36 var
root@vk4msl-ws:~# cat /mnt/etc/gentoo-release 
Gentoo Base System release 2.4.1

Yes, I’ll be backing this up properly RIGHT NOW. But, my data is back, and I’ll be storing this little data recovery technique for next time.

The real lesson here is:

  1. KEEP RELIABLE BACKUPS! You never know when something will fail.
  2. Catch the copy process before it starts overwriting your source data! If there’s no overlap between the old and new locations, you’re fine, but if there is and it starts overwriting the start of your original volume, it’s all over red rover! You might be lucky with a superblock back-up, but don’t bet on it!
  3. Make note of the filesystem type and its approximate location. The fact that I knew roughly where to look, and what sort of filesystem I was looking for meant I could look for magic bytes that say “I’m a BTRFS filesystem”. The magic bytes for EXT4, XFS, etc will differ, but the same concepts are there, you just have to look up the documentation on how your particular filesystem structures its data.
Aug 262018

So, I had a brief look after getting kernel 4.18.5 booting… sure enough the problem was I had forgotten the watchdog, although I did see btrfs trigger a deadlock warning, so I may not be out of the woods yet.  I’ve posted the relevant kernel output to the linux-btrfs list.

Anyway, as it happens, that watchdog driver looks like it’ll need some re-factoring as a multi-function device.  At the moment, ts-wdt.c claims it based on this binding.

If I try to add a second driver, they’ll clash, and I expect the same if I try to access it via userspace.  So the sensible thing to do here, is to add a ts-companion.c MFD driver here, then re-factor ts-wdt.c to use it.  From there, I can write a ts-psu.c module which will go right here.

I think I’ll definitely be digging into those older sources to remind myself how that all worked.