So, I’ve got a big music collection…
RC=0 stuartl@rikishi ~ $ find /mnt/music-archive/.by-uuid/ -type f -name \*.flac | wc -l 7624
I keep a few copies of it. Between three of my machines and two USB drives (one HDD, one SSD), I keep a copy of the lossless archive. This is a recent addition since (1) I’ve got the space to do it, and (2) some experimentation with Ogg/Vorbis metadata corrupting files necessitated me re-ripping everything so I thought I’ll save future-me the hassle by keeping a lossless copy on-hand.
Actually, this is not the first time I’ve done a re-rip of the whole collection. The previous re-rip was done back in 2005 when I moved from MP3 to Ogg/Vorbis (and ditched a lot of illegally obtained MP3s while I was at it — leaving me with just the recordings that I had licenses for). But, back then, storing a lossless copy of every file as I re-ripped everything would have been prohibitively expensive in terms of required storage. When even my near-10-year-old laptop sports a 2TB SSD, this isn’t a problem.
The working copy that I generally do my listening from uses the Ogg/Vorbis format today. I haven’t quite re-ripped everything, there’s a stack of records that are waiting for me to put them back on the turntable … one day I’ll get to those … but every CD, DVD and digital download (which were FLAC to begin with) is losslessly stored in FLAC.
If I make a change to the files, I really want to synchronise my changes between the two copies. Notably, if I change the file data, I need to re-encode the FLAC file to Ogg/Vorbis — but if I simply change its metadata (i.e. cover art or tags), I merely need to re-write the metadata on the destination file and can save some processing cycles.
The thinking is, if I can “fingerprint” the various parts of the file, I can determine what bits changed and what to convert. Obviously when I transcode the audio data itself, the audio data bytes will bear little resemblance to the ones that were fed into the transcoder — that’s fine — I have other metadata which can link the two files. The aim of this exercise is to store the hashes for the audio data and tags, and detect when one of those things changes on the source side, so the change can be copied across to the destination.
Existing option: MD5 hash
FLAC actually does store a hash of its source input as part of the stream metadata. It uses the MD5 hashing algorithm, which while good enough for a rough check, and is certainly better than linear codes like CRC, it’s really quite dated as a cryptographic hash.
I’d prefer to use SHA-256 for this since it’s generally regarded as being a “secure” hash algorithm that is less vulnerable to collisions than MP3 or SHA-1.
Naïve approach: decode and compare
The naïve approach would be to just decode to raw audio data and compare the raw audio files. I could do this via a pipe to avoid writing the files out to disk just to delete them moments later. The following will output a raw file:
RC=0 stuartl@rikishi ~ $ time flac -d -f -o /tmp/test.raw /mnt/music-archive/by-album-artist/Traveling\ Wilburys/The\ Traveling\ Wilburys\ Collection/d1o001t001\ Traveling\ Wilburys\ -\ Handle\ With\ Care.flac flac 1.3.4 Copyright (C) 2000-2009 Josh Coalson, 2011-2016 Xiph.Org Foundation flac comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions. Type `flac' for details. d1o001t001 Traveling Wilburys - Handle With Care.flac: done real 0m0.457s user 0m0.300s sys 0m0.065s
On my laptop, it takes about 200~500ms to decode a single file to raw audio. Multiply that by 7624 and you get something that will take nearly an hour to complete. I think we can do better!
Alternate naïve approach: Copy the file then strip metadata
Making a copy of the file without the metadata is certainly an option. Something like this will do that:
RC=0 stuartl@rikishi ~ $ time ffmpeg -y -i \ /mnt/music-archive/by-album-artist/Traveling\ Wilburys/The\ Traveling\ Wilburys\ Collection/d1o001t001\ Traveling\ Wilburys\ -\ Handle\ With\ Care.flac \ -c:a copy -c:v copy -map_metadata -1 \ /tmp/test.flac … snip lots of output … Output #0, flac, to '/tmp/test.flac': Metadata: encoder : Lavf58.76.100 Stream #0:0: Video: mjpeg (Progressive), yuvj420p(pc, bt470bg/unknown/unknown), 1000x1000 [SAR 72:72 DAR 1:1], q=2-31, 90k tbr, 90k tbn, 90k tbc (attached pic) Stream #0:1: Audio: flac, 44100 Hz, stereo, s16 Side data: replaygain: track gain - -8.320000, track peak - 0.000023, album gain - -8.320000, album peak - 0.000023, Stream mapping: Stream #0:1 -> #0:0 (copy) Stream #0:0 -> #0:1 (copy) Press [q] to stop, [?] for help frame= 1 fps=0.0 q=-1.0 Lsize= 24671kB time=00:03:19.50 bitrate=1013.0kbits/s speed=4.77e+03x video:114kB audio:24549kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.033000% real 0m0.139s user 0m0.105s sys 0m0.032s
This is a big improvement, but just because the audio blocks are the same does not mean the file itself won’t change in other ways — FLAC files can include “padding blocks” anywhere after the
STREAMINFO block which will change the hash value without having any meaningful effect on the file content.
So this may not be as stable as I’d like. However,
ffmpeg is on the right track…
MusicBrainz actually has an audio fingerprinting library that can be matched to a specific recording, and is reasonably “stable” across different audio compression formats. Great for the intended purpose, but in this case it’s likely going to be computationally expensive since it has to analyse the audio in terms of frequency components, try to extract tempo information, etc. I don’t need this level of detail.
It may also miss that one file might for example, be proceeded by lots of silence — multi-track exports out of Audacity are a prime example. Audacity used to just export the multiple tracks “as-is” so you could re-construct the full recording by concatenating the files, but some bright-spark thought it would be a good idea to prepend the exported tracks with silence by default so if re-imported, their relative positions were “preserved”. Consequently, I’ve got some record rips that I need to fix because of the extra “silence”!
Getting hashes out of
It turns out that
ffmpeg can output any hash you’d like of whatever input data you give it:
RC=0 stuartl@rikishi ~ $ time ffmpeg \ -loglevel quiet \ -i /tmp/test.flac \ -c:a copy -vn -map_metadata -1 -f hash -hash sha256 - SHA256=31e38749daa1061e6a2008ea61e841e5bc05b8b9ec1f0dfc54d8cd70f18fee3f real 0m0.248s user 0m0.234s sys 0m0.014s RC=0 stuartl@rikishi ~ $ time ffmpeg \ -loglevel quiet \ -i /mnt/music-archive/by-album-artist/Traveling\ Wilburys/The\ Traveling\ Wilburys\ Collection/d1o001t001\ Traveling\ Wilburys\ -\ Handle\ With\ Care.flac \ -c:a copy -vn -map_metadata -1 -f hash -hash sha256 - SHA256=31e38749daa1061e6a2008ea61e841e5bc05b8b9ec1f0dfc54d8cd70f18fee3f real 0m0.242s user 0m0.226s sys 0m0.016s
Notice the hashes are the same, yet the first copy of the file we hashed does not contain the tags or cover art present in the file it was generated from. Speed isn’t as good as just stripping the metadata, but on the flip-side, it’s not as expensive as decoding the file to raw format, and should be more stable than naïvely hashing the whole file after metadata stripping.
Where to from here?
Well, having a hash, I can store this elsewhere (I’m thinking SQLite3 or LMDB), then compare it later to know if the audio has changed. It’s not difficult or expensive using
mutagen or similar to extract the tags and images, those can be hashed using conventional means to generate a hash of that information. I can also store the
mtime and a complete file hash for an even faster “quick check”.