Skip to content

Blogospheric Refraction

The life and times of Stuart Longland (VK4MSL)

Hashing the audio *data* in a file

So, I’ve got a big music collection…

RC=0 stuartl@rikishi ~ $ find /mnt/music-archive/.by-uuid/ -type f -name \*.flac | wc -l
7624

I keep a few copies of it. Between three of my machines and two USB drives (one HDD, one SSD), I keep a copy of the lossless archive. This is a recent addition since (1) I’ve got the space to do it, and (2) some experimentation with Ogg/Vorbis metadata corrupting files necessitated me re-ripping everything so I thought I’ll save future-me the hassle by keeping a lossless copy on-hand.

Actually, this is not the first time I’ve done a re-rip of the whole collection. The previous re-rip was done back in 2005 when I moved from MP3 to Ogg/Vorbis (and ditched a lot of illegally obtained MP3s while I was at it — leaving me with just the recordings that I had licenses for). But, back then, storing a lossless copy of every file as I re-ripped everything would have been prohibitively expensive in terms of required storage. When even my near-10-year-old laptop sports a 2TB SSD, this isn’t a problem.

The working copy that I generally do my listening from uses the Ogg/Vorbis format today. I haven’t quite re-ripped everything, there’s a stack of records that are waiting for me to put them back on the turntable … one day I’ll get to those … but every CD, DVD and digital download (which were FLAC to begin with) is losslessly stored in FLAC.

If I make a change to the files, I really want to synchronise my changes between the two copies. Notably, if I change the file data, I need to re-encode the FLAC file to Ogg/Vorbis — but if I simply change its metadata (i.e. cover art or tags), I merely need to re-write the metadata on the destination file and can save some processing cycles.

The thinking is, if I can “fingerprint” the various parts of the file, I can determine what bits changed and what to convert. Obviously when I transcode the audio data itself, the audio data bytes will bear little resemblance to the ones that were fed into the transcoder — that’s fine — I have other metadata which can link the two files. The aim of this exercise is to store the hashes for the audio data and tags, and detect when one of those things changes on the source side, so the change can be copied across to the destination.

Existing option: MD5 hash

FLAC actually does store a hash of its source input as part of the stream metadata. It uses the MD5 hashing algorithm, which while good enough for a rough check, and is certainly better than linear codes like CRC, it’s really quite dated as a cryptographic hash.

I’d prefer to use SHA-256 for this since it’s generally regarded as being a “secure” hash algorithm that is less vulnerable to collisions than MP3 or SHA-1.

Naïve approach: decode and compare

The naïve approach would be to just decode to raw audio data and compare the raw audio files. I could do this via a pipe to avoid writing the files out to disk just to delete them moments later. The following will output a raw file:

RC=0 stuartl@rikishi ~ $ time flac -d -f -o /tmp/test.raw /mnt/music-archive/by-album-artist/Traveling\ Wilburys/The\ Traveling\ Wilburys\ Collection/d1o001t001\ Traveling\ Wilburys\ -\ Handle\ With\ Care.flac 

flac 1.3.4
Copyright (C) 2000-2009  Josh Coalson, 2011-2016  Xiph.Org Foundation
flac comes with ABSOLUTELY NO WARRANTY.  This is free software, and you are
welcome to redistribute it under certain conditions.  Type `flac' for details.

d1o001t001 Traveling Wilburys - Handle With Care.flac: done         

real    0m0.457s
user    0m0.300s
sys     0m0.065s

On my laptop, it takes about 200~500ms to decode a single file to raw audio. Multiply that by 7624 and you get something that will take nearly an hour to complete. I think we can do better!

Alternate naïve approach: Copy the file then strip metadata

Making a copy of the file without the metadata is certainly an option. Something like this will do that:

RC=0 stuartl@rikishi ~ $ time ffmpeg -y -i \
    /mnt/music-archive/by-album-artist/Traveling\ Wilburys/The\ Traveling\ Wilburys\ Collection/d1o001t001\ Traveling\ Wilburys\ -\ Handle\ With\ Care.flac \
    -c:a copy -c:v copy -map_metadata -1 \
    /tmp/test.flac
… snip lots of output …
Output #0, flac, to '/tmp/test.flac':
  Metadata:
    encoder         : Lavf58.76.100
  Stream #0:0: Video: mjpeg (Progressive), yuvj420p(pc, bt470bg/unknown/unknown), 1000x1000 [SAR 72:72 DAR 1:1], q=2-31, 90k tbr, 90k tbn, 90k tbc (attached pic)
  Stream #0:1: Audio: flac, 44100 Hz, stereo, s16
    Side data:
      replaygain: track gain - -8.320000, track peak - 0.000023, album gain - -8.320000, album peak - 0.000023, 
Stream mapping:
  Stream #0:1 -> #0:0 (copy)
  Stream #0:0 -> #0:1 (copy)
Press [q] to stop, [?] for help
frame=    1 fps=0.0 q=-1.0 Lsize=   24671kB time=00:03:19.50 bitrate=1013.0kbits/s speed=4.77e+03x    
video:114kB audio:24549kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.033000%

real    0m0.139s
user    0m0.105s
sys     0m0.032s

This is a big improvement, but just because the audio blocks are the same does not mean the file itself won’t change in other ways — FLAC files can include “padding blocks” anywhere after the STREAMINFO block which will change the hash value without having any meaningful effect on the file content.

So this may not be as stable as I’d like. However, ffmpeg is on the right track…

Audio fingerprinting

MusicBrainz actually has an audio fingerprinting library that can be matched to a specific recording, and is reasonably “stable” across different audio compression formats. Great for the intended purpose, but in this case it’s likely going to be computationally expensive since it has to analyse the audio in terms of frequency components, try to extract tempo information, etc. I don’t need this level of detail.

It may also miss that one file might for example, be proceeded by lots of silence — multi-track exports out of Audacity are a prime example. Audacity used to just export the multiple tracks “as-is” so you could re-construct the full recording by concatenating the files, but some bright-spark thought it would be a good idea to prepend the exported tracks with silence by default so if re-imported, their relative positions were “preserved”. Consequently, I’ve got some record rips that I need to fix because of the extra “silence”!

Getting hashes out of ffmpeg

It turns out that ffmpeg can output any hash you’d like of whatever input data you give it:

RC=0 stuartl@rikishi ~ $ time ffmpeg \
    -loglevel quiet \
    -i /tmp/test.flac \
    -c:a copy -vn -map_metadata -1 -f hash -hash sha256 -
SHA256=31e38749daa1061e6a2008ea61e841e5bc05b8b9ec1f0dfc54d8cd70f18fee3f

real    0m0.248s
user    0m0.234s
sys     0m0.014s
RC=0 stuartl@rikishi ~ $ time ffmpeg \
    -loglevel quiet \
    -i /mnt/music-archive/by-album-artist/Traveling\ Wilburys/The\ Traveling\ Wilburys\ Collection/d1o001t001\ Traveling\ Wilburys\ -\ Handle\ With\ Care.flac \
    -c:a copy -vn -map_metadata -1 -f hash -hash sha256 -
SHA256=31e38749daa1061e6a2008ea61e841e5bc05b8b9ec1f0dfc54d8cd70f18fee3f

real    0m0.242s
user    0m0.226s
sys     0m0.016s

Notice the hashes are the same, yet the first copy of the file we hashed does not contain the tags or cover art present in the file it was generated from. Speed isn’t as good as just stripping the metadata, but on the flip-side, it’s not as expensive as decoding the file to raw format, and should be more stable than naïvely hashing the whole file after metadata stripping.

Where to from here?

Well, having a hash, I can store this elsewhere (I’m thinking SQLite3 or LMDB), then compare it later to know if the audio has changed. It’s not difficult or expensive using mutagen or similar to extract the tags and images, those can be hashed using conventional means to generate a hash of that information. I can also store the mtime and a complete file hash for an even faster “quick check”.

Tags: flacmetadatamusic

2022/12/10 by Redhatter (VK4MSL) Computing Music Public Syndication Thinktank 0
  • Next A stereo/binaural tactical headset: part two
  • Previous FT5DR repeater database for Australia

You may also like...

  • Demise of a Brisbane Icon: 4KQ to become “sports” station from 1st July

    Demise of a Brisbane Icon: 4KQ to become “sports” station from 1st July

  • Boomer Boom Box: The Locals

    Boomer Boom Box: The Locals

  • Demise of 4KQ: one week on

    Demise of 4KQ: one week on

Leave a Reply Cancel reply

You must be logged in to post a comment.

Site Login and Registration

  • Register
  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Calendar

December 2022
S M T W T F S
 123
45678910
11121314151617
18192021222324
25262728293031
« Nov   Jan »

Pages

  • Amateur Station Setup & Frequencies
  • Contacting Me
  • Curriculum Vitae
  • Leaving Comments
  • My Projects
    • 6LoWHAM
    • High-power DC-DC Converter
    • Improved Helmets
    • Solar Powered Cloud Computing Cluster
    • Toy Synthesizer
  • Syndicating this blog

Categories

  • Amateur Radio
    • AWNOI Net
    • Bicycle Mobile
    • Emergency Communications
      • WICEN
    • FreeDV
    • Homebrew
  • Asperger's Syndrome
  • Computing
    • Open Source
      • Ceph
      • OpenNebula
      • OpenStack
  • Humour
  • Linux Development
    • Atomic Linux Development
    • Gentoo Development
    • Kernel hacking
    • linux.conf.au
      • 2011
  • Music
    • Boomer Boom Box
  • Nature
  • Projects
    • 6LoWHAM
    • High-power DC-DC Converter
    • Improved Helmets
    • Solar-powered Cloud Computing
    • Toy Synthesizer
  • Public Syndication
    • Public Service Announcements
  • Rants
  • Thinktank
  • Uncategorized
  • University

Recent Comments

  • Redhatter (VK4MSL) on Mapping call-signs to “hardware” addresses
  • darco on Mapping call-signs to “hardware” addresses
  • Redhatter (VK4MSL) on How did I get to be an Engineer?
  • josuah on How did I get to be an Engineer?
  • znedw on Internode HFC NBN on OpenBSD

Blogroll

  • Appleman1234 (Benjamin Southall)
  • Brisbane Area WICEN Group
  • Nick Stallman’s Blog

Tags

6lowham (15) 6lowpan (10) amateur-radio (107) arm (14) armv5 (14) atmel-attiny24a (28) ax.25 (19) battery-charging (51) bicycle-mobile (23) ceph (15) covid-19 (15) cycling (10) emergency-communications (17) freedv (8) gentoo (129) headset (8) high-power-dc-dc-converter (8) Homebrew (25) humour (37) i2c (10) improved-helmets (20) ipv6 (18) linux (149) linux.conf.au (17) meanwell-hep-600c-12 (11) mic29712 (11) mosfet (20) musl (15) opennebula (12) packet-radio (12) powertech-mp-3735 (9) pwm (10) raspberry-pi (8) redarc-bcdc1225 (15) security (13) solar (65) solar-cluster (96) ti-ina219b (11) toy-synthesizer (15) traumatic-brain-injury (8) ts-7670 (19) virtualisation (8) vlan (8) wicen (17) xantrex-tc2012 (11)

Battery voltage

Mastodon

Mastodon

Blogospheric Refraction © 2023. All Rights Reserved.

Powered by WordPress. Theme by Alx.