reed-solomon

Thoughts on a forward-erasure-coded optical disc filesystem

So, in the last 12 months or so, I’ve grown my music collection in a big way. Basically over the Christmas – New Year break, I was stuck at home, coughing and spluttering due to the bushfire smoke in the area (and yes, I realise it was no where near as bad in Brisbane as it was in other parts of the country).

I spent a lot of time listening to the radio, and one of the local radio stations was doing a “25 years in 25 days” feature, covering many iconic tracks from the latter part of last decade. Now, I’ve always been a big music listener. Admittedly, I’m very much a music luddite, with the vast majority of my music spanning 1965~1995… with some spill over as far back as 1955 and going as forward as 2005 (maybe slightly further).

Trouble is, I’m not overly familiar with the names, and the moment I walk into a music shop, I’m like the hungry patron walking into a food court: I want to eat something, but what? My mind goes blank as my mind is bombarded with all kinds of possibilities.

So when this count-down appeared on the radio, naturally, I found myself looking up the play list, and I came away with a long “shopping list” of songs I’d look for. Since then, a decent amount has been obtained as CDs from the likes of Amazon and Sanity… however, for some songs, I found it was easiest to obtain them as a digital download in FLAC format.

Now, for me, my music is a long-term investment. An investment that transcends changes in media formats. I do agree with ensuring that the “creators” of these works are suitably compensated for their efforts, but I do not agree with paying for the same thing multiple times.

A few people have had to perform in a studio (or on stage), someone’s had to collect the recordings, mix them, work with the creators to assemble those into an album, work with other creative people to come up with cover art, marketing… all that costs money, and I’m happy to contribute to that. The rest is simply an act of duplication: and yes, that has a cost, but it’s minimal and highly automated compared to the process of creating the initial work in the first place.

To me, the physical media represents one “license”, to perform that work, in private, on one device. Even if I make a few million copies myself, so long as I only play one of those copies at a time, I am keeping in the spirit of that license.

Thus, I work on the principle of keeping an “archival” copy, from which I can derive working copies that get day-to-day playback. The day-to-day copy will be in some lossy format for convenience.

A decade ago that was MP3, but due to licensing issues, that became awkward, so I switched over to Ogg/Vorbis, which also reduced the storage requirements by 40% whilst not having much audible impact on the sound quality (if anything, it improved). Since I also had to ditch the illegally downloaded MP3s in the process, that also had a “cleaning” effect: I insisted then on that I have a “license” for each song after that, whether that be wax cylinder, tape reel, 8-track, cassette tape, vinyl record, CD, whatever.

This year saw the first time I returned to music downloads, but this time, downloading legally purchased FLAC files. This leads to an interesting problem, how do you store these files in a manner that will last?

Audio archiving and CDs

I’m far from the first person with this problem, and the problem isn’t specific to audio. The archiving business is big money, and sometimes it does go wrong, whether it be old media being re-purposed (e.g. old tapes of “The Goon Show” being re-recorded with other material by the BBC), destruction (e.g. Universal Studios fire), or just old fashioned media degredation.

The procedure for film-based media (whether it be optical film, or magnetic media) usually involves temperature and humidity control, along with periodic inspection. Time-consuming, expensive, error prone.

CDs are reasonably resilient, particularly proper audio CDs made to the Red Book audio disc standard. In the CD-DA standard, uncompressed PCM audio is Reed Solomon encoded to achieve forward error correction of the PCM data. Thus, if a minor surface defect develops on the media, there is hopefully enough data intact to recover the audio samples and play on as if nothing had happened.

The fact that one can take a disc purchased decades ago, and still play it, is testament to this design feature.

I’m not sure what features exist in DVDs along the same lines. While there is the “video object” container format, the purpose of this seems to be more about copyright protection than about resiliency of the content.

Much of the above applies to pressed media. Recordable media (CD-Rs) sadly isn’t as resilient. In particular, the quality of blanks varies, with some able to withstand years of abuse, and others degrading after 18 months. Notably, the dye fades, and so you start to experience data loss beginning with the edge of the disc.

This works great for stuff I’ve purchased on CDs. Vinyl records if looked after, will also age well, although it’d be nice to have a CD back-up in case my record player packs it in. However, this presents a problem for my digital downloads.

At the moment, my strategy is to download the files to a directory, save a copy of the email receipt with them, place my GPG public key along-side, take SHA-256 hashes of all of the files, then digitally sign the hashes. I then place a copy on an old 1TB HDD, and burn a copy to CD-R or DVD-R. This will get me by for the next few years, but I’ve been “burned” by recordable media failing, and HDDs are not infallible either.

Getting discs pressed only makes sense when you need thousands of copies. I just need one or two. So I need some media that will last the distance, but can be produced in small quantities at home from readily available blanks.

Archiving formats

So, there are a few options out there for archival storage. Let’s consider a few:

Magnetic tape

Professional outfits seem to work on tape storage. Magnetic media, with all the overheads that implies. The newest drive in the house is a DDS-2 DAT drive, the media for which has not been produced in years, so that’s a lame duck. LTO is the new kid on the block, and LTO-6 drives are pricey!

Magneto-Optical

MO drives are another option from the past… we do have a 5¼” SCSI MO drive sitting in the cupboard, which takes 2GB cartridges, but where do you get the media from? Moreover, what do I do when this unit croaks (if it hasn’t already)?

Flash

Flash media sounds tempting, but then one must remember how flash works. It’s a capacitor on the gate of a MOSFET, storing a charge. The dielectric material around this capacitor has a finite resistance, which will cause “leakage” of the charge, meaning over time, your data “rots” much like it does on magnetic media. No one is quite sure what the retention truly is. NOR flash is better for endurance than NAND, but if it’s a recent device with more than about 32MB of storage, it’ll likely be NAND.

PROM

I did consider whether PROMs could be used for this, the idea being you’d work out what you wanted to store, burn a PROM with the data as ISO9660, then package it up with a small MCU that presents it as CD-ROM. The concept could work since it worked great for game consoles from the 80s. In practice they don’t make PROMs big enough. Best I can do is about 1 floppy’s worth: maybe 8 seconds of audio.

Hard drives

HDDs are an option, and for now that’s half my present interim solution. I have a 1TB drive formatted UDF which I store my downloads on. The drive is one of the old object storage drives from the server cluster after I upgraded to 2TB drives. So not a long-term solution. I am presently also recovering data from an old 500GB drive (PATA!), and observing what age does to these disks when they’re not exercised frequently. In short, I can’t rely on this alone.

CDs, DVDs and Bluray

So, we’re back to optical media. All three of these are available as blank record-able media, and even Bluray drives can read CDs. (Unlike LTO: where an LTO-$X drive might be backward compatible with LTO-$(X-2) but no further.)

There are blanks out there that are designed for archival use, notably the M-Disc DVD media, are allegedly capable of lasting 1000 years.

I don’t plan to wait that long to see if their claims stack up.

All of these formats use the same file systems normally, either ISO-9660 or UDF. Neither of these file systems offer any kind of forward error correction of data, so if the dye fades, or the disc gets scratched, you can potentially lose data.

Right now, my other mechanism, is to use CDs and DVDs, burned with the same material I put on the aforementioned 1TB HDD. The optical media is formatted ISO-9660 with Joliet and Rock-Ridge extensions. It works for now, but I know from hard experience that CD-Rs and DVD-Rs aren’t forever. Question is, can they be improved?

File system thoughts

Obviously genuinely better quality media will help in this archiving endeavour, but the thought is can I improve the odds? Can I sacrifice some storage capacity to achieve data resilience?

Audio CDs, as I mentioned, use Reed-Solomon encoding. Specifically, Cross-Interleaved Reed-Solomon encoding. ISO-9660 is a file system that supports extensions on the base standard.

I mentioned two before, Rock-Ridge and Joliet. On top of Rock-Ridge, there’s also zisofs, which adds transparent decompression to a Rock-Ridge file system. What if, I could make a copy of each file’s blocks that were RS-encoded, and placed them around the disc surface so that if the original file was unreadable, we could turn to the forward-error corrected copy?

There is some precedent in such a proposal. In Usenet, the “parchive” format was popularised as a way of adding FEC to files distributed on Usenet. That at least has the concept of what I’m wishing to achieve.

The other area of research is how can I make the ISO-9660 filesystem metadata more resilient. No good the files surviving if the filesystem metadata that records where they are is dead.

Video DVD are often dual UDF/ISO-9660 file systems, the so-called “UDF Bridge” format. Thus, it must be possible for a foreign file system to live amongst the blocks of an ISO-9660 file system. Conceptually, if we could take a copy of the ISO-9660 filesystem metadata, FEC-encode those blocks, and map them around the drive, we can make the file system resilient too.

FEC algorithms are another consideration. RS is a tempting prospect for two reasons:

zfec used in Tahoe-LAFS is another option, as is Golay, and many others. They’ll need to be assessed on their merits.

Anyway, there are some ideas… I’ll ponder further details in another post.