1,000 Year Personal Bit Storage

Image for post
Image for post

Since the 90’s I’ve had a filesystem which I call “Vault”. It contains all the bits I really don’t want to loose. It includes documents, code, (including my first efforts from the 80's) audio and video recordings and most importantly my photo archive. (both digitally originated and scans of film based photos from pre-digital camera era) I keep a copy of the Vault on a live filesystem that is network accessible, usually on a software RAID. This is handy to both be able to access the bits and as a place to easily drop new data that should eventually reach replicated offline state. Suffice it to say, periodically I make multiple backups of “Vault” and store them on different kinds of media in various geographically distributed locations.

Image for post
Image for post

Initially the Vault was backed up as a tar on 8mm tape but in the late 90’s I started to write it onto CDROMs instead because tape drives were becoming harder to find. I also wanted something other than magnetic media holding these bits should a big magnetic event, either local or global, ever happen.

I had two general problems migrating to writable CD-ROMs. While the access was better, the bits rotted much faster than I expected with some discs becoming completely unreadable within 6 years. That wasn’t great but thankfully was easily mitigated by upping the pace of the backups. It also really hasn’t been a problem because I’ve never actually had to resort to the backup because the live version of the filesystem never crashed.

The other problem was the size of the discs. At only 650 megabytes per CD-ROM, I had to get creative as the average file size in the video directory exploded. It got to the point where I would have some discs containing one file each which (aside from just being janky) ended up becoming a nightmarishly manual physical task of shuttling discs in and out of drives. (and therefore cutting down on my overall motivation — not good)

M-DISC

Although I missed it in 2010, a now-defunct company called Millenniata released a storage technology called M-DISC which claimed properly stored media should last 1,000 years. It requires a special writer with a much more powerful laser. The media is made of glass rather than plastic and hence it is a bit more expensive. The kicker was that an M-DISC, once burned, was functionally identical to a DVD or BlueRay disk and readable by conventional drives. The media is sold by a variety of companies, notably Verbatim, and the writer (sporting the required more powerful laser) from LG, notably the WH16NS40.

Image for post
Image for post
LG WH16NS40 Blu-ray M-DISC Writer — $130
Image for post
Image for post
Verbatim BDXL M-DISC (pack of 10) — $60

While I’ll probably never escape having to repeatedly swap disks, the fact that one M-DISC (BDXL type) can hold 100 Gigabytes puts us in much better shape. Currently the Vault is pushing 5.7 Terabytes so we’re coming in around 57 disks. And at $6 per disc that is $342 to burn one complete backup. That’s not bad at all for a lifetime of readability. Then all we need to do is save deltas. (all changes to the Vault are additions in my case)

Data Layout

That solves half of the issue. The other half is a bit more complex. Cutting sets of files down on 100GB boundaries creates inefficiencies. I’ll probably end up using a few more disks (not the end of the world) but I’m also limited to a maximum file size of 100GB. That may not seem like a big deal but that isn’t far away from where some of my biggest files are now. Because of that, I’d probably end up with several disks sporting a single file. Time to think creatively.

I could just cut an entire raw filesystem on 100GB boundaries and write each chunk as a single .bin file, one per disk. That would be good but it causes an intermediary step which is only really done by reconstituting the entire filesystem onto a sizable partition and mounting it. This may be a necessary evil but it also probably requires us to be able to read two filesystem formats, ISO 9660 and whatever filesystem you end up using for your data. In my case that is ext4. (I suppose you could do ISO inside of ISO too)

Let’s say we decide we can live with two different filesystems. That gives us all the features of a true unix filesystem. (executable bits, ownership, compression, redundancy) It turns out RAID technology is fairly well developed and gives us a few nice features if we’re already doing this, namely better redundancy. Let’s say we striped the data across 58 disks using RAID5. That would build in some redundancy so should one of the disks be unreadable, we wouldn’t actually loose data.

Now of course we have to have access to all but one of the disks to be able to read any one file. That’s not great from a usability perspective but I don’t anticipate ever touching these disks unless I’m trying to recover from a catastrophic data loss so I’m willing to work with this constraint.

We could up the replication factor so we can tolerate loosing two disks by moving to RAID6, but I’ll leave that as an exercise for the reader. We’re out of luck with RAID if we want to be able to survive a 3 disk failure though.

For that we could use zfs and make a RAIDZ3. zfs is nothing short of awesome from a flexibility standpoint but it is stuck in a licensing quagmire which makes me worry that the software for reconstructing a RAIDZ3 in the future might be somewhat hard to get (fixable if we leave a copy on the disks) and poorly understood. (harder to mitigate although a README.md would help) It seems to me there is room here for a RAID-like option with an arbitrary replication factor to appear although you could just ghetto-rig something together by making RAIDs on top of RAIDs. (too complicated for archival needs I would argue)

RAID5

So how does one actually make a RAID5 for archival purposes? We could create a bunch of empty files, make them loopback devices and use them to construct a standard RAID5 in Linux. Let’s look at a simple example with only 4 disks. First we’ll use dd to make 4 empty 99GB files. (we’ll want to go slightly less than 100GB so they fit on the ISO filesystem along with a README.md and any necessary software)

$ dd if=/dev/zero of=raid5-01.bin bs=1G count=99
$ dd if=/dev/zero of=raid5-02.bin bs=1G count=99
$ dd if=/dev/zero of=raid5-03.bin bs=1G count=99
$ dd if=/dev/zero of=raid5-04.bin bs=1G count=99

Next, we’ll make sure we have enough block loopback devices to support the number of disks we have. If you don’t find them in /dev/loop* you can create them. I started at /dev/loop10 because I was already using the lower numbered ones for docker which tends to make heavy use of loopback devices.

$ mknod /dev/loop10 b 7 10
$ mknod /dev/loop11 b 7 11
$ mknod /dev/loop12 b 7 12
$ mknod /dev/loop13 b 7 13

Next we’ll watch to attach the files to the loopback devices so the :

$ losetup loop10 raid5-01.bin
$ losetup loop11 raid5-02.bin
$ losetup loop12 raid5-03.bin
$ losetup loop13 raid5-04.bin

Now let’s create a RAID5 across all of our devices:

$ mdadm --create /dev/md0 --level=5 --raid-devices=4 /dev/loop10 /dev/loop11 /dev/loop12 /dev/loop13

Now we have /dev/md0 which is a RAID5 striping data across the four files. Let’s make a filesystem on this device:

$ mkfs.ext4 /dev/md0

Now we have a filesystem so let’s mount it somewhere and start adding some files:

$ mkdir raid
$ mount /dev/md0 raid/
$ cp -r vault/* raid/

Now that we have some data on the RAID, let’s see how it is functioning:

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 loop13[4] loop12[2] loop11[1] loop10[0]
3035136 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
unused devices: <none>

Notice the [UUUU] near the end? That’s showing us all four “disks” are U(up) and working. Looks like its happy.

Now that we have some data in there, we’re going to want to shut it down and unwind everything. We’ll start by un-mounting the RAID.

$ umount raid/

Then we’ll stop the RAID. (this also removes /dev/md0)

$ mdadm --stop /dev/md0

We also want to unlink the loopbacks. We can see what we have with:

$ losetup -l

And then we’ll unlink the loops. (sometimes stopping the RAID does this for us but it can’t hurt)

$ losetup -d /dev/loop10
$ losetup -d /dev/loop11
$ losetup -d /dev/loop12
$ losetup -d /dev/loop13

Now we have 4 files of equal size that can be burned onto M-DISC media. As mentioned before, in reality we would include a text file with the instructions on what the big file is and how to reconstruct it as well. If we’re feeling especially conservative, we might also leave a bootable Linux system with the RAID tools installed and possibly the associated source code

Burning takes a while. In my experience, a 100GB M-DISC takes about 3 hours which can be doubled if you want to verify the disk as well. I don’t mind so much as long as I can cut an entire backup within a month or so. Once burned, you are going to want make sure it is stored in a dark, cool and dry place somewhere. Basements are good provided the house doesn’t burn down. ;)

Let’s simulate a lost or damaged disk. We’ll just leave out the last image file and see if we can still get at all the data given the remaining images. You’ll see this turns out to be aggressively automated by the kernel. As soon as we mount the first loopback device: (the -r here mounts the loopback device in read-only mode)

$ losetup -r loop10 raid-01.bin

the kernel will notice that this device is part of a RAID and will attempt to bring it up.

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive loop10[0](S)
1011712 blocks super 1.2
unused devices: <none>

We’re currently inactive but we’ve recognized this as md0 so we’re on our way. Let’s add the other two images. (leaving off the last one to simulate loss)

$ losetup -r loop11 raid-02.bin
$ losetup -r loop12 raid-03.bin

Now we have an active (but degraded) RAID:

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active (read-only) raid5 loop12[2] loop11[1] loop10[0]
3035136 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
unused devices: <none>

Notice the [UUU_] there showing us we have three disks (U) up and one (_) missing? The RAID is marked active so let’s mount it:

$ mount /dev/md0 raid/
mount: /home/anders/raid: WARNING: device write-protected, mounted read-only.

A quick check in the raid/ directory shows us all our files. Looks like we’re in business! Not bad for a repurposed technology, huh?

zfs RAIDZ3

We could also do this with zfs which is incredibly flexible for this kind of thing, ignoring the licensing debacle. RAIDZ3 gives us replication that will survive three lost disks. Additionally, you can make data more redundant by upping the number of copies of each block. (called “ditto blocks”) zfs also supports compression, deduplication and streaming updates which are nice in a backup scenario.

First we’ll want to make sure we have the tools. In Ubuntu and similar you can install them with:

$ apt install zfsutils-linux

We’ll start with 5 files this time:

$ dd if=/dev/zero of=raidz3-01.bin bs=1G count=99
$ dd if=/dev/zero of=raidz3-02.bin bs=1G count=99
$ dd if=/dev/zero of=raidz3-03.bin bs=1G count=99
$ dd if=/dev/zero of=raidz3-04.bin bs=1G count=99
$ dd if=/dev/zero of=raidz3-05.bin bs=1G count=99

but we won’t need to do loopback devices because zfs supports files directly. (you have to refer to them by full path though or zpool looks for devices under /dev/) Let’s create the RAIDZ3 pool:

$ zpool create pool-raidz3 raidz3 /root/raidz3-01.bin /root/raidz3-02.bin /root/raidz3-03.bin  /root/raidz3-04.bin /root/raidz3-05.bin

We can check the status with:

$ zpool status
pool: pool-raidz3
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
pool-raidz3 ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
/root/raidz3-01.bin ONLINE 0 0 0
/root/raidz3-02.bin ONLINE 0 0 0
/root/raidz3-03.bin ONLINE 0 0 0
/root/raidz3-04.bin ONLINE 0 0 0
/root/raidz3-05.bin ONLINE 0 0 0
errors: No known data errors

Looks good. By default that is mounted at /pool-raidz3 so we can dive right in if we want to.

But let’s flip on compression for the fun of it:

$ zfs set compression=lz4 pool-raidz3

We could also keep 2 copies of each block if we feel we have the space:

$ zfs set copies=2 pool-raidz3

OK, moving on. Let’s copy some files in there:

$ cp -r vault/* /pool-raidz3

Have we’ve saved any space using compression?

$ zfs get compressratio
NAME PROPERTY VALUE SOURCE
pool-raidz3 compressratio 1.31x -

Looks like we’ve saved a little bit there. Most of my files are already compressed so I wouldn’t expect too much of a win for my stuff. There is obviously a performance hit using compression but I think it is clearly worth it in the backup context.

OK, let’s tear that down so we can burn the files.

$ zpool export pool-raidz3

This writes out any unwritten data and marks the pool as exported. We’ll want to do this so we can re-import it later on. (as opposed to zpool destroy pool-raidz3 which will do exactly what you think)

That’s pretty much it. We wrote to the files directly so no need to undo loopbacks or anything. We can just burn the files to disk as-is.

Let’s simulate two lost disks and see how we can recover things.

$ mkdir hold/
$ mv raidz3-04.bin hold/
$ mv raidz3-05.bin hold/

Now let’s see what zfs thinks can be imported. We’ll point to our/root directory to search for virtual device files.

$ zpool import -d /root/
pool: pool-raidz3
id: 9730183507039862411
state: DEGRADED
status: One or more devices are missing from the system.
action: The pool can be imported despite missing or damaged devices. The
fault tolerance of the pool may be compromised if imported.
see: http://zfsonlinux.org/msg/ZFS-8000-2Q
config:
pool-raidz3 DEGRADED
raidz3-0 DEGRADED
/root/raidz3-01.bin ONLINE
/root/raidz3-02.bin ONLINE
/root/raidz3-03.bin ONLINE
14777853700595681385 UNAVAIL cannot open
7227292630874512686 UNAVAIL cannot open

So the state is DEGRADED but this should still work. Let’s mount it by adding the pool name at the end of that same command.

$ zpool import -d root/ -o readonly=on pool-raidz3

And that does it — we can see all our files in /pool-raidz3. That’s it — fairly straightforward. (and compressed!)

Bootable Partition

Lastly, I want something that my future self or others will hopefully find reasonably actionable. The whole point of backing things up is being able to recover them reasonably effectively so I want to leave some instructions in the form of a README.md file, and index of all the files and ideally a bootable partition with the software necessary to get the data back. I’m going to ignore a world where x86 computers with BlueRay drives don’t exist. (although this would be a major concern if archiving something for hundreds or thousands of years — see GitHub’s Arctic Code Vault which, unrealted, contains several of my projects!)

Lately I’ve been messing with getting a small bootable partition going based on an Ubuntu LiveCD release. The idea is to have all the tools to decode built into every disk in the archive to streamline reconstruction. Of course you either need 57 BlueRay drives connected or a filesystem big enough to hold what is on them. Either way, it is a big ask in a system like this. I happen to have some reasonably large filesystems that could hold copies of the files so this isn’t currently an issue.

Wishlist

I could be unaware of other open source options out there but it seems like something similar to this but without the requirement to have most copies of the backup online just to get one file would be a great enhancement. I’m thinking of something like a tar that is striped across multiple pieces of media. It would keep the unix file attributes and other stuff like compression but be more space efficient. (not be restricted to file boundaries when they don’t completely fill the disk) If you striped in a staggered fashion across 3 sets of disks you could probably survive a random loss of half of the disks with no data loss and be able to recover at least something if degraded past that.

Some of the data storage locations would be 1,000 year timescale / write once media like M-DISCs but others could be local or cloud based filesystems. They might vary not only in size, speed or expected longevity but also in the level of privacy expected at the location where they are stored. For example, while you might cut data onto an M-DISC, you might store it with a third party such as in a bank deposit box. Encryption seems like an option for this but you could also elect to group multiple locations with the same concern and distribute the data so as not to actually have all the bits in any one location. The other issue if you go the encryption route is where and how do you store the keys.

Ideally you would come to the software with a filesystem to be backed up saying you had 5 data storage locations with these attributes and a requested replication factor. Then the software would return you a new set of files that you could cut to your backup media which would maximize the amount of data likely retrievable from random events at each location. The software would also produce an index as well. An interesting implementation of the index would be a new filesystem type that could be mounted (via FUSE?) where you could list the files and roam around the directories but any actual file access would hang while another user process asks you to insert a specific piece of media.

Image for post
Image for post
Arctic Code Vault 2D Data Format (source: YouTube)

Another thought on writable media is using giant 2D codes printed out on paper with a laser printer perhaps similar in look to the way the Arctic Code Vault does things. (pic above) We have had data on paper (also called “books”) for thousands of years so presumably how to stand that scale of time is well understood in this medium. If you got a pile of acid free paper and printed chunks of data on them, presumably storage and retrieval (via scanner / digital camera) would be reasonably straightforward. The clear downside here is information density. We’re talking millions of pieces of paper at least for a bigger file. This might make sense for smaller high value files like private keys or text documents though.

Conclusion

While I haven’t tested this out for 1,000 years or anything yet, M-DISC has fairly drastically impacted my long term bit storage strategy. In particular it has made it a far less manual process given the fact that I don’t think bits will rot in my lifetime. (to be confirmed though) At the expense of some de-archiving complexity and a reliance on the readability of almost all my backup media, I can have much more efficient file storage which retains all the native stuff you get with a real filesystem but also other streaming backup options as well.

This is disaster scenario backup and not intended for random file access. It is clearly better from a convenience standpoint to write discrete files onto backup media and suffer the slight inefficiencies as a small price to pay. You have to buy the tradeoffs in having a filesystem above the ISO 9660 filesystem such as having most disks online to reconstruct even a single file. Because all the media is probably stored together, it either all gets destroyed or all lasts for the same reason so I don’t think of this too negativly but you might not agree.

In hindsight, I might just keep the live copy of my Vault on a zfs pool and make the backups using snapshots. I’m sure the first one would be a headache but incremental backups from there would be very convenient. One could stream the changes (all writes in my case — I don’t alter things already written) to a file and when that file approached the maximum size of the backup media (100GB for M-DISC BD-XL) a checkpoint would be flushed to disk and the changes could be archived. Of course to get a copy of the filesystem in its most recent state, you would need to read all media back and reply the changes. As this is a disaster scenario type of system with all hopes in it not ever needing to be accessed, I think this is a reasonable tradeoff.

If you have delved into this and have other ideas, I’d love to hear them. Please leave a comment.

Written by

Applied CBDC Research @ the Federal Reserve — fmr Circle.com, Bandwidth.com. MIT / Podcaster / Runner / Helicopter Pilot

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store