Digital Photo Archiving

This came up in comments on TOP, and I realized I’d written enough that I wanted to make an article of it and keep it here where I could refer to it easily.

Craig Norris referred to this article about digital bit-rot that he had suffered, and that got me thinking about whether I’m covered against that sort of problem. He says he’s getting a stream of email from people who have had similar problems. I’ve never seen anything like that in my own collection—but I’m doing quite a few things to cover myself against such situations.

Here are things I’m doing to insure the integrity of my digital photo archive:

  • ECC RAM especially in my file server. This memory (and associated software in the OS) can detect up to two bit errors in a word, and correct up to one bit error in a word.
  • No overclocking. I’m not taking risks with data integrity on this system.
  • Storing the images on a ZFS filesystem.  ZFS keeps data block checksums independent of the hardware error protection, so it can detect more errors than just relying on the hardware.  (Also the data is mirrored on two disks).  (The ZFS checksums are larger than the hardware checksums, and so will detect more error cases.  No checksum system will detect all possible changes to a block  of data, though.)
  • Run weekly “scrubs”, where it reads all the blocks on the disks and verifies their checksums.  This means errors will be detected within a week, rather than waiting until the next time I look at an image.  This makes it more likely that I’ll have a valid backup somewhere.  (I have not yet detected any error on a scrub.) The early detection, and detection not depending on a human eye, are very valuable I think.

(I believe the BTRFS and NILFS filesystems for Linux also do block checksums.  ZFS is available in Linux and BSD ports, but none of these  are mainstream or considered production-ready in the Linux world (the original Solaris ZFS that I’m running is production-grade).  You could simulate block checksums with a fairly simple script and the md5sum utility, making a list of the MD5 checksums of all files in a directory and then checking it each week.)

  • For many of the older directories, I’ve run PAR2 to create redundant bits and checksums of the files in the directory (I choose about 15% overhead).  This gives me yet another way to detect and possibly fix errors.  I should really go through and do more of this.
  • Multiple backups on optical and magnetic media, including off-site copies.
  • Using high-quality optical media for backups (Kodak Gold Ultima, MAM Gold archival).
  • I have a program for analyzing the state of optical disks, which can tell how much error correction is going on to make it readable.  This should give me early warning before a disk becomes unreadable.  I need to run this again on some of my older samples.

You’ll notice I can’t achieve these things with white-box hardware and mainstream commercial software.  And that ongoing work is needed.  And that I’m behind on a couple of aspects.

I won’t say my digital photos are perfectly protected; I know they’re not. But I do think that I’m less likely to lose a year of my digital photos than I am of my film photos. A flood or fire in my house would be quite likely to do all the film in, while my digital photos would be fine (due to off-site backups).  (So would the scans I’ve made of film photos.)

Furthermore, I realized recently that I’ve been storing my film in plastic tubs, nearly air-tight, without any silica gel in there. I’m working to fix this, but that kind of oversight can be serious in a more humid climate. (If I lived in a more humid climate, I might have had enough bad experiences in the past that I wouldn’t make that kind of mistake!)

Anyway—the real lesson here is “archiving is hard”. Archiving with a multi-century lifespan in mind is especially hard.

Film, especially B&W film, tolerates benign neglect much more gracefully than digital data—it degrades slowly, and can often be restored to near-perfect condition (with considerable effort) after decades in an attic or garage, say.

Most people storing film are not doing it terribly “archivally”, though. Almost nobody is using temperature-controlled cold storage.  Most people store negatives in the materials they came back from the lab in, which includes plastics of uncertain quality and paper that’s almost certainly acidic.

Digital archives are rather ‘brittle’—they tend to seem perfect for a while, and then suddenly shatter when the error correction mechanism reaches its limits. But through copying and physical separation of copies, they can survive disasters that would totally destroy a film archive.

A digital archive requires constant attention; but it can store stuff perfectly for as long as it gets that attention. My digital archive gets that attention from me, and is unlikely to outlast me by as much as 50 years (though quite possibly individual pictures will live on online for a long time, like the Heinlein photo).

Planning a New Computer

I’m starting to feel the urge to upgrade my computer at home (partly because the one they gave me at work is really pretty good). And of course I’m all out-of-date on what’s good.

I don’t play 3d-graphics-intensive games on this computer, and do very little video editing. The challenging application for this computer is photographic work, with Photo Mechanic, Bibble Pro, Photoshop, and Thumbs Plus being the main consumers of cycles. I do multi-task a lot, running Thunderbird and Firefox pretty much full-time, and often playing music through WinAmp while working on photos.

Photo Mechanic, Thumbs Plus, and Photoshop make Linux out of the question for this box (Bibble runs excellently on Linux, but there are no good alternatives for at least two of the other three). Dual-boot is not satisfactory. And besides, I’ve got a Solaris box with lots of spare cycles up on the same shelf (it’s primarily the file server), so stuff I preferred to run on Unix I could just run there. So vitualization doesn’t really seem to be a big win for this new box. One thing it might buy me, though, is living within the 32-bit Windows memory restrictions. If the Windows virtual slice *only* ran the photo programs, not any of the multi-tasking, having only 3.3GB memory available might be tolerable longer (I’m not doing stitching to produce gigapixel images, or any of the professional-level work that makes 500MB files normal; my PSD files rarely exceed 150MB), though, which might keep me from having to run Vista (64 Ultimate).

I have the general impression that memory throughput is the biggest single thing that can help Photoshop. Anybody have any clear information on that, or pointers to good benchmarks that would confirm or contradict this hypothesis?

Some Photoshop functions, and Photo Mechanic, and Bibble, and apparently the new version of Thumbs Plus, are pretty heavily multi-threaded (making use of more than two cores, I mean). Plus the multi-tasking. So I’m inclined to look at a quad-core rather than a slightly faster dual-core chip. It may lose a bit on some slow Photoshop filters, but it will win in the things I most commonly run 500 photos through.

I’m keeping all my photos (and the rest of my Windows Documents) on the file server, not on a local disk at all. I need a local disk for the basic software installation, some Windows pagefile (I expect I’ll start with 8GB of ram, so it probably won’t page all *that* much), and temp files (and the photoshop pagefile).

I started out thinking of maybe getting two of the 10kRPM SATA drives (small ones, 80GB or whatever the smallest size is), one to be the software install disk and the other to hold temp files and pagefiles. Then it occurred to me that small solid-state disks (flash memory packaged with an [S]ATA interface) were in that price range too, and they’re faster, or should be. I’m looking for clearer specs on this.

Also, the argument for splitting those uses across two disks is mostly access arm motion, and SSDs don’t have access arms. They still have bandwidth limitations, but they’re pretty good. So one 64GB SSD would probably perform better than *two* 10K magnetic drives, and would cost considerably less.

Then the really creeping wacko idea hit me: Two (or more) SSDs, striped (RAID0). Double the read and write bandwidth available for any large transfer, and no access arms so the competing uses would compete *only* for bandwidth, not for access arm position. I think perhaps this is too creeping wacko, but I do find it amusing to think about. Maybe I’ll benchmark it. (The lowered reliability is not a problem since it’s just installed software, doesn’t change that often, and the underlying drive reliability is much higher than with spinning platters.)

My impression is that Intel is the one a bit ahead at the moment in the chip race.

The graphic card doesn’t matter much any more; a third-tier mainstream card will do fine for Photoshop. I don’t have and won’t be getting this decade a full-Adobe-RGB 16-bit monitor or anything like that. I do need dual DVI out (well, only one this instant, but the old 17″ VGA tube is getting a bit long in the tooth).

Prices seem weird. ZipZoomfly will sell me 8GB of DDR2 800 memory for half what General Nanosystems will. Best Buy wants three times more for a DVD drive than General Nano does (I guess that’s not a total surprise). Motherboard availability is spotty overall. Hard drive prices, too. Best buy has a dual-DVI graphics board  a lot cheaper than General Nano.

I’m having trouble finding any motherboard reviews that really address anything I care about for current products. I’m largely looking at the Asus P5K family, and some at the P5Q. I do want IEEE-1394 (though in fact all my external devices are dual-mode), I don’t want WiFi. I want at least four RAM slots, but I don’t want to pay for a server-level motherboard, so I’m unlikely to get more. (Yeah, room for 4 dual-core fast chips would be attractive, and room for 64GB of RAM, but it’s also kinda expensive.)

The case doesn’t have to have room for lots of disk drives, finally, so I’m looking at a mini-tower for the first time in ages, and maybe even at one of the miniature (“shuttle”-type) cases. I do still care a lot about cooling, and a little about noise (there’s the file server with 8 spinning disks right next to it, and a window air-conditioner).

The current system is an Athlon XP “Thoroughbred” (family 6(7), model 8(8)). I think it was a “2800+” model. Single core, single thread. Currently with 2GB of PC3200 DDR ram. So, really, it’s not going to be hard for the new system to seem considerably better.

Still in research mode. Input welcome.