OCR

Or, for those not from this part of science-fiction fandom, just think of it as some rather challenging scanning and OCR issues. (Read about APAs.)

I sorted through three boxes from upstairs and got this:

The first three banker’s boxes, plus extras

Back cover on top of the stack of extras

And there are four more boxes up there waiting.

Now, there is probably a lot of duplication (mine plus Pamela’s copies).

Minn-StF owns an Epson duplex auto-feed scanner, which is kind of tailor-made for this job (“duplex” means it scans both sides of the sheet in one pass through). And it’s amazing how good we’ve gotten at handling individual sheets of paper using just a few plastic rollers. Still, when the paper is 40 or so years old and the stack includes many different kinds of paper intermixed, it can be a challenge. (Most Minneapas had at least offset paper, mimeo paper, ditto paper, and often twilltone. And the covers are sometimes card stock.) Luckily, restarting after a jam is easy, so long as you didn’t let it reset the page numbering to 1 automatically.

I made some test scans at 300 dpi and 400 dpi, and tried saving them as JPEG and TIFF files. The scanner was nearly twice as fast at 300 dpi than at higher resolutions, so I left resolution there. I was pleased, though a bit surprised, to find essentially no visible JPEG artifacts (at 80% quality) on all this text. You’re seeing a lot of the paper texture at full res, and it’s enough to satisfy the OCR software…and the JPEG file is 2 MB or less, the TIFF is about 34 MB. So I actually stored the images as JEPGS.Â (Nearly 5000 pages from Saturday’s session, looks like; which was one of those three banker’s boxes.)

Many pages show some browning around the edges. It’s interesting how much variation there is among the different kinds of paper people used.

The print density and clarity varied quite a lot to begin with, as I remember. It certainly varies a lot today.Â Here are some examples at 100% size.

OCR of this sort of material ranges from chancy to hopeless. The volume involved is such that no real quality control or proofreading pass on the OCR is possible, either. However, by using a clever PDF feature we can produce “PDF/A” files which, when opened, show you the image of the page, but when searched by the computer let it search the OCR output (including the images of every page does make the files big, though). Even when OCR is bad, it catches words correctly a lot of the time, so searching for a name or a topic keyword will find you many of the references. And important words in a discussion tend to be repeated, so you’ll be brought to most of the pages the discussion occurs on. (My OCR work on this is being done with an old version of ABBYY Finereader.)

There are legal and privacy issues that make it unlikely that the collection of scans will be posted publicly. They may well be available to people who were in Minneapa. Scanning them gives us backup copies and protection against further deterioration, and some convenience for some people with access to the collection.

Anyway…17 down, 383 to go!

(Continuing the story of self-publishing a new edition of Pamela’s The Dubious Hills from here.)

This book was written on a computer to begin with, so in the ideal situation we’d still have access to the files with the canonical text. However, this frequently doesn’t work out in cases like this where the book was first published more than 20 years ago.

There are at least three ways this doesn’t work out:

The files may simply be lost.

The files may not be readable with any software we currently have (this could possibly happen even if you’re using the same brand-name word processing program, possibly).

And finally, there may never have been files reflecting the final state of the book.Â In fact, this is a near-certainty for anything first published in 1994, because at that time the copy-editing process depended entirely on marks on paper.Â So, unless the author bothered to update the files to reflect changes made at that stage, there never were files with what we really want in them.

Hence the title of this article; we’re going to recover the text for our edition of the book from a printed copy.

There are at least two ways to approach this, but I’ve only ever used one because it’s so obviously best.Â We could simply retype from the printed copy into some word processor (or dictate it into a voice-typing package).Â But I actually use the other approach, scanning the pages and then using OCR software on them. I’ve done this more than half a dozen times over the years, and it’s surprisingly easy (I mean, compared to retyping; it still takes a number of hours).

The particular way I did this one bothers some people, I know; it involves destroying the physical copy of the book I scan to save some time and effort (though not as spectacularly as Vernor Vinge does in Rainbow’s End). I’m going to show pictures below the cut; you have been warned!

Yes, I myself do suffer from the delusion that physical books are nearly sacred objects.Â However, books issued in the modern era in many thousands of copies are very rarely in really short supply; I’m willing to sacrifice one copy of such a book in the service of producing a good e-text, especially since that will contribute to making the book available to a modern generation of readers. If the book were rare or valuable, I’d handle things differently; I’d take the extra time and effort to get a scan (and correct the scan; a good chunk of the benefit comes from a better scan leading to better OCR leading to fewer hours of correction) without destroying the book.

(There are fancy scanners that will scan a book, turning the pages themselves, without even breaking the spine. However, we do not own one. We’re doing this on our own, with outlay of time but only the absolute minimum outlay of money, since we have more time than money at the moment.)

Okay; that cut with the pictures below it coming up now….

Continue reading Starting From a Book

Tag: OCR

Archiving Minneapa

Starting From a Book