Ever wonder how Google makes searchable text available from the page images of all those books? The answer is Ocropus, the open-source optical character recognition (OCR) software that Google’s funded the development of. Put together with book-scanning hardware, Ocropus is a key part of why ordinary, nontechnical people can do full-text search on the Google Books collection. So it’s good, it’s freely modifiable and redistributable, and it can be scripted and automated in ways that (affordable for students) commercial OCR software can’t.
I have a theory that Ocropus could be useful for some of my research images— namely, early 20th century typescript correspondence and print works. I don’t need perfectly OCRed images, though high accuracy would be great; mostly, what I need is a rough-and-ready way to give Spotlight something to search. My early experiments have been promising, but the recognition quality needs to be a little better, and the learning curve on how to make that happen is steep. (I’d welcome suggestions; the best guide I’ve found is the IUPR course on Ocropus, but it’s targeted at CS researchers and is very slow going for me. The examples in the extras directory show how to train Ocropus on particular text corpuses, but they have no comments at all. (That may change, if I decide to comment them as part of understanding how they work.)
I’d have been working on figuring out that part of Ocropus much sooner if I hadn’t spent big chunks of several days trying to install all its dependencies on OS X so that it’ll compile. There’s a pretty good compile guide, but the time I spent on that— which hasn’t yet resulted in a functional installation— is time I’ll never get back. There’s another option; someone put together a package for OS X called TakOCR, which is Ocropus together with the libraries it requires. Unfortunately, it doesn’t work on Snow Leopard, and the developer isn’t maintaining TakOCR any longer because he doesn’t have Snow Leopard himself.
Ocropus is meant to be cross-platform for Unix-like OSes, but it’s developed on Ubuntu Linux. Fortunately, I have an Ubuntu machine at home— an aging laptop given to me by a friend specifically as a development machine— and my installation of Ocropus on it Just Works. (I followed the developer team’s installation transcript, cut-and-pasted into a shell script.)
I’m writing this up mostly as a cautionary tale: if you’re on OS X and are interested in using Ocropus, be prepared to put some time into compiling it (or, better yet, improving the installation process.)