History Research Hacks

Exploring digital tools & methods

Command-line OCR of JPGs to PDFs?

Tags: , , , ,

The more I use DEVONThink to organize my research, the more my little Macbook Pro (2.4Ghz, 8GB RAM) gets bound up in running OCR jobs on my archival images.

I’m starting to wish for a more powerful OCR option that I can run on one of my university’s servers. (“Computer, here’s a folder hierarchy with 5GB of JPGs. Run a background job that turns them all into OCRed PDFs, and email me when you’re done.”) That isn’t a task my university’s servers routinely do, so I need to figure out a set of tools to accomplish it before I go asking a sysadmin for disk space and processor time.1

There are commercial options for this sort of thing, but I don’t have an industrial-scale budget. DEVONThink uses AABBYY FineReader’s OCR libraries under the hood. If AABBYY would license a command-line version cheaply that I could install in my home directory on a university server, I’d consider doing that.

In the open-source world, OCRopus does most of this, but it takes images as input and produces HTML output, not OCRed PDFs. (I mean, I guess that would work, but many of my images don’t OCR cleanly enough for me to be able to use HTML output for research without also looking at the source image. I use OCR as a rough-and-ready indexing strategy.) pyocrhelper will take PDFs as input, but it converts them back to images first, which is a waste of processor.

Are there other open-source OCR packages out there that will do what I’m looking for? What should I be looking into? (WatchOCR seems like one option, but I don’t have time for dead-ends and wild goose chases. I just want to know what works.)

  1. Realistically, I won’t get this in time for my dissertation-related needs, so I’ll probably just run big OCR jobs overnight on my Macbook. Consider this post as a long-term question for my research.

Tags: , , , ,

Leave a Reply

History Research Hacks, © 2010 Shane Landrum. Some Rights Reserved.

This blog is powered by Wordpress and Magatheme by Bryan Helmig.