History Research Hacks

Exploring digital tools & methods

Command-line OCR of JPGs to PDFs?

Tags: , , , ,

The more I use DEVONThink to organize my research, the more my little Macbook Pro (2.4Ghz, 8GB RAM) gets bound up in running OCR jobs on my archival images.

I’m starting to wish for a more powerful OCR option that I can run on one of my university’s servers. (“Computer, here’s a folder hierarchy with 5GB of JPGs. Run a background job that turns them all into OCRed PDFs, and email me when you’re done.”) That isn’t a task my university’s servers routinely do, so I need to figure out a set of tools to accomplish it before I go asking a sysadmin for disk space and processor time.1

There are commercial options for this sort of thing, but I don’t have an industrial-scale budget. DEVONThink uses AABBYY FineReader’s OCR libraries under the hood. If AABBYY would license a command-line version cheaply that I could install in my home directory on a university server, I’d consider doing that.

In the open-source world, OCRopus does most of this, but it takes images as input and produces HTML output, not OCRed PDFs. (I mean, I guess that would work, but many of my images don’t OCR cleanly enough for me to be able to use HTML output for research without also looking at the source image. I use OCR as a rough-and-ready indexing strategy.) pyocrhelper will take PDFs as input, but it converts them back to images first, which is a waste of processor.

Are there other open-source OCR packages out there that will do what I’m looking for? What should I be looking into? (WatchOCR seems like one option, but I don’t have time for dead-ends and wild goose chases. I just want to know what works.)

  1. Realistically, I won’t get this in time for my dissertation-related needs, so I’ll probably just run big OCR jobs overnight on my Macbook. Consider this post as a long-term question for my research.

Mapping the spread of birth registration with Protovis

Tags: , , , ,

Yesterday, I was reading Lauren Klein’s recently-posted talk on her work about Thomas Jefferson and his enslaved cook, James Hemings, when I found a tool I wish I’d known about sooner. Lauren made her images with Protovis, a Javascript visualization toolkit produced by the Stanford Visualization Group.

Protovis can do all kinds of things, but I was particularly taken with the choropleth map example, which bears a striking resemblance to a problem I’ve been wanting to work up a visualization for. It took a while to figure out how to adapt their example to the data I’ve got, but here’s what I ended up with.

This image is for 1921. Gray states didn’t meet the federal standards for birth registration or death registration; yellow states registered deaths well; purple states registered births and deaths well. You can click on the image for an interactive version that’ll let you set what year to examine.

US death & birth registration areas, 1921

The ability to play with this has helped me think about the periodization of my story in new ways that the source data (rendered in a table on page 59 of this PDF) just wasn’t. (Note the lack of good coverage in heavily rural states, especially the southeast and southwest. These states developed better birth registration by the late 1920s as a result of federal funding from the Sheppard-Towner Act.)

Want to do this yourself?

For those of you who have a similar research problem that might be able to use some visualization, I’ve posted the code on GitHub. You don’t need a web server to run it; all you’ll need is to modify the data file with your own state-level data. If you change the names of any variables in that file, you’ll need to search-and-replace them in the main page’s scripts as well.

A brief note on GeoCommons

Tags: , ,

Yesterday, Digital Humanities Answers helped me find an answer to a problem I’ve been wondering about for a long time: how to map some data easily, without having to know a lot about GIS.


Ocropus on OS X: frustrations

Tags: , , , , ,

A promising bit of OCR software seems like more trouble than it’s worth, at least on OS X.


History Research Hacks, © 2010 Shane Landrum. Some Rights Reserved.

This blog is powered by Wordpress and Magatheme by Bryan Helmig.