History Research Hacks

Command-line OCR of JPGs to PDFs?

Shane Landrum — Fri, 14 Oct 2011 15:23:19 +0000

The more I use DEVONThink to organize my research, the more my little Macbook Pro (2.4Ghz, 8GB RAM) gets bound up in running OCR jobs on my archival images.

I’m starting to wish for a more powerful OCR option that I can run on one of my university’s servers. (“Computer, here’s a folder hierarchy with 5GB of JPGs. Run a background job that turns them all into OCRed PDFs, and email me when you’re done.”) That isn’t a task my university’s servers routinely do, so I need to figure out a set of tools to accomplish it before I go asking a sysadmin for disk space and processor time.¹

There are commercial options for this sort of thing, but I don’t have an industrial-scale budget. DEVONThink uses AABBYY FineReader’s OCR libraries under the hood. If AABBYY would license a command-line version cheaply that I could install in my home directory on a university server, I’d consider doing that.

In the open-source world, OCRopus does most of this, but it takes images as input and produces HTML output, not OCRed PDFs. (I mean, I guess that would work, but many of my images don’t OCR cleanly enough for me to be able to use HTML output for research without also looking at the source image. I use OCR as a rough-and-ready indexing strategy.) pyocrhelper will take PDFs as input, but it converts them back to images first, which is a waste of processor.

Are there other open-source OCR packages out there that will do what I’m looking for? What should I be looking into? (WatchOCR seems like one option, but I don’t have time for dead-ends and wild goose chases. I just want to know what works.)

Realistically, I won’t get this in time for my dissertation-related needs, so I’ll probably just run big OCR jobs overnight on my Macbook. Consider this post as a long-term question for my research. ↩

Mapping the spread of birth registration with Protovis

Shane Landrum — Sat, 19 Mar 2011 12:34:23 +0000

Yesterday, I was reading Lauren Klein’s recently-posted talk on her work about Thomas Jefferson and his enslaved cook, James Hemings, when I found a tool I wish I’d known about sooner. Lauren made her images with Protovis, a Javascript visualization toolkit produced by the Stanford Visualization Group.

Protovis can do all kinds of things, but I was particularly taken with the choropleth map example, which bears a striking resemblance to a problem I’ve been wanting to work up a visualization for. It took a while to figure out how to adapt their example to the data I’ve got, but here’s what I ended up with.

This image is for 1921. Gray states didn’t meet the federal standards for birth registration or death registration; yellow states registered deaths well; purple states registered births and deaths well. You can click on the image for an interactive version that’ll let you set what year to examine.

The ability to play with this has helped me think about the periodization of my story in new ways that the source data (rendered in a table on page 59 of this PDF) just wasn’t. (Note the lack of good coverage in heavily rural states, especially the southeast and southwest. These states developed better birth registration by the late 1920s as a result of federal funding from the Sheppard-Towner Act.)

Want to do this yourself?

For those of you who have a similar research problem that might be able to use some visualization, I’ve posted the code on GitHub. You don’t need a web server to run it; all you’ll need is to modify the data file with your own state-level data. If you change the names of any variables in that file, you’ll need to search-and-replace them in the main page’s scripts as well.

A brief note on GeoCommons

Shane Landrum — Wed, 16 Mar 2011 15:49:25 +0000

Yesterday, Digital Humanities Answers helped me find an answer to a problem I’ve been wondering about for a long time: how to map some data easily, without having to know a lot about GIS.

DHAnswers, a project of theÂ Association for Computers and the Humanities andÂ ProfHacker, is a much-better-than-average implementation of the message-board concept, with really smart people who answer questions there. When I saw Bethany Nowviskie’s reference to GeoCommons, I decided to play with it. (I’d just listened to an older podcast from the Scholars’ Lab, Andrew Turner’s November 2011 “Neogeography: from Tower to Town Hall.” Andrew is the CTO of GeoCommons, and that talk’s a good introduction to mapping for non-experts, even if the sound quality’s not great.)

In any case: if you’ve ever wondered how to map some data, and especially if you already have a spreadsheet of it with state names, other place names, or latitude/longitude columns, go play withÂ GeoCommons. Once I clean up my maps a little, maybe I’ll post them here.

I’m finding some annoyances with GeoCommons, largely around how it handles date-formatted data, but overall it’s more useful than frustrating.

Ocropus on OS X: frustrations

Shane Landrum — Wed, 23 Jun 2010 02:43:09 +0000

Ever wonder how Google makes searchable text available from the page images of all those books? The answer is Ocropus, the open-source optical character recognition (OCR) software that Google’s funded the development of. Put together with book-scanning hardware, Ocropus is a key part of why ordinary, nontechnical people can do full-text search on the Google Books collection. So it’s good, it’s freely modifiable and redistributable, and it can be scripted and automated in ways that (affordable for students) commercial OCR software can’t.

I have a theory that Ocropus could be useful for some of my research images— namely, early 20th century typescript correspondence and print works. I don’t need perfectly OCRed images, though high accuracy would be great; mostly, what I need is a rough-and-ready way to give Spotlight something to search. My early experiments have been promising, but the recognition quality needs to be a little better, and the learning curve on how to make that happen is steep. (I’d welcome suggestions; the best guide I’ve found is the IUPR course on Ocropus, but it’s targeted at CS researchers and is very slow going for me. The examples in the extras directory show how to train Ocropus on particular text corpuses, but they have no comments at all. (That may change, if I decide to comment them as part of understanding how they work.)

I’d have been working on figuring out that part of Ocropus much sooner if I hadn’t spent big chunks of several days trying to install all its dependencies on OS X so that it’ll compile. There’s a pretty good compile guide, but the time I spent on that— which hasn’t yet resulted in a functional installation— is time I’ll never get back. There’s another option; someone put together a package for OS X called TakOCR, which is Ocropus together with the libraries it requires. Unfortunately, it doesn’t work on Snow Leopard, and the developer isn’t maintaining TakOCR any longer because he doesn’t have Snow Leopard himself.

Ocropus is meant to be cross-platform for Unix-like OSes, but it’s developed on Ubuntu Linux. Fortunately, I have an Ubuntu machine at home— an aging laptop given to me by a friend specifically as a development machine— and my installation of Ocropus on it Just Works. (I followed the developer team’s installation transcript, cut-and-pasted into a shell script.)

I’m writing this up mostly as a cautionary tale: if you’re on OS X and are interested in using Ocropus, be prepared to put some time into compiling it (or, better yet, improving the installation process.)

Naming archival reference photos

Shane Landrum — Sun, 20 Jun 2010 17:39:14 +0000

Now that summer is well underway, many of us are deeply ensconced in archival research. Some of us, particularly the newly-ABD, are looking at our digital cameras and carts full of Hollinger boxes and wondering how to make sense of it all.

I’ve written before about hacking better tools for the short research trip andÂ taming the all-digital research collection, but here’s a more specific tip:

When you’re shooting reference-quality photos, think about your file names like accession numbers. When you take notes, you’re going to want to refer to individual images, and it’s easier to take notes from photos if you know that every image you ever shoot has a unique file name.

On the Mac, I use a simple file renaming tool, Hazel, to make this easier. Hazel watches whatever directories you tell it to, then executes a set of rules on the files it finds. It’s free for a 14-day trial and US$21.95 to register. Here’s what it looks like (click on the image to enlarge):

So I have a folder called incoming images, and I can tell Hazel to pick up all JPG files that begin in DSCN— which is how my camera names files— and rename them according to a pattern I choose. Here’s the rule-setting dialog:

For archival reference images, it’s important that the pattern be sortable and that it be based on the file creation time of the image as it comes from your camera. That way, when you sort the files by name, you see them in the order that you shot them.

When I started doing this several years ago, I wanted something human-readable, so I chose to use a naming convention that contained a written-out month. By this rule, DSCN0316.JPG became 08Jun2008-113648 DSCN0316.JPGâ€” an image that I shot on June 8, 2008, at 11:36:48. That way, however I moved the file (to put it into folders by collection), it would retain its sortability.

Note that the numbers are all padded with leading zeros; this is what makes the sorting work properly. To make sure this happens in Hazel, you’ll need to edit the date pattern:

Then, for every element of the date where you’re using a number, set it so that all digits of the number will be displayed, like this:

If I were setting up a naming convention again, I’d do it differently, more like museums and archives do their accession numbers: year.month.day.serialnumber, where serialnumber starts at 00001 and goes up for each photo I shoot that day. Unfortunately, this is a kind of renaming that Hazel doesn’t support well, and it requires slightly more complex scripting. In the meantime, this works fine.

I’ve posted the Hazel ruleset I use to my diy-archives-tools package, which is hosted at GitHub. (GitHub is a social version-control site; it lets me share revisions for software that I’m working on. For more on version control systems like Git and how to use them, read Julie Meloni’s A Gentle Introduction to Version Control.)