OCRing archival research photos with DEVONThink Pro Office

2011 October 11
Alice swimming alone, from Sir John Tenniel's illustrations of Alice in Wonderland

Feeling like you're in over your head with that research project? You're not alone. Here's one tool for learning to swim.

Recently, I stumbled on Rachel Leow‘s series of posts (part 1, part 2, part 3) about DEVONThink Pro Office (DTPO) and how she used it for organizing her dissertation sources and writing (on decolonization in British Malaya). Chad Black, who studies early Latin America, is also a big fan of DEVONThink.

I’ve known about DTPO for years, but I held out on buying it. It’s relatively pricey for a grad student (over $100 even with a 25% student/educator discount), although the versions of DEVONThink without optical character recognition (OCR) features are cheaper.1 I experimented with an earlier version of DTPO a few years ago and didn’t see anything overwhelmingly awesome about it, plus it was slow on the hardware I had at the time. But Marta Rivera Monclova‘s recent effusive praise about DTPO led me to try it again (there’s a 30 day free trial version), and the current version (2.3 as of this writing) is much better than the last time I looked. I now wish I’d gone for it sooner.2


DTPO doesn’t have any major features that you can’t find in various other Mac software (OCR with Adobe Acrobat Pro or AABBYY FineReader; file management and fulltext-search with Finder/Spotlight; tagging and notetaking with Evernote; prose composition with the editor/word-processor of your choice.) Nor does it precisely replace them; if you do a lot of PDF manipulation or image processing, you’ll still need to have a dedicated tool for working with those. Where DTPO shines is in integrating all those features extraordinarily well, in making them Applescriptable, and in doing them for large quantities of data without slowing to a crawl.3

Since I started using DTPO about a month ago, I’ve been putting more and more of my archival research photos into it (not to mention PDFs from Google Books, JSTOR, etc.) Since I have a lot of JPG photos of typescript material from the early 20th century, DTPO’s ability to convert those images en masse into searchable PDFs is a huge time-saver. PDFs are a lot smaller, and my “Archival Photos” folder of JPGs was getting just too big and unwieldy— over 30GB.4

Because some of my images aren’t great-quality to begin with, the OCR isn’t perfect, but it’s good enough to index typed material well.5 Especially because the OCR picks up common words reliably, I can search on state names to answer some of my questions about birth registration in particular states. Because the government collections I use have interleaved pages of manuscript letters and typescript replies, I don’t sweat a lot about the fact that OCR won’t catch old handwriting. The typescript replies are a good-enough index for the correspondence, and if I need to transcribe something, I can add an Annotation to that image which will sit, searchably, in the same folder.

I like DEVONThink Pro Office, and there’s a lot about how I’m using it that I’m not detailing in this post. If you’re working on a major research project, you might want to try it. If you do and it works for you— or not– I’d be interested in knowing more about the details.

  1. For me, OCR is worth the money. One thing I’ve learned in graduate school: sometimes throwing money at a problem is cheaper than throwing time at it. This is a great example of the financial barriers to doing research with experimental digital methods; my institution doesn’t site-license most of the software research tools I’ve found most useful, so I’ve been paying out-of-pocket using my student loans. Not everyone’s able to make that choice or comfortable doing so.
  2. In case you’re wondering: I am not being paid by DEVONTechnologies, the makers of DEVONThink Pro Office. My effusive praise for DTPO springs solely from the fact that when you’ve been bashing your head against a (research-methods) wall for months, you feel so good when the pain stops.
  3. I use a Macbook Pro which was on sale in the summer of 2010, with the RAM maxed out to 8GB. Right now, I’ve got less than 10GB in my DEVONThink databases. Your mileage may vary, especially if you run something processor-intensive like speech-recognition software at the same time. I can’t easily run Scrivener, Bookends, DTPO, and Dragon Dictate at the same time unless I’ve got several GB of free disk space.
  4. I’ve been keeping my original JPG images on external hard drives/DVDs in case I need the high-res images later. Sometimes you just need to zoom in to decipher some original handwriting.
  5. When I say “not great quality,” I mean that some are 3-5 megapixel images, taken with poor lighting or in a library where I couldn’t use my table-mount monopod to stabilize the camera. I haven’t yet tried it on the batch of images I took with an iPhone 3G’s camera.
One Response

Trackbacks/Pingbacks

  1. Embarrassments of riches: Managing research assets | Miriam Posner's Blog

Comments are closed.