{"id":37,"date":"2011-10-14T11:23:19","date_gmt":"2011-10-14T15:23:19","guid":{"rendered":"http:\/\/cliotropic.org\/wip\/?p=37"},"modified":"2011-10-14T11:27:24","modified_gmt":"2011-10-14T15:27:24","slug":"command-line-ocr-of-pdfs","status":"publish","type":"post","link":"http:\/\/cliotropic.org\/wip\/2011\/10\/14\/command-line-ocr-of-pdfs\/","title":{"rendered":"Command-line OCR of JPGs to PDFs?"},"content":{"rendered":"<p>The more I <a href=\"http:\/\/cliotropic.org\/blog\/2011\/10\/ocring-archival-research-photos-with-devonthink\/\">use DEVONThink to organize my research<\/a>, the more my little Macbook Pro (2.4Ghz, 8GB RAM) gets bound up in running OCR jobs on my archival images. <\/p>\n<p>I&#8217;m starting to wish for a more powerful OCR option that I can run on one of my university&#8217;s servers. (&#8220;Computer, here&#8217;s a folder hierarchy with 5GB of JPGs. Run a background job that turns them all into OCRed PDFs, and email me when you&#8217;re done.&#8221;) That isn&#8217;t a task my university&#8217;s servers routinely do, so I need to figure out a set of tools to accomplish it before I go asking a sysadmin for disk space and processor time.<sup class='footnote'><a href='#fn-37-1' id='fnref-37-1' onclick='return fdfootnote_show(37)'>1<\/a><\/sup><\/p>\n<p>There are commercial options for this sort of thing, but I don&#8217;t have an industrial-scale budget. DEVONThink uses <a href=\"http:\/\/www.abbyy.com\/\">AABBYY FineReader<\/a>&#8217;s OCR libraries under the hood. If AABBYY would license a command-line version cheaply that I could install in my home directory on a university server, I&#8217;d consider doing that.<\/p>\n<p>In the open-source world, <a href=\"http:\/\/code.google.com\/p\/ocropus\/\">OCRopus<\/a> does most of this, but it takes images as input and produces HTML output, not OCRed PDFs. (I mean, I guess that would work, but many of my images don&#8217;t OCR cleanly enough for me to be able to use HTML output for research without also looking at the source image. I use OCR as a rough-and-ready indexing strategy.) <a href=\"http:\/\/code.google.com\/p\/pyocrhelper\/\">pyocrhelper<\/a> will take PDFs as input, but it converts them back to images first, which is a waste of processor.<\/p>\n<p>Are there other open-source OCR packages out there that will do what I&#8217;m looking for? What should I be looking into? (<a href=\"http:\/\/www.watchocr.com\/\">WatchOCR<\/a> seems like one option, but I don&#8217;t have time for dead-ends and wild goose chases. I just want to know what works.)<\/p>\n<div class='footnotes' id='footnotes-37'>\n<div class='footnotedivider'><\/div>\n<ol>\n<li id='fn-37-1'> Realistically, I won&#8217;t get this in time for my dissertation-related needs, so I&#8217;ll probably just run big OCR jobs overnight on my Macbook. Consider this post as a long-term question for my research. <span class='footnotereverse'><a href='#fnref-37-1'>&#8617;<\/a><\/span><\/li>\n<\/ol>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>The more I use DEVONThink to organize my research, the more my little Macbook Pro (2.4Ghz, 8GB RAM) gets bound up in running OCR jobs on my archival images. I&#8217;m starting to wish for a more powerful OCR option that I can run on one of my university&#8217;s servers. (&#8220;Computer, here&#8217;s a folder hierarchy with [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[25],"tags":[54,29,8,52,53],"class_list":["post-37","post","type-post","status-publish","format-standard","hentry","category-tools-hacking","tag-devonthink","tag-ocr","tag-ocropus","tag-pyocrhelper","tag-watchocr"],"_links":{"self":[{"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/posts\/37","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/comments?post=37"}],"version-history":[{"count":0,"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/posts\/37\/revisions"}],"wp:attachment":[{"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/media?parent=37"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/categories?post=37"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/tags?post=37"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}