{"id":20,"date":"2010-06-22T22:43:09","date_gmt":"2010-06-23T02:43:09","guid":{"rendered":"http:\/\/cliotropic.org\/wip\/?p=20"},"modified":"2010-06-22T22:48:06","modified_gmt":"2010-06-23T02:48:06","slug":"ocropus-on-os-x-frustrations","status":"publish","type":"post","link":"http:\/\/cliotropic.org\/wip\/2010\/06\/22\/ocropus-on-os-x-frustrations\/","title":{"rendered":"Ocropus on OS X: frustrations"},"content":{"rendered":"<p>Ever wonder how Google makes searchable text available from the page images of all those <a href=\"http:\/\/books.google.com\">books<\/a>? The answer is <a href=\"http:\/\/code.google.com\/p\/ocropus\/\">Ocropus<\/a>, the open-source optical character recognition (OCR) software that Google&#8217;s funded the development of. Put together with book-scanning hardware, Ocropus is a key part of why ordinary, nontechnical people can do full-text search on the Google Books collection. So it&#8217;s good, it&#8217;s freely modifiable and redistributable, and it can be scripted and automated in ways that (affordable for students) commercial OCR software can&#8217;t.<\/p>\n<p>I have a theory that Ocropus could be useful for some of my research images&#8212; namely, early 20th century typescript correspondence and print works. I don&#8217;t need perfectly OCRed images, though high accuracy would be great; mostly, what I need is a rough-and-ready way to give <a href=\"http:\/\/en.wikipedia.org\/wiki\/Spotlight_(software)\">Spotlight<\/a> something to search. My early experiments have been promising, but the recognition quality needs to be a little better, and the learning curve on how to make that happen is steep. (I&#8217;d welcome suggestions; the best guide I&#8217;ve found is the <a href=\"http:\/\/ocrocourse.iupr.com\/Home\">IUPR course on Ocropus,<\/a> but it&#8217;s targeted at CS researchers and is very slow going for me. The examples in the <a href=\"http:\/\/code.google.com\/p\/ocropus\/source\/browse\/extras\">extras<\/a> directory show how to train Ocropus on particular text corpuses, but they have <em>no comments at all<\/em>. (That may change, if I decide to comment them as part of understanding how they work.)<\/p>\n<p>I&#8217;d have been working on figuring out that part of Ocropus much sooner if I hadn&#8217;t spent big chunks of several days trying to install all its dependencies on OS X so that it&#8217;ll compile. There&#8217;s a pretty good <a href=\"http:\/\/groups.google.com\/group\/ocropus\/web\/compiling-ocropus-on-mac-os-x?pli=1\">compile guide<\/a>, but the time I spent on that&#8212; which hasn&#8217;t yet resulted in a functional installation&#8212; is time I&#8217;ll never get back. There&#8217;s another option; someone put together a package for OS X called <a href=\"http:\/\/stuporglue.org\/takocr\/\">TakOCR<\/a>, which is Ocropus together with the libraries it requires. Unfortunately, it doesn&#8217;t work on <a href=\"http:\/\/en.wikipedia.org\/wiki\/Mac_OS_X_Snow_Leopard\">Snow Leopard<\/a>, and the developer isn&#8217;t maintaining TakOCR any longer because he doesn&#8217;t have Snow Leopard himself.<\/p>\n<p>Ocropus is meant to be cross-platform for Unix-like OSes, but it&#8217;s developed on Ubuntu Linux. Fortunately, I have an Ubuntu machine at home&#8212; an aging laptop given to me by a friend specifically as a development machine&#8212; and my installation of Ocropus on it Just Works. (I followed the developer team&#8217;s <a href=\"http:\/\/code.google.com\/p\/ocropus\/wiki\/InstallTranscript\">installation transcript,<\/a> cut-and-pasted into a shell script.)<\/p>\n<p>I&#8217;m writing this up mostly as a cautionary tale: if you&#8217;re on OS X and are interested in using Ocropus, be prepared to put some time into compiling it (or, better yet, improving the installation process.)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A promising bit of OCR software seems like more trouble than it&#8217;s worth, at least on OS X.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[25],"tags":[28,29,8,13,26,27],"class_list":["post-20","post","type-post","status-publish","format-standard","hentry","category-tools-hacking","tag-installation","tag-ocr","tag-ocropus","tag-os-x","tag-rants","tag-ubuntu"],"_links":{"self":[{"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/posts\/20","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/comments?post=20"}],"version-history":[{"count":0,"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/posts\/20\/revisions"}],"wp:attachment":[{"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/media?parent=20"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/categories?post=20"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/cliotropic.org\/wip\/wp-json\/wp\/v2\/tags?post=20"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}