<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>History Research Hacks</title>
	<atom:link href="http://cliotropic.org/wip/feed/" rel="self" type="application/rss+xml" />
	<link>http://cliotropic.org/wip</link>
	<description>Exploring digital tools &#38; methods</description>
	<lastBuildDate>Fri, 14 Oct 2011 15:27:24 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Command-line OCR of JPGs to PDFs?</title>
		<link>http://cliotropic.org/wip/2011/10/14/command-line-ocr-of-pdfs/</link>
		<comments>http://cliotropic.org/wip/2011/10/14/command-line-ocr-of-pdfs/#comments</comments>
		<pubDate>Fri, 14 Oct 2011 15:23:19 +0000</pubDate>
		<dc:creator>Shane Landrum</dc:creator>
				<category><![CDATA[Tools Hacking]]></category>
		<category><![CDATA[devonthink]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[ocropus]]></category>
		<category><![CDATA[pyocrhelper]]></category>
		<category><![CDATA[watchocr]]></category>

		<guid isPermaLink="false">http://cliotropic.org/wip/?p=37</guid>
		<description><![CDATA[The more I use DEVONThink to organize my research, the more my little Macbook Pro (2.4Ghz, 8GB RAM) gets bound up in running OCR jobs on my archival images. I&#8217;m starting to wish for a more powerful OCR option that I can run on one of my university&#8217;s servers. (&#8220;Computer, here&#8217;s a folder hierarchy with [...]]]></description>
				<content:encoded><![CDATA[<p>The more I <a href="http://cliotropic.org/blog/2011/10/ocring-archival-research-photos-with-devonthink/">use DEVONThink to organize my research</a>, the more my little Macbook Pro (2.4Ghz, 8GB RAM) gets bound up in running OCR jobs on my archival images. </p>
<p>I&#8217;m starting to wish for a more powerful OCR option that I can run on one of my university&#8217;s servers. (&#8220;Computer, here&#8217;s a folder hierarchy with 5GB of JPGs. Run a background job that turns them all into OCRed PDFs, and email me when you&#8217;re done.&#8221;) That isn&#8217;t a task my university&#8217;s servers routinely do, so I need to figure out a set of tools to accomplish it before I go asking a sysadmin for disk space and processor time.<sup class='footnote'><a href='#fn-37-1' id='fnref-37-1' onclick='return fdfootnote_show(37)'>1</a></sup></p>
<p>There are commercial options for this sort of thing, but I don&#8217;t have an industrial-scale budget. DEVONThink uses <a href="http://www.abbyy.com/">AABBYY FineReader</a>&#8217;s OCR libraries under the hood. If AABBYY would license a command-line version cheaply that I could install in my home directory on a university server, I&#8217;d consider doing that.</p>
<p>In the open-source world, <a href="http://code.google.com/p/ocropus/">OCRopus</a> does most of this, but it takes images as input and produces HTML output, not OCRed PDFs. (I mean, I guess that would work, but many of my images don&#8217;t OCR cleanly enough for me to be able to use HTML output for research without also looking at the source image. I use OCR as a rough-and-ready indexing strategy.) <a href="http://code.google.com/p/pyocrhelper/">pyocrhelper</a> will take PDFs as input, but it converts them back to images first, which is a waste of processor.</p>
<p>Are there other open-source OCR packages out there that will do what I&#8217;m looking for? What should I be looking into? (<a href="http://www.watchocr.com/">WatchOCR</a> seems like one option, but I don&#8217;t have time for dead-ends and wild goose chases. I just want to know what works.)</p>
<div class='footnotes' id='footnotes-37'>
<div class='footnotedivider'></div>
<ol>
<li id='fn-37-1'>Realistically, I won&#8217;t get this in time for my dissertation-related needs, so I&#8217;ll probably just run big OCR jobs overnight on my Macbook. Consider this post as a long-term question for my research. <span class='footnotereverse'><a href='#fnref-37-1'>&#8617;</a></span></li>
</ol>
</div>
]]></content:encoded>
			<wfw:commentRss>http://cliotropic.org/wip/2011/10/14/command-line-ocr-of-pdfs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mapping the spread of birth registration with Protovis</title>
		<link>http://cliotropic.org/wip/2011/03/19/mapping-birth-registration-with-protovis/</link>
		<comments>http://cliotropic.org/wip/2011/03/19/mapping-birth-registration-with-protovis/#comments</comments>
		<pubDate>Sat, 19 Mar 2011 12:34:23 +0000</pubDate>
		<dc:creator>Shane Landrum</dc:creator>
				<category><![CDATA[Tools Hacking]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[mapping]]></category>
		<category><![CDATA[protovis]]></category>
		<category><![CDATA[timelines]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://cliotropic.org/wip/?p=35</guid>
		<description><![CDATA[Yesterday, I was reading Lauren Klein&#8217;s recently-posted talk on her work about Thomas Jefferson and his enslaved cook, James Hemings, when I found a tool I wish I&#8217;d known about sooner. Lauren made her images with Protovis, a Javascript visualization toolkit produced by the Stanford Visualization Group. Protovis can do all kinds of things, but [...]]]></description>
				<content:encoded><![CDATA[<p>Yesterday, I was reading Lauren Klein&#8217;s recently-posted <a href="http://macaulay.cuny.edu/eportfolios/lklein/2011/03/17/how-we-know-what-we-know/">talk on her work about Thomas Jefferson and his enslaved cook, James Hemings,</a> when I found a tool I wish I&#8217;d known about sooner. Lauren made her images with <a href="http://vis.stanford.edu/protovis/">Protovis,</a> a Javascript visualization toolkit produced by the <a href="http://vis.stanford.edu/">Stanford Visualization Group.</a></p>
<p>Protovis can do <a href="http://vis.stanford.edu/protovis/ex/">all kinds of things</a>, but I was particularly taken with the <a href="http://vis.stanford.edu/protovis/ex/choropleth.html">choropleth map</a> example, which bears a <a href="http://cliotropic.org/wip/2011/03/16/a-brief-note-on-geocommons/?preview=true&#038;preview_id=32&#038;preview_nonce=0fdc50d0ed#comment-245">striking resemblance</a> to a problem I&#8217;ve been wanting to work up a visualization for. It took a while to figure out how to adapt their example to the data I&#8217;ve got, but here&#8217;s what I ended up with. </p>
<p>This image is for 1921. Gray states didn&#8217;t meet the federal standards for birth registration or death registration; yellow states registered deaths well; purple states registered births and deaths well. You can click on the image for an interactive version that&#8217;ll let you set what year to examine. </p>
<p><a href="http://cliotropic.org/sandbox/regarea_maps/"><img style="display:block; margin-left:auto; margin-right:auto;" src="http://cliotropic.org/wip/wp-content/uploads/2011/03/Screen-shot-2011-03-19-at-8.18.07-AM.png" alt="US death &#038; birth registration areas, 1921" title="registration areas screenshot.png" border="0" width="500" height="349" /></a></p>
<p>The ability to play with this has helped me think about the periodization of my story in new ways that the source data (rendered in a table on page 59 of <a href="http://www.cdc.gov/nchs/data/misc/usvss.pdf">this PDF</a>) just wasn&#8217;t. (Note the lack of good coverage in heavily rural states, especially the southeast and southwest. These states developed better birth registration by the late 1920s as a result of federal funding from the <a href="http://womenshistory.about.com/od/laws/a/sheppard-towner.htm">Sheppard-Towner Act.</a>)</p>
<h2>Want to do this yourself?</h2>
<p>For those of you who have a similar research problem that might be able to use some visualization, I&#8217;ve posted the code on <a href="https://github.com/cliotropic/sandbox">GitHub</a>.  You don&#8217;t need a web server to run it; all you&#8217;ll need is to modify the <a href="https://github.com/cliotropic/sandbox/raw/master/regarea_maps/birthreg_grid_simple.js">data file</a> with your own state-level data. If you change the names of any variables in that file, you&#8217;ll need to search-and-replace them in the <a href="https://github.com/cliotropic/sandbox/raw/master/regarea_maps/index.html">main page&#8217;s scripts</a> as well.</p>
]]></content:encoded>
			<wfw:commentRss>http://cliotropic.org/wip/2011/03/19/mapping-birth-registration-with-protovis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A brief note on GeoCommons</title>
		<link>http://cliotropic.org/wip/2011/03/16/a-brief-note-on-geocommons/</link>
		<comments>http://cliotropic.org/wip/2011/03/16/a-brief-note-on-geocommons/#comments</comments>
		<pubDate>Wed, 16 Mar 2011 15:49:25 +0000</pubDate>
		<dc:creator>Shane Landrum</dc:creator>
				<category><![CDATA[Tools Hacking]]></category>
		<category><![CDATA[dhanswers]]></category>
		<category><![CDATA[gis]]></category>
		<category><![CDATA[mapping]]></category>

		<guid isPermaLink="false">http://cliotropic.org/wip/?p=32</guid>
		<description><![CDATA[Yesterday, Digital Humanities Answers helped me find an answer to a problem I&#8217;ve been wondering about for a long time: how to map some data easily, without having to know a lot about GIS. DHAnswers, a project of the Association for Computers and the Humanities and ProfHacker, is a much-better-than-average implementation of the message-board concept, with really [...]]]></description>
				<content:encoded><![CDATA[<p>Yesterday, <a href="http://digitalhumanities.org/answers/">Digital Humanities Answers</a> helped me find an answer to a problem I&#8217;ve been wondering about for a long time: <a href="http://digitalhumanities.org/answers/topic/open-source-historical-gis-tools#post-162">how to map some data easily,</a> without having to know a lot about GIS.</p>
<p><span id="more-32"></span></p>
<p>DHAnswers, a project of the <a href="http://www.ach.org/">Association for Computers and the Humanities</a> and <a href="http://chronicle.com/blog/ProfHacker/27/">ProfHacker</a>, is a much-better-than-average implementation of the message-board concept, with really smart people who answer questions there. When I saw <a href="http://digitalhumanities.org/answers/topic/open-source-historical-gis-tools#post-222" target="_blank">Bethany Nowviskie&#8217;s reference </a>to <a href="http://geocommons.com/" target="_blank">GeoCommons</a>, I decided to play with it. (I&#8217;d just listened to an older podcast from the <a href="http://www.scholarslab.org/blog/" target="_blank">Scholars&#8217; Lab</a>, <a href="http://gisvirginia.blogspot.com/2009/11/neogeography-andrew-turner-uva-gis-day.html">Andrew Turner&#8217;s</a> November 2011 &#8220;<a title="Permanent Link to Neogeography: from Tower to Town Hall" rel="bookmark" href="http://www.scholarslab.org/podcasts/neogeography-from-tower-to-town-hall/">Neogeography: from Tower to Town Hall</a>.&#8221; Andrew is the CTO of GeoCommons, and that talk&#8217;s a good introduction to mapping for non-experts, even if the sound quality&#8217;s not great.)</p>
<p>In any case: if you&#8217;ve ever wondered how to map some data, and especially if you already have a spreadsheet of it with state names, other place names, or latitude/longitude columns, go play with <a href="http://geocommons.com/">GeoCommons.</a> Once I clean up my maps a little, maybe I&#8217;ll post them here. </p>
<p>I&#8217;m finding some annoyances with GeoCommons, largely around how it handles date-formatted data, but overall it&#8217;s more useful than frustrating.</p>
]]></content:encoded>
			<wfw:commentRss>http://cliotropic.org/wip/2011/03/16/a-brief-note-on-geocommons/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Ocropus on OS X: frustrations</title>
		<link>http://cliotropic.org/wip/2010/06/22/ocropus-on-os-x-frustrations/</link>
		<comments>http://cliotropic.org/wip/2010/06/22/ocropus-on-os-x-frustrations/#comments</comments>
		<pubDate>Wed, 23 Jun 2010 02:43:09 +0000</pubDate>
		<dc:creator>Shane Landrum</dc:creator>
				<category><![CDATA[Tools Hacking]]></category>
		<category><![CDATA[installation]]></category>
		<category><![CDATA[ocr]]></category>
		<category><![CDATA[ocropus]]></category>
		<category><![CDATA[os x]]></category>
		<category><![CDATA[rants]]></category>
		<category><![CDATA[ubuntu]]></category>

		<guid isPermaLink="false">http://cliotropic.org/wip/?p=20</guid>
		<description><![CDATA[A promising bit of OCR software seems like more trouble than it's worth, at least on OS X.]]></description>
				<content:encoded><![CDATA[<p>Ever wonder how Google makes searchable text available from the page images of all those <a href="http://books.google.com">books</a>? The answer is <a href="http://code.google.com/p/ocropus/">Ocropus</a>, the open-source optical character recognition (OCR) software that Google&#8217;s funded the development of. Put together with book-scanning hardware, Ocropus is a key part of why ordinary, nontechnical people can do full-text search on the Google Books collection. So it&#8217;s good, it&#8217;s freely modifiable and redistributable, and it can be scripted and automated in ways that (affordable for students) commercial OCR software can&#8217;t.</p>
<p>I have a theory that Ocropus could be useful for some of my research images&#8212; namely, early 20th century typescript correspondence and print works. I don&#8217;t need perfectly OCRed images, though high accuracy would be great; mostly, what I need is a rough-and-ready way to give <a href="http://en.wikipedia.org/wiki/Spotlight_(software)">Spotlight</a> something to search. My early experiments have been promising, but the recognition quality needs to be a little better, and the learning curve on how to make that happen is steep. (I&#8217;d welcome suggestions; the best guide I&#8217;ve found is the <a href="http://ocrocourse.iupr.com/Home">IUPR course on Ocropus,</a> but it&#8217;s targeted at CS researchers and is very slow going for me. The examples in the <a href="http://code.google.com/p/ocropus/source/browse/extras">extras</a> directory show how to train Ocropus on particular text corpuses, but they have <em>no comments at all</em>. (That may change, if I decide to comment them as part of understanding how they work.)</p>
<p>I&#8217;d have been working on figuring out that part of Ocropus much sooner if I hadn&#8217;t spent big chunks of several days trying to install all its dependencies on OS X so that it&#8217;ll compile. There&#8217;s a pretty good <a href="http://groups.google.com/group/ocropus/web/compiling-ocropus-on-mac-os-x?pli=1">compile guide</a>, but the time I spent on that&#8212; which hasn&#8217;t yet resulted in a functional installation&#8212; is time I&#8217;ll never get back. There&#8217;s another option; someone put together a package for OS X called <a href="http://stuporglue.org/takocr/">TakOCR</a>, which is Ocropus together with the libraries it requires. Unfortunately, it doesn&#8217;t work on <a href="http://en.wikipedia.org/wiki/Mac_OS_X_Snow_Leopard">Snow Leopard</a>, and the developer isn&#8217;t maintaining TakOCR any longer because he doesn&#8217;t have Snow Leopard himself.</p>
<p>Ocropus is meant to be cross-platform for Unix-like OSes, but it&#8217;s developed on Ubuntu Linux. Fortunately, I have an Ubuntu machine at home&#8212; an aging laptop given to me by a friend specifically as a development machine&#8212; and my installation of Ocropus on it Just Works. (I followed the developer team&#8217;s <a href="http://code.google.com/p/ocropus/wiki/InstallTranscript">installation transcript,</a> cut-and-pasted into a shell script.)</p>
<p>I&#8217;m writing this up mostly as a cautionary tale: if you&#8217;re on OS X and are interested in using Ocropus, be prepared to put some time into compiling it (or, better yet, improving the installation process.)</p>
]]></content:encoded>
			<wfw:commentRss>http://cliotropic.org/wip/2010/06/22/ocropus-on-os-x-frustrations/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Naming archival reference photos</title>
		<link>http://cliotropic.org/wip/2010/06/20/naming-archival-reference-photos/</link>
		<comments>http://cliotropic.org/wip/2010/06/20/naming-archival-reference-photos/#comments</comments>
		<pubDate>Sun, 20 Jun 2010 17:39:14 +0000</pubDate>
		<dc:creator>Shane Landrum</dc:creator>
				<category><![CDATA[Digital Archives]]></category>
		<category><![CDATA[archives]]></category>
		<category><![CDATA[filing]]></category>
		<category><![CDATA[hazel]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[os x]]></category>
		<category><![CDATA[photos]]></category>
		<category><![CDATA[tools]]></category>

		<guid isPermaLink="false">http://cliotropic.org/wip/?p=11</guid>
		<description><![CDATA[Now that summer is well underway, many historians are deeply ensconced in archival research. Some of us, particularly the newly-ABD, are looking at our digital cameras and carts full of Hollinger boxes and wondering how to make sense of it all. The first step is being able to find an image reliably, and that requires good file naming practices.]]></description>
				<content:encoded><![CDATA[<p>Now that summer is well underway, many of us are deeply ensconced in archival research. Some of us, particularly the newly-<a href="http://www.noodlesoft.com/hazel.php">ABD</a>, are looking at our digital cameras and carts full of <a href="http://www.google.com/images?q=hollinger+boxes">Hollinger boxes</a> and wondering how to make sense of it all.</p>
<p>I&#8217;ve written before about <a href="http://cliotropic.org/blog/talks/camera-laptop-and-what-else/">hacking better tools for the short research trip</a> and  <a href="http://cliotropic.org/blog/2009/12/taming-the-all-digital-history-research-collection-1-tagging-and-filing/">taming the all-digital research collection</a>, but here&#8217;s a more specific tip:</p>
<p><strong>When you&#8217;re shooting reference-quality photos, think about your file names like accession numbers.</strong> When you take notes, you&#8217;re going to want to refer to individual images, and it&#8217;s easier to take notes from photos if you know that every image you ever shoot has a unique file name.</p>
<p>On the Mac, I use a simple file renaming tool, <a href="http://www.noodlesoft.com/hazel.php">Hazel</a>, to make this easier. Hazel watches whatever directories you tell it to, then executes a set of rules on the files it finds. It&#8217;s free for a 14-day trial and US$21.95 to register. Here&#8217;s what it looks like (click on the image to enlarge):</p>
<p style="text-align: center;"><a href="http://cliotropic.org/wip/wp-content/uploads/2010/06/Hazel-main-dialog.png" rel="lightbox[11]"><img class="size-large wp-image-14 aligncenter" title="Hazel main dialog" src="http://cliotropic.org/wip/wp-content/uploads/2010/06/Hazel-main-dialog-300x226.png" alt="" width="300" height="226" /></a></p>
<p>So I have a folder called <tt>incoming images</tt>, and I can tell Hazel to pick up all JPG files that begin in DSCN&#8212; which is how my camera names files&#8212; and rename them according to a pattern I choose. Here&#8217;s the rule-setting dialog:</p>
<p style="text-align: center;"><a href="http://cliotropic.org/wip/wp-content/uploads/2010/06/Hazel-rules-screen.png" rel="lightbox[11]"><img class="size-medium wp-image-15 aligncenter" title="Hazel rules screen" src="http://cliotropic.org/wip/wp-content/uploads/2010/06/Hazel-rules-screen-300x236.png" alt="" width="300" height="236" /></a></p>
<p>For archival reference images, it&#8217;s important that the pattern be sortable and that it be based on the file creation time of the image as it comes from your camera. That way, when you sort the files by name, you see them in the order that you shot them.</p>
<p>When I started doing this several years ago, I wanted something human-readable, so I chose to use a naming convention that contained a written-out month. By this rule, <tt>DSCN0316.JPG</tt> became <tt>08Jun2008-113648 DSCN0316.JPG</tt>— an image that I shot on June 8, 2008, at 11:36:48. That way, however I moved the file (to put it into folders by collection), it would retain its sortability.</p>
<p style="text-align: left;">Note that the numbers are all padded with <a href="http://en.wikipedia.org/wiki/Leading_zero">leading zeros</a>; this is what makes the sorting work properly. To make sure this happens in Hazel, you&#8217;ll need to edit the date pattern:<br />
<a href="http://cliotropic.org/wip/wp-content/uploads/2010/06/Screen-shot-2010-06-20-at-12.51.56-PM.png" rel="lightbox[11]"><img class="size-medium wp-image-12 aligncenter" title="Screen shot 2010-06-20 at 12.51.56 PM" src="http://cliotropic.org/wip/wp-content/uploads/2010/06/Screen-shot-2010-06-20-at-12.51.56-PM-300x237.png" alt="" width="300" height="237" /></a></p>
<p style="text-align: left;">Then, for every element of the date where you&#8217;re using a number, set it so that all digits of the number will be displayed, like this:<br />
<a href="http://cliotropic.org/wip/wp-content/uploads/2010/06/Screen-shot-2010-06-20-at-12.31.40-PM.png" rel="lightbox[11]"><img class="size-medium wp-image-13 aligncenter" title="Date pattern dialog showing leading zeros" src="http://cliotropic.org/wip/wp-content/uploads/2010/06/Screen-shot-2010-06-20-at-12.31.40-PM-300x206.png" alt="" width="300" height="206" /></a></p>
<p>If I were setting up a naming convention again, I&#8217;d do it differently, more like museums and archives do their accession numbers: <tt>year.month.day.serialnumber</tt>, where <tt>serialnumber</tt> starts at 00001 and goes up for each photo I shoot that day. Unfortunately, this is a kind of renaming that Hazel doesn&#8217;t support well, and it requires slightly more complex scripting. In the meantime, this works fine.</p>
<p>I&#8217;ve posted the Hazel ruleset I use to my <a href="http://github.com/cliotropic/diy-archives-tools">diy-archives-tools</a> package, which is hosted at <a href="http://github.com">GitHub</a>. (GitHub is a social version-control site; it lets me share revisions for software that I&#8217;m working on. For more on version control systems like Git and how to use them, read Julie Meloni&#8217;s <a href="http://chronicle.com/blogPost/A-Gentle-Introduction-to/23064/">A Gentle Introduction to Version Control</a>.)</p>
]]></content:encoded>
			<wfw:commentRss>http://cliotropic.org/wip/2010/06/20/naming-archival-reference-photos/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
