Taming the all-digital history research collection, 1: tagging and filing

2009 December 27

Ed Lazowska points to ways that computer science has changed information management in the last 40 years. Here’s the top of his list:

1. Search. Ten years ago, you would painstakingly organize things – label them and file them – so that you could find them. How 1990s! Today, you can search more than 500 Terabytes of the web (not to mention your own desktop) in 100 milliseconds.

From where I sit, he’s half right. I’ve been doing everything I can to have all my primary sources available in digital format, whether they’re photos of manuscripts and typescripts at a physical archive, PDFs from JSTOR, or PDFs from preservation microfilm. Like many historians, the vast majority of the sources I work with aren’t yet digitized (and may never be), but the benefits of always having my primary sources available on my laptop are huge. I get a lot out of web search and desktop search, but I still end up doing a lot of “painstakingly organiz[ing] things” by hand.

Since 2007 or so, I’ve come up with my own research-organization scheme, because I couldn’t find anyone who’d written in detail about how to keep an all-digital research collection for individual scholarly use. (Everyone I talked to said, “That’s always highly individualized, and different things work for different people. You can probably figure out something good on your own.”)

In the spirit of ProfHacker, my favorite academia-plus-technology howto website, here’s what I’ve found useful. These tricks aren’t perfect at all, but they’re a place to start. It’s Mac-centric because that’s what I use, but Mac and Windows users should feel free to comment with your experiences about what works for you.


One particularly helpful blog (that I’ve lost track of now) suggested using tag prefixes to allow more granularity. For example, a tag “SU) infant mortality” is a subject tag for materials about infant mortality. “PN) Abbott. Grace” means that this item relates to a person by name (American social reformer Grace Abbott). Not complicated, flexible, easy enough to remember. It’s not Dublin Core Metadata, but it’s good enough for personal use.1

Early in my research, on the advice of the internet, I adopted a tag-based filing approach at the file level. I’ve used a bunch of different software that supports tagging and searching by tags, like Ironic Software’s TagIt, but ultimately filesystem-based tagging wasn’t granular enough for my purposes. These days, my main use of tags is in Evernote for individual notes I’ve taken about particular topics, or paragraph-length quote clippings out of longer PDFs.

Tagging my materials well requires a lot of discipline, and I haven’t mastered it yet, but when I do it, it works. One of the pitfalls I’ve had to avoid is the temptation to get every single item tagged properly, because I’m one person with a big project and a finite amount of time. Over time, I’ve moved away from tagging files in favor of verbose, search-friendly file names.

Search-friendly file and folder naming

When I take digital photos at an archive, I file them in a hierarchy by repository, collection, series, box, and folder. (Shooting images of the pull slips, the box label, and the folder label helps immensely with this, since I can identify a folder label from its thumbnail image.) I’ve built some Applescripts and shell scripts to help with that filing process.

When I find an item in a set of photos that I want to remember, I rename the file by adding useful words to the end of the camera-generated filename. (I know it’s around here somewhere, it was a letter in the Children’s Bureau collections from a mother in California in the early 1940s…) A Spotlight search on “California” and “mother” in my folder for “Children’s Bureau Central File 1941-44″ comes up with a manageable number of files, and I can browse them until I find what I was thinking of. The trick here is to pick obvious words or phrases that you’ll think of when you want to search.

Search-friendly filenames also are more likely to survive multiple generations of cross-platform backups than are most of the existing Mac file-tagging systems.

(I keep backups on local harddrives and online via JungleDisk. A redundant backup system is critical before trusting your career-making research to any computer.)

For PDFs, I use my citation management software’s filing system; whenever I can, I rename the files using a modified Chicago-style citation format. That solves the problem of finding Grace Abbott’s writings outside of my photo collection. When I’ve been to a library with a digital-camera ban and have returned with a stack of photocopies, I take advantage of my university’s bulk-feed PDF scanner (and its OCR software).

When I see a little quote I want to remember, I use Evernote‘s screen-clip feature to create a note about it, and I type in a basic citation for that clipping. Evernote does some OCR to make the image searchable, which is a huge help.

Together, these tricks handle the “How do I keep track of it?” question, mostly.2 But they’re rough tools. They don’t solve the problem of creating intellectually meaningful ways to search, sort, and analyze digital-format sources. In my experience, that’s a much harder problem, and I’ll write more about it in an upcoming post.

  1. I have a little theory that the Omeka collections-management software—which does use the Dublin Core metadata standards—might now be up to handling many of the tasks I describe here, particularly with its bulk-loading plugins, but I haven’t had time to do anything more than install it experimentally.
  2. Connecting my digital-sources files with my citation management software is more work than it ought to be, which is a subject for another post. I’ve used EndNote and now use Bookends, and someday I’ll switch over to Zotero once the Mellel developers support it properly.
  1. December 28, 2009

    Nice post. I’m teaching an undergrad honors class on research methods this semester, and reading this reminds me that I need a week specifically on the topic of managing one’s own digital archive, as the projects students will be writing will invariably rely on digital collections to write theses locally. In my own case, I use DEVONthink Pro Office, which has only now added a tagging feature. I’m reticent to even bother starting to use it. My book link projects end up with thousands of files in the database, and millions of words. DT’s search function is so robust, I’m not certain that tagging would do me that much.

    I don’t rename the files of my digital photos, because I have 10s of thousands of them. In June, I collected 19,000 (@70gig) more digital images of archival docs. That brings the total for my second project to somewhere around 50,000 individual images. Instead of using file names, I rely on constructing a file structure that reproduces the archive. I think with a batch file renaming utility, I would be interested in reconsidering the file name issue. To make those photos useful, I construct (or cut/paste if there is a pdf finder’s guide) inventories of the files, and create specialized inventories for particular subjects (ie, adultery, murder, tobacco, etc.). These inventories live in DTPO, but the photos do not.

    Finally, I’ll add that with ABBYY OCR built in to DTPO, I’m even photocopying/scanning important books, or sections of those books, to reside in the database, and notetaking off those files.

  2. December 28, 2009

    Correct File naming is all important. It took me a long long time to realise this.

    Since most of my resources are published material (albeit from the nineteenth century), I have found that naming things “author_year” has made my research life a whole lot easier.

  3. January 8, 2010

    Thanks for explaining your system. It’s helpful to see how someone else keeps research materials organized. I’ve only recently gotten better at using search-friendly file names, and there is stille plenty of room for improvement.


