Camera, laptop, and what else?: Hacking better tools for the short archival research trip
These are my prepared notes for a talk at Yale’s graduate conference, The Past’s Digital Presence: Database, Archive, and Knowledge Work in the Humanities, February 19-20, 2010. The version posted here doesn’t (yet) include slide images or links to the software I describe.
Today, I’m going to talk about the last two years I’ve spent in archives, both physical and digital, and how they’ve made my work possible. In 2006, when I started proposing my dissertation, I couldn’t have told you that digital strategies would be a core part of my research method. Everyone I talked to told me that notetaking and filing techniques are highly individual. So I’ll repeat that, with a difference: I’m going to actually explain what I do.
I’ve used digital camera strategies for reducing the length of archival trips. Instead of building a giant research budget to support months on end in a particular faraway collection, I’ve put together short travel grants, and I’ve been able to collect research materials that wouldn’t otherwise have been easy to amass. My travels have included visits to about 15 archival libraries; many are on the east coast, but I’ve also worked in libraries in Iowa and California. My digital-camera approach, combined with a PDF scanner for photocopies, means that I can carry nearly all of my primary sources with me on my laptop.
Today, I’ll talk about how I work in the archives, how I organize the materials I collect, and the problems I’ve identified as I work. Finally, I’ll discuss some new tools I think we need to do more intellectually meaningful work from digitized collections.
For this talk, I’ll focus on one particular collection: the Children’s Bureau Papers, which are held by the National Archives and Records Administration (NARA) in College Park, Maryland. Record Group 102 is a civilian collection with no access restrictions. It spans 1400 cubic feet, which by my calculation is somewhere close to 2000 Hollinger-style archival boxes. The bulk of the records are from 1912 to the mid-1940s.
The Children’s Bureau was the first federal agency to focus on the wellbeing of mothers and children, and this collection is a major source for US women’s history. It’s full of letters. Ordinary women, particularly mothers, wrote to the Children’s Bureau with all kinds of questions, and the women who worked at the Children’s Bureau wrote back. Researchers use this collection constantly, but only parts of it have been microfilmed, and they’re generally not the parts I work with.
The techniques I’m describing will work, with minor modifications, in any library which allows you to shoot digital images; this means it’ll be most useful for people working on periods after about the early 19th century, since rare-books libraries tend to have tighter controls. I’m happy to entertain comments about the usefulness and limitations of these approaches.
What to bring
Unlike many private and some public libraries, the National Archives allows researchers to bring in digital cameras, tripods, and even flatbed scanners. There’s a basic process in place to ensure that you’re not shooting classified materials; it’s about as open as I could imagine for an archival library. When I go, I take the following materials:
- Nikon Coolpix compact camera, 7 megapixels ($130); if you’re purchasing one, look on reviews to make sure that the camera supports a macro (close range) mode. You may also want to check out how much barrel distortion the camera has when it zooms in. Most photos will be at wide angles, but it’s nice to be able to zoom in on details of text without the image being too distorted. It’s also good to have a camera that can handle low light levels well without blurring; not every library is well-lit for photography, and some libraries won’t allow tripods.
- Sharpics table-mount monopod (www.sharpics.com, $50). This is the one piece of equipment people ask me about constantly, because it’s vastly easier to work with than a tripod when you’re shooting straight down. It’s also compact and lightweight.
- My laptop, an early-2006 MacBook with upgraded RAM and hard drive; initially $1500, and I’ve spent about $200 per year on upgrading to larger hard drives as prices drop. It’s nice to have a computer with user-replaceable hard drives and RAM, so that you don’t have to pay the manufacturer’s high prices.
- A USB flash-media reader ($5 on eBay); faster than plugging a cable into your camera.
- Extra flash-media cards, 2GB size ($25 total on eBay).
- Rechargeable AA batteries (4) and AC charger ($30). I also have Nikon’s AC adapter, which is nice for a day when I plan to plug the camera in, put it on the tripod, and spend all day shooting papers. Otherwise, batteries work just fine, are more versatile for other electronics I own, and take up less room in luggage.
- Blank CDs or DVDs for backups.
- Headphones and an iPod, to lighten a very repetitive task.
In the archives: my process
- I consult the finding aids, which for the Children’s Bureau are paper-only, and shoot images of them. I write call slips for each box I want and wait for them to be delivered. (For ease of filing requests, I take extra call slips, then use my digital images of the finding aids to fill out the call slips later, after closing time, so that I can turn them in the next morning.)
- While waiting for box delivery, I find a desk that’s well-lit and where daylight shadows won’t adversely affect my images. (There will be shadows; these aren’t preservation-quality images, but they’re good enough for research.) I set up the tripod, confirm that I’ve got enough battery life left, or look through my notes while I wait. At some libraries, I’ve had to sign a permission-to-photograph agreement; I shoot an image of it, with my signature, so that I have a record of it in digital form near the images it relates to.
- When the boxes are delivered, they’re delivered with copies of the pull slip clipped to each one. I shoot in a particular order. The pull slip gets its own image, as does the end of the box with its filing information. Each time I pull out a folder, I shoot an image of it, with the folder label angled diagonally. Because these three kinds of images are distinctive even in thumbnail size, they come in handy later.
- When I set up the camera on its mount, I try to get as far above the document as my tripod will allow, and to set it in a position that will allow me to open a folder underneath it and shoot without moving the documents around. I don’t bother with framing the images precisely as long as the entire paper is in the frame.
- I shoot images of documents in relatively high resolution and full color; 5 or 7 megapixel is wise for standard 8.5×11” documents. (This can pose some problems for disk space, but I’ll talk about those in a moment.) If I’m in doubt about whether an item may be useful for me, I shoot it; disk space is cheaper than another research trip.
- I work with automation in mind: I try to orient the camera consistently on every shot, so that I can easily know how to rotate the resulting batch of images without having to examine each one individually.
- If I need to leave myself a note, like “this box only partially examined; start tomorrow at folder 3” I take a piece of scrap paper and a pencil, write the note, and take a picture of it. The image in my files is more useful for me than a separate notes file would be. (I do also sometimes keep a notes file to remind me about particular documents I see as I work.)
When this works well, I shoot until a memory card fills, then put the card in my cardreader and start offloading the images to my computer while I shoot more images on the second memory card.
On the laptop: the challenges of image filing by the gigabyte
- Most Mac users would assume that iPhoto is the tool to use for image filing, but I don’t use it for anything other than organizing personal snapshots. I find that it’s slow and unwieldy for handling tens of gigabytes of images. Instead, I use OS X’s built-in Image Capture utility, which just dumps files from a memory card into the folder of your choice. That’s all I do while I’m actually at the library. I have a folder called “incoming images,” and I make a new subfolder with a basic description for each time I dump the memory card.
- Because the time sequence of photos is critical to how I work, I don’t like to trust the camera’s filenames, but I do trust its timestamps. I use a piece of automation software called Hazel (http://www.noodlesoft.com/hazel.php) to rename files with consistent names based on when they were first taken. Hazel watches my “incoming images” folder for new files and renames them automatically within a few minutes.
After I leave the library for the day, I try to make an hour while my brain’s still fresh for filing my day’s findings.
- Now I’ve got a big folder full of files. I break it up piece by piece, sorting the images by name and building a folder hierarchy that replicates the arrangement of the physical collection by series, box, and folder. This helps me retain all the citation information I need later when I start taking notes and writing.
- I’ve built a little script that helps me do this. It’s Applescript and a little bit of shell-script. It walks me through the process of previewing images so that I can tell where each box and folder begins and ends, and it asks for the name of the new folder to put each group of files in.
- When I’m filing my images, I also use OS X’s Automator to rotate images into the proper orientation so that I can read them on my screen. Sometimes I add little bits of description in the filenames of particularly interesting items I want to remember.
- Once the images are in a basic state of order, I back them up to a CD or DVD, and I move them into my “dissertation” folder, which is backed up by a cloud backup service (JungleDisk).
- Once the backup’s done, I sometimes use Automator or command-line image tools to reduce the size of images in a particular folder, and if I do that I add a descriptive phrase to the filename: “DSCN0021 halfsize.jpg” so that I know that there’s a higher-resolution version somewhere if I need it.
That’s all I do while I’m traveling. I’m at the library as much as I need to be, but I can hammer through an entire archival cart in a day. As with other research marathons, it helps to eat, stay hydrated, and take breaks regularly throughout the day.
After the research trip: the problems of indexing and searching
Back at home, the real work begins— and here’s where the problems become more intellectual. If you pursue a digital-camera strategy, you inevitably have to tackle the problem of indexing your entire source base and doing it in a way that’s meaningful to your subject.
One possibility, which has become usable only in the last few months, is Evernote, cloud-based software that will do optical character recognition (OCR) on any image you store within it. This is particularly handy if you’re working with a collection that has lots of typed material. The Children’s Bureau papers contain lots of manuscript letters, which you’d need to transcribe or at least to provide keywords for if you were going to rely on Evernote. I’ve also found that Evernote is slower than I want it to be.
For finding what I’m looking for within photo collections, I’ve relied mostly on OS X’s built-in search software, called Spotlight. When I’m taking notes on a particular part of a collection, not every image will be useful. For the ones that are worth paying attention to, I add descriptive, search-friendly words to the filename. It’s hard to learn how much detail and time to put into this, but I try to make phrases distinctive enough to jog my memory when I read the list. For searchability, I also try to adopt consistent terminology: state abbreviations in capital letters, rather than a mix of “Georgia” and “GA”. How you describe your images will depend on what your research questions are.
A word about tags:
Evernote has tagging features built in, and there are several software packages that add tagging features on top of OS X. I’ve found them useful, but also slow. Because they use a single namespace, there’s no good way to separate author tags from subject tags. If you’re using a tag-based approach, I recommend tagging with prefixes: “SU) Mothers” for subjects, “PN) Grace Abbott” for a document where a person’s name appears, and so on.
Research as usual?
From here on out, you could follow whatever processes you’ve learned for taking notes and compiling research. By and large, that’s what I’ve done, but I know it’s possible to do better. My sense of historians’ intellectual needs falls into three categories by types of software: desktop, cloud-based, and social.
- Customizable metadata fields at a filesystem level. I know that Spotlight can do this, but I haven’t had the time to code it. For letters, these might be fields like “author,” “author location,” “recipient,” “recipient location,” “date.” Standard photo metadata in OS X can include geocoded data for where the photo was taken; researchers who study correspondence networks might want to have geocoding fields so they can map correspondence over time.
- Better citation software for handling bulk source collections. Using this filesystem form of storage, it ought to be easy enough to generate most of the collection-location parts of a citation by walking through a folder hierarchy. What I want to be able to do is look at a file using Quick View, hit a key, and have a citation appear in my word processor, just like I can do with a print source stored in Endnote or Bookends. Most citation software really doesn’t play nicely with the formats required by citing archival materials, though Zotero is close.
- Better tools to organize materials by date. Without date-coding of individual letters arranged in a database, it’s difficult to do the kinds of analysis over time that historians rely on. The SIMILE project at MIT has some great timelining tools, and Zotero uses them a little, but not nearly enough.
Evernote and similar information-organizing databases need more robust tagging and searching features, including the ability to separate namespaces for tags and to assign historical creation dates to particular items. In short, they need the ability to handle more structured data without compromising the freeform data-handling features they already provide.
Everything I’ve described up to this point is one individual’s work, very much like most historians are trained to do. However, I think that social software has significant potential to change how historians do archival research. We— both research scholars and archivists—should be making much, much wider use of collaborative strategies. (I know that social software opens up questions of data trustworthiness, authority and control, particularly around organization and metadata, but I want to bracket those concerns for a moment.)
For a collection like the Children’s Bureau papers, and probably for many other large, publicly-held collections, there’s a core group of professional scholars who work with the material regularly. Depending on the collection, sometimes research scholars know as much or more about what we’re seeing as do the archivists who work with it. For the Children’s Bureau papers, the finding aids are more like box lists; because of the scale of the collection, even a folder-level description would take professionals years. The budget to do that work just doesn’t exist.
Open-government advocate Carl Malamud has argued that the federal government should fund archival digitization in a kind of “digital Works Projects Administration.” Until that happens, which is far from certain, I think that a crowdsourcing strategy for putting NARA collections online is far more viable than the commercial-outsourcing strategies NARA has pursued to date. (These have usually allowed vendors like Footnote.com to digitize major collections and sell access to those collections for a fee in exchange for providing free access to the digital collections from within NARA facilities.)
I’d like to see collaborative software for historical archives that lets historians and other collection users build group-authored finding aids and collection guides. If I’ve looked through Box 265, shot images of most of it, and know what’s there, I’d like to be able to share that information easily with other scholars—at least the metadata, if not the images themselves. They’re not preservation-quality, but they’re good enough for research, and NARA has no plans to digitize this collection anytime soon.
Towards the future
For the Children’s Bureau papers, a crowdsourcing strategy for digitization opens several other possibilities for innovative research:
- First, with adequately-curated metadata, researchers could mark oﬀ particular sets of documents for use in a number of ways, including sharing with a research group, exporting to citation software, exposing their sources to scholarly reviewers of a manuscript, classroom use, or large-scale quantitative and geographic analysis. (This last item is particularly important; everyone who’s written about the Children’s Bureau papers talks about how the size of the collection deﬁes quantitative work. Over a number of years, an online scholarly edition of the Children’s Bureau papers could enable exactly those kinds of research.)
- Secondly, any social software for archival collections needs to allow users to set their own personalized metadata alongside the standard metadata oﬀered by the collection software. Perhaps I want to collect a group of letters by mothers about infant mortality; I’d like to have a custom ﬁeld, “cause of baby’s death,” so I can understand how mothers’ responses to children’s deaths varied. Existing social software for bibliography collection, like CiteULike and Zotero, allows personalized, private keywords; I’m describing an extension of that, with multiple personalized ﬁelds relevant to each user’s research.
- Finally, and most excitingly, social software for archives allows a collection’s value to be enhanced over time with other technologies. To enable fulltext searching of images, OCR or crowdsourced-transcription extensions could be added. Interlinking the collection software with Geographic Information Systems (GIS) tools could also allow historians to map the locations of letter-writing mothers and their social networks. Interlinking it with a database of manuscript census records could allow researchers to explore and share their discoveries about a particular letter-writer’s whereabouts and family conditions beyond what the letter reveals.
So, although I’ve described how to do individual forms of digital-photo research, I think we’ll reap greater rewards by building software infrastructures to tie these individual efforts together into a larger fabric of scholarly community.