HathiTrust Research Center

The HathiTrust is a multi-institutional effort to build a digital library based on images from the research libraries that were initially involved in the Google Books project. Those original institutions and new partners, including Emory, will add to the collection as they continue to digitize their collections. The goal is to create a well-organized, accessible resource that will be especially valuable to scholars and open to the world (read the full mission statement here).

As it currently exists, the HathiTrust is valuable for researchers looking for digital access to specific titles. However, with the creation of the HathiTrust Research Center (HTRC), scholars will soon be able to leverage the power of some very exciting tools to conduct truly cutting edge research such as what Stanford-based researcher Franco Moretti calls “distant reading.”

Traditional “close reading” focuses on minute details in a poem or novel: a word, a line, a couplet. Distant reading, by contrast, uses algorithms to look for patterns over hundreds, thousands, even millions of books. By looking at these patterns, scholars can begin to ask different kinds of questions. How does word usage evolve over time? How does one national literature differ from another?

Scholars interested in doing this kind of research face two challenges. First, they will need a machine-readable “corpus” of text, and second, they will need to learn how to use the text-analysis tools. Neither task is particularly simple. Creating a digital corpus often involves scanning and OCR-ing a selection of texts and, if you have time, proofreading them. Once that is complete, dealing with the variety of tools commonly used to analyze texts can be very complex. Existing software is very sophisticated and often requires more than average ability with computer systems and statistical methods.

The HTRC aims to help scholars on both fronts. The HathiTrust is already composed of machine-readable text so scholars simply choose which texts they want to analyze and add them to their “workset.” A workset can contain one text or hundreds; the only limit is server capacity. Next, scholars can choose from a number of analysis tools in the Meander suite such as Simile timelines, tag clouds, and topic modeling. Simply choose what algorithm you want to use, select your workset, set some limits, and press submit. It will take some time for the computer to process the job, and this will depend on the complexity of the algorithm and the size of the workset. Very large and complex jobs can take several hours, but most will run in a matter of minutes.

The HTRC is currently in beta testing and should be available to the public sometime next year.

Share this: