HathiTrust and Text Mining

Working with HathiTrust

As members of the HathiTrust community, George Washington University faculty, students and staff may make use of the HathiTrust corpus of digitized books for research and educational computational investigation. The full corpus can be searched using the HathiTrust Digital Library (HTDL). Many of the books are in the public domain and the full text readily available. For books still in copyright, HTDL makes available only the book's descriptive metadata (though there is a way to work with materials in copyright, described below). HTDL utilizes your GW UserID and password for access. At the HTDL site, click the Login Button and select The George Washington University. (note the T)

The HathiTrust Research Center (HTRC) supports researchers with computational analysis using the corpus. HTRC requires you to create a separate but free account. At a basic level, you can create a workset of books in HDTL and import this into HTRC to run basic algorithms; at the advanced level, you can work with HTRC to gain access to the entire HathiTrust corpus, including materials still in copyright to use in non-consumptive research** activities.

Scroll down to read more about the following options:

Web-based Algorithms
Datasets for Non-Consumptive Research**
Data Capsules for Non-Consumptive Research

** From the 2010 Authors Guild vs Google amended settlement agreement: "Non-Consumptive Research means research in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book." Non-consumptive analytics includes image analysis, text extraction, textual analysis and information extraction, linguistic analysis, automated translation, and indexing and search. Read more on Hathi-Trust's Non-Consumptive Use Research Policy.

Getting Started Guide

HTRC's documentation and FAQ to get you started.

Introduction to the HathiTrust Research Center (2019) (video)

1. Web-based Algorithms (Public Domain Books)

At a basic level, you can run scripts on small worksets of books you have gathered from the HathiTrust Digital Library, basically canned algorithms for quick analysis.

Open HTDL and HTRC and login to both.
In HDTL, build a collection using the public domain volumes in HathiTrust Digital Library. Upload your workset into HTRC.
In HTRC, use the web-based algorithms. Execute an algorithm. This will prompt you to select a workset (your own, or a publically available workset).

Note: This approach does NOT include in-copyright works.

HathiTrust Digital Library

This is the digital preservation repository and access platform. It provides long-term preservation and access services for public domain and in-copyright content from a variety of sources, including Google, the Internet Archive, Microsoft, and in-house partner institution initiatives.

HathiTrust Research Center (HTRC) Analytics

Supports large-scale computational analysis of works in the HathiTrust Digital Library to facilitate non-profit and educational research. Sign up for a free account.

2. Research Datasets

HTRC releases research datasets to facilitate text analysis using the HathiTrust Digital Library. While copyright-protected texts are not available for download from HathiTrust, research can still be performed on the basis of non-consumptive analysis of features extracted from full-text, for example, n-grams from over 13 million volumes in the HDTL to analyze in the computer environment of your choice.

Extracted features include volume-level metadata, page-level metadata, part-of-speech-tagged tokens, and token counts.

HathiTrust Research Center (HTRC) Research Datasets

HTRC Derived Datasets: Information about Extracted Features, including use cases.

3. Data Capsules

The HTRC Data Capsule gives a researcher a secure, virtual computer for non-consumptive analytical access to the full OCR text of the works in the HathiTrust Digital Library. Data capsules are restricted, particularly in limiting how and when products created by analysis tools leave the capsule. Data products leaving the capsule must undergo results review prior to release. To get started with Data Capsule, check out the tutorial below.

Data Capsule Tutorial

Hands-on instructions to introduce the HTRC Data Capsule tool.

Text modified from HathiTrust and Text Mining Guide -- UC Santa Cruz.

Today’s Hours

Featured services

Working with HathiTrust

1. Web-based Algorithms (Public Domain Books)

2. Research Datasets

3. Data Capsules