HTRC API

HTRC API#

HathiTrust Digital Library
HathiTrust Research Center
- The HathiTrust Research Center (HTRC) enables computational analysis of the HathiTrust corpus.
- Log in with your partner institution (SMU) account to access the largest number of volumes and features.
- Click here for a list of HTRC tutorials to walk you through the steps for using HTRC tools and data.

HTRC for Text Analysis#

HTRC Algorithms are web-based, click-and-run tools to perform computational text analysis on volumes in the HathiTrust Digital Library.

HTRC for text analysis

In a basic text analysis workflow, a researcher:
- Gathers digitized text (text that has been scanned and OCR-ed) Note: OCR (Optical character recognition) refers to the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text.
- Applies computational methods to that text, such as word counts, classification techniques, and topic modeling
- And then analyzes the results generated by the algorithm or technique
HathiTrust Research Center (HTRC) enters the workflow at the points of providing digitized text at scale from HathiTrust Digital Library (HTDL) and providing tools and services that enable computational research.
The researcher, of course, still brings her own analysis to bear on the results.

API (application programming interface)#

An application programming interface (API) is a way for two or more computer programs or components to communicate with each other.

Additional Explanation form technology companies:

Postman: video explanation of API, Text explanation of API
IBM: text explanation of API

Example APIs: loc.gov

For example, look at these pages from the Library of Congress
The main page is meant for people to interact with.
The APIs for LoC.gov “makes information available via a series of application programming interfaces (APIs)” it is meant to retrieve information from Loc.gov in a way that is formatted for a machine and for a programming interface.
Library of Congress Data Exploration

HTRC Extracted Features (EF) API#

The HTRC Extracted Features Dataset v.2.0 is composed of page-level features for 17.1 million volumes in the HathiTrust Digital Library. This version contains non-consumptive features for both public-domain and in-copyright books. Features include part-of-speech tagged term token counts, header/footer identification, marginal character counts, and much more.

HTRC Extracted Features (EF) Dataset
A full explanation of the dataset’s features, motivation, and creation is available at the EF Dataset documentation page: HTRC Extracted Features (EF) Documentation
EF API