HTRC API#
-
The HathiTrust Research Center (HTRC) enables computational analysis of the HathiTrust corpus.
Log in with your partner institution (SMU) account to access the largest number of volumes and features.
Click here for a list of HTRC tutorials to walk you through the steps for using HTRC tools and data.
HTRC for Text Analysis#
HTRC Algorithms are web-based, click-and-run tools to perform computational text analysis on volumes in the HathiTrust Digital Library.
In a basic text analysis workflow, a researcher:
Gathers digitized text (text that has been scanned and OCR-ed) Note: OCR (Optical character recognition) refers to the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text.
Applies computational methods to that text, such as word counts, classification techniques, and topic modeling
And then analyzes the results generated by the algorithm or technique
HathiTrust Research Center (HTRC) enters the workflow at the points of providing digitized text at scale from HathiTrust Digital Library (HTDL) and providing tools and services that enable computational research.
The researcher, of course, still brings her own analysis to bear on the results.
API (application programming interface)#
An application programming interface (API) is a way for two or more computer programs or components to communicate with each other.
Additional Explanation form technology companies:
Example APIs: loc.gov
For example, look at these pages from the Library of Congress
The main page is meant for people to interact with.
The APIs for LoC.gov “makes information available via a series of application programming interfaces (APIs)” it is meant to retrieve information from Loc.gov in a way that is formatted for a machine and for a programming interface.
HTRC Extracted Features (EF) API#
The HTRC Extracted Features Dataset v.2.0 is composed of page-level features for 17.1 million volumes in the HathiTrust Digital Library. This version contains non-consumptive features for both public-domain and in-copyright books. Features include part-of-speech tagged term token counts, header/footer identification, marginal character counts, and much more.
A full explanation of the dataset’s features, motivation, and creation is available at the EF Dataset documentation page: HTRC Extracted Features (EF) Documentation
Worksets#
TORCHLITE#
All HTRC Tutorials#
Further Learning#
APIs
Some content in this session based on HTRC Digging Deeper, Reaching Further used under a Creative Commons Attribution-NonCommercial 4.0 International License.