HTRC for Text Analysis#

  • HTRC Algorithms are web-based, click-and-run tools to perform computational text analysis on volumes in the HathiTrust Digital Library.

HTRC for text analysis

  • In a basic text analysis workflow, a researcher:

    • Gathers digitized text (text that has been scanned and OCR-ed) Note: OCR (Optical character recognition) refers to the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text.

    • Applies computational methods to that text, such as word counts, classification techniques, and topic modeling

    • And then analyzes the results generated by the algorithm or technique

  • HathiTrust Research Center (HTRC) enters the workflow at the points of providing digitized text at scale from HathiTrust Digital Library (HTDL) and providing tools and services that enable computational research.

  • The researcher, of course, still brings her own analysis to bear on the results.

API (application programming interface)#

An application programming interface (API) is a way for two or more computer programs or components to communicate with each other.

Additional Explanation form technology companies:

Example APIs:

  • For example, look at these pages from the Library of Congress

  • The main page is meant for people to interact with.

  • The APIs for “makes information available via a series of application programming interfaces (APIs)” it is meant to retrieve information from in a way that is formatted for a machine and for a programming interface.

  • Library of Congress Data Exploration

HTRC Extracted Features (EF) API#

The HTRC Extracted Features Dataset v.2.0 is composed of page-level features for 17.1 million volumes in the HathiTrust Digital Library. This version contains non-consumptive features for both public-domain and in-copyright books. Features include part-of-speech tagged term token counts, header/footer identification, marginal character counts, and much more.



All HTRC Tutorials#

Further Learning#

  • APIs

Some content in this session based on HTRC Digging Deeper, Reaching Further used under a Creative Commons Attribution-NonCommercial 4.0 International License.