Text data mining (TDM)#

* What is Data?
* What is Text data mining (TDM)?
* What is Text as Data?
* What are Types of Text Mining?

What is Data ?#

  • We begin without data. Then it is observed, or made, or imagined, or generated. After that, it goes through further transformations.

    • “What are the differences between data, a dataset, and a database?

      • Data are observations or measurements (unprocessed or processed) represented as text, numbers, or multimedia.

      • A dataset is a structured collection of data generally associated with a unique body of work.

      • A database is an organized collection of data stored as multiple datasets. Those datasets are generally stored and accessed electronically from a computer system that allows the data to be easily accessed, manipulated, and updated. - Definition via USGS

  • For the purposes of this workshop, we recommend finding an already existing data set for your project, as creating, cleaning and/or structuring a new dataset is often time and labor intensive.

    • Remember that just because data may be avaible digitally, it does not automatically exist as a dataset. You may have to do works manually (copying and pasting into a spreadsheet) or computationally (scarping the data) to create a dataset usable for computational analysis.

  • Read more about Data Prep and Cleaning and Cleaning Text Data

data life cycle

Forms of data#

There are many ways to represent data, just as there are many sources of data. For the purposes of this series we are focusing on already digitized text.

Raw data#

  • Raw data is yet to be processed, meaning it has yet to be manipulated by a human or computer. Received or collected data could be in any number of formats, locations, etc.. It could be in any number of forms.

  • But “raw data” is a relative term, inasmuch as when one person finishes processing data and presents it as a finished product, another person may take that product and work on it further, and for them that data is “raw data”.

    • For example, is “big data” “raw data”? How do we understand data that we have “scraped”?

Processed/transformed#

  • Processing data puts it into a state more readily available for analysis, and makes the data legible. For instance it could be rendered as structured data. This can also take many forms, e.g., a table. We will discuss what this means for text data (OCR, Tokenizing, etc.).

Data mining#

  • Data mining is looking for patterns using computational methods, often across large data sets.

  • Computer algorithms can discern patterns in bodies of (often unstructured) text.

    • “Unstructured” means that little is known about the semantic meaning of the text data and that it does not fit a defined data model or database.

    • An algorithm is simply a computational process that creates an output from an input. * In text analysis, the input would be the unstructured text, and the output would be indicators to help you reveal different things about the text.

What is TDM?#

“The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts.”

  • What Is Text Mining? by Marti Hearst.

  • Text analysis can be used for a variety of purposes. It can be used for exploratory and analytical research, such as seeking out patterns in scientific literature to pick up trends in medical research otherwise difficult to see from the vantage point of an individual reader. It can also be used in developing tools that we use in our daily lives, for example creating spam filters to identify spam e-mail.

TDM projects#

For a TDM project you need to complete the following steps:

  • Identify a research question or topic of interest.

  • Data gathering or Data Search

    • Search for and identify a data set that is available for use and relevant to your research interests.

    • For an introductory project, we recommend you find and use an already existing data set, rather then creating and/or cleaning a new dataset.

  • Data preparation

    • The text needs to be transformed from a form that human readers are familiar with to something that the computer can “read”.

    • The text is broken into smaller pieces and abstracted (reduced) into things that a computer can process.

  • Text analysis

  • Counting is often what happens next.

    • Some of the things that are often counted include words, phrases, and parts of speech (POS).

    • The number of these counts can be used to identify characteristics of texts.

  • Then, researchers can apply computational statistics to the counts of textual features, and develop hypotheses based on these counts.

    • Do you have a specific question you are trying to answer or are you doing an exploratory project?

    • What type of analysis are you interested in? It may be useful to look for an article in your field on a topic you are interested in to see what the process will be like. The data and methodology sections are always useful to read to see where people are finding data and how they are analyzing it.

  • Visualization

  • What are the stages in text analysis? -AWS

How does it impact research?#

  • In a general sense, the shift in the researcher’s perspective leads to shifts in research questions. Text analysis techniques are sometimes called “distant reading.”

    • This is a term attributed to Franco Moretti, meaning “reading” literature not by studying particular texts, but by aggregating and analyzing massive amounts of texts and “reading” them at a “distance”. This scaling-up and “distancing” can bring out more insights from a very different vantage point.

    • It is also worth mentioning that text analysis doesn’t have to be the primary method in a research project. It may be just one step in the entire process, or it can be combined with close reading. This approach has been called “intermediate reading” or “distant-close reading”.

  • This shift in research perspective allows for new kinds of research questions to be asked, or for old questions to be “answered” in new ways. Here are some of the possibilities that text analysis can bring to researchers:

    • It can explore questions not provable by human reading alone

    • It allows larger corpora for analysis

    • It allows studies that cover longer time spans

  • Text analysis techniques are often best when combined with qualitative assessment and theoretical context.

  • Text analysis research questions explore a wide range of topics, from biomedical discovery to literary history. Research questions that are conducive for text analysis methods may involve these characteristics:

    • Change over time

    • Pattern recognition

    • Comparative analysis

    • Text analysis research examples from HTRC

      • How do the projects involve change over time, pattern recognition, or comparative analysis?

      • What kind of text data do they use (time period, source, etc.)?

      • What are their findings?

What kind of texts do I need for text analysis?#

Text analysis depends on having a large number of texts in an accessible format. Since many text analysis methods rely on statistical models, it is generally true that having more texts (a text corpus) will improve the outcomes of your analysis.

  • It is also generally true that an ideal set of texts—or corpus—will be:

    • Full-text

    • Easily readable, such as plaintext files or Python strings

In practice, “easily-readable” means that you could hypothetically copy and paste the text. This may be made complicated if working with works which are in copyright. If it is not possible to access “full-text” due to applicable copyright laws, the ideal corpus will give readers access to metadata or n-gram counts. (N-grams are a contiguous chain of n-items (i.e. words) where n is the number of items in the chain.)

  • While having the full texts for the documents in your corpus is ideal, a great deal can be still be discovered through the use of unigrams. Even when researchers have access to the full-texts of a corpus, it is common for them to create a list of n-gram counts for analysis.

What is Text as Data ?#

When approaching text as data, here are some things to keep in mind:

  • First, having textual data of sufficient quality is important. Textual data quality is determined by how it’s created. Hand-keyed text is often of the best quality, while text obtained by OCR, Optical Character Recognition, can vary in quality. Raw, uncorrected OCR text is dirty, and it can only become clean until it is corrected. (For example, Please note that HathiTrust OCR is dirty and uncorrected.

  • When viewing text as data, we usually analyze them by corpus or corpora. As mentioned in previous modules, a “corpus” of text can refer to both a digital collection and an individual’s research text dataset. Text corpora are bodies of text.

  • When preparing text, one can think in terms of what Geoffrey Rockwell has called text decomposition or re-composition. The text will be split, combined, or represented in ways that distinguish it from human readable text. It may involve discarding some text, and requires the researcher to shift their understanding of the text from human-legible object to data. What stays, what goes, and how things are manipulated is a researcher’s choice. While there are emerging best practices, there isn’t a step-by-step guide to follow.

I have my own data. What will it take to get it ready?#

Finding, cleaning and structuring data is often time consuming and labor intensive. One of the most significant benefits of using a database such as HathiTrust Digital Library, Constellate by ITHAKA or any database with a dataset builder is that it takes out the vast majority of effort in doing text analysis.

  • For a major text analysis project, such as UNC Chapel Hill’s On the Books: Jim Crow and Algorithms of Resistance, about 90% of the labor is creating the corpus. For your initial project we suggest assembling a data set from one of the previously mentioned databases.

If you have your own data, you will need to assess what it will take to make it ready for analysis. Here are some questions you should ask:

  • Do I need to use Optical Character Recognition (OCR) to convert my data into plain text?

  • Do i need to tokenize (break up and separate the words) my texts?

Consider the data’s current form as well as your current skill level or the size and skill of your project staff. The corpus creation process could take anywhere from a few hours to many years of labor.

  • If there is a significant amount of labor, you may need to write a grant proposal to hire help.

    • If writing a grant, contact your library with questions about Research Data Management since funding agencies often require your corpus to be committed to a disciplinary or institutional repository.

  • In addition to the cleaned-up texts for your corpus, you will also need a strategy for dealing with textual metadata, information such as author, year, etc. It helps to have some experience with working with data at scale with either Excel or Python Pandas.

Searching for Text Data#

  • You can search for already existing datasets in the following:

Databases

Licensed content in Library databases

Policies for Mining Licensed Content

  • If you are thinking of basing a research project on data extracted from a library database, contact your subject librarian to discuss issues around permissions (copyright and licensing agreements), formats and fees.

  • In addition to copyright considerations, we must take into account what the database vendors’ own policies specify in regard to this type of use. When providing access to a database, the library enters into licensing agreements, which also dictate what types of data can be extracted and used. Many prohibit text and data mining and the use of software such as scripts, agents, or robots, but it may be possible to negotiate text mining rights.

Non-consumptive or non-expressive use

  • Research in which computational analysis is performed on text, but not research in which a researcher reads or displays substantial portions of the text to understand the expressive content presented within it.

  • Non-consumptive research complies with copyright law because of the distinction in law between “ideas” and “expressions”. It is sometimes called non-expressive use (because it works with “ideas” instead of specific “expressions”, hence the term “non-expressive”).

    • Non-consumptive research complies with copyright law because of the distinction in law between “ideas” and “expressions”. It is sometimes called non-expressive use (because it works with “ideas” instead of specific “expressions”, hence the term “non-expressive”).

  • Foundation of HTRC work.

Open access (OA) or Public Domain information

You can search the open web using a web browser such as Chrome or Firefox, adding specific terms such as: data, datasets, API, file format (such as .csv). Advanced search options amy also allow for searching for specific file types. Try searching:

Depending on the the type of data, the collecting agency or you field their might be open access repositories with that data available.

Preparing Data (Cleaning and Transforming)#

  • After gathering the data needed for research and before conducting the actual analysis, data often requires preparation (also sometimes refereed to as pre-processing the data). Preparing data can take a lot of time and effort.

  • Cleaning text data can involve…

  • Removing Duplicates

  • Correcting Errors

  • Removing Outliers

  • Adjusting for Missing data

  • Correcting OCR errors.

  • Removing title and header information

  • Removing html or xml tags.

  • Splitting or combining files.

  • Removing certain words or punctuation marks.

  • Making text into lowercase.

The importance of using open data formats#

A small detour to discuss data formats. For accessibility, future-proofing, and preservation, keep your data in open, sustainable formats.

  • Sustainable formats are generally unencrypted, uncompressed, and follow an open standard. A small list:

    • ASCII

    • PDF

    • .csv

    • FLAC

    • TIFF

    • JPEG2000

    • MPEG-4

    • XML

    • RDF

    • .txt

    • .r

How do you decide the formats to store your data when you transition from ‘raw’ to ‘processed/transformed’ data? What are some of your considerations?

Tidy data#

There are guidelines to the processing of data, sometimes referred to as Tidy Data.1 One manifestation of these rules:

  1. Each variable is in a column.

  2. Each observation is a row.

  3. Each value is a cell.

{"Cats":[
    {"Calico":[
    { "firstName":"Smally", "lastName":"McTiny" },
    { "firstName":"Kitty", "lastName":"Kitty" }],
    "Tortoiseshell":[
    { "firstName":"Foots", "lastName":"Smith" }, 
    { "firstName":"Tiger", "lastName":"Jaws" }]}]}

1Wickham, Hadley. “Tidy Data”. Journal of Statistical Software.

Chunking text#

Chunking text

  • As mentioned, preparing text often involves splitting and combining files. In text analysis, splitting files is commonly referred to as chunking text. It means splitting text into smaller pieces before analysis. The text may be divided by paragraph, chapter, or a chosen number of words (e.g. 1000 word chunks). Let’s say that we have a whole text that consist of speeches of Abraham Lincoln. Before conducting analysis, the researcher may need to split the text into individual speeches. This process can be called chunking text.

Grouping text#

Grouping text

  • An opposite process that needs to be done just as often is combining text into larger pieces before analysis, which can be referred to as grouping text. Let’s look at political speeches as an example. Say that this time we have individual texts of various speeches made by Abraham Lincoln as well as George Washington. Before conducting our analysis, we may need to group the texts by combining all speeches by Lincoln into one group and all speeches by Washington into another group.

Both chunking and grouping are ways of modifying the unit of analysis for the researcher, and it’s wholly dependent on what the researcher wants to study. Maybe someone wants to compare all of Abraham Lincoln’s speeches to all of George Washington’s speeched. They could create two large “buckets” of data via chunking. Or someone only wants to compare the chapters in John F. Kennedy’s “Profiles in Courage” to see how descriptions of the figures it profiled are similar or different, then a researcher might split a single work out by chapter. Those are simplistic examples, but they highlight the kinds of splitting and combining that may happen.

Tokenization#

Tokenization

  • An additional step in preparation is called tokenization. Tokenization is simply the process of breaking text into pieces called tokens. Often certain characters, such as punctuation marks, are discarded in the process. Here’s a tokenized version of the beginning of The Gettysburg Address on the image above. The original text, which is in a human-readable form, has been translated into tokens. While the tokens can still be parsed by a human, it isn’t in a form we regularly read. It can now, however, be read and processed by a computer.

  • It is important to note that different choices in text preparation will affect the results of the analysis.

  • Depending on the amount of text and size of chunks, which stop words are removed and which characters are included, and whether to lowercase and normalize words, the eventual text that is ready for analysis can be very different. Additionally, preparation for analysis takes a lot of time and effort. This is where scripting becomes useful!

  • Additional information about how text preparation impacts results

  • An argument against cleaning data

Types of Text Mining#

What disciplinary questions can text analysis answer?#

  • You can use text analysis to answer a wide variety of questions.

  • Here are a few that are common:

  1. What are these texts about?

  2. How are these texts connected?

  3. What emotions (or affects) are found within these texts?

  4. What names are used in these texts?

  5. Which of these texts are most similar?

1. What are these texts about?#

  • When it comes to a large body of texts, scholars tend to be most curious about the text’s contents. What are the words, topics, concepts, and significant terms in these documents? There are a number of methods often used which vary in complexity and difficulty.


Word Frequency#

  • Counting the frequency of a word in any given text. relative frequency – how often do words appear relative to other words in the text, typically a percentage Common as a baseline for more sophisticated methods This includes Bag of Words and TF-IDF. Example: “Which of these texts focus on women?”

  • If you search for digital humanities in Google image search, the most common result is a word cloud. A word cloud visualizes the most frequent content words in a text or corpus.

    • Before you can create a word cloud, however, you need to collect the word frequencies for all the words in your text. You may also need to use a stop words list to remove common function words (grammatical word constructions like “the”, “of”, and “or”).

word cloud


Collocation#

  • Examining where words occur close to one another. Example: “Where are women mentioned in relation to home ownership?”

Collocation


  • A type of collocation is n-gram counts.

  • N-grams are a contiguous chain of n-items (i.e. words) where n is the number of items in the chain.

N-gram

In the cases where a database (such as Constellate or Hathitrust) cannot supply full-text due to copyright laws, they may supply three n-gram counts:

  • Unigrams - A single-word construction, for example: “vegetable”.

  • Bigrams - An two-word construction, for example: “vegetable stock”.

    • Notice how in the bigram example above the window of the gram slides across the text, so in bigrams, a word will occur with both the word preceding and following it.

  • Trigrams- A three-word construction, for example: “homemade vegetable stock”.

While having the full texts for the documents in your corpus is ideal, a great deal can be still be discovered through the use of unigrams. Even when researchers have access to the full-texts of a corpus, it is common for them to create a list of n-gram counts for analysis.

Read about using n-grams to get a sense of language usage and change on Reddit


TF/IDF#

  • Tf-IDF (term frequency–inverse document frequency) is intended to reflect how important a word is to a document in a collection or corpus. Example: “Given a decade of board reports, are there seasonal issues that crop up in summer vs. winter?” TF-IDF

Read more about how to do TF/IDF here or here.


Significant Terms#

  • Search engines use significant terms analysis to match a user query with a list of appropriate documents. This method could be useful if you want to search your corpus for the most significant texts based on a word (or set of words). It can also be useful in reverse. For a given document, you could create a list of the ten most significant terms. This can be useful for summarizing the content of a document.


Topic Modeling#

  • Topic modeling is a method of using statistical models for discovering the abstract “topics” that occur in a collection of documents. Topic modeling simple

  • For this kind of analysis, the text is chunked, and stop words (frequently used words such as “the”, “and”, “if”) are removed since they reveal little about the substance of a text. * Discovering the topics within a group of texts. Example: “What are the most frequent topics discussed in this newspaper?”

  • While significant terms analysis reveals terms commonly found in a given document, a topic analysis can tell us what words tend to cluster together across a corpus. For example, if we were to study newspapers, we would expect that certain words would cluster together into topics that match the sections of the newspaper. We might see something like:

  • Topic 1: baseball, ball, player, trade, score, win, defeat

  • Topic 2: market, dow, bull, trade, run, fund, stock

  • Topic 3: campaign, democratic, polls, red, vote, defeat, state

  • We can recognize that these words tend to cluster together within newspaper sections such as “Sports”, “Finance”, and “Politics”. If we have never read a set of documents, we might use a topic analysis to get a sense of what topics are in a given corpus. Given that Topic Analysis is an exploratory technique, it may require some expertise to fine-tune and get good results for a given corpus. However, if the topics can be discovered then they could potentially be used to train a model using Machine Learning to discover the topics in a given document automatically.

Topic modeling

Bag of words#

  • The computer treats the textual documents as bags of words, and guesses which words make up a “topic” based on their proximity to one another in the documents, with the idea the words that frequently co-occur are likely about the same thing. So the different colored groupings are the groups of words that the computer has statistically analyzed and determined are likely related to each other about a “topic”.

    • “Bag-of-words” is a concept where grammar and word order of the original text are disregarded and frequency is maintained. Here is an example of the beginning of The Gettysburg Address as a bag of words.

Bag Of Words

  • Here are some tips for topic modeling:

  • Treat topic modeling as one part of a larger analysis.

    • Understand what you input, including how you set your parameters, will affect the output. Some points to note are:

    • Be careful with how you set the number of texts analyzed, as well as number of topics generated

    • Be familiar with your input data

  • Know that changing your stop word list can have really interesting impacts on your topics, so tread carefully/wisely.

  • You’re going to want to go back to the text at some point. Make sure to examine your results to see if they make sense.

Read more Keli Du’s A Survey on LDA Topic Modeling in Digital Humanities


2. How are these texts connected?#


Concordance#

  • Where is this word or phrase used in these documents? Example: “Which journal articles mention Maya Angelou’s phrase, ‘If you’re for the right thing, then you do it without thinking.’”

  • The concordance has a long history in humanities study and Roberto Busa’s concordance Index Thomisticus—started in 1946—is arguably the first digital humanities project. Before computers were common, they were printed in large volumes such as John Bartlett’s 1982 reference book A Complete Concordance to Shakespeare—it was 1909 pages pages long! A concordance gives the context of a given word or phrase in a body of texts. For example, a literary scholar might ask: how often and in what context does Shakespeare use the phrase “honest Iago” in Othello? A historian might examine a particular politician’s speeches, looking for examples of a particular “dog whistle”. See in the image below, all instances of “cousin” in a Shakespeare concordance.

Shakespeare concordance


Network Analysis#

  • How are the authors of these texts connected? Example: “What local communities formed around civil rights in 1963?”

  • in the image below, The circles (nodes) in the graph represent individual newspapers or magazines, while the lines between them (edges) represent shared texts.

Network visualisation

3. What emotions (or affects) are found within these texts?#

Sentiment Analysis#

Sentiment analysis, which uses computers to explore what emotions are present in the text. Does the author use positive or negative language? Example: “How do presidents describe gun control?” Sentiment analysis can help determine the emotions expressed in a given text. This can be determined using rule-based algorithms, Machine Learning, or both.

Network visualisation incorporating sentiment analysis

Read about:


4. What names are used in these texts?#


Named Entity Recognition** (or NER)#

  • Named Entity Recognition (NER) automatically identifies entities within a text and can helpful for extracting certain kinds of entities such as proper nouns. For example, NER could identify names of organizations, people, and places. It might also help identify things like dates, times, or dollar amounts.

  • You can use NER to list every example of a kind of entity from these texts. Example: “What are all of the geographic locations mentioned by Tolstoy?” Named Entity Recognition

Read more


5. Which of these texts are most similar?#


Stylometrics and Authorship Attribution#

  • Stylometrics uses computers to speculate who wrote the text based on language style. You could use it to find the author of an anonymous document.

    • Example: “Who wrote The Federalist Papers?”

  • The digital humanities, and its precursor Humanities Computing, have a long history in the analysis of literature, particularly for analyzing genre and authorship. For example, the New Oxford Shakespeare surprised many scholars by assigning significant authorship of Shakespeare’s “Henry VI,” Parts 1, 2, and 3. It also lists as co-authors many Shakespeare contemporaries such as Thomas Nashe, George, Peele, Thomas Heywood, Ben Jonson, George Wilkins, Thomas Middleton, and John Fletcher.

Read more


Machine Learning(ML)#

  • Another key approach to text analysis is *Machine Learning (ML). ML is training computers to recognize patterns in text.

  • Machine learning can either be unsupervised (with minimal human intervention) or supervised (with more human intervention). ML

  • Here are some common, specific methods that are based on machine learning:

    • Topic modeling, which explores the thematic topics present in the text. Remember that topic modeling is a bag-of-words approach.

    • Naïve Bayes classification, which explores the categorization of texts, i.e. determining what categories that the researcher have named does a certain text belong to.

  • One key approach in ML for TDM is Natural Language Processing (NLP), meaning using computers to understand the meaning, relationships, and semantics within human-language text. Generally for natural language processing, full text is needed. It is not a bag-of-words method.

  • Some common, specific methods under NLP are: Named entity extraction, Sentiment analysis and Stylometry.


Clustering#

  • Clustering is unsupervised machine learning.

  • Which texts are the most similar? Example: “Is this play closer to comedy or tragedy?” clustering


Classification#

Read more


Text Visualization#

theoretical Workflow

Actual Workflow


Read more#


Attribution#