L13: Text as Data

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Introduction

Language underpins nearly all forms of social interaction:

Laws are codified
Political events are debated
Historical narratives are preserved
People exchange messages

Until recently, analyzing these interactions quantitatively was a challenge.

There are large quantities of text available in the form of digitized books and texts

Now, we also have powerful methods to analyze texts

Quantitative Text Analysis

Throughout the rest of the semester, we will use different methods to:

Assigning numbers to words and documents to measure latent concepts in text.

So, we want to assign numbers that enable us measure latent concepts from large corpora of text.

Latent concepts means that we cannot observe them directly, but rather we have to infer them from observed text.

Thus, we need to find strategies to score words and documents in corpus.

Examples of Latent Concepts based on Observed Text

Examples:

Aggression used in political communication
Economic Topics in news articles
Hate speech in online comments
Ideological position in party manifestos

Quantitative Text Analysis Techniques

Dictionaries

Some words get a score of 1 and others, 0
Documents get evaluated on whether they include words in the dictionary or not

Supervised Learning

Words are assigned weights depending on how they are used accorss groups
Documents get evaluated on whether they include words associated with different groups

Text Scaling

Words are given weights according to their use across groups
Documents receive different weights depending on how they use different words

Topic Models

Words are assigned a vector of numbers, representing their relevance to a set of topics
Documents receive a vector of numbers, showing their relevance for specific topics

Quantitative Text Analysis Techniques

World Embedding Models

Words are assigned a vector of numbers, representing the context in which they are used
Documents are characterised by some average of the vectors of the words they contain

Applications of Quantitative Text Analysis

1. When did Western Political Thought start diverging from Islamic political thought

Applications of Quantitative Text Analysis

1. When did Western Political Thought start diverging from Islamic political thought

2. How do central bankers make decisions on economic policy?

Applications of Quantitative Text Analysis

1. When did Western Political Thought start diverging from Islamic political thought

2. How do central bankers make decisions on economic policy?

3. How has the cultural meaning of words changed over time?

Applications of Quantitative Text Analysis

1. When did Western Political Thought start diverging from Islamic political thought

2. How do central bankers make decisions on economic policy?

3. How has the cultural meaning of words changed over time?

4. How can we detect online hate speech?

Applications of Quantitative Text Analysis

1. When did Western Political Thought start diverging from Islamic political thought

2. How do central bankers make decisions on economic policy?

3. How has the cultural meaning of words changed over time?

4. How can we detect online hate speech?

5. Do men and women debate differently?

Other Applications of Quantitative Text Analysis

Predicting whether the author of a tweet or text message is young or old

use of emojis, informal language, length of text

Measuring the political content of news

presence of words related to politics

Evaluating how complex texts are

counting number of syllables, use of adjectives, nouns, and verbs.

Caveats to Quantitative Text Analysis

1. Language models are wrong but some are useful

Data-generation for language is complex
We use methods that don’t provide an accurate representation of how data was generated
Text analysis simplifies multidimensional text and preserves important aspects of meaning

2. We need to validate text-analysis insights with domain knowledge

The methods we use can be applied quickly and on large data
They could lead to wrong inferences
It’s important to validate our inferences with out qualitative insights about a domain

Caveats to Quantitative Text Analysis

3. We need to combine quantitative and qualitative insights

Text analysis is useful for many texts, rather than close readings of few texts
Text analysis still entails qualitative insights about how the feature-document matrix is built
Text analysis still entails qualitative insights about interpreting statistical outputs

Stylized Workflow of Text Analysis

Detailed Workflow of Text Analysis

Typically, text analysis entails more steps:

Deciding on documents: speeches, books, tweets, etc.
Digitizing documents (e.g. books, speeches) or Scraping tweets
Representing documents in a quantitative way
Analyze data
Validating the analysis
Interpret the result

Document-feature matrix

A Document-feature matrix is a typical way of representing text in a quantitative form

The rows of a matrix represent the documents

The columns indicate the features (e.g. words).

We need to make decisions about which documents and features are important to build a document-feature matrix.

Definitions

A document is the basic unit of text analysis

A corpus is a structured set of documents for analysis

Tokenization is the process of breaking down a piece of text, like a sentence or a paragraph, into individual words or “tokens.”