L13: Text as Data

Bogdan G. Popescu

John Cabot University

Introduction

Language underpins nearly all forms of social interaction:

  • Laws are codified
  • Political events are debated
  • Historical narratives are preserved
  • People exchange messages

Until recently, analyzing these interactions quantitatively was a challenge.

There are large quantities of text available in the form of digitized books and texts

Now, we also have powerful methods to analyze texts

Quantitative Text Analysis

Throughout the rest of the semester, we will use different methods to:

Assigning numbers to words and documents to measure latent concepts in text.

So, we want to assign numbers that enable us measure latent concepts from large corpora of text.

Latent concepts means that we cannot observe them directly, but rather we have to infer them from observed text.

Thus, we need to find strategies to score words and documents in corpus.

Examples of Latent Concepts based on Observed Text

Examples:

  1. Aggression used in political communication
  2. Economic Topics in news articles
  3. Hate speech in online comments
  4. Ideological position in party manifestos

Quantitative Text Analysis Techniques

  1. Dictionaries
  • Some words get a score of 1 and others, 0
  • Documents get evaluated on whether they include words in the dictionary or not
  1. Supervised Learning
  • Words are assigned weights depending on how they are used accorss groups
  • Documents get evaluated on whether they include words associated with different groups
  1. Text Scaling
  • Words are given weights according to their use across groups
  • Documents receive different weights depending on how they use different words
  1. Topic Models
  • Words are assigned a vector of numbers, representing their relevance to a set of topics
  • Documents receive a vector of numbers, showing their relevance for specific topics

Quantitative Text Analysis Techniques

  1. World Embedding Models
  • Words are assigned a vector of numbers, representing the context in which they are used
  • Documents are characterised by some average of the vectors of the words they contain

Applications of Quantitative Text Analysis

1. When did Western Political Thought start diverging from Islamic political thought

Applications of Quantitative Text Analysis

1. When did Western Political Thought start diverging from Islamic political thought

2. How do central bankers make decisions on economic policy?

Applications of Quantitative Text Analysis

1. When did Western Political Thought start diverging from Islamic political thought

2. How do central bankers make decisions on economic policy?

3. How has the cultural meaning of words changed over time?

Applications of Quantitative Text Analysis

1. When did Western Political Thought start diverging from Islamic political thought

2. How do central bankers make decisions on economic policy?

3. How has the cultural meaning of words changed over time?

4. How can we detect online hate speech?

Applications of Quantitative Text Analysis

1. When did Western Political Thought start diverging from Islamic political thought

2. How do central bankers make decisions on economic policy?

3. How has the cultural meaning of words changed over time?

4. How can we detect online hate speech?

5. Do men and women debate differently?

Other Applications of Quantitative Text Analysis

  1. Predicting whether the author of a tweet or text message is young or old
  • use of emojis, informal language, length of text
  1. Measuring the political content of news
  • presence of words related to politics
  1. Evaluating how complex texts are
  • counting number of syllables, use of adjectives, nouns, and verbs.

Caveats to Quantitative Text Analysis

1. Language models are wrong but some are useful

  • Data-generation for language is complex
  • We use methods that don’t provide an accurate representation of how data was generated
  • Text analysis simplifies multidimensional text and preserves important aspects of meaning

2. We need to validate text-analysis insights with domain knowledge

  • The methods we use can be applied quickly and on large data
  • They could lead to wrong inferences
  • It’s important to validate our inferences with out qualitative insights about a domain

Caveats to Quantitative Text Analysis

3. We need to combine quantitative and qualitative insights

  • Text analysis is useful for many texts, rather than close readings of few texts
  • Text analysis still entails qualitative insights about how the feature-document matrix is built
  • Text analysis still entails qualitative insights about interpreting statistical outputs

Stylized Workflow of Text Analysis

Stylized Workflow of Text Analysis

Stylized Workflow of Text Analysis

Stylized Workflow of Text Analysis

Stylized Workflow of Text Analysis

Stylized Workflow of Text Analysis

Detailed Workflow of Text Analysis

Typically, text analysis entails more steps:

  1. Deciding on documents: speeches, books, tweets, etc.
  2. Digitizing documents (e.g. books, speeches) or Scraping tweets
  3. Representing documents in a quantitative way
  4. Analyze data
  5. Validating the analysis
  6. Interpret the result

Document-feature matrix

A Document-feature matrix is a typical way of representing text in a quantitative form

The rows of a matrix represent the documents

The columns indicate the features (e.g. words).

We need to make decisions about which documents and features are important to build a document-feature matrix.

Definitions

A document is the basic unit of text analysis

A corpus is a structured set of documents for analysis

Tokenization is the process of breaking down a piece of text, like a sentence or a paragraph, into individual words or “tokens.”

What are documents?

Documents as the basic unit of analysis in text analysis can be:

1. A collection of literary works

What are documents?

Documents as the basic unit of analysis in text analysis can be:

1. A collection of literary works
2. A novel

What are documents?

Documents as the basic unit of analysis in text analysis can be:

1. A collection of literary works
2. A novel
3. A chapter

What are documents?

Documents as the basic unit of analysis in text analysis can be:

1. A collection of literary works
2. A novel
3. A chapter
4. A tweet or a message

What are documents?

Documents as the basic unit of analysis in text analysis can be:

1. A collection of literary works
2. A novel
3. A chapter
4. A tweet or a message
5. A customer review

What are documents?

Documents as the basic unit of analysis in text analysis can be:

1. A collection of literary works
2. A novel
3. A chapter
4. A tweet or a message
5. A customer review

What the unit is will depend on your research question.

What are features?

Features: Characteristics of text used to analyze and quantify meaning, structure, or style.

1. Words

  • Simplest feature type
  • Individual words can reveal themes, topics, or sentiment (e.g., “exciting,” “boring,” “predictable”)

2. N-grams

  • N-grams: Sequences of N words
    • Example: Bigram (N=2) in “text analysis”
  • Useful for identifying phrases or common expressions.

What are features?

3. Linguistic Features

  • Analyzes language structure and meaning beyond individual words.
  • Parts of Speech (POS)
    • Categorizes words (e.g., nouns, verbs, adjectives)
    • Helps identify sentiment words (e.g., “excellent,” “poor”).
  • Named Entities
    • Recognizes names in text (e.g., people, places, organizations, dates)
    • Useful for extracting key information from large texts.

What are features?

3. Linguistic Features

  • Analyzes language structure and meaning beyond individual words.
  • Dependency Parsing
    • Examines grammatical structure of sentences
    • Identifies relationships like subject, object, and verbs

Bags of Words

The simplest way of analyzing text is to count words

For each document, we may count how many times each unique words appears

In this context, word order does not matter

Word combinations also do not matter.

Grammar does not matter.

Words are the only relevant features

Bags of Words Example

Examples:

  • Sentence 1: “I saw her duck.”
  • Sentence 1: “The duck swam across the pond.”
across duck her pond saw swam the
Sentence 1 0 1 1 0 1 0 0
Sentence 2 1 1 0 1 0 1 2

What BoW Loses

  • Context of “duck”
    • In 1, “duck” means to lower one’s head.
    • In 2, “duck” is the animal
  • Meaning of “saw”
    • With no context, BoW does not distinguish between: saw - the verb and saw - the tool
  • Grammar and Structure
    • The relationships between words (e.g., “saw her” vs. “her duck”) are lost in BoW.

Bags of Words Example

Python
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Assuming `sdg` is a DataFrame with a column 'long_description'
# Example DataFrame creation (skip this if you already have your `sdg` DataFrame)
data = {'long_description': [
  "The quick brown fox jumps over the lazy dog.",
  "Python programming is fun! Let's learn how to code.",
  "In 2024, we will have 50% more data and problems to analyze.",
  "Machine learning is changing how we approach problems in 5 different fields."
]}
sdg = pd.DataFrame(data)

# Convert the 'long_description' column into a list of documents
sdg_corpus = sdg['long_description'].tolist()

# Initialize CountVectorizer for tokenization and DFM creation
vectorizer = CountVectorizer()

# Fit and transform the corpus to create the Document-Feature Matrix (DFM)
sdg_dfm = vectorizer.fit_transform(sdg_corpus)

# Convert to DataFrame for better readability
sdg_dfm_df = pd.DataFrame(sdg_dfm.toarray(), columns=vectorizer.get_feature_names_out())

2024 50 analyze and approach brown changing code data different dog fields fox fun have how in is jumps lazy learn learning let machine more over problems programming python quick the to we will
0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 2 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0
1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 1
0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0

Bags of words

Python
# Number of words (features) in the DFM
num_words = sdg_dfm_df.shape[1]
print(f"Number of words: {num_words}")

# Number of documents in the DFM
num_documents = sdg_dfm_df.shape[0]
print(f"Number of documents: {num_documents}")

# Most common features (top 10 words) in the DFM
top_features = sdg_dfm_df.sum().sort_values(ascending=False).head(10)
print("Top 10 most common features:\n", top_features)
Number of words: 34
Number of documents: 4
Top 10 most common features:
 is          2
we          2
to          2
the         2
in          2
how         2
problems    2
machine     1
learn       1
learning    1
dtype: int64

Bags of words

Python
# Create a DataFrame for top features
top_features_df = pd.DataFrame(top_features).reset_index()
top_features_df.columns = ['word', 'frequency']
R
library(reticulate)
library(ggplot2)

df <-  reticulate::py$top_features_df

ggplot(df, aes(x = reorder(word, -frequency), y = frequency)) +
  geom_bar(stat = "identity") +
  labs(title = "Most Common Words", x = "Words", y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  theme_bw()

Word sequences/N-grams

N-grams are sequences of words from a document.

Word sequences/N-grams

Python
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
nltk.download('punkt')

# Convert the 'long_description' column into a list of documents
sdg_corpus = sdg['long_description'].tolist()
sdg_corpus_tokens = [" ".join(word_tokenize(doc.lower())) for doc in sdg_corpus]

# Create a bigram document-feature matrix
vectorizer_bigram = CountVectorizer(ngram_range=(1, 2))
sdg_dfm_bigram = vectorizer_bigram.fit_transform(sdg_corpus_tokens)
sdg_dfm_bigram_df = pd.DataFrame(sdg_dfm_bigram.toarray(), columns=vectorizer_bigram.get_feature_names_out())

# Create a trigram document-feature matrix
vectorizer_trigram = CountVectorizer(ngram_range=(1, 3))
sdg_dfm_trigram = vectorizer_trigram.fit_transform(sdg_corpus_tokens)
sdg_dfm_trigram_df = pd.DataFrame(sdg_dfm_trigram.toarray(), columns=vectorizer_trigram.get_feature_names_out())
True

Display the Document-Feature Matrix

Remember that our documents are the following:

Document 1: The quick brown fox jumps over the lazy dog.

Document 2: Python programming is fun! Let’s learn how to code.

Document 3: In 2024, we will have 50% more data and problems to analyze.

Document 4: Machine learning is changing how we approach problems in 5 different fields.

Display the Document-Feature Matrix

We can easily visualize the bigram we produced:

Python
print(sdg_dfm_bigram_df)

2024 2024 we 50 50 more analyze and and problems approach approach problems brown brown fox changing changing how code data data and different different fields dog fields fox fox jumps fun fun let have have 50 how how to how we in in 2024 in different is is changing is fun jumps jumps over lazy lazy dog learn learn how learning learning is let let learn machine machine learning more more data over over the problems problems in problems to programming programming is python python programming quick quick brown the the lazy the quick to to analyze to code we we approach we will will will have
0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 2 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1
0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0

Display the Document-Feature Matrix

We can easily visualize the trigram we produced:

Python
print(sdg_dfm_trigram_df)

2024 2024 we 2024 we will 50 50 more 50 more data analyze and and problems and problems to approach approach problems approach problems in brown brown fox brown fox jumps changing changing how changing how we code data data and data and problems different different fields dog fields fox fox jumps fox jumps over fun fun let fun let learn have have 50 have 50 more how how to how to code how we how we approach in in 2024 in 2024 we in different in different fields is is changing is changing how is fun is fun let jumps jumps over jumps over the lazy lazy dog learn learn how learn how to learning learning is learning is changing let let learn let learn how machine machine learning machine learning is more more data more data and over over the over the lazy problems problems in problems in different problems to problems to analyze programming programming is programming is fun python python programming python programming is quick quick brown quick brown fox the the lazy the lazy dog the quick the quick brown to to analyze to code we we approach we approach problems we will we will have will will have will have 50
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 2 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0

Word sequences/N-grams

Python
# Print the results
# Column counts
ncol_dfm = sdg_dfm.shape[1]
ncol_bigram = sdg_dfm_bigram.shape[1]
ncol_trigram = sdg_dfm_trigram.shape[1]

print(ncol_dfm, ncol_bigram, ncol_trigram)
34 71 104

How many words are there in these documents?

Python
# Convert the 'long_description' column into a list of documents
sdg_corpus = sdg['long_description'].tolist()

# Initialize CountVectorizer for tokenization and DFM creation
vectorizer = CountVectorizer()

# Fit and transform the corpus to create the Document-Feature Matrix (DFM)
sdg_dfm = vectorizer.fit_transform(sdg_corpus)

# Convert to DataFrame for better readability
sdg_dfm_df = pd.DataFrame(sdg_dfm.toarray(), columns=vectorizer.get_feature_names_out())

How many words are there in these documents?

1. Count the number of words (features) in the DFM

Python
num_words = sdg_dfm.shape[1]
num_words
34

2. Count the number of documents in the DFM

Python
num_documents = sdg_dfm.shape[0]
num_documents
4

How many words are there in these documents?

3. Find the most common features (top 10)

Python
top_features = sdg_dfm_df.sum(axis=0).nlargest(10)
top_features
how         2
in          2
is          2
problems    2
the         2
to          2
we          2
2024        1
50          1
analyze     1
dtype: int64

Observations based on this Dataset

In our case, we deal with a small corpus

  • 4 documents
  • 34 unique words
  • 34 unique unigrams
  • 37 unique bigrams
  • 33 unique trigrams

Here is how we know:

Python
# Find unique 1-grams
unique_unigrams = [col for col in sdg_dfm_df.columns if len(col.split()) == 1]
unique_unigrams
['2024', '50', 'analyze', 'and', 'approach', 'brown', 'changing', 'code', 'data', 'different', 'dog', 'fields', 'fox', 'fun', 'have', 'how', 'in', 'is', 'jumps', 'lazy', 'learn', 'learning', 'let', 'machine', 'more', 'over', 'problems', 'programming', 'python', 'quick', 'the', 'to', 'we', 'will']

Observations based on this Dataset

In our case, we deal with a small corpus

  • 4 documents
  • 34 unique words
  • 34 unique unigrams
  • 37 unique bigrams
  • 33 unique trigrams

Here is how we know:

Python
# Find unique 2-grams
unique_bigrams = [col for col in sdg_dfm_bigram_df.columns if len(col.split()) == 2]
unique_bigrams
['2024 we', '50 more', 'and problems', 'approach problems', 'brown fox', 'changing how', 'data and', 'different fields', 'fox jumps', 'fun let', 'have 50', 'how to', 'how we', 'in 2024', 'in different', 'is changing', 'is fun', 'jumps over', 'lazy dog', 'learn how', 'learning is', 'let learn', 'machine learning', 'more data', 'over the', 'problems in', 'problems to', 'programming is', 'python programming', 'quick brown', 'the lazy', 'the quick', 'to analyze', 'to code', 'we approach', 'we will', 'will have']

Observations based on this Dataset

In our case, we deal with a small corpus

  • 4 documents
  • 34 unique words
  • 34 unique unigrams
  • 37 unique bigrams
  • 33 unique trigrams

Here is how we know:

Python
# Find unique 3-grams
unique_trigrams = [col for col in sdg_dfm_trigram_df.columns if len(col.split()) == 3]
unique_trigrams
['2024 we will', '50 more data', 'and problems to', 'approach problems in', 'brown fox jumps', 'changing how we', 'data and problems', 'fox jumps over', 'fun let learn', 'have 50 more', 'how to code', 'how we approach', 'in 2024 we', 'in different fields', 'is changing how', 'is fun let', 'jumps over the', 'learn how to', 'learning is changing', 'let learn how', 'machine learning is', 'more data and', 'over the lazy', 'problems in different', 'problems to analyze', 'programming is fun', 'python programming is', 'quick brown fox', 'the lazy dog', 'the quick brown', 'we approach problems', 'we will have', 'will have 50']

Sparsity

The resulting dfms are not very sparse – they do not contain a high fraction of zeros because most n-grams do not appear in most documents

Python
a = (sdg_dfm_df.to_numpy() == 0).mean()
print(a)
0.7058823529411765
Python
b = (sdg_dfm_bigram_df.to_numpy() == 0).mean()
print(b)
0.7288732394366197
Python
c = (sdg_dfm_trigram_df.to_numpy() == 0).mean()
print(c)
0.7355769230769231

However, in most cases we deal with a lot of zeros which means that we will have high sparsity.

Text Cleaning in Python

Step 1: Turning words to Lowercase
Step 2: Removing Punctuation
Step 3: Removing Stopwords
Step 4: Removing Numbers
Step 5: Tokenization
Step 6: Lemmatization
Step 7: Stemming

Text Cleaning in Python

Step1: Turning words to Lowercase

Python
import string
import re
import nltk
from nltk.tokenize import word_tokenize
# Create a DataFrame
df = pd.DataFrame(sdg_corpus, columns=['document'])
df['document_transformed'] = df['document'].str.lower()

document document_transformed
The quick brown fox jumps over the lazy dog. the quick brown fox jumps over the lazy dog.
Python programming is fun! Let's learn how to code. python programming is fun! let's learn how to code.
In 2024, we will have 50% more data and problems to analyze. in 2024, we will have 50% more data and problems to analyze.
Machine learning is changing how we approach problems in 5 different fields. machine learning is changing how we approach problems in 5 different fields.

Text Cleaning in Python

Step2: Removing Punctuation

This is how we remove punctuation:

Python
df['document_transformed'] = df['document_transformed'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

document document_transformed
The quick brown fox jumps over the lazy dog. the quick brown fox jumps over the lazy dog
Python programming is fun! Let's learn how to code. python programming is fun lets learn how to code
In 2024, we will have 50% more data and problems to analyze. in 2024 we will have 50 more data and problems to analyze
Machine learning is changing how we approach problems in 5 different fields. machine learning is changing how we approach problems in 5 different fields

Text Cleaning in Python

Step3: Removing Stopwords

This is how we remove stopwords

Python
from nltk.corpus import stopwords
stop = stopwords.words('english')
stop
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Text Cleaning in Python

Step3: Removing Stopwords

This is how we remove stopwords

Python
df['document_transformed'] = df['document_transformed'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

document document_transformed
The quick brown fox jumps over the lazy dog. quick brown fox jumps lazy dog
Python programming is fun! Let's learn how to code. python programming fun lets learn code
In 2024, we will have 50% more data and problems to analyze. 2024 50 data problems analyze
Machine learning is changing how we approach problems in 5 different fields. machine learning changing approach problems 5 different fields

Text Cleaning in Python

Step4: Removing Numbers

This is not applicable here, but this is how we would remove numbers

Python
df['document_transformed'] = df['document_transformed'].apply(lambda x: re.sub(r'\d+', '', x))

document document_transformed
The quick brown fox jumps over the lazy dog. quick brown fox jumps lazy dog
Python programming is fun! Let's learn how to code. python programming fun lets learn code
In 2024, we will have 50% more data and problems to analyze. data problems analyze
Machine learning is changing how we approach problems in 5 different fields. machine learning changing approach problems different fields

Text Cleaning in Python

Step5: Tokenization

This is what tokenization looks like:

Python
df['document_transformed'] = df['document_transformed'].apply(lambda x: ' '.join(word_tokenize(x)))

document document_transformed
The quick brown fox jumps over the lazy dog. quick brown fox jumps lazy dog
Python programming is fun! Let's learn how to code. python programming fun lets learn code
In 2024, we will have 50% more data and problems to analyze. data problems analyze
Machine learning is changing how we approach problems in 5 different fields. machine learning changing approach problems different fields

Text Cleaning in Python

Step6: Lemmatization

Lemmatization - algorithmic process of converting words to their lemmas. E.g.: am, are, is → be

This is how we lemmatize the column:

Python
from nltk.stem import WordNetLemmatizer

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Tokenize and lemmatize each document
df['document_transformed'] = df['document_transformed'].apply(lambda x: ' '.join([lemmatizer.lemmatize(x)]))

document document_transformed
The quick brown fox jumps over the lazy dog. quick brown fox jumps lazy dog
Python programming is fun! Let's learn how to code. python programming fun lets learn code
In 2024, we will have 50% more data and problems to analyze. data problems analyze
Machine learning is changing how we approach problems in 5 different fields. machine learning changing approach problems different fields

Text Cleaning in Python

Step7: Stemming

Stemming - process for reducing inflected (or sometimes derived) words to their stem, base or root form. Stemmers operate on single words without knowledge of the context.

E.g. production, producer, produce, produces, produced → produc

Python
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()

df['document_transformed'] = df['document_transformed'].apply(lambda x: ' '.join([porter_stemmer.stem(x)]))

document document_transformed
The quick brown fox jumps over the lazy dog. quick brown fox jumps lazy dog
Python programming is fun! Let's learn how to code. python programming fun lets learn cod
In 2024, we will have 50% more data and problems to analyze. data problems analyz
Machine learning is changing how we approach problems in 5 different fields. machine learning changing approach problems different field

Text Cleaning in Python

Step7: Stemming

Python
from collections import Counter
# Split each document into words and count the frequency of each word
word_counts = Counter(" ".join(df['document_transformed']).split())

# Convert the word counts to a DataFrame and get the top 10 most frequent words
top_features = pd.DataFrame(word_counts.most_common(10), columns=['word', 'frequency'])
print(top_features)
          word  frequency
0     problems          2
1        quick          1
2        brown          1
3          fox          1
4        jumps          1
5         lazy          1
6          dog          1
7       python          1
8  programming          1
9          fun          1

Text Cleaning in Python

Step7: Stemming

library(reticulate)
library(ggplot2)

df <-  reticulate::py$top_features

ggplot(df, aes(x = reorder(word, -frequency), y = frequency)) +
  geom_bar(stat = "identity") +
  labs(title = "Most Common Words", x = "Words", y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  theme_bw()

Deciding on the Cleaning

The optimal representation of a corpus will depend on the particular research task

  1. Would you want to remove stop words when trying to detect gendered hate speech?
  1. Would you want to stem if you wanted to measure future-oriented language?
  1. Would you want to discard rare words when calculating linguistic complexity?

Conclusion

Quantitative Text Analysis allows us to address a wide variety of important research questions

There is no one right way to represent text for all research questions.

The representation we choose can be consequential for the results we present