across | duck | her | pond | saw | swam | the | |
---|---|---|---|---|---|---|---|
Sentence 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
Sentence 2 | 1 | 1 | 0 | 1 | 0 | 1 | 2 |
Language underpins nearly all forms of social interaction:
Until recently, analyzing these interactions quantitatively was a challenge.
There are large quantities of text available in the form of digitized books and texts
Now, we also have powerful methods to analyze texts
Throughout the rest of the semester, we will use different methods to:
Assigning numbers to words and documents to measure latent concepts in text.
So, we want to assign numbers that enable us measure latent concepts from large corpora of text.
Latent concepts means that we cannot observe them directly, but rather we have to infer them from observed text.
Thus, we need to find strategies to score words and documents in corpus.
Examples:
1. When did Western Political Thought start diverging from Islamic political thought
1. When did Western Political Thought start diverging from Islamic political thought
2. How do central bankers make decisions on economic policy?
1. When did Western Political Thought start diverging from Islamic political thought
2. How do central bankers make decisions on economic policy?
3. How has the cultural meaning of words changed over time?
1. When did Western Political Thought start diverging from Islamic political thought
2. How do central bankers make decisions on economic policy?
3. How has the cultural meaning of words changed over time?
4. How can we detect online hate speech?
1. When did Western Political Thought start diverging from Islamic political thought
2. How do central bankers make decisions on economic policy?
3. How has the cultural meaning of words changed over time?
4. How can we detect online hate speech?
5. Do men and women debate differently?
1. Language models are wrong but some are useful
2. We need to validate text-analysis insights with domain knowledge
3. We need to combine quantitative and qualitative insights
Typically, text analysis entails more steps:
A Document-feature matrix is a typical way of representing text in a quantitative form
The rows of a matrix represent the documents
The columns indicate the features (e.g. words).
We need to make decisions about which documents and features are important to build a document-feature matrix.
A document is the basic unit of text analysis
A corpus is a structured set of documents for analysis
Tokenization is the process of breaking down a piece of text, like a sentence or a paragraph, into individual words or “tokens.”
Documents as the basic unit of analysis in text analysis can be:
1. A collection of literary works
Documents as the basic unit of analysis in text analysis can be:
1. A collection of literary works
2. A novel
Documents as the basic unit of analysis in text analysis can be:
1. A collection of literary works
2. A novel
3. A chapter
Documents as the basic unit of analysis in text analysis can be:
1. A collection of literary works
2. A novel
3. A chapter
4. A tweet or a message
Documents as the basic unit of analysis in text analysis can be:
1. A collection of literary works
2. A novel
3. A chapter
4. A tweet or a message
5. A customer review
Documents as the basic unit of analysis in text analysis can be:
1. A collection of literary works
2. A novel
3. A chapter
4. A tweet or a message
5. A customer review
What the unit is will depend on your research question.
Features: Characteristics of text used to analyze and quantify meaning, structure, or style.
1. Words
2. N-grams
3. Linguistic Features
3. Linguistic Features
The simplest way of analyzing text is to count words
For each document, we may count how many times each unique words appears
In this context, word order does not matter
Word combinations also do not matter.
Grammar does not matter.
Words are the only relevant features
Examples:
across | duck | her | pond | saw | swam | the | |
---|---|---|---|---|---|---|---|
Sentence 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
Sentence 2 | 1 | 1 | 0 | 1 | 0 | 1 | 2 |
What BoW Loses
Python
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
# Assuming `sdg` is a DataFrame with a column 'long_description'
# Example DataFrame creation (skip this if you already have your `sdg` DataFrame)
data = {'long_description': [
"The quick brown fox jumps over the lazy dog.",
"Python programming is fun! Let's learn how to code.",
"In 2024, we will have 50% more data and problems to analyze.",
"Machine learning is changing how we approach problems in 5 different fields."
]}
sdg = pd.DataFrame(data)
# Convert the 'long_description' column into a list of documents
sdg_corpus = sdg['long_description'].tolist()
# Initialize CountVectorizer for tokenization and DFM creation
vectorizer = CountVectorizer()
# Fit and transform the corpus to create the Document-Feature Matrix (DFM)
sdg_dfm = vectorizer.fit_transform(sdg_corpus)
# Convert to DataFrame for better readability
sdg_dfm_df = pd.DataFrame(sdg_dfm.toarray(), columns=vectorizer.get_feature_names_out())
2024 | 50 | analyze | and | approach | brown | changing | code | data | different | dog | fields | fox | fun | have | how | in | is | jumps | lazy | learn | learning | let | machine | more | over | problems | programming | python | quick | the | to | we | will |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Python
# Number of words (features) in the DFM
num_words = sdg_dfm_df.shape[1]
print(f"Number of words: {num_words}")
# Number of documents in the DFM
num_documents = sdg_dfm_df.shape[0]
print(f"Number of documents: {num_documents}")
# Most common features (top 10 words) in the DFM
top_features = sdg_dfm_df.sum().sort_values(ascending=False).head(10)
print("Top 10 most common features:\n", top_features)
Number of words: 34
Number of documents: 4
Top 10 most common features:
is 2
we 2
to 2
the 2
in 2
how 2
problems 2
machine 1
learn 1
learning 1
dtype: int64
R
library(reticulate)
library(ggplot2)
df <- reticulate::py$top_features_df
ggplot(df, aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity") +
labs(title = "Most Common Words", x = "Words", y = "Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))+
theme_bw()
N-grams are sequences of words from a document.
Python
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
nltk.download('punkt')
# Convert the 'long_description' column into a list of documents
sdg_corpus = sdg['long_description'].tolist()
sdg_corpus_tokens = [" ".join(word_tokenize(doc.lower())) for doc in sdg_corpus]
# Create a bigram document-feature matrix
vectorizer_bigram = CountVectorizer(ngram_range=(1, 2))
sdg_dfm_bigram = vectorizer_bigram.fit_transform(sdg_corpus_tokens)
sdg_dfm_bigram_df = pd.DataFrame(sdg_dfm_bigram.toarray(), columns=vectorizer_bigram.get_feature_names_out())
# Create a trigram document-feature matrix
vectorizer_trigram = CountVectorizer(ngram_range=(1, 3))
sdg_dfm_trigram = vectorizer_trigram.fit_transform(sdg_corpus_tokens)
sdg_dfm_trigram_df = pd.DataFrame(sdg_dfm_trigram.toarray(), columns=vectorizer_trigram.get_feature_names_out())
True
Remember that our documents are the following:
Document 1: The quick brown fox jumps over the lazy dog.
Document 2: Python programming is fun! Let’s learn how to code.
Document 3: In 2024, we will have 50% more data and problems to analyze.
Document 4: Machine learning is changing how we approach problems in 5 different fields.
We can easily visualize the bigram we produced:
2024 | 2024 we | 50 | 50 more | analyze | and | and problems | approach | approach problems | brown | brown fox | changing | changing how | code | data | data and | different | different fields | dog | fields | fox | fox jumps | fun | fun let | have | have 50 | how | how to | how we | in | in 2024 | in different | is | is changing | is fun | jumps | jumps over | lazy | lazy dog | learn | learn how | learning | learning is | let | let learn | machine | machine learning | more | more data | over | over the | problems | problems in | problems to | programming | programming is | python | python programming | quick | quick brown | the | the lazy | the quick | to | to analyze | to code | we | we approach | we will | will | will have |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
We can easily visualize the trigram we produced:
2024 | 2024 we | 2024 we will | 50 | 50 more | 50 more data | analyze | and | and problems | and problems to | approach | approach problems | approach problems in | brown | brown fox | brown fox jumps | changing | changing how | changing how we | code | data | data and | data and problems | different | different fields | dog | fields | fox | fox jumps | fox jumps over | fun | fun let | fun let learn | have | have 50 | have 50 more | how | how to | how to code | how we | how we approach | in | in 2024 | in 2024 we | in different | in different fields | is | is changing | is changing how | is fun | is fun let | jumps | jumps over | jumps over the | lazy | lazy dog | learn | learn how | learn how to | learning | learning is | learning is changing | let | let learn | let learn how | machine | machine learning | machine learning is | more | more data | more data and | over | over the | over the lazy | problems | problems in | problems in different | problems to | problems to analyze | programming | programming is | programming is fun | python | python programming | python programming is | quick | quick brown | quick brown fox | the | the lazy | the lazy dog | the quick | the quick brown | to | to analyze | to code | we | we approach | we approach problems | we will | we will have | will | will have | will have 50 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
Python
# Convert the 'long_description' column into a list of documents
sdg_corpus = sdg['long_description'].tolist()
# Initialize CountVectorizer for tokenization and DFM creation
vectorizer = CountVectorizer()
# Fit and transform the corpus to create the Document-Feature Matrix (DFM)
sdg_dfm = vectorizer.fit_transform(sdg_corpus)
# Convert to DataFrame for better readability
sdg_dfm_df = pd.DataFrame(sdg_dfm.toarray(), columns=vectorizer.get_feature_names_out())
1. Count the number of words (features) in the DFM
3. Find the most common features (top 10)
In our case, we deal with a small corpus
Here is how we know:
Python
['2024', '50', 'analyze', 'and', 'approach', 'brown', 'changing', 'code', 'data', 'different', 'dog', 'fields', 'fox', 'fun', 'have', 'how', 'in', 'is', 'jumps', 'lazy', 'learn', 'learning', 'let', 'machine', 'more', 'over', 'problems', 'programming', 'python', 'quick', 'the', 'to', 'we', 'will']
In our case, we deal with a small corpus
Here is how we know:
Python
['2024 we', '50 more', 'and problems', 'approach problems', 'brown fox', 'changing how', 'data and', 'different fields', 'fox jumps', 'fun let', 'have 50', 'how to', 'how we', 'in 2024', 'in different', 'is changing', 'is fun', 'jumps over', 'lazy dog', 'learn how', 'learning is', 'let learn', 'machine learning', 'more data', 'over the', 'problems in', 'problems to', 'programming is', 'python programming', 'quick brown', 'the lazy', 'the quick', 'to analyze', 'to code', 'we approach', 'we will', 'will have']
In our case, we deal with a small corpus
Here is how we know:
Python
['2024 we will', '50 more data', 'and problems to', 'approach problems in', 'brown fox jumps', 'changing how we', 'data and problems', 'fox jumps over', 'fun let learn', 'have 50 more', 'how to code', 'how we approach', 'in 2024 we', 'in different fields', 'is changing how', 'is fun let', 'jumps over the', 'learn how to', 'learning is changing', 'let learn how', 'machine learning is', 'more data and', 'over the lazy', 'problems in different', 'problems to analyze', 'programming is fun', 'python programming is', 'quick brown fox', 'the lazy dog', 'the quick brown', 'we approach problems', 'we will have', 'will have 50']
The resulting dfms are not very sparse – they do not contain a high fraction of zeros because most n-grams do not appear in most documents
However, in most cases we deal with a lot of zeros which means that we will have high sparsity.
Step 1: Turning words to Lowercase
Step 2: Removing Punctuation
Step 3: Removing Stopwords
Step 4: Removing Numbers
Step 5: Tokenization
Step 6: Lemmatization
Step 7: Stemming
document | document_transformed |
---|---|
The quick brown fox jumps over the lazy dog. | the quick brown fox jumps over the lazy dog. |
Python programming is fun! Let's learn how to code. | python programming is fun! let's learn how to code. |
In 2024, we will have 50% more data and problems to analyze. | in 2024, we will have 50% more data and problems to analyze. |
Machine learning is changing how we approach problems in 5 different fields. | machine learning is changing how we approach problems in 5 different fields. |
This is how we remove punctuation:
document | document_transformed |
---|---|
The quick brown fox jumps over the lazy dog. | the quick brown fox jumps over the lazy dog |
Python programming is fun! Let's learn how to code. | python programming is fun lets learn how to code |
In 2024, we will have 50% more data and problems to analyze. | in 2024 we will have 50 more data and problems to analyze |
Machine learning is changing how we approach problems in 5 different fields. | machine learning is changing how we approach problems in 5 different fields |
This is how we remove stopwords
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
This is how we remove stopwords
document | document_transformed |
---|---|
The quick brown fox jumps over the lazy dog. | quick brown fox jumps lazy dog |
Python programming is fun! Let's learn how to code. | python programming fun lets learn code |
In 2024, we will have 50% more data and problems to analyze. | 2024 50 data problems analyze |
Machine learning is changing how we approach problems in 5 different fields. | machine learning changing approach problems 5 different fields |
This is not applicable here, but this is how we would remove numbers
document | document_transformed |
---|---|
The quick brown fox jumps over the lazy dog. | quick brown fox jumps lazy dog |
Python programming is fun! Let's learn how to code. | python programming fun lets learn code |
In 2024, we will have 50% more data and problems to analyze. | data problems analyze |
Machine learning is changing how we approach problems in 5 different fields. | machine learning changing approach problems different fields |
This is what tokenization looks like:
document | document_transformed |
---|---|
The quick brown fox jumps over the lazy dog. | quick brown fox jumps lazy dog |
Python programming is fun! Let's learn how to code. | python programming fun lets learn code |
In 2024, we will have 50% more data and problems to analyze. | data problems analyze |
Machine learning is changing how we approach problems in 5 different fields. | machine learning changing approach problems different fields |
Lemmatization - algorithmic process of converting words to their lemmas. E.g.: am, are, is → be
This is how we lemmatize the column:
document | document_transformed |
---|---|
The quick brown fox jumps over the lazy dog. | quick brown fox jumps lazy dog |
Python programming is fun! Let's learn how to code. | python programming fun lets learn code |
In 2024, we will have 50% more data and problems to analyze. | data problems analyze |
Machine learning is changing how we approach problems in 5 different fields. | machine learning changing approach problems different fields |
Stemming - process for reducing inflected (or sometimes derived) words to their stem, base or root form. Stemmers operate on single words without knowledge of the context.
E.g. production, producer, produce, produces, produced → produc
document | document_transformed |
---|---|
The quick brown fox jumps over the lazy dog. | quick brown fox jumps lazy dog |
Python programming is fun! Let's learn how to code. | python programming fun lets learn cod |
In 2024, we will have 50% more data and problems to analyze. | data problems analyz |
Machine learning is changing how we approach problems in 5 different fields. | machine learning changing approach problems different field |
Python
from collections import Counter
# Split each document into words and count the frequency of each word
word_counts = Counter(" ".join(df['document_transformed']).split())
# Convert the word counts to a DataFrame and get the top 10 most frequent words
top_features = pd.DataFrame(word_counts.most_common(10), columns=['word', 'frequency'])
print(top_features)
word frequency
0 problems 2
1 quick 1
2 brown 1
3 fox 1
4 jumps 1
5 lazy 1
6 dog 1
7 python 1
8 programming 1
9 fun 1
The optimal representation of a corpus will depend on the particular research task
Quantitative Text Analysis allows us to address a wide variety of important research questions
There is no one right way to represent text for all research questions.
The representation we choose can be consequential for the results we present
Popescu (JCU): Lecture 13