Previously, we discussed the notion of similarity among documents.
Similarity relied on the assumption that each document can be represented by a vector of (weighted) feature counts
Some of the ways to measure document similarity included:
Edit distances
Inner product
Euclidean distance
Cosine similarity
Application
If we use the same dataset as before, we want to see how similar politicians are.
Python
import pandas as pdaggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")# Group the DataFrame by the 'name' column.# For each group, take the 'body' column and concatenate all text entries into a single string, separated by spaces.# Reset the index of the resulting DataFrame so that 'name' becomes a standard column instead of an index.aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()aggression_texts_aggregated.head(8)
name body
0 Adam Afriyie I welcomed much of what was said by the honou...
1 Adam Holloway What recent discussions he has had on the futu...
2 Adam Ingram I think I said that it was an additional power...
3 Adam Price There has been much talk in the Chamber this a...
4 Adrian Bailey Given the failure of successive well-intention...
5 Adrian Sanders I do not know whether the honourable Gentleman...
6 Afzal Khan What recent progress the Government have made ...
7 Aidan Burley I should start with a declaration of interest ...
Application
Creating Count vectorized representation of speeches
We first create the Count vectorized representation of speeches
Python
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.metrics.pairwise import cosine_similarity# Create a CountVectorizer for raw term frequenciesvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])
We then select the politician of interest
Python
#Selecting the politician of interestboris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similarity between Boris Johnson and all other politicians# flatten() converts a 2D array into a 1D array.cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()
Application
Calculating the Cosine Similarity
We then create a more friendly dataframe that we can visualize:
Python
# Create a DataFrame to display the similaritiessimilarity_df = pd.DataFrame({'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated'cosine_similarity': cosine_sim_50})# Sort by cosine similarity in descending ordersimilarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)# Show the top similaritiesprint(similarity_df.head())
name_politician cosine_similarity
145 Boris Johnson 1.000000
1590 William Hague 0.964526
283 David Miliband 0.964144
278 David Lidington 0.963183
1356 Philip Hammond 0.963094
Application
Comparing to other Speeches
Python
# Selecting the top 11similarity_df2 = similarity_df.head(11)# Selecting everything but Boris Johnsonsimilarity_df2 = similarity_df2[similarity_df2["name_politician"] !="Boris Johnson"]
See code
R
library(dplyr)library(ggplot2)similarity_df2 <- reticulate::py$similarity_df2# Create the bar chartggplot(similarity_df2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 1)+theme(axis.text.x =element_text(angle =45, hjust =1))
Application
Comparing it with Other Speeches
As you can see below, most speeches are smiliar to one another
See code
R
library(dplyr)similarity_df2 <- reticulate::py$similarity_df# Create the bar chartggplot(similarity_df2, aes(x = cosine_similarity)) +geom_histogram(binwidth =0.05, color ="white", alpha =0.7) +labs(title ="Histogram of Cosine Similarities",x ="Cosine Similarity",y ="Frequency")+theme_bw()
Application
Misleading Word Counts
Let us compare the most common features:
Python
# Get the feature names (words) from the vectorizerfeature_names = vectorizer.get_feature_names_out()# Get the TF-IDF scores for 'Boris Johnson'text_term_frequencies = text_dfm[boris_index].toarray().flatten()# Create a DataFrame with terms and their TF-IDF scores for easy sortingterm_frequency_df = pd.DataFrame({'term': feature_names,'freq': text_term_frequencies})# Get the top 8 features by TF-IDF scoretop_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)print(top_features)
term freq
35017 the 85
35375 to 43
24497 of 34
35010 that 33
2684 and 28
19341 is 25
18172 in 22
38445 will 13
Application
Misleading Word Counts
Feature selection matters! Similarities here are being driven by substantively unimportant words.
One solution would be to remove stopwords and try again.
Application
Creating TF-IDF vectorized representation of speeches
We can remove stopwords.
Python
import pandas as pdfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityfrom nltk.corpus import stopwordsimport nltk# Download the NLTK stopword list if not already donenltk.download('stopwords')# Define the stopwordsstop_words =set(stopwords.words('english'))# Remove stopwordsdef remove_stopwords(text): words = text.split() filtered_words = [word for word in words if word.lower() notin stop_words]return" ".join(filtered_words)# Apply the stopword removal function to the aggregated textaggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)# Vectorize the cleaned textvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])# Find the index of the selected politician (Boris Johnson)boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similaritycosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()# Create a DataFrame to display the similaritiessimilarity_df = pd.DataFrame({'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated'cosine_similarity': cosine_sim_50})# Sort by cosine similarity in descending ordersimilarity_new_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)# Show the top similaritiesprint(similarity_new_df.head())
True
name_politician cosine_similarity
145 Boris Johnson 1.000000
256 David Cameron 0.569765
1018 Mr John Major 0.569098
278 David Lidington 0.566099
1564 Tony Blair 0.562866
Application
Creating a Vectorized representation of speeches
We then select the politician of interest
Python
# Sort by cosine similarity in descending ordersimilarity_new_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)# Show the top similaritiesprint(similarity_new_df.head())# Selecting the first 11similarity_df2 = similarity_new_df.head(11)# Excluding Boris Johnsonsimilarity_df2 = similarity_df2[similarity_df2["name_politician"] !="Boris Johnson"]
Application
Aftter removing Misleading Word Counts
This is what the output looks like if we remove the stopwords.
See code
R
library(dplyr)similarity_df2 <- reticulate::py$similarity_df2# Create the bar chartggplot(similarity_df2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 0.7)+theme(axis.text.x =element_text(angle =45, hjust =1))
Application
Aftter removing Misleading Word Counts
This is what the original dataframe looks like compares:
See code
Python
import pandas as pdaggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")#aggression_texts.head(3)aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()# Create a CountVectorizer for raw term frequenciesvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])#Selecting the politician of interestboris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similarity between Boris Johnson and all other politicianscosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()# Create a DataFrame to display the similaritiessimilarity_df = pd.DataFrame({'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated'cosine_similarity': cosine_sim_50})# Sort by cosine similarity in descending ordersimilarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)#Select the first 11similarity_df2 = similarity_df.head(11)#Select everything but Boris Johnsonsimilarity_df2 = similarity_df2[similarity_df2["name_politician"] !="Boris Johnson"]
See code
R
library(dplyr)library(ggplot2)similarity_df2 <- reticulate::py$similarity_df2# Create the bar chartggplot(similarity_df2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 1)+theme(axis.text.x =element_text(angle =45, hjust =1))
See code
Python
# Define the stopwordsstop_words =set(stopwords.words('english'))# Remove stopwordsdef remove_stopwords(text): words = text.split() filtered_words = [word for word in words if word.lower() notin stop_words]return" ".join(filtered_words)# Apply the stopword removal function to the aggregated textaggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)# Create a CountVectorizer for raw term frequenciesvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])#Selecting the politician of interestboris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similarity between Boris Johnson and all other politicianscosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()# Create a DataFrame to display the similaritiessimilarity_df = pd.DataFrame({'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated'cosine_similarity': cosine_sim_50})# Sort by cosine similarity in descending ordersimilarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)#Select the first 11similarity_df2 = similarity_df.head(11)#Select everything but Boris Johnsonsimilarity_df2 = similarity_df2[similarity_df2["name_politician"] !="Boris Johnson"]
See code
R
library(dplyr)library(ggplot2)similarity_df2 <- reticulate::py$similarity_df2# Create the bar chartggplot(similarity_df2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 1)+theme(axis.text.x =element_text(angle =45, hjust =1))
Application
After removing Misleading Word Counts
And here are the leading words.
Python
# Get the feature names (words) from the vectorizerfeature_names = vectorizer.get_feature_names_out()# Get the TF-IDF scores for Boris Johnsontext_term_frequencies = text_dfm[boris_index].toarray().flatten()# Create a DataFrame with terms and their TF-IDF scores for easy sortingterm_frequency_df = pd.DataFrame({'term': feature_names,'freq': text_term_frequencies})# Get the top 8 features by TF-IDF scoretop_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)top_features
term freq
17343 honourable 11
38748 would 7
28951 referendum 6
22764 minister 6
14121 fields 5
14904 free 5
26525 playing 5
17490 house 4
Application
After removing Misleading Word Counts
And this is how the top words compare:
See code
Python
import pandas as pdaggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")#aggression_texts.head(3)aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()# Create a CountVectorizer for raw term frequenciesvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])# Get the feature names (words) from the vectorizerfeature_names = vectorizer.get_feature_names_out()# Get the TF-IDF scores for 'Boris Johnson'text_term_frequencies = text_dfm[boris_index].toarray().flatten()# Create a DataFrame with terms and their TF-IDF scores for easy sortingterm_frequency_df = pd.DataFrame({'term': feature_names,'freq': text_term_frequencies})# Get the top 8 features by TF-IDF scoretop_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)print(top_features)
term freq
35017 the 85
35375 to 43
24497 of 34
35010 that 33
2684 and 28
19341 is 25
18172 in 22
38445 will 13
See code
Python
# Download the NLTK stopword list if not already done#nltk.download('stopwords')# Define the stopwordsstop_words =set(stopwords.words('english'))# Remove stopwordsdef remove_stopwords(text): words = text.split() filtered_words = [word for word in words if word.lower() notin stop_words]return" ".join(filtered_words)# Apply the stopword removal function to the aggregated textaggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)# Vectorize the cleaned textvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])# Get the feature names (words) from the vectorizerfeature_names = vectorizer.get_feature_names_out()# Get the TF-IDF scores for 'Boris Johnson'text_term_frequencies = text_dfm[boris_index].toarray().flatten()# Create a DataFrame with terms and their TF-IDF scores for easy sortingterm_frequency_df = pd.DataFrame({'term': feature_names,'freq': text_term_frequencies})# Get the top 8 features by TF-IDF scoretop_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)print(top_features)
term freq
17343 honourable 11
38748 would 7
28951 referendum 6
22764 minister 6
14121 fields 5
14904 free 5
26525 playing 5
17490 house 4
Caveats
When comparing the speeches of Boris Johnson to other politicians, we characterize speeches according to the raw counts of each word.
We used the raw term frequency to characterize similarity:
each word is considered equally important.
This way to count words is called bag-of-words: a model using a representation of text that is based on an unordered collection.
Caveats: bag-of-words
The following models a text document using bag-of-words. Here are two simple text documents:
(1) John likes to watch movies. Mary likes movies too.
(2) Mary also likes to watch football games.
Based on these two text documents, a list is constructed as follows for each document:
is also equivalent to BoW1. It is also what we expect from a strict JSON object representation.
Caveats: bag-of-words
So, the bag-of-words representation characterizes documents according to the raw counts of each word.
The critical problem with using raw term frequency is that all terms are considered equally important when it comes to assessing similarity.
One way of avoiding this problem is to weigh the vectors of word counts in ways that make our text representations more informative.
The most common strategy is tf-idf weighting
Tf-idf intuition
Tf-idf stands for “term-frequency-inverse-document-frequency”
Tf-idf assigns higher weights to:
words that are common in a given document (“term-frequency”)
words that are rare in the corpus (i.e. multiple documents) (“inverse-document-frequency”)
Tf-idf intuition
TF (Term Frequency): Rewards terms that occur often in the current document — makes sense, because it might be important in that context.
IDF (Inverse Document Frequency): Penalizes terms that are common across many documents (like “the” or “and”), and boosts those that appear in fewer documents, because they might be more discriminative or unique.
Thus, words that are downgraded include:
stopwords
terms that are domain specific and are frequently used across documents
Words that are upgraded include: words that are more distinctive: more helpful to characterize a given text.
Tf-idf intuition
The Term-frequency-inverse-document-frequency (tf-idf) weighting scheme assigns to feature j a weight in document i:
\(W_{i,j}\) - the number of times feature \(j\) appears in document \(i\)
\(df_{j}\) - the number of documents in the corpus that contain feature \(j\)
\(N\) - the total number of documents
We use \(log(\frac{N}{df_j})\) rather than \(\frac{N}{df_j}\) in order to avoid large weights on rare words.
Tf-idf intuition
Characteristics
Tf-idf will be highest when feature \(j\) occurs many times in a small number of documents
Tf-idf will be lower when feature \(j\) occurs few times in a document, or occurs in many documents
Tf-idf will be lowest when feature \(j\) occurs in virtually all documents
Tf-idf Application
What are most common words in Boris Johnson’s speech
See code
Python
import pandas as pdfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerfrom nltk.corpus import stopwordsimport nltk# Download NLTK stopwords (if not already done)#nltk.download('stopwords')# Define the stopwordsstop_words =set(stopwords.words('english'))# Remove stopwords functiondef remove_stopwords(text): words = text.split() filtered_words = [word for word in words if word.lower() notin stop_words]return" ".join(filtered_words)# Load the datasetaggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")# Aggregate the texts by "name"aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()# Apply the stopword removal function to the aggregated textaggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)# Create a CountVectorizer for raw term frequenciescount_vectorizer = CountVectorizer()text_dfm = count_vectorizer.fit_transform(aggression_texts_aggregated['body'])# Create a TfidfVectorizer for TF-IDF scorestfidf_vectorizer = TfidfVectorizer()tfidf_dfm = tfidf_vectorizer.fit_transform(aggression_texts_aggregated['body'])# Get the feature names (words) from the vectorizerfeature_names = tfidf_vectorizer.get_feature_names_out()# Identify the index for 'Boris Johnson' (replace 'Boris Johnson' with the actual name you want)boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Get the TF-IDF scores for 'Boris Johnson'tfidf_scores = tfidf_dfm[boris_index].toarray().flatten()# Create a DataFrame with terms and their TF-IDF scores for easy sortingtfidf_df = pd.DataFrame({'term': feature_names,'tfidf': tfidf_scores})# Get the top 8 features by TF-IDF scoretop_tfidf_features = tfidf_df.sort_values(by='tfidf', ascending=False).head(8)print(top_tfidf_features)
term tfidf
14121 fields 0.262896
28951 referendum 0.196836
26525 playing 0.185771
19541 jcpoa 0.160847
19387 israel 0.140571
9169 criminalised 0.137572
17343 honourable 0.135140
23696 nato 0.133815
Tf-idf Application
We now look at similarities to other politicians
See code
Python
# Selecting the politician of interest (replace 'Boris Johnson' with the actual name)boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similarity between Boris Johnson and all other politicianscosine_sim_tfidf = cosine_similarity(tfidf_dfm[boris_index], tfidf_dfm).flatten()# Add cosine similarity to the DataFrameaggression_texts_aggregated['cosine_similarity_to_boris'] = cosine_sim_tfidf# Sort the DataFrame by cosine similarity (highest similarity first)similarity_results_tfidf = aggression_texts_aggregated.sort_values(by='cosine_similarity_to_boris', ascending=False)# Rename the columnssimilarity_results_tfidf = similarity_results_tfidf.rename(columns={'name': 'name_politician','cosine_similarity_to_boris': 'cosine_similarity'})# Selecting the first 11 observationssimilarity_results_tfidf2 = similarity_results_tfidf.head(11)# Dropping Borish Johnson Outsimilarity_results_tfidf2 = similarity_results_tfidf2[similarity_results_tfidf2["name_politician"] !="Boris Johnson"]
See code
R
library(dplyr)similarity_results_tfidf2 <- reticulate::py$similarity_results_tfidf2# Create the bar chartggplot(similarity_results_tfidf2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 1)+theme(axis.text.x =element_text(angle =45, hjust =1))
Application
After removing Misleading Word Counts
This is what the original dataframe looks like compared:
See code
Python
import pandas as pdaggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")#aggression_texts.head(3)aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()# Create a CountVectorizer for raw term frequenciesvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])#Selecting the politician of interestboris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similarity between Boris Johnson and all other politicianscosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()# Create a DataFrame to display the similaritiessimilarity_df = pd.DataFrame({'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated'cosine_similarity': cosine_sim_50})# Sort by cosine similarity in descending ordersimilarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)# Selecting the first 11 observationssimilarity_df2 = similarity_df.head(11)# Dropping Borish Johnson Outsimilarity_df2 = similarity_df2[similarity_df2["name_politician"] !="Boris Johnson"]
See code
R
library(dplyr)similarity_df2 <- reticulate::py$similarity_df2# Create the bar chartggplot(similarity_df2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 1)+theme(axis.text.x =element_text(angle =45, hjust =1))
See code
Python
# Define the stopwordsstop_words =set(stopwords.words('english'))# Remove stopwordsdef remove_stopwords(text): words = text.split() filtered_words = [word for word in words if word.lower() notin stop_words]return" ".join(filtered_words)# Apply the stopword removal function to the aggregated textaggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)# Create a CountVectorizer for raw term frequenciesvectorizer = CountVectorizer()text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])#Selecting the politician of interestboris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] =='Boris Johnson'].index[0]# Calculate the cosine similarity between Boris Johnson and all other politicianscosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()# Create a DataFrame to display the similaritiessimilarity_df = pd.DataFrame({'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated'cosine_similarity': cosine_sim_50})# Sort by cosine similarity in descending ordersimilarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)# Selecting the first 11 observationssimilarity_df2 = similarity_df.head(11)# Dropping Borish Johnson Outsimilarity_df2 = similarity_df2[similarity_df2["name_politician"] !="Boris Johnson"]
See code
R
library(dplyr)similarity_df2 <- reticulate::py$similarity_df2# Create the bar chartggplot(similarity_df2, aes(x =reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +geom_col() +geom_text(aes(label =round(cosine_similarity, 3)), vjust =-0.5, size =3.5) +labs(title ="Cosine Similarity Scores with Boris Johnson",x ="Politicians",y ="Cosine Similarity" ) +theme_bw()+ylim(0, 1)+theme(axis.text.x =element_text(angle =45, hjust =1))
Caveats Associated with Tf-idf Cosine similarity
There are however some caveats associated with Cosine Similarity
Note
Text 1: “Artificial intelligence has revolutionized text processing” Text 2: “Progress in computational linguistics is dramatic”
The message of these two sentences is pretty much the the same.
index
artificial
intelligence
revolutionized
text
processing
progress
in
computational
linguistics
is
dramatic
text_1
1.0
1.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
text_2
0.0
0.0
0.0
0.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
Yet the Cosine Similarity is 0: there are non-overlapping sets of words.
\[cos(\theta) = \frac{a \cdot b}{||a|| ||b||}=0\]
Caveats Associated with Tf-idf Cosine similarity
There are however some caveats associated with Cosine Similarity
Note
Text 1: “Artificial intelligence has revolutionized text processing” Text 2: “Progress in computational linguistics is dramatic”
The message of these two sentences is pretty much the the same.
index
artificial
intelligence
revolutionized
text
processing
progress
in
computational
linguistics
is
dramatic
text_1
1.0
1.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
text_2
0.0
0.0
0.0
0.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
Yet the Cosine Similarity is 0: there are non-overlapping sets of words.
Word-embedding approaches are the solution to this. More on this in future lectures.
Word Clouds
A common way to visualize differences is by using word clouds
Word Clouds are visual representations of the frequency and importance of words in a given text.
the size of words indicates the frequency or its importance within a text
Word Clouds: Johnson vs. Cameron
Python
import pandas as pdfrom nltk.corpus import stopwordsfrom wordcloud import WordCloudimport matplotlib.pyplot as pltimport refrom nltk.corpus import stopwords# Load the datasetfile_path ="/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv"aggression_texts = pd.read_csv(file_path)text_johnson = aggression_texts.loc[aggression_texts['name'] =='Boris Johnson']text_cameron = aggression_texts.loc[aggression_texts['name'] =='David Cameron']# Load English stopwordsstop_words =set(stopwords.words('english'))# Function to clean text by removing stopwordsdef clean_text(text):# Remove punctuation using regex text = re.sub(r'[^\w\s]', '', str(text))# Remove stopwordsreturn' '.join([word for word in text.split() if word.lower() notin stop_words])# Apply text cleaning to the datasetstext_johnson['cleaned_text'] = text_johnson['body'].apply(clean_text)text_cameron['cleaned_text'] = text_cameron['body'].apply(clean_text)
library(quanteda)library(gridExtra)library(png)library(grid)# Save the word clouds to temporary image filespng("johnson_wordcloud.png", width =800, height =800)textplot_wordcloud(text_johnson2, max_words =300)garbage <-dev.off()png("cameron_wordcloud.png", width =800, height =800)textplot_wordcloud(text_cameron2, max_words =300)garbage <-dev.off()# Read the images back as grobsjohnson_grob <-rasterGrob(readPNG("johnson_wordcloud.png"))cameron_grob <-rasterGrob(readPNG("cameron_wordcloud.png"))# Arrange the grobs side by sidegrid.arrange(johnson_grob, cameron_grob, ncol =2)
Even here, it is difficult to identify distinguishing words.
The primary difficulty is the fact that the X and Y axes are meaningless.
Fightin’ Words
One approach is to visualise the difference in word use across groups by using the Fightin’ Words method (Munroe et al. 2008)
This starts with calculating the probability of observing a given word for a given category of documents:
\(W^{*}_{j,k}\) - number of times feature \(j\) appears in documents in category \(k\) \(n_k\) the total number of tokens in documents in category \(k\) \(a_j\) “regularization” parameter which shrinks differences in very common words towards 0
Fightin’ Words
We then take the log-odds for category \(k\) and \(k'\):
This ratio estimates the relative probability of the use of word \(j\) between the two groups.
When the ratio is positive, group \(k\) uses the word more often. When it is negative, group \(k'\) uses it more often.
Fightin’ Words
The final step is to standardize the ratio by its variance.
\[
\textrm{Fightin' Words Score}_j =
\frac{log-odds-ratio_{j,k}}
{\sqrt{Var(log-odds-ratio_{j,k})}}
\]
Fightin’ Words
Here is how we implement this technique in a function.
Python
import pandas as pdimport numpy as npfrom sklearn.feature_extraction.text import CountVectorizerfrom nltk.corpus import stopwordsimport reimport nltk# Download NLTK stopwords if not already availablenltk.download('stopwords')# Text preprocessingdef preprocess_text(text):# Remove punctuation, symbols, and numbers text = re.sub(r'[^\w\s]', '', text) # Remove punctuation and symbols text = re.sub(r'\d+', '', text) # Remove numbers# Convert to lowercase text = text.lower()# Remove stopwords stop_words =set(stopwords.words('english')) text =" ".join(word for word in text.split() if word notin stop_words)return text# Apply preprocessing to the 'body' columnaggression_texts['body'] = aggression_texts['body'].apply(preprocess_text)# Initialize CountVectorizervectorizer = CountVectorizer()# Subset text data for Johnson and Camerontext_group1 = aggression_texts.loc[aggression_texts['name'] =='Boris Johnson']text_group2 = aggression_texts.loc[aggression_texts['name'] =='David Cameron']# Create document-term matrices (DFMs) for each groupdfm_input1 = vectorizer.fit_transform(text_group1['body'])dfm_input2 = vectorizer.transform(text_group2['body'])# Get feature namesfeatures = vectorizer.get_feature_names_out()# Sum term frequencies for each groupcounts_group1 = np.array(dfm_input1.sum(axis=0)).flatten()counts_group2 = np.array(dfm_input2.sum(axis=0)).flatten()
True
Fightin’ Words
Here is how we implement this technique in a function.
Python
# Combine raw counts into a single DataFramedfm_counts = pd.DataFrame({"Johnson": counts_group1, # Token counts from group 1 (e.g. Johnson speeches)"Cameron": counts_group2, # Token counts from group 2 (e.g. Cameron speeches)"feature": features # Corresponding feature (e.g. word/token)})# Add Dirichlet priors to avoid zero probabilities and stabilize estimatesalpha_0 =0.1# Overall strength of the prior (smaller = weaker prior)total_counts = counts_group1 + counts_group2alpha_w = total_counts * (alpha_0 / total_counts.sum()) # Prior scaled to overall frequencydfm_counts["Johnson"] += alpha_w # Add prior to Johnson countsdfm_counts["Cameron"] += alpha_w # Add prior to Cameron counts# Convert counts to probabilities (smoothed word distributions)dfm_counts["mu_johnson"] = dfm_counts["Johnson"] / dfm_counts["Johnson"].sum()dfm_counts["mu_cameron"] = dfm_counts["Cameron"] / dfm_counts["Cameron"].sum()# Calculate log-odds for each word in each groupdfm_counts["log_odds_johnson"] = np.log(dfm_counts["mu_johnson"] / (1- dfm_counts["mu_johnson"]))dfm_counts["log_odds_cameron"] = np.log(dfm_counts["mu_cameron"] / (1- dfm_counts["mu_cameron"]))# Compute log-odds ratio between groupsdfm_counts["log_odds_ratio"] = dfm_counts["log_odds_johnson"] - dfm_counts["log_odds_cameron"]# Estimate variance of the log-odds ratio (approximate)dfm_counts["variance"] =1/ dfm_counts["Johnson"] +1/ dfm_counts["Cameron"]# Calculate standardized score (z-score) for significance rankingdfm_counts["score"] = dfm_counts["log_odds_ratio"] / np.sqrt(dfm_counts["variance"])# Return results sorted by score, highest absolute difference at the topresult = dfm_counts[["feature", "score", "Johnson", "Cameron"]].sort_values(by="score", ascending=False)
Fightin’ Words
Here is how we implement this technique in a function.
feature
score
Johnson
Cameron
referendum
3.544026
6.000279
10.000279
playing
3.420876
5.000209
7.000209
free
3.139933
5.000244
9.000244
global
2.499159
3.000139
5.000139
african
2.472248
2.000052
1.000052
advance
2.472248
2.000052
1.000052
applications
2.472248
2.000052
1.000052
concrete
2.472248
2.000052
1.000052
sport
2.472248
2.000052
1.000052
provide
2.334536
2.00007
2.00007
proposal
2.112993
2.000087
3.000087
region
2.112993
2.000087
3.000087
partners
2.112993
2.000087
3.000087
iran
2.112993
2.000087
3.000087
arrest
2.112993
2.000087
3.000087
stability
2.112993
2.000087
3.000087
campaign
1.894895
2.000105
4.000105
learned
1.718199
3.000227
10.000227
decided
1.649506
1.000035
1.000035
previously
1.649506
1.000035
1.000035
Fightin’ Words
Here is how we implement this technique in a function.
Python
# Prepare `result` for ggplotresult['n'] = result['Johnson'] + result['Cameron'] # Total count for the wordresult['log_n'] = np.log(result['n']) # Log-transformed total countresult['cex'] = np.abs(result['score']) # Text sizeresult['alpha'] = np.abs(result['score']) # Opacityresult_df = result[['log_n', 'score', 'feature', 'cex', 'alpha', "Johnson", "Cameron"]]
log_n
score
feature
cex
alpha
Johnson
Cameron
2.772624
3.544026
referendum
3.544026
3.544026
6.000279
10.000279
2.484942
3.420876
playing
3.420876
3.420876
5.000209
7.000209
2.639092
3.139933
free
3.139933
3.139933
5.000244
9.000244
2.079476
2.499159
global
2.499159
2.499159
3.000139
5.000139
1.098647
2.472248
african
2.472248
2.472248
2.000052
1.000052
1.098647
2.472248
advance
2.472248
2.472248
2.000052
1.000052
1.098647
2.472248
applications
2.472248
2.472248
2.000052
1.000052
1.098647
2.472248
concrete
2.472248
2.472248
2.000052
1.000052
1.098647
2.472248
sport
2.472248
2.472248
2.000052
1.000052
1.386329
2.334536
provide
2.334536
2.334536
2.00007
2.00007
Fightin’ Words
Here is how we implement this technique in a function.
See code
R
library(dplyr)result_df2 <- reticulate::py$result_dfggplot(data=result_df2, aes(x = log_n, # x-axisy = score, # y-axislabel = feature, # text labelscex =abs(score), # text sizealpha =abs(score))) +# opacitygeom_text() +# plot textxlab("log(n)") +# x-axis label ylab("Fightin' words score") +# y-axis labeltheme_bw() +# nice black and white theme#theme(panel.grid = element_blank()) + # remove grid linesscale_size_continuous(guide ="none") +# remove size legendscale_alpha_continuous(guide ="none")
johnson
cameron
honourable
cherie
would
burdensome
referendum
peep
minister
accreted
free
acquire
playing
advancement
fields
afternoons
right
xtaggart
gentleman
guarantor
friend
alive
house
prosperous
know
balkans
last
priorities
believe
rid
uk
barricades
people
battalion
said
presented
government
annual
one
presence
us
owing
Fightin’ Words
Here is how we implement this technique in a function.
See code
R
library(dplyr)result_df2 <- reticulate::py$result_dfggplot(data=result_df2, aes(x = log_n, # x-axisy = score, # y-axislabel = feature, # text labelscex =abs(score), # text sizealpha =abs(score))) +# opacitygeom_text() +# plot textxlab("log(n)") +# x-axis label ylab("Fightin' words score") +# y-axis labeltheme_bw() +# nice black and white theme#theme(panel.grid = element_blank()) + # remove grid linesscale_size_continuous(guide ="none") +# remove size legendscale_alpha_continuous(guide ="none")
johnson
cameron
cherie
honourable
rid
right
barricades
people
battalion
minister
presented
prime
presence
said
blaby
would
boycott
gentleman
priorities
friend
boycotts
think
vindicated
one
pence
government
consular
work
mornings
us
continent
get
inspiration
need
tillerson
point
washington
take
kinds
say
balkans
made
Conclusion
Text Representation:
The Vector Space Model transforms text into numerical vectors, enabling computation of similarity and other metrics.
TF-IDF weighting refines word importance
Similarity Metrics:
Measures like cosine similarity quantify relationships between documents or vectors.
Visualization Techniques:
Word clouds and “Fightin’ Words” highlight contrasts between corpora, leveraging both frequency and weighting.