L18: Other Ways of Measuring Similarity

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Similarity

Previously, we discussed the notion of similarity among documents.

Similarity relied on the assumption that each document can be represented by a vector of (weighted) feature counts

Some of the ways to measure document similarity included:

Edit distances
Inner product
Euclidean distance
Cosine similarity

Application

If we use the same dataset as before, we want to see how similar politicians are.

Python

import pandas as pd
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
# Group the DataFrame by the 'name' column.
# For each group, take the 'body' column and concatenate all text entries into a single string, separated by spaces.
# Reset the index of the resulting DataFrame so that 'name' becomes a standard column instead of an index.
aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()
aggression_texts_aggregated.head(8)

             name                                               body
0    Adam Afriyie  I welcomed much of what was said by  the honou...
1   Adam Holloway  What recent discussions he has had on the futu...
2     Adam Ingram  I think I said that it was an additional power...
3      Adam Price  There has been much talk in the Chamber this a...
4   Adrian Bailey  Given the failure of successive well-intention...
5  Adrian Sanders  I do not know whether the honourable Gentleman...
6      Afzal Khan  What recent progress the Government have made ...
7    Aidan Burley  I should start with a declaration of interest ...

Application

Creating Count vectorized representation of speeches

We first create the Count vectorized representation of speeches

Python

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Create a CountVectorizer for raw term frequencies
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])

We then select the politician of interest

Python

#Selecting the politician of interest
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]
# Calculate the cosine similarity between Boris Johnson and all other politicians
# flatten() converts a 2D array into a 1D array.
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()

Application

Calculating the Cosine Similarity

We then create a more friendly dataframe that we can visualize:

Python

# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
    'name_politician': aggression_texts_aggregated['name'],  # Use the names from conserv_aggregated
    'cosine_similarity': cosine_sim_50
})

# Sort by cosine similarity in descending order
similarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

# Show the top similarities
print(similarity_df.head())

      name_politician  cosine_similarity
145     Boris Johnson           1.000000
1590    William Hague           0.964526
283    David Miliband           0.964144
278   David Lidington           0.963183
1356   Philip Hammond           0.963094

Application

Comparing to other Speeches

Python

# Selecting the top 11
similarity_df2 = similarity_df.head(11)
# Selecting everything but Boris Johnson
similarity_df2 = similarity_df2[similarity_df2["name_politician"] != "Boris Johnson"]

See code

library(dplyr)
library(ggplot2)
similarity_df2 <- reticulate::py$similarity_df2

# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 1)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Application

Comparing it with Other Speeches

As you can see below, most speeches are smiliar to one another

See code

library(dplyr)
similarity_df2 <- reticulate::py$similarity_df

# Create the bar chart
ggplot(similarity_df2, aes(x = cosine_similarity)) +
  geom_histogram(binwidth = 0.05, color = "white", alpha = 0.7) +
  labs(title = "Histogram of Cosine Similarities",
       x = "Cosine Similarity",
       y = "Frequency")+
  theme_bw()

Application

Misleading Word Counts

Let us compare the most common features:

Python

# Get the feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Get the TF-IDF scores for 'Boris Johnson'
text_term_frequencies = text_dfm[boris_index].toarray().flatten()
# Create a DataFrame with terms and their TF-IDF scores for easy sorting
term_frequency_df = pd.DataFrame({
    'term': feature_names,
    'freq': text_term_frequencies
})

# Get the top 8 features by TF-IDF score
top_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)
print(top_features)

       term  freq
35017   the    85
35375    to    43
24497    of    34
35010  that    33
2684    and    28
19341    is    25
18172    in    22
38445  will    13

Application

Misleading Word Counts

Feature selection matters! Similarities here are being driven by substantively unimportant words.

One solution would be to remove stopwords and try again.

Application

Creating TF-IDF vectorized representation of speeches

We can remove stopwords.

Python

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
import nltk
# Download the NLTK stopword list if not already done
nltk.download('stopwords')

# Define the stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Apply the stopword removal function to the aggregated text
aggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)

# Vectorize the cleaned text
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])

# Find the index of the selected politician (Boris Johnson)
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]

# Calculate the cosine similarity
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()

# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
    'name_politician': aggression_texts_aggregated['name'],  # Use the names from conserv_aggregated
    'cosine_similarity': cosine_sim_50
})

# Sort by cosine similarity in descending order
similarity_new_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

# Show the top similarities
print(similarity_new_df.head())

True

      name_politician  cosine_similarity
145     Boris Johnson           1.000000
256     David Cameron           0.569765
1018    Mr John Major           0.569098
278   David Lidington           0.566099
1564       Tony Blair           0.562866

Application

Creating a Vectorized representation of speeches

We then select the politician of interest

Python

# Sort by cosine similarity in descending order
similarity_new_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

# Show the top similarities
print(similarity_new_df.head())

# Selecting the first 11
similarity_df2 = similarity_new_df.head(11)

# Excluding Boris Johnson
similarity_df2 = similarity_df2[similarity_df2["name_politician"] != "Boris Johnson"]

Application

Aftter removing Misleading Word Counts

This is what the output looks like if we remove the stopwords.

See code

library(dplyr)
similarity_df2 <- reticulate::py$similarity_df2

# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 0.7)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Application

Aftter removing Misleading Word Counts

This is what the original dataframe looks like compares:

See code

Python

import pandas as pd
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
#aggression_texts.head(3)
aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()

# Create a CountVectorizer for raw term frequencies
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])

#Selecting the politician of interest
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]
# Calculate the cosine similarity between Boris Johnson and all other politicians
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()

# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
    'name_politician': aggression_texts_aggregated['name'],  # Use the names from conserv_aggregated
    'cosine_similarity': cosine_sim_50
})

# Sort by cosine similarity in descending order
similarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

#Select the first 11
similarity_df2 = similarity_df.head(11)

#Select everything but Boris Johnson
similarity_df2 = similarity_df2[similarity_df2["name_politician"] != "Boris Johnson"]

See code

library(dplyr)
library(ggplot2)
similarity_df2 <- reticulate::py$similarity_df2

# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 1)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

See code

Python

# Define the stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Apply the stopword removal function to the aggregated text
aggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)

# Create a CountVectorizer for raw term frequencies
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])


#Selecting the politician of interest
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]
# Calculate the cosine similarity between Boris Johnson and all other politicians
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()

# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
    'name_politician': aggression_texts_aggregated['name'],  # Use the names from conserv_aggregated
    'cosine_similarity': cosine_sim_50
})

# Sort by cosine similarity in descending order
similarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

#Select the first 11
similarity_df2 = similarity_df.head(11)

#Select everything but Boris Johnson
similarity_df2 = similarity_df2[similarity_df2["name_politician"] != "Boris Johnson"]

See code

library(dplyr)
library(ggplot2)
similarity_df2 <- reticulate::py$similarity_df2

# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 1)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Application

After removing Misleading Word Counts

And here are the leading words.

Python

# Get the feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Get the TF-IDF scores for Boris Johnson
text_term_frequencies = text_dfm[boris_index].toarray().flatten()
# Create a DataFrame with terms and their TF-IDF scores for easy sorting
term_frequency_df = pd.DataFrame({
    'term': feature_names,
    'freq': text_term_frequencies
})

# Get the top 8 features by TF-IDF score
top_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)
top_features

             term  freq
17343  honourable    11
38748       would     7
28951  referendum     6
22764    minister     6
14121      fields     5
14904        free     5
26525     playing     5
17490       house     4

Application

After removing Misleading Word Counts

And this is how the top words compare:

See code

Python

import pandas as pd
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
#aggression_texts.head(3)
aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()


# Create a CountVectorizer for raw term frequencies
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])

# Get the feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Get the TF-IDF scores for 'Boris Johnson'
text_term_frequencies = text_dfm[boris_index].toarray().flatten()
# Create a DataFrame with terms and their TF-IDF scores for easy sorting
term_frequency_df = pd.DataFrame({
    'term': feature_names,
    'freq': text_term_frequencies
})

# Get the top 8 features by TF-IDF score
top_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)
print(top_features)

       term  freq
35017   the    85
35375    to    43
24497    of    34
35010  that    33
2684    and    28
19341    is    25
18172    in    22
38445  will    13

See code

Python

# Download the NLTK stopword list if not already done
#nltk.download('stopwords')
# Define the stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Apply the stopword removal function to the aggregated text
aggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)

# Vectorize the cleaned text
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])

# Get the feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Get the TF-IDF scores for 'Boris Johnson'
text_term_frequencies = text_dfm[boris_index].toarray().flatten()
# Create a DataFrame with terms and their TF-IDF scores for easy sorting
term_frequency_df = pd.DataFrame({
    'term': feature_names,
    'freq': text_term_frequencies
})

# Get the top 8 features by TF-IDF score
top_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)
print(top_features)

             term  freq
17343  honourable    11
38748       would     7
28951  referendum     6
22764    minister     6
14121      fields     5
14904        free     5
26525     playing     5
17490       house     4

Caveats

When comparing the speeches of Boris Johnson to other politicians, we characterize speeches according to the raw counts of each word.

We used the raw term frequency to characterize similarity:

each word is considered equally important.

This way to count words is called bag-of-words: a model using a representation of text that is based on an unordered collection.

Caveats: bag-of-words

The following models a text document using bag-of-words. Here are two simple text documents:

(1) John likes to watch movies. Mary likes movies too.

(2) Mary also likes to watch football games.

Based on these two text documents, a list is constructed as follows for each document:

"John","likes","to","watch","movies","Mary","likes","movies","too"

"Mary","also","likes","to","watch","football","games"

Representing each bag-of-words as a JSON object (dictionary), we have the following:

BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
BoW2 = {"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};

Caveats: bag-of-words

BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
BoW2 = {"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};

Each key is the word, and each value is the number of occurrences of that word in the given text document.

The order of elements is free, so, for example:

{"too":1,"Mary":1,"movies":2,"John":1,"watch":1,"likes":2,"to":1}

is also equivalent to BoW1. It is also what we expect from a strict JSON object representation.

Caveats: bag-of-words

So, the bag-of-words representation characterizes documents according to the raw counts of each word.

The critical problem with using raw term frequency is that all terms are considered equally important when it comes to assessing similarity.

One way of avoiding this problem is to weigh the vectors of word counts in ways that make our text representations more informative.

The most common strategy is tf-idf weighting

Tf-idf intuition

Tf-idf stands for “term-frequency-inverse-document-frequency”

Tf-idf assigns higher weights to:

words that are common in a given document (“term-frequency”)
words that are rare in the corpus (i.e. multiple documents) (“inverse-document-frequency”)

Tf-idf intuition

TF (Term Frequency): Rewards terms that occur often in the current document — makes sense, because it might be important in that context.

IDF (Inverse Document Frequency): Penalizes terms that are common across many documents (like “the” or “and”), and boosts those that appear in fewer documents, because they might be more discriminative or unique.

Thus, words that are downgraded include:

stopwords
terms that are domain specific and are frequently used across documents

Words that are upgraded include: words that are more distinctive: more helpful to characterize a given text.

Tf-idf intuition

The Term-frequency-inverse-document-frequency (tf-idf) weighting scheme assigns to feature j a weight in document i:

\[ \begin{aligned} \text{tf-idf}_{i,j} &= W_{i,j} \times \text{idf}_j \\ &= W_{i,j} \times \log\left(\frac{N}{df_j}\right) \end{aligned} \]

where:

\(W_{i,j}\) - the number of times feature \(j\) appears in document \(i\)
\(df_{j}\) - the number of documents in the corpus that contain feature \(j\)
\(N\) - the total number of documents

We use \(log(\frac{N}{df_j})\) rather than \(\frac{N}{df_j}\) in order to avoid large weights on rare words.

Tf-idf intuition

Characteristics

Tf-idf will be highest when feature \(j\) occurs many times in a small number of documents

Tf-idf will be lower when feature \(j\) occurs few times in a document, or occurs in many documents

Tf-idf will be lowest when feature \(j\) occurs in virtually all documents

Tf-idf Application

What are most common words in Boris Johnson’s speech

See code

Python

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
import nltk

# Download NLTK stopwords (if not already done)
#nltk.download('stopwords')

# Define the stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords function
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Load the dataset
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")

# Aggregate the texts by "name"
aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()

# Apply the stopword removal function to the aggregated text
aggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)

# Create a CountVectorizer for raw term frequencies
count_vectorizer = CountVectorizer()
text_dfm = count_vectorizer.fit_transform(aggression_texts_aggregated['body'])

# Create a TfidfVectorizer for TF-IDF scores
tfidf_vectorizer = TfidfVectorizer()
tfidf_dfm = tfidf_vectorizer.fit_transform(aggression_texts_aggregated['body'])

# Get the feature names (words) from the vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Identify the index for 'Boris Johnson' (replace 'Boris Johnson' with the actual name you want)
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]

# Get the TF-IDF scores for 'Boris Johnson'
tfidf_scores = tfidf_dfm[boris_index].toarray().flatten()

# Create a DataFrame with terms and their TF-IDF scores for easy sorting
tfidf_df = pd.DataFrame({
    'term': feature_names,
    'tfidf': tfidf_scores
})

# Get the top 8 features by TF-IDF score
top_tfidf_features = tfidf_df.sort_values(by='tfidf', ascending=False).head(8)
print(top_tfidf_features)

               term     tfidf
14121        fields  0.262896
28951    referendum  0.196836
26525       playing  0.185771
19541         jcpoa  0.160847
19387        israel  0.140571
9169   criminalised  0.137572
17343    honourable  0.135140
23696          nato  0.133815

Tf-idf Application

We now look at similarities to other politicians

See code

Python

# Selecting the politician of interest (replace 'Boris Johnson' with the actual name)
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]

# Calculate the cosine similarity between Boris Johnson and all other politicians
cosine_sim_tfidf = cosine_similarity(tfidf_dfm[boris_index], tfidf_dfm).flatten()
# Add cosine similarity to the DataFrame
aggression_texts_aggregated['cosine_similarity_to_boris'] = cosine_sim_tfidf

# Sort the DataFrame by cosine similarity (highest similarity first)
similarity_results_tfidf = aggression_texts_aggregated.sort_values(by='cosine_similarity_to_boris', ascending=False)

# Rename the columns
similarity_results_tfidf = similarity_results_tfidf.rename(columns={
    'name': 'name_politician',
    'cosine_similarity_to_boris': 'cosine_similarity'
})

# Selecting the first 11 observations
similarity_results_tfidf2 = similarity_results_tfidf.head(11)

# Dropping Borish Johnson Out
similarity_results_tfidf2 = similarity_results_tfidf2[similarity_results_tfidf2["name_politician"] != "Boris Johnson"]

See code

library(dplyr)
similarity_results_tfidf2 <- reticulate::py$similarity_results_tfidf2

# Create the bar chart
ggplot(similarity_results_tfidf2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 1)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Application

After removing Misleading Word Counts

This is what the original dataframe looks like compared:

See code

Python

import pandas as pd
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
#aggression_texts.head(3)
aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()

# Create a CountVectorizer for raw term frequencies
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])


#Selecting the politician of interest
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]
# Calculate the cosine similarity between Boris Johnson and all other politicians
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()

# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
    'name_politician': aggression_texts_aggregated['name'],  # Use the names from conserv_aggregated
    'cosine_similarity': cosine_sim_50
})

# Sort by cosine similarity in descending order
similarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

# Selecting the first 11 observations
similarity_df2 = similarity_df.head(11)

# Dropping Borish Johnson Out
similarity_df2 = similarity_df2[similarity_df2["name_politician"] != "Boris Johnson"]

See code

library(dplyr)
similarity_df2 <- reticulate::py$similarity_df2

# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 1)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

See code

Python

# Define the stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Apply the stopword removal function to the aggregated text
aggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)

# Create a CountVectorizer for raw term frequencies
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])


#Selecting the politician of interest
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]
# Calculate the cosine similarity between Boris Johnson and all other politicians
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()

# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
    'name_politician': aggression_texts_aggregated['name'],  # Use the names from conserv_aggregated
    'cosine_similarity': cosine_sim_50
})

# Sort by cosine similarity in descending order
similarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

# Selecting the first 11 observations
similarity_df2 = similarity_df.head(11)

# Dropping Borish Johnson Out
similarity_df2 = similarity_df2[similarity_df2["name_politician"] != "Boris Johnson"]

See code

library(dplyr)
similarity_df2 <- reticulate::py$similarity_df2

# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 1)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Caveats Associated with Tf-idf Cosine similarity

There are however some caveats associated with Cosine Similarity

Note

Text 1: “Artificial intelligence has revolutionized text processing”
Text 2: “Progress in computational linguistics is dramatic”

The message of these two sentences is pretty much the the same.

index	artificial	intelligence	revolutionized	text	processing	progress	in	computational	linguistics	is	dramatic
text_1	1.0	1.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
text_2	0.0	0.0	0.0	0.0	0.0	1.0	1.0	1.0	1.0	1.0	1.0

Yet the Cosine Similarity is 0: there are non-overlapping sets of words.

\[cos(\theta) = \frac{a \cdot b}{||a|| ||b||}=0\]

Caveats Associated with Tf-idf Cosine similarity

There are however some caveats associated with Cosine Similarity

Note

Text 1: “Artificial intelligence has revolutionized text processing”
Text 2: “Progress in computational linguistics is dramatic”

The message of these two sentences is pretty much the the same.

index	artificial	intelligence	revolutionized	text	processing	progress	in	computational	linguistics	is	dramatic
text_1	1.0	1.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
text_2	0.0	0.0	0.0	0.0	0.0	1.0	1.0	1.0	1.0	1.0	1.0

Yet the Cosine Similarity is 0: there are non-overlapping sets of words.

Word-embedding approaches are the solution to this. More on this in future lectures.

Word Clouds

A common way to visualize differences is by using word clouds

Word Clouds are visual representations of the frequency and importance of words in a given text.

the size of words indicates the frequency or its importance within a text

Word Clouds: Johnson vs. Cameron

Python

import pandas as pd
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import re
from nltk.corpus import stopwords

# Load the dataset
file_path = "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv"
aggression_texts = pd.read_csv(file_path)

text_johnson = aggression_texts.loc[aggression_texts['name'] == 'Boris Johnson']
text_cameron = aggression_texts.loc[aggression_texts['name'] == 'David Cameron']

# Load English stopwords
stop_words = set(stopwords.words('english'))

# Function to clean text by removing stopwords
def clean_text(text):
    # Remove punctuation using regex
    text = re.sub(r'[^\w\s]', '', str(text))
    # Remove stopwords
    return ' '.join([word for word in text.split() if word.lower() not in stop_words])

# Apply text cleaning to the datasets
text_johnson['cleaned_text'] = text_johnson['body'].apply(clean_text)
text_cameron['cleaned_text'] = text_cameron['body'].apply(clean_text)

Word Clouds: Johnson vs. Cameron

library(quanteda.textplots)
library(quanteda)

text_johnson2 <- reticulate::py$text_johnson
text_johnson2 <- corpus(text_johnson2, text_field = "cleaned_text")
text_johnson2 <- tokens(text_johnson2)
text_johnson2 <- dfm(text_johnson2)

text_cameron2 <- reticulate::py$text_cameron
text_cameron2 <- corpus(text_cameron2, text_field = "cleaned_text")
text_cameron2 <- tokens(text_cameron2)
text_cameron2 <- dfm(text_cameron2)

Word Clouds: Johnson vs. Cameron

We see that there still many common words present in the document.

That is why it could be helpful to use tf-idf weighting

Word Clouds: Using tf-idf weighting

Here is how we can use the tf-idf weighting

library(quanteda.textplots)
library(quanteda)

text_johnson2 <- reticulate::py$text_johnson
text_johnson2 <- corpus(text_johnson2, text_field = "cleaned_text")
text_johnson2 <- tokens(text_johnson2)
text_johnson2 <- dfm(text_johnson2)
text_johnson2 <- dfm_tfidf(text_johnson2)

text_cameron2 <- reticulate::py$text_cameron
text_cameron2 <- corpus(text_cameron2, text_field = "cleaned_text")
text_cameron2 <- tokens(text_cameron2)
text_cameron2 <- dfm(text_cameron2)
text_cameron2 <- dfm_tfidf(text_cameron2)

Word Clouds: Using tf-idf weighting

Here is how we can use the tf-idf weighting

See code

library(quanteda)
library(gridExtra)
library(png)
library(grid)

# Save the word clouds to temporary image files
png("johnson_wordcloud.png", width = 800, height = 800)
textplot_wordcloud(text_johnson2, max_words = 300)
garbage <- dev.off()

png("cameron_wordcloud.png", width = 800, height = 800)
textplot_wordcloud(text_cameron2, max_words = 300)
garbage <- dev.off()

# Read the images back as grobs
johnson_grob <- rasterGrob(readPNG("johnson_wordcloud.png"))
cameron_grob <- rasterGrob(readPNG("cameron_wordcloud.png"))

# Arrange the grobs side by side
grid.arrange(johnson_grob, cameron_grob, ncol = 2)

Even here, it is difficult to identify distinguishing words.

The primary difficulty is the fact that the X and Y axes are meaningless.

Fightin’ Words

One approach is to visualise the difference in word use across groups by using the Fightin’ Words method (Munroe et al. 2008)

This starts with calculating the probability of observing a given word for a given category of documents:

\[ \hat{\mu_{j,k}} = \frac{W^{*}_{j,k} + a_j}{n_k +\sum_{j=1}^{J} a_{j}} \] where:

\(W^{*}_{j,k}\) - number of times feature \(j\) appears in documents in category \(k\)
\(n_k\) the total number of tokens in documents in category \(k\)
\(a_j\) “regularization” parameter which shrinks differences in very common words towards 0

Fightin’ Words

We then take the log-odds for category \(k\) and \(k'\):

\[ log-odds-ratio_{j,k} = log\left(\frac{\hat{\mu_{j,k}}}{1-\hat{\mu_{j,k}}}\right)- log\left(\frac{\hat{\mu_{j,k'}}}{1-\hat{\mu_{j,k'}}}\right) \]

This ratio estimates the relative probability of the use of word \(j\) between the two groups.

When the ratio is positive, group \(k\) uses the word more often. When it is negative, group \(k'\) uses it more often.

Fightin’ Words

The final step is to standardize the ratio by its variance.

\[ \textrm{Fightin' Words Score}_j = \frac{log-odds-ratio_{j,k}} {\sqrt{Var(log-odds-ratio_{j,k})}} \]