L18: Other Ways of Measuring Similarity

Bogdan G. Popescu

John Cabot University

Similarity

Previously, we discussed the notion of similarity among documents.

Similarity relied on the assumption that each document can be represented by a vector of (weighted) feature counts

Some of the ways to measure document similarity included:

  1. Edit distances
  2. Inner product
  3. Euclidean distance
  4. Cosine similarity

Application

If we use the same dataset as before, we want to see how similar politicians are.

Python
import pandas as pd
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
# Group the DataFrame by the 'name' column.
# For each group, take the 'body' column and concatenate all text entries into a single string, separated by spaces.
# Reset the index of the resulting DataFrame so that 'name' becomes a standard column instead of an index.
aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()
aggression_texts_aggregated.head(8)
             name                                               body
0    Adam Afriyie  I welcomed much of what was said by  the honou...
1   Adam Holloway  What recent discussions he has had on the futu...
2     Adam Ingram  I think I said that it was an additional power...
3      Adam Price  There has been much talk in the Chamber this a...
4   Adrian Bailey  Given the failure of successive well-intention...
5  Adrian Sanders  I do not know whether the honourable Gentleman...
6      Afzal Khan  What recent progress the Government have made ...
7    Aidan Burley  I should start with a declaration of interest ...

Application

Creating Count vectorized representation of speeches

We first create the Count vectorized representation of speeches

Python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Create a CountVectorizer for raw term frequencies
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])

We then select the politician of interest

Python
#Selecting the politician of interest
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]
# Calculate the cosine similarity between Boris Johnson and all other politicians
# flatten() converts a 2D array into a 1D array.
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()

Application

Calculating the Cosine Similarity

We then create a more friendly dataframe that we can visualize:

Python
# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
    'name_politician': aggression_texts_aggregated['name'],  # Use the names from conserv_aggregated
    'cosine_similarity': cosine_sim_50
})

# Sort by cosine similarity in descending order
similarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

# Show the top similarities
print(similarity_df.head())
      name_politician  cosine_similarity
145     Boris Johnson           1.000000
1590    William Hague           0.964526
283    David Miliband           0.964144
278   David Lidington           0.963183
1356   Philip Hammond           0.963094

Application

Comparing to other Speeches

Python
# Selecting the top 11
similarity_df2 = similarity_df.head(11)
# Selecting everything but Boris Johnson
similarity_df2 = similarity_df2[similarity_df2["name_politician"] != "Boris Johnson"]
See code
R
library(dplyr)
library(ggplot2)
similarity_df2 <- reticulate::py$similarity_df2

# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 1)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Application

Comparing it with Other Speeches

As you can see below, most speeches are smiliar to one another

See code
R
library(dplyr)
similarity_df2 <- reticulate::py$similarity_df

# Create the bar chart
ggplot(similarity_df2, aes(x = cosine_similarity)) +
  geom_histogram(binwidth = 0.05, color = "white", alpha = 0.7) +
  labs(title = "Histogram of Cosine Similarities",
       x = "Cosine Similarity",
       y = "Frequency")+
  theme_bw()

Application

Misleading Word Counts

Let us compare the most common features:

Python
# Get the feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Get the TF-IDF scores for 'Boris Johnson'
text_term_frequencies = text_dfm[boris_index].toarray().flatten()
# Create a DataFrame with terms and their TF-IDF scores for easy sorting
term_frequency_df = pd.DataFrame({
    'term': feature_names,
    'freq': text_term_frequencies
})

# Get the top 8 features by TF-IDF score
top_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)
print(top_features)
       term  freq
35017   the    85
35375    to    43
24497    of    34
35010  that    33
2684    and    28
19341    is    25
18172    in    22
38445  will    13

Application

Misleading Word Counts

Feature selection matters! Similarities here are being driven by substantively unimportant words.

One solution would be to remove stopwords and try again.

Application

Creating TF-IDF vectorized representation of speeches

We can remove stopwords.

Python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
import nltk
# Download the NLTK stopword list if not already done
nltk.download('stopwords')

# Define the stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Apply the stopword removal function to the aggregated text
aggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)

# Vectorize the cleaned text
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])

# Find the index of the selected politician (Boris Johnson)
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]

# Calculate the cosine similarity
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()

# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
    'name_politician': aggression_texts_aggregated['name'],  # Use the names from conserv_aggregated
    'cosine_similarity': cosine_sim_50
})

# Sort by cosine similarity in descending order
similarity_new_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

# Show the top similarities
print(similarity_new_df.head())
True
      name_politician  cosine_similarity
145     Boris Johnson           1.000000
256     David Cameron           0.569765
1018    Mr John Major           0.569098
278   David Lidington           0.566099
1564       Tony Blair           0.562866

Application

Creating a Vectorized representation of speeches

We then select the politician of interest

Python
# Sort by cosine similarity in descending order
similarity_new_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

# Show the top similarities
print(similarity_new_df.head())

# Selecting the first 11
similarity_df2 = similarity_new_df.head(11)

# Excluding Boris Johnson
similarity_df2 = similarity_df2[similarity_df2["name_politician"] != "Boris Johnson"]

Application

Aftter removing Misleading Word Counts

This is what the output looks like if we remove the stopwords.

See code
R
library(dplyr)
similarity_df2 <- reticulate::py$similarity_df2

# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 0.7)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Application

Aftter removing Misleading Word Counts

This is what the original dataframe looks like compares:

See code
Python
import pandas as pd
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
#aggression_texts.head(3)
aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()

# Create a CountVectorizer for raw term frequencies
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])

#Selecting the politician of interest
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]
# Calculate the cosine similarity between Boris Johnson and all other politicians
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()

# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
    'name_politician': aggression_texts_aggregated['name'],  # Use the names from conserv_aggregated
    'cosine_similarity': cosine_sim_50
})

# Sort by cosine similarity in descending order
similarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

#Select the first 11
similarity_df2 = similarity_df.head(11)

#Select everything but Boris Johnson
similarity_df2 = similarity_df2[similarity_df2["name_politician"] != "Boris Johnson"]
See code
R
library(dplyr)
library(ggplot2)
similarity_df2 <- reticulate::py$similarity_df2

# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 1)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

See code
Python
# Define the stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Apply the stopword removal function to the aggregated text
aggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)

# Create a CountVectorizer for raw term frequencies
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])


#Selecting the politician of interest
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]
# Calculate the cosine similarity between Boris Johnson and all other politicians
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()

# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
    'name_politician': aggression_texts_aggregated['name'],  # Use the names from conserv_aggregated
    'cosine_similarity': cosine_sim_50
})

# Sort by cosine similarity in descending order
similarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

#Select the first 11
similarity_df2 = similarity_df.head(11)

#Select everything but Boris Johnson
similarity_df2 = similarity_df2[similarity_df2["name_politician"] != "Boris Johnson"]
See code
R
library(dplyr)
library(ggplot2)
similarity_df2 <- reticulate::py$similarity_df2

# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 1)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Application

After removing Misleading Word Counts

And here are the leading words.

Python
# Get the feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Get the TF-IDF scores for Boris Johnson
text_term_frequencies = text_dfm[boris_index].toarray().flatten()
# Create a DataFrame with terms and their TF-IDF scores for easy sorting
term_frequency_df = pd.DataFrame({
    'term': feature_names,
    'freq': text_term_frequencies
})

# Get the top 8 features by TF-IDF score
top_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)
top_features
             term  freq
17343  honourable    11
38748       would     7
28951  referendum     6
22764    minister     6
14121      fields     5
14904        free     5
26525     playing     5
17490       house     4

Application

After removing Misleading Word Counts

And this is how the top words compare:

See code
Python
import pandas as pd
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
#aggression_texts.head(3)
aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()


# Create a CountVectorizer for raw term frequencies
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])

# Get the feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Get the TF-IDF scores for 'Boris Johnson'
text_term_frequencies = text_dfm[boris_index].toarray().flatten()
# Create a DataFrame with terms and their TF-IDF scores for easy sorting
term_frequency_df = pd.DataFrame({
    'term': feature_names,
    'freq': text_term_frequencies
})

# Get the top 8 features by TF-IDF score
top_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)
print(top_features)
       term  freq
35017   the    85
35375    to    43
24497    of    34
35010  that    33
2684    and    28
19341    is    25
18172    in    22
38445  will    13
See code
Python
# Download the NLTK stopword list if not already done
#nltk.download('stopwords')
# Define the stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Apply the stopword removal function to the aggregated text
aggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)

# Vectorize the cleaned text
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])

# Get the feature names (words) from the vectorizer
feature_names = vectorizer.get_feature_names_out()
# Get the TF-IDF scores for 'Boris Johnson'
text_term_frequencies = text_dfm[boris_index].toarray().flatten()
# Create a DataFrame with terms and their TF-IDF scores for easy sorting
term_frequency_df = pd.DataFrame({
    'term': feature_names,
    'freq': text_term_frequencies
})

# Get the top 8 features by TF-IDF score
top_features = term_frequency_df.sort_values(by='freq', ascending=False).head(8)
print(top_features)
             term  freq
17343  honourable    11
38748       would     7
28951  referendum     6
22764    minister     6
14121      fields     5
14904        free     5
26525     playing     5
17490       house     4

Caveats

When comparing the speeches of Boris Johnson to other politicians, we characterize speeches according to the raw counts of each word.

We used the raw term frequency to characterize similarity:

  • each word is considered equally important.

This way to count words is called bag-of-words: a model using a representation of text that is based on an unordered collection.

Caveats: bag-of-words

The following models a text document using bag-of-words. Here are two simple text documents:

(1) John likes to watch movies. Mary likes movies too.
(2) Mary also likes to watch football games.

Based on these two text documents, a list is constructed as follows for each document:

"John","likes","to","watch","movies","Mary","likes","movies","too"

"Mary","also","likes","to","watch","football","games"

Representing each bag-of-words as a JSON object (dictionary), we have the following:

BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
BoW2 = {"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};

Caveats: bag-of-words

BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
BoW2 = {"Mary":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};

Each key is the word, and each value is the number of occurrences of that word in the given text document.

The order of elements is free, so, for example:

{"too":1,"Mary":1,"movies":2,"John":1,"watch":1,"likes":2,"to":1}

is also equivalent to BoW1. It is also what we expect from a strict JSON object representation.

Caveats: bag-of-words

So, the bag-of-words representation characterizes documents according to the raw counts of each word.

The critical problem with using raw term frequency is that all terms are considered equally important when it comes to assessing similarity.

One way of avoiding this problem is to weigh the vectors of word counts in ways that make our text representations more informative.

The most common strategy is tf-idf weighting

Tf-idf intuition

Tf-idf stands for “term-frequency-inverse-document-frequency”

Tf-idf assigns higher weights to:

  • words that are common in a given document (“term-frequency”)
  • words that are rare in the corpus (i.e. multiple documents) (“inverse-document-frequency”)

Tf-idf intuition

TF (Term Frequency): Rewards terms that occur often in the current document — makes sense, because it might be important in that context.

IDF (Inverse Document Frequency): Penalizes terms that are common across many documents (like “the” or “and”), and boosts those that appear in fewer documents, because they might be more discriminative or unique.

Thus, words that are downgraded include:

  • stopwords
  • terms that are domain specific and are frequently used across documents

Words that are upgraded include: words that are more distinctive: more helpful to characterize a given text.

Tf-idf intuition

The Term-frequency-inverse-document-frequency (tf-idf) weighting scheme assigns to feature j a weight in document i:

\[ \begin{aligned} \text{tf-idf}_{i,j} &= W_{i,j} \times \text{idf}_j \\ &= W_{i,j} \times \log\left(\frac{N}{df_j}\right) \end{aligned} \]

where:

  • \(W_{i,j}\) - the number of times feature \(j\) appears in document \(i\)
  • \(df_{j}\) - the number of documents in the corpus that contain feature \(j\)
  • \(N\) - the total number of documents

We use \(log(\frac{N}{df_j})\) rather than \(\frac{N}{df_j}\) in order to avoid large weights on rare words.

Tf-idf intuition

Characteristics

Tf-idf will be highest when feature \(j\) occurs many times in a small number of documents

Tf-idf will be lower when feature \(j\) occurs few times in a document, or occurs in many documents

Tf-idf will be lowest when feature \(j\) occurs in virtually all documents

Tf-idf Application

What are most common words in Boris Johnson’s speech

See code
Python
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
import nltk

# Download NLTK stopwords (if not already done)
#nltk.download('stopwords')

# Define the stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords function
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Load the dataset
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")

# Aggregate the texts by "name"
aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()

# Apply the stopword removal function to the aggregated text
aggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)

# Create a CountVectorizer for raw term frequencies
count_vectorizer = CountVectorizer()
text_dfm = count_vectorizer.fit_transform(aggression_texts_aggregated['body'])

# Create a TfidfVectorizer for TF-IDF scores
tfidf_vectorizer = TfidfVectorizer()
tfidf_dfm = tfidf_vectorizer.fit_transform(aggression_texts_aggregated['body'])

# Get the feature names (words) from the vectorizer
feature_names = tfidf_vectorizer.get_feature_names_out()

# Identify the index for 'Boris Johnson' (replace 'Boris Johnson' with the actual name you want)
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]

# Get the TF-IDF scores for 'Boris Johnson'
tfidf_scores = tfidf_dfm[boris_index].toarray().flatten()

# Create a DataFrame with terms and their TF-IDF scores for easy sorting
tfidf_df = pd.DataFrame({
    'term': feature_names,
    'tfidf': tfidf_scores
})

# Get the top 8 features by TF-IDF score
top_tfidf_features = tfidf_df.sort_values(by='tfidf', ascending=False).head(8)
print(top_tfidf_features)
               term     tfidf
14121        fields  0.262896
28951    referendum  0.196836
26525       playing  0.185771
19541         jcpoa  0.160847
19387        israel  0.140571
9169   criminalised  0.137572
17343    honourable  0.135140
23696          nato  0.133815

Tf-idf Application

We now look at similarities to other politicians

See code
Python
# Selecting the politician of interest (replace 'Boris Johnson' with the actual name)
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]

# Calculate the cosine similarity between Boris Johnson and all other politicians
cosine_sim_tfidf = cosine_similarity(tfidf_dfm[boris_index], tfidf_dfm).flatten()
# Add cosine similarity to the DataFrame
aggression_texts_aggregated['cosine_similarity_to_boris'] = cosine_sim_tfidf

# Sort the DataFrame by cosine similarity (highest similarity first)
similarity_results_tfidf = aggression_texts_aggregated.sort_values(by='cosine_similarity_to_boris', ascending=False)

# Rename the columns
similarity_results_tfidf = similarity_results_tfidf.rename(columns={
    'name': 'name_politician',
    'cosine_similarity_to_boris': 'cosine_similarity'
})

# Selecting the first 11 observations
similarity_results_tfidf2 = similarity_results_tfidf.head(11)

# Dropping Borish Johnson Out
similarity_results_tfidf2 = similarity_results_tfidf2[similarity_results_tfidf2["name_politician"] != "Boris Johnson"]
See code
R
library(dplyr)
similarity_results_tfidf2 <- reticulate::py$similarity_results_tfidf2

# Create the bar chart
ggplot(similarity_results_tfidf2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 1)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Application

After removing Misleading Word Counts

This is what the original dataframe looks like compared:

See code
Python
import pandas as pd
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
#aggression_texts.head(3)
aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()

# Create a CountVectorizer for raw term frequencies
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])


#Selecting the politician of interest
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]
# Calculate the cosine similarity between Boris Johnson and all other politicians
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()

# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
    'name_politician': aggression_texts_aggregated['name'],  # Use the names from conserv_aggregated
    'cosine_similarity': cosine_sim_50
})

# Sort by cosine similarity in descending order
similarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

# Selecting the first 11 observations
similarity_df2 = similarity_df.head(11)

# Dropping Borish Johnson Out
similarity_df2 = similarity_df2[similarity_df2["name_politician"] != "Boris Johnson"]
See code
R
library(dplyr)
similarity_df2 <- reticulate::py$similarity_df2

# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 1)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

See code
Python
# Define the stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Apply the stopword removal function to the aggregated text
aggression_texts_aggregated['body'] = aggression_texts_aggregated['body'].apply(remove_stopwords)

# Create a CountVectorizer for raw term frequencies
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])


#Selecting the politician of interest
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]
# Calculate the cosine similarity between Boris Johnson and all other politicians
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()

# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
    'name_politician': aggression_texts_aggregated['name'],  # Use the names from conserv_aggregated
    'cosine_similarity': cosine_sim_50
})

# Sort by cosine similarity in descending order
similarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

# Selecting the first 11 observations
similarity_df2 = similarity_df.head(11)

# Dropping Borish Johnson Out
similarity_df2 = similarity_df2[similarity_df2["name_politician"] != "Boris Johnson"]
See code
R
library(dplyr)
similarity_df2 <- reticulate::py$similarity_df2

# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 1)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Caveats Associated with Tf-idf Cosine similarity

There are however some caveats associated with Cosine Similarity

Note

Text 1: “Artificial intelligence has revolutionized text processing”
Text 2: “Progress in computational linguistics is dramatic”

The message of these two sentences is pretty much the the same.

index artificial intelligence revolutionized text processing progress in computational linguistics is dramatic
text_1 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
text_2 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0

Yet the Cosine Similarity is 0: there are non-overlapping sets of words.

\[cos(\theta) = \frac{a \cdot b}{||a|| ||b||}=0\]

Caveats Associated with Tf-idf Cosine similarity

There are however some caveats associated with Cosine Similarity

Note

Text 1: “Artificial intelligence has revolutionized text processing”
Text 2: “Progress in computational linguistics is dramatic”

The message of these two sentences is pretty much the the same.

index artificial intelligence revolutionized text processing progress in computational linguistics is dramatic
text_1 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0
text_2 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0

Yet the Cosine Similarity is 0: there are non-overlapping sets of words.

Word-embedding approaches are the solution to this. More on this in future lectures.

Word Clouds

A common way to visualize differences is by using word clouds

Word Clouds are visual representations of the frequency and importance of words in a given text.

  • the size of words indicates the frequency or its importance within a text

Word Clouds: Johnson vs. Cameron

Python
import pandas as pd
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import re
from nltk.corpus import stopwords

# Load the dataset
file_path = "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv"
aggression_texts = pd.read_csv(file_path)

text_johnson = aggression_texts.loc[aggression_texts['name'] == 'Boris Johnson']
text_cameron = aggression_texts.loc[aggression_texts['name'] == 'David Cameron']

# Load English stopwords
stop_words = set(stopwords.words('english'))

# Function to clean text by removing stopwords
def clean_text(text):
    # Remove punctuation using regex
    text = re.sub(r'[^\w\s]', '', str(text))
    # Remove stopwords
    return ' '.join([word for word in text.split() if word.lower() not in stop_words])

# Apply text cleaning to the datasets
text_johnson['cleaned_text'] = text_johnson['body'].apply(clean_text)
text_cameron['cleaned_text'] = text_cameron['body'].apply(clean_text)

Word Clouds: Johnson vs. Cameron

R
library(quanteda.textplots)
library(quanteda)

text_johnson2 <- reticulate::py$text_johnson
text_johnson2 <- corpus(text_johnson2, text_field = "cleaned_text")
text_johnson2 <- tokens(text_johnson2)
text_johnson2 <- dfm(text_johnson2)

text_cameron2 <- reticulate::py$text_cameron
text_cameron2 <- corpus(text_cameron2, text_field = "cleaned_text")
text_cameron2 <- tokens(text_cameron2)
text_cameron2 <- dfm(text_cameron2)

Word Clouds: Johnson vs. Cameron

We see that there still many common words present in the document.

That is why it could be helpful to use tf-idf weighting

Word Clouds: Using tf-idf weighting

Here is how we can use the tf-idf weighting

R
library(quanteda.textplots)
library(quanteda)

text_johnson2 <- reticulate::py$text_johnson
text_johnson2 <- corpus(text_johnson2, text_field = "cleaned_text")
text_johnson2 <- tokens(text_johnson2)
text_johnson2 <- dfm(text_johnson2)
text_johnson2 <- dfm_tfidf(text_johnson2)

text_cameron2 <- reticulate::py$text_cameron
text_cameron2 <- corpus(text_cameron2, text_field = "cleaned_text")
text_cameron2 <- tokens(text_cameron2)
text_cameron2 <- dfm(text_cameron2)
text_cameron2 <- dfm_tfidf(text_cameron2)

Word Clouds: Using tf-idf weighting

Here is how we can use the tf-idf weighting

See code
R
library(quanteda)
library(gridExtra)
library(png)
library(grid)

# Save the word clouds to temporary image files
png("johnson_wordcloud.png", width = 800, height = 800)
textplot_wordcloud(text_johnson2, max_words = 300)
garbage <- dev.off()

png("cameron_wordcloud.png", width = 800, height = 800)
textplot_wordcloud(text_cameron2, max_words = 300)
garbage <- dev.off()

# Read the images back as grobs
johnson_grob <- rasterGrob(readPNG("johnson_wordcloud.png"))
cameron_grob <- rasterGrob(readPNG("cameron_wordcloud.png"))

# Arrange the grobs side by side
grid.arrange(johnson_grob, cameron_grob, ncol = 2)

Even here, it is difficult to identify distinguishing words.

The primary difficulty is the fact that the X and Y axes are meaningless.

Fightin’ Words

One approach is to visualise the difference in word use across groups by using the Fightin’ Words method (Munroe et al. 2008)

This starts with calculating the probability of observing a given word for a given category of documents:

\[ \hat{\mu_{j,k}} = \frac{W^{*}_{j,k} + a_j}{n_k +\sum_{j=1}^{J} a_{j}} \] where:

\(W^{*}_{j,k}\) - number of times feature \(j\) appears in documents in category \(k\)
\(n_k\) the total number of tokens in documents in category \(k\)
\(a_j\) “regularization” parameter which shrinks differences in very common words towards 0

Fightin’ Words

We then take the log-odds for category \(k\) and \(k'\):

\[ log-odds-ratio_{j,k} = log\left(\frac{\hat{\mu_{j,k}}}{1-\hat{\mu_{j,k}}}\right)- log\left(\frac{\hat{\mu_{j,k'}}}{1-\hat{\mu_{j,k'}}}\right) \]

This ratio estimates the relative probability of the use of word \(j\) between the two groups.

When the ratio is positive, group \(k\) uses the word more often. When it is negative, group \(k'\) uses it more often.

Fightin’ Words

The final step is to standardize the ratio by its variance.

\[ \textrm{Fightin' Words Score}_j = \frac{log-odds-ratio_{j,k}} {\sqrt{Var(log-odds-ratio_{j,k})}} \]

Fightin’ Words

Here is how we implement this technique in a function.

Python
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
import re
import nltk

# Download NLTK stopwords if not already available
nltk.download('stopwords')

# Text preprocessing
def preprocess_text(text):
    # Remove punctuation, symbols, and numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation and symbols
    text = re.sub(r'\d+', '', text)      # Remove numbers
    # Convert to lowercase
    text = text.lower()
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = " ".join(word for word in text.split() if word not in stop_words)
    return text

# Apply preprocessing to the 'body' column
aggression_texts['body'] = aggression_texts['body'].apply(preprocess_text)

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Subset text data for Johnson and Cameron
text_group1 = aggression_texts.loc[aggression_texts['name'] == 'Boris Johnson']
text_group2 = aggression_texts.loc[aggression_texts['name'] == 'David Cameron']

# Create document-term matrices (DFMs) for each group
dfm_input1 = vectorizer.fit_transform(text_group1['body'])
dfm_input2 = vectorizer.transform(text_group2['body'])

# Get feature names
features = vectorizer.get_feature_names_out()

# Sum term frequencies for each group
counts_group1 = np.array(dfm_input1.sum(axis=0)).flatten()
counts_group2 = np.array(dfm_input2.sum(axis=0)).flatten()
True

Fightin’ Words

Here is how we implement this technique in a function.

Python
# Combine raw counts into a single DataFrame
dfm_counts = pd.DataFrame({
    "Johnson": counts_group1,     # Token counts from group 1 (e.g. Johnson speeches)
    "Cameron": counts_group2,     # Token counts from group 2 (e.g. Cameron speeches)
    "feature": features           # Corresponding feature (e.g. word/token)
})

# Add Dirichlet priors to avoid zero probabilities and stabilize estimates
alpha_0 = 0.1  # Overall strength of the prior (smaller = weaker prior)
total_counts = counts_group1 + counts_group2
alpha_w = total_counts * (alpha_0 / total_counts.sum())  # Prior scaled to overall frequency
dfm_counts["Johnson"] += alpha_w  # Add prior to Johnson counts
dfm_counts["Cameron"] += alpha_w  # Add prior to Cameron counts

# Convert counts to probabilities (smoothed word distributions)
dfm_counts["mu_johnson"] = dfm_counts["Johnson"] / dfm_counts["Johnson"].sum()
dfm_counts["mu_cameron"] = dfm_counts["Cameron"] / dfm_counts["Cameron"].sum()

# Calculate log-odds for each word in each group
dfm_counts["log_odds_johnson"] = np.log(dfm_counts["mu_johnson"] / (1 - dfm_counts["mu_johnson"]))
dfm_counts["log_odds_cameron"] = np.log(dfm_counts["mu_cameron"] / (1 - dfm_counts["mu_cameron"]))
# Compute log-odds ratio between groups
dfm_counts["log_odds_ratio"] = dfm_counts["log_odds_johnson"] - dfm_counts["log_odds_cameron"]

# Estimate variance of the log-odds ratio (approximate)
dfm_counts["variance"] = 1 / dfm_counts["Johnson"] + 1 / dfm_counts["Cameron"]

# Calculate standardized score (z-score) for significance ranking
dfm_counts["score"] = dfm_counts["log_odds_ratio"] / np.sqrt(dfm_counts["variance"])

# Return results sorted by score, highest absolute difference at the top
result = dfm_counts[["feature", "score", "Johnson", "Cameron"]].sort_values(by="score", ascending=False)

Fightin’ Words

Here is how we implement this technique in a function.

feature score Johnson Cameron
referendum 3.544026 6.000279 10.000279
playing 3.420876 5.000209 7.000209
free 3.139933 5.000244 9.000244
global 2.499159 3.000139 5.000139
african 2.472248 2.000052 1.000052
advance 2.472248 2.000052 1.000052
applications 2.472248 2.000052 1.000052
concrete 2.472248 2.000052 1.000052
sport 2.472248 2.000052 1.000052
provide 2.334536 2.00007 2.00007
proposal 2.112993 2.000087 3.000087
region 2.112993 2.000087 3.000087
partners 2.112993 2.000087 3.000087
iran 2.112993 2.000087 3.000087
arrest 2.112993 2.000087 3.000087
stability 2.112993 2.000087 3.000087
campaign 1.894895 2.000105 4.000105
learned 1.718199 3.000227 10.000227
decided 1.649506 1.000035 1.000035
previously 1.649506 1.000035 1.000035

Fightin’ Words

Here is how we implement this technique in a function.

Python
# Prepare `result` for ggplot
result['n'] = result['Johnson'] + result['Cameron']  # Total count for the word
result['log_n'] = np.log(result['n'])  # Log-transformed total count
result['cex'] = np.abs(result['score'])  # Text size
result['alpha'] = np.abs(result['score'])  # Opacity
result_df = result[['log_n', 'score', 'feature', 'cex', 'alpha', "Johnson", "Cameron"]]

log_n score feature cex alpha Johnson Cameron
2.772624 3.544026 referendum 3.544026 3.544026 6.000279 10.000279
2.484942 3.420876 playing 3.420876 3.420876 5.000209 7.000209
2.639092 3.139933 free 3.139933 3.139933 5.000244 9.000244
2.079476 2.499159 global 2.499159 2.499159 3.000139 5.000139
1.098647 2.472248 african 2.472248 2.472248 2.000052 1.000052
1.098647 2.472248 advance 2.472248 2.472248 2.000052 1.000052
1.098647 2.472248 applications 2.472248 2.472248 2.000052 1.000052
1.098647 2.472248 concrete 2.472248 2.472248 2.000052 1.000052
1.098647 2.472248 sport 2.472248 2.472248 2.000052 1.000052
1.386329 2.334536 provide 2.334536 2.334536 2.00007 2.00007

Fightin’ Words

Here is how we implement this technique in a function.

See code
R
library(dplyr)
result_df2 <- reticulate::py$result_df

ggplot(data=result_df2, aes(x = log_n, # x-axis
                         y = score, # y-axis
                         label = feature, # text labels
                         cex = abs(score), # text size
                         alpha = abs(score))) + # opacity
                geom_text() + # plot text
                xlab("log(n)") + # x-axis label 
                ylab("Fightin' words score") + # y-axis label
                theme_bw() + # nice black and white theme
                #theme(panel.grid = element_blank()) + # remove grid lines
                scale_size_continuous(guide = "none") + # remove size legend
                scale_alpha_continuous(guide = "none")

johnson cameron
honourable cherie
would burdensome
referendum peep
minister accreted
free acquire
playing advancement
fields afternoons
right xtaggart
gentleman guarantor
friend alive
house prosperous
know balkans
last priorities
believe rid
uk barricades
people battalion
said presented
government annual
one presence
us owing

Fightin’ Words

Here is how we implement this technique in a function.

See code
R
library(dplyr)
result_df2 <- reticulate::py$result_df

ggplot(data=result_df2, aes(x = log_n, # x-axis
                         y = score, # y-axis
                         label = feature, # text labels
                         cex = abs(score), # text size
                         alpha = abs(score))) + # opacity
                geom_text() + # plot text
                xlab("log(n)") + # x-axis label 
                ylab("Fightin' words score") + # y-axis label
                theme_bw() + # nice black and white theme
                #theme(panel.grid = element_blank()) + # remove grid lines
                scale_size_continuous(guide = "none") + # remove size legend
                scale_alpha_continuous(guide = "none")

johnson cameron
cherie honourable
rid right
barricades people
battalion minister
presented prime
presence said
blaby would
boycott gentleman
priorities friend
boycotts think
vindicated one
pence government
consular work
mornings us
continent get
inspiration need
tillerson point
washington take
kinds say
balkans made

Conclusion

Text Representation:

  • The Vector Space Model transforms text into numerical vectors, enabling computation of similarity and other metrics.
  • TF-IDF weighting refines word importance

Similarity Metrics:

  • Measures like cosine similarity quantify relationships between documents or vectors.

Visualization Techniques:

  • Word clouds and “Fightin’ Words” highlight contrasts between corpora, leveraging both frequency and weighting.