L17: Similarity

Bogdan G. Popescu

John Cabot University

Vector Space Model

We discuss today similarity.

We previously represented our text data as a document-feature matrix

- Rows: Documents
- Columns: Features

Each document is therefore described by a vector of word counts

Vector Space Model

We can write a vector representation of a document:

\[\textbf{a}=\{a_{1}, a_{2},..., a_{n}\}\]

where:

\(a_1\) - no. times feature 1 appears in the document
\(a_2\) - no. times feature 2 appears in the document
\(a_n\) - no. times feature n appears in the document

Similarity

The basic idea of similarity within text analysis is that each document can be represented by a vector of (weighted) feature counts

These vectors can be evaluated using similarity metrics

A document’s vector is simply a row in the document-feature matrix

There are many ways to measure document similarity:

  1. Edit distances
  2. Inner product
  3. Euclidean distance
  4. Cosine similarity

Similarity

Edit distances

A common way is Levenshtein distance

It measures the minimal number of operations (replacing, inserting, or deleting) required to transform one string into another.

For example, kitten and sitting

1. kitten → sitten (substitute ‘k’ with ‘s’)
2. sitten → sittin (substitute ‘e’ with ‘i’)
3. sittin → sitting (insert ‘g’ at the end)

Python
import Levenshtein
x = ["kitten", "sitting"]
distance = Levenshtein.distance(x[0], x[1])
print(distance)
3

Similarity

Edit distances

Levenshtein distance is generally not used in large scale applications.

Levenshtein distance works best for short strings and is computationally expensive for large-scale text.

Similarity

Inner Product

The inner product, or “dot” product, between two vectors is the sum of the element-wise multiplication of the vectors:

\[ \begin{aligned} a \cdot b &= a^{T}b\\ &=a_{1} b_{1} + a_{2} b_{2} + ... + a_{n} b_{n} \end{aligned} \]

The dot product is a scalar or just a number

When the vectors are document-feature matrices with only 0s and 1s, the inner product gives the number of features that the two documents share in common.

Similarity

Inner Product

Let’s assume that we have the following documents.

1.text_1: “I love playing football with my friends”
2.text_2: “I hate watching and playing basketball”
3.text_3: “When I was a kid I was playing football with my friends every day all the evening”

Similarity

Inner Product

Create Frequency Vectors

Python
from collections import Counter
import pandas as pd
from pretty_html_table import build_table
from IPython.display import display, HTML

# Input texts
texts = {
    "text_1": "I love playing football with my friends",
    "text_2": "I hate watching and playing basketball",
    "text_3": "When I was a kid I was playing football with my friends every day all the evening"
}

# Create a frequency dictionary for each text
word_counts_list = {key: Counter(text.lower().split()) for key, text in texts.items()}
word_counts_list
{'text_1': Counter({'i': 1, 'love': 1, 'playing': 1, 'football': 1, 'with': 1, 'my': 1, 'friends': 1}), 'text_2': Counter({'i': 1, 'hate': 1, 'watching': 1, 'and': 1, 'playing': 1, 'basketball': 1}), 'text_3': Counter({'i': 2, 'was': 2, 'when': 1, 'a': 1, 'kid': 1, 'playing': 1, 'football': 1, 'with': 1, 'my': 1, 'friends': 1, 'every': 1, 'day': 1, 'all': 1, 'the': 1, 'evening': 1})}

Similarity

Inner Product

Create Frequency Vectors

Python
# Combine into a DataFrame where rows are texts and columns are words
frequency_table = pd.DataFrame(word_counts_list).fillna(0)

index text_1 text_2 text_3
i 1.0 1.0 2.0
love 1.0 0.0 0.0
playing 1.0 1.0 1.0
football 1.0 0.0 1.0
with 1.0 0.0 1.0
my 1.0 0.0 1.0
friends 1.0 0.0 1.0
hate 0.0 1.0 0.0
watching 0.0 1.0 0.0
and 0.0 1.0 0.0
basketball 0.0 1.0 0.0
when 0.0 0.0 1.0
was 0.0 0.0 2.0
a 0.0 0.0 1.0
kid 0.0 0.0 1.0
every 0.0 0.0 1.0
day 0.0 0.0 1.0
all 0.0 0.0 1.0
the 0.0 0.0 1.0
evening 0.0 0.0 1.0

Similarity

Inner Product

Create Frequency Vectors

Python
# Add the text as a column for clarity
frequency_table = frequency_table.transpose()
frequency_table_reset = frequency_table.reset_index()

index i love playing football with my friends hate watching and basketball when was a kid every day all the evening
text_1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
text_2 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
text_3 2.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

Similarity

Inner Product

The inner product between A and B is:

Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_b = frequency_table.loc['text_2']
# Compute element-wise products
element_wise_products = row_a * row_b
# Compute the dot product (sum of element-wise products)
dot_product1 = element_wise_products.sum()

Similarity

Inner Product

The inner product between A and B is:

Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_b = frequency_table.loc['text_2']
# Compute element-wise products
element_wise_products = row_a * row_b
# Compute the dot product (sum of element-wise products)
dot_product1 = element_wise_products.sum()
Python
# Print intermediate steps
print(f"Row A (text_1): {row_a.tolist()}")
Row A (text_1): [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Similarity

Inner Product

The inner product between A and B is:

Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_b = frequency_table.loc['text_2']
# Compute element-wise products
element_wise_products = row_a * row_b
# Compute the dot product (sum of element-wise products)
dot_product1 = element_wise_products.sum()
Python
print(f"Row B (text_2): {row_b.tolist()}")
Row B (text_2): [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Similarity

Inner Product

The inner product between A and B is:

Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_b = frequency_table.loc['text_2']
# Compute element-wise products
element_wise_products = row_a * row_b
# Compute the dot product (sum of element-wise products)
dot_product1 = element_wise_products.sum()
Python
print(f"Element-wise products: {element_wise_products.tolist()}")
Element-wise products: [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Similarity

Inner Product

The inner product between A and B is:

Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_b = frequency_table.loc['text_2']
# Compute element-wise products
element_wise_products = row_a * row_b
# Compute the dot product (sum of element-wise products)
dot_product1 = element_wise_products.sum()
Python
print(f"Dot product (sum): {dot_product1}")
Dot product (sum): 2.0

Similarity

Inner Product

The inner product between A and C is:

Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_c = frequency_table.loc['text_3']
# Compute element-wise products
element_wise_products = row_a * row_c
# Compute the dot product (sum of element-wise products)
dot_product2 = element_wise_products.sum()
Python
# Print intermediate steps
print(f"Row A (text_1): {row_a.tolist()}")
Row A (text_1): [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Similarity

Inner Product

The inner product between A and C is:

Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_c = frequency_table.loc['text_3']
# Compute element-wise products
element_wise_products = row_a * row_c
# Compute the dot product (sum of element-wise products)
dot_product2 = element_wise_products.sum()
Python
# Print intermediate steps
print(f"Row C (text_3): {row_c.tolist()}")
Row C (text_3): [2.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Similarity

Inner Product

The inner product between A and C is:

Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_c = frequency_table.loc['text_3']
# Compute element-wise products
element_wise_products = row_a * row_c
# Compute the dot product (sum of element-wise products)
dot_product2 = element_wise_products.sum()
Python
print(f"Element-wise products: {element_wise_products.tolist()}")
Element-wise products: [2.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Similarity

Inner Product

The inner product between A and C is:

Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_c = frequency_table.loc['text_3']
# Compute element-wise products
element_wise_products = row_a * row_c
# Compute the dot product (sum of element-wise products)
dot_product2 = element_wise_products.sum()
Python
print(f"Dot product (sum): {dot_product2}")
Dot product (sum): 7.0

Similarity

Inner Product

Create Frequency Vectors

index i love playing football with my friends hate watching and basketball when was a kid every day all the evening
text_1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
text_2 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
text_3 2.0 0.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

So, we now compare the two inner products:

Python
print(f"Dot product (sum): {dot_product1}")
Dot product (sum): 2.0
Python
print(f"Dot product (sum): {dot_product2}")
Dot product (sum): 7.0

Because \(a \cdot c > a \cdot b\), we conclude that A and C are more similar than A and B.

Similarity

Inner Product

Create Frequency Vectors

Because \(a \cdot c > a \cdot b\), we conclude that A and C are more similar than A and B.

1.text_1: “I love playing football with my friends”
2.text_2: “I hate waching and playing basketball”
3.text_3: “When I was a kid I was playing football with my friends every day all the evening”

Notice that the inner product is sensitive to document length.

Similarity

Euclidean Distance

The Euclidean distance between a and b can be defined as:

\[ \begin{aligned} d(a, b) &= \sqrt\sum^{J}_{j=1} (a_{j} - b_{j})^2\\ &= ||a +b|| \end{aligned} \]

Similarity

Euclidean Distance

Python
from collections import Counter
import pandas as pd
from pretty_html_table import build_table
from IPython.display import display, HTML

# Input texts
texts = {
    "text_1": "I love playing football with my friends",
    "text_2": "I hate watching and playing basketball",
    "text_3": "When I was a kid I was playing football with my friends every day all the evening"
}

# Create a frequency dictionary for each text
word_counts_list = {key: Counter(text.lower().split()) for key, text in texts.items()}
# Combine into a DataFrame where rows are texts and columns are words
frequency_table = pd.DataFrame(word_counts_list).fillna(0).astype(int)

# Add the text as a column for clarity
frequency_table = frequency_table.transpose()

# Rest Index
frequency_table_reset = frequency_table.reset_index()

Similarity

Euclidean Distance

index i love playing football with my friends hate watching and basketball when was a kid every day all the evening
text_1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
text_2 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0
text_3 2 0 1 1 1 1 1 0 0 0 0 1 2 1 1 1 1 1 1 1

We now compute the Euclidean distance of these sentences.

Python
from scipy.spatial import distance

# Compute pairwise Euclidean distances
matrix = distance.cdist(frequency_table, frequency_table, metric='euclidean')

df_eucl = pd.DataFrame(matrix, 
                  columns= ["text_1", "text_2", "text_3"],
                  index=['text_1', 'text_2', 'text_3'])
df_eucl
          text_1    text_2    text_3
text_1  0.000000  3.000000  3.741657
text_2  3.000000  0.000000  4.582576
text_3  3.741657  4.582576  0.000000

Similarity

Euclidean Distance

See code
R
# Load required libraries
library(ggplot2)

# Plot the points
ggplot(points_df, aes(x = x, y = y, label = label)) +
  geom_point(size = 3) +
  geom_text(vjust = -1) +
  xlim(-1, 5) +
  ylim(-1, 5) +
  coord_fixed() + # Ensure equal scaling for x and y axes
  labs(title = "2D Representation of Text Distances")+
  theme_bw()

Similarity

Euclidean Distance

text_1 text_2 text_3
text_1 0.0 3.0 3.741657
text_2 3.0 0.0 4.582576
text_3 3.741657 4.582576 0.0

See code
R
# Extract coordinates from points_df for clarity
text_1 <- points_df[points_df$label == "text_1", c("x", "y")]
text_2 <- points_df[points_df$label == "text_2", c("x", "y")]
text_3 <- points_df[points_df$label == "text_3", c("x", "y")]

# Define full lines dataframe (all 3 connections)
lines_df <- data.frame(
  x = c(text_1$x, text_1$x, text_2$x),
  y = c(text_1$y, text_1$y, text_2$y),
  xend = c(text_2$x, text_3$x, text_3$x),
  yend = c(text_2$y, text_3$y, text_3$y),
  distance = c(dist_12, dist_13, dist_23)
)

# Define individual line dataframes
lines_df1 <- data.frame(
  x = text_1$x,
  y = text_1$y,
  xend = text_2$x,
  yend = text_2$y,
  distance = dist_12
)

lines_df2 <- data.frame(
  x = text_1$x,
  y = text_1$y,
  xend = text_3$x,
  yend = text_3$y,
  distance = dist_13
)

lines_df3 <- data.frame(
  x = text_2$x,
  y = text_2$y,
  xend = text_3$x,
  yend = text_3$y,
  distance = dist_23
)

# Plot points, lines, and distances
ggplot() +
  geom_point(data = points_df, aes(x = x, y = y), size = 3) +
  geom_text(data = points_df, aes(x = x, y = y, label = label), vjust = -1) +
  geom_segment(data = lines_df1, aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", color="grey50") +
  geom_segment(data = lines_df2, aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", color="grey50") +
  geom_segment(data = lines_df3, aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", color="grey50") +

  # Annotate distances at midpoint of each line
  annotate("text",
           x = (text_1$x + text_2$x) / 2,
           y = (text_1$y + text_2$y) / 2,
           label = round(dist_12, 3),
           vjust = -0.5) +
  annotate("text",
           x = (text_1$x + text_3$x) / 2,
           y = (text_1$y + text_3$y) / 2,
           label = round(dist_13, 3),
           vjust = -0.5) +
  annotate("text",
           x = (text_2$x + text_3$x) / 2,
           y = (text_2$y + text_3$y) / 2,
           label = round(dist_23, 3),
           vjust = -0.5)+
  coord_fixed() + # Ensures equal scaling for x and y axes
  labs(title = "2D Representation of Text Distances") +
  xlim(-1, 5) +
  ylim(-1, 5)+
  theme_bw()

Similarity

Euclidean Distance

See code
R
# Extract coordinates from points_df for clarity
text_1 <- points_df[points_df$label == "text_1", c("x", "y")]
text_2 <- points_df[points_df$label == "text_2", c("x", "y")]
text_3 <- points_df[points_df$label == "text_3", c("x", "y")]

# Define full lines dataframe (all 3 connections)
lines_df <- data.frame(
  x = c(text_1$x, text_1$x, text_2$x),
  y = c(text_1$y, text_1$y, text_2$y),
  xend = c(text_2$x, text_3$x, text_3$x),
  yend = c(text_2$y, text_3$y, text_3$y),
  distance = c(dist_12, dist_13, dist_23)
)

# Define individual line dataframes
lines_df1 <- data.frame(
  x = text_1$x,
  y = text_1$y,
  xend = text_2$x,
  yend = text_2$y,
  distance = dist_12
)

lines_df2 <- data.frame(
  x = text_1$x,
  y = text_1$y,
  xend = text_3$x,
  yend = text_3$y,
  distance = dist_13
)

lines_df3 <- data.frame(
  x = text_2$x,
  y = text_2$y,
  xend = text_3$x,
  yend = text_3$y,
  distance = dist_23
)

# Plot points, lines, and distances
ggplot() +
  geom_point(data = points_df, aes(x = x, y = y), size = 3) +
  geom_text(data = points_df, aes(x = x, y = y, label = label), vjust = -1) +
  geom_segment(data = lines_df1, aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", color="grey50") +
  geom_segment(data = lines_df2, aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", color="grey50") +
  geom_segment(data = lines_df3, aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", color="grey50") +

  # Annotate distances at midpoint of each line
  annotate("text",
           x = (text_1$x + text_2$x) / 2,
           y = (text_1$y + text_2$y) / 2,
           label = round(dist_12, 3),
           vjust = -0.5) +
  annotate("text",
           x = (text_1$x + text_3$x) / 2,
           y = (text_1$y + text_3$y) / 2,
           label = round(dist_13, 3),
           vjust = -0.5) +
  annotate("text",
           x = (text_2$x + text_3$x) / 2,
           y = (text_2$y + text_3$y) / 2,
           label = round(dist_23, 3),
           vjust = -0.5)+
  coord_fixed() + # Ensures equal scaling for x and y axes
  labs(title = "2D Representation of Text Distances") +
  xlim(-1, 5) +
  ylim(-1, 5)+
  theme_bw()

  • The shorter the distance between the two texts is, the more similar they are.
  • However, the length of the text is a factor that affects the result
  • Long sentences tend to have higher Eucledian score than the short ones:
    “text_1”: “I love playing football with my friends”
    “text_2”: “I hate watching and playing basketball”
    “text_3”: “When I was a kid I was playing football with my friends every day all the evening”

Similarity

Cosine Similarity

Measures of document similarity should not be sensitive to the number of words in each of the documents

We don’t want long documents to be “more similar” than shorter documents just as a function of length

Cosine similarity is a measure of similarity that is based on the normalized inner product of two vectors.

It can be interpreted as:

  • normalized version of the inner product or Euclidean distance
  • cosine of the angle between the two vectors

Similarity

Cosine Similarity

The cosine similarity \(cos(\theta)\) between two vectors a and b can be defined as:

\[cos(\theta) = \frac{a \cdot b}{||a|| ||b||}\]

where:

\(\theta\) - the angle between the two vectors ||a|| and ||b|| are the magnitudes of the vectors a and b

Similarity

Vector magnitude

The vector magnitude is in turn given by:

\[ \begin{aligned} ||a||&=\sqrt{a\cdot a}\\ &=\sqrt{a_{1}^2 + a_{2}^2 + ... + a^2_{J}} \end{aligned} \]

Similarity

Vector magnitude

Python
## So, I will work again on the same three sentences aiming to achieve the desired outcome by using the Cosine similarity module 
text_1 = "I love playing football with my friends"
text_2 = "I hate waching and playing basketball"
text_3 = "When I was a kid I was playing football with my friends every day all the evening"

texts = [text_1, text_2, text_3]

## Constract again the bag of words table
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
matrix = count_vectorizer.fit_transform(texts)

## Creating a data frame to represent the number of the words in every sentence
table = matrix.todense()
df = pd.DataFrame(table, 
                  columns=count_vectorizer.get_feature_names_out(), 
                  index=['text_1', 'text_2', 'text_3'])
                  
 ## Aplying the Cosine similarity module 
from sklearn.metrics.pairwise import cosine_similarity
values = cosine_similarity(df, df)
df = pd.DataFrame(values, columns=["Text 1", "Text 2", "Text 3"], index = ["Text 1", "Text 2", "Text 3"])
print(df)
          Text 1    Text 2    Text 3
Text 1  1.000000  0.182574  0.510310
Text 2  0.182574  1.000000  0.111803
Text 3  0.510310  0.111803  1.000000

Similarity

Vector magnitude

Similarity

Vector magnitude

The value of cosine similarity ranges from -1 to 1

  • A value of 1 indicates that the vectors are identical
  • A value of 0 indicates that the vectors are orthogonal (i.e., not similar at all)
  • A value of -1 indicating that the vectors are diametrically opposed.

Aplication

If we use the same dataset as before, we want to see how similar politicians are.

Python
import pandas as pd
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
#aggression_texts.head(3)
aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()
aggression_texts_aggregated
                 name                                               body
0        Adam Afriyie  I welcomed much of what was said by  the honou...
1       Adam Holloway  What recent discussions he has had on the futu...
2         Adam Ingram  I think I said that it was an additional power...
3          Adam Price  There has been much talk in the Chamber this a...
4       Adrian Bailey  Given the failure of successive well-intention...
...               ...                                                ...
1593    Willie Rennie  I am greatly concerned about the Government's ...
1594   Yasmin Qureshi  The Gracious Speech represents a missed opport...
1595    Yvette Cooper  I begin by responding to some of the concerns ...
1596  Yvonne Fovargue  What recent assessment her Department has made...
1597    Zac Goldsmith  I absolutely do, and I strongly encourage the ...

[1598 rows x 2 columns]

Aplication

Creating Count vectorized representation of speeches

We first create the Count vectorized representation of speeches

Python
# Create a CountVectorizer for raw term frequencies
vectorizer = CountVectorizer()
text_dfm = vectorizer.fit_transform(aggression_texts_aggregated['body'])

We then select the politician of interest

Python
#Selecting the politician of interest
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]
# Calculate the cosine similarity between Boris Johnson and all other politicians
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()

Aplication

Calculating the Cosine Similarity

We then create a more friendly dataframe that we can visualize:

Python
# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
    'name_politician': aggression_texts_aggregated['name'],  # Use the names from conserv_aggregated
    'cosine_similarity': cosine_sim_50
})

# Sort by cosine similarity in descending order
similarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)

# Show the top similarities
print(similarity_df.head())
      name_politician  cosine_similarity
145     Boris Johnson           1.000000
1590    William Hague           0.964526
283    David Miliband           0.964144
278   David Lidington           0.963183
1356   Philip Hammond           0.963094

Aplication

Comparing to other Speeches

See code
R
library(dplyr)
similarity_df2 <- reticulate::py$similarity_df
similarity_df2<-head(similarity_df2, 11)
similarity_df2<-subset(similarity_df2, name_politician!="Boris Johnson")

# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
  geom_col() +
  geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
  labs(
    title = "Cosine Similarity Scores with Boris Johnson",
    x = "Politicians",
    y = "Cosine Similarity"
  ) +
    theme_bw()+
  ylim(0, 1)+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Aplication

Comparing to other Speeches

We can also examnine the histogram of cosine similarities among all the politicians.

Most politicians have a very similar speech as indicated below.

See code
R
library(dplyr)
similarity_df2 <- reticulate::py$similarity_df
#similarity_df2<-head(similarity_df2, 11)
#similarity_df2<-subset(similarity_df2, name_politician!="Boris Johnson")

# Create the bar chart
ggplot(similarity_df2, aes(x = cosine_similarity)) +
  geom_histogram(binwidth = 0.05, color = "white", alpha = 0.7) +
  labs(title = "Histogram of Cosine Similarities",
       x = "Cosine Similarity",
       y = "Frequency")+
  theme_bw()

Conclusion

Text as Vectors: Represent text as vectors in a document-feature matrix.

Similarity Metrics: Use edit distance, inner product, Euclidean distance, or cosine similarity.

Edit Distance: Best for short texts; computationally costly for long texts.

Inner Product: Measures feature overlap but is sensitive to text length.

Cosine Similarity: Adjusts for text length; ideal for comparing documents.

Applications: Compare speeches, cluster texts, and analyze document similarity.