Python
3
We discuss today similarity.
We previously represented our text data as a document-feature matrix
- Rows: Documents
- Columns: Features
Each document is therefore described by a vector of word counts
We can write a vector representation of a document:
\[\textbf{a}=\{a_{1}, a_{2},..., a_{n}\}\]
where:
\(a_1\) - no. times feature 1 appears in the document
\(a_2\) - no. times feature 2 appears in the document
\(a_n\) - no. times feature n appears in the document
The basic idea of similarity within text analysis is that each document can be represented by a vector of (weighted) feature counts
These vectors can be evaluated using similarity metrics
A document’s vector is simply a row in the document-feature matrix
There are many ways to measure document similarity:
A common way is Levenshtein distance
It measures the minimal number of operations (replacing, inserting, or deleting) required to transform one string into another.
For example, kitten
and sitting
1. kitten → sitten (substitute ‘k’ with ‘s’)
2. sitten → sittin (substitute ‘e’ with ‘i’)
3. sittin → sitting (insert ‘g’ at the end)
Levenshtein distance is generally not used in large scale applications.
Levenshtein distance works best for short strings and is computationally expensive for large-scale text.
The inner product, or “dot” product, between two vectors is the sum of the element-wise multiplication of the vectors:
\[ \begin{aligned} a \cdot b &= a^{T}b\\ &=a_{1} b_{1} + a_{2} b_{2} + ... + a_{n} b_{n} \end{aligned} \]
The dot product is a scalar or just a number
When the vectors are document-feature matrices with only 0s and 1s, the inner product gives the number of features that the two documents share in common.
Let’s assume that we have the following documents.
1.text_1: “I love playing football with my friends”
2.text_2: “I hate watching and playing basketball”
3.text_3: “When I was a kid I was playing football with my friends every day all the evening”
Create Frequency Vectors
Python
from collections import Counter
import pandas as pd
from pretty_html_table import build_table
from IPython.display import display, HTML
# Input texts
texts = {
"text_1": "I love playing football with my friends",
"text_2": "I hate watching and playing basketball",
"text_3": "When I was a kid I was playing football with my friends every day all the evening"
}
# Create a frequency dictionary for each text
word_counts_list = {key: Counter(text.lower().split()) for key, text in texts.items()}
word_counts_list
{'text_1': Counter({'i': 1, 'love': 1, 'playing': 1, 'football': 1, 'with': 1, 'my': 1, 'friends': 1}), 'text_2': Counter({'i': 1, 'hate': 1, 'watching': 1, 'and': 1, 'playing': 1, 'basketball': 1}), 'text_3': Counter({'i': 2, 'was': 2, 'when': 1, 'a': 1, 'kid': 1, 'playing': 1, 'football': 1, 'with': 1, 'my': 1, 'friends': 1, 'every': 1, 'day': 1, 'all': 1, 'the': 1, 'evening': 1})}
Create Frequency Vectors
index | text_1 | text_2 | text_3 |
---|---|---|---|
i | 1.0 | 1.0 | 2.0 |
love | 1.0 | 0.0 | 0.0 |
playing | 1.0 | 1.0 | 1.0 |
football | 1.0 | 0.0 | 1.0 |
with | 1.0 | 0.0 | 1.0 |
my | 1.0 | 0.0 | 1.0 |
friends | 1.0 | 0.0 | 1.0 |
hate | 0.0 | 1.0 | 0.0 |
watching | 0.0 | 1.0 | 0.0 |
and | 0.0 | 1.0 | 0.0 |
basketball | 0.0 | 1.0 | 0.0 |
when | 0.0 | 0.0 | 1.0 |
was | 0.0 | 0.0 | 2.0 |
a | 0.0 | 0.0 | 1.0 |
kid | 0.0 | 0.0 | 1.0 |
every | 0.0 | 0.0 | 1.0 |
day | 0.0 | 0.0 | 1.0 |
all | 0.0 | 0.0 | 1.0 |
the | 0.0 | 0.0 | 1.0 |
evening | 0.0 | 0.0 | 1.0 |
Create Frequency Vectors
index | i | love | playing | football | with | my | friends | hate | watching | and | basketball | when | was | a | kid | every | day | all | the | evening |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
text_1 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
text_2 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
text_3 | 2.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
The inner product between A and B is:
Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_b = frequency_table.loc['text_2']
# Compute element-wise products
element_wise_products = row_a * row_b
# Compute the dot product (sum of element-wise products)
dot_product1 = element_wise_products.sum()
The inner product between A and B is:
Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_b = frequency_table.loc['text_2']
# Compute element-wise products
element_wise_products = row_a * row_b
# Compute the dot product (sum of element-wise products)
dot_product1 = element_wise_products.sum()
The inner product between A and B is:
Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_b = frequency_table.loc['text_2']
# Compute element-wise products
element_wise_products = row_a * row_b
# Compute the dot product (sum of element-wise products)
dot_product1 = element_wise_products.sum()
The inner product between A and B is:
Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_b = frequency_table.loc['text_2']
# Compute element-wise products
element_wise_products = row_a * row_b
# Compute the dot product (sum of element-wise products)
dot_product1 = element_wise_products.sum()
The inner product between A and B is:
Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_b = frequency_table.loc['text_2']
# Compute element-wise products
element_wise_products = row_a * row_b
# Compute the dot product (sum of element-wise products)
dot_product1 = element_wise_products.sum()
The inner product between A and C is:
Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_c = frequency_table.loc['text_3']
# Compute element-wise products
element_wise_products = row_a * row_c
# Compute the dot product (sum of element-wise products)
dot_product2 = element_wise_products.sum()
The inner product between A and C is:
Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_c = frequency_table.loc['text_3']
# Compute element-wise products
element_wise_products = row_a * row_c
# Compute the dot product (sum of element-wise products)
dot_product2 = element_wise_products.sum()
The inner product between A and C is:
Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_c = frequency_table.loc['text_3']
# Compute element-wise products
element_wise_products = row_a * row_c
# Compute the dot product (sum of element-wise products)
dot_product2 = element_wise_products.sum()
The inner product between A and C is:
Python
# Set text identifiers as the index if not already set
frequency_table.index = ['text_1', 'text_2', 'text_3']
row_a = frequency_table.loc['text_1']
row_c = frequency_table.loc['text_3']
# Compute element-wise products
element_wise_products = row_a * row_c
# Compute the dot product (sum of element-wise products)
dot_product2 = element_wise_products.sum()
Create Frequency Vectors
index | i | love | playing | football | with | my | friends | hate | watching | and | basketball | when | was | a | kid | every | day | all | the | evening |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
text_1 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
text_2 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
text_3 | 2.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
So, we now compare the two inner products:
Because \(a \cdot c > a \cdot b\), we conclude that A and C are more similar than A and B.
Create Frequency Vectors
Because \(a \cdot c > a \cdot b\), we conclude that A and C are more similar than A and B.
1.text_1: “I love playing football with my friends”
2.text_2: “I hate waching and playing basketball”
3.text_3: “When I was a kid I was playing football with my friends every day all the evening”
Notice that the inner product is sensitive to document length.
The Euclidean distance between a and b can be defined as:
\[ \begin{aligned} d(a, b) &= \sqrt\sum^{J}_{j=1} (a_{j} - b_{j})^2\\ &= ||a +b|| \end{aligned} \]
Python
from collections import Counter
import pandas as pd
from pretty_html_table import build_table
from IPython.display import display, HTML
# Input texts
texts = {
"text_1": "I love playing football with my friends",
"text_2": "I hate watching and playing basketball",
"text_3": "When I was a kid I was playing football with my friends every day all the evening"
}
# Create a frequency dictionary for each text
word_counts_list = {key: Counter(text.lower().split()) for key, text in texts.items()}
# Combine into a DataFrame where rows are texts and columns are words
frequency_table = pd.DataFrame(word_counts_list).fillna(0).astype(int)
# Add the text as a column for clarity
frequency_table = frequency_table.transpose()
# Rest Index
frequency_table_reset = frequency_table.reset_index()
index | i | love | playing | football | with | my | friends | hate | watching | and | basketball | when | was | a | kid | every | day | all | the | evening |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
text_1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text_2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
text_3 | 2 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
We now compute the Euclidean distance of these sentences.
Python
text_1 text_2 text_3
text_1 0.000000 3.000000 3.741657
text_2 3.000000 0.000000 4.582576
text_3 3.741657 4.582576 0.000000
R
# Load required libraries
library(ggplot2)
# Plot the points
ggplot(points_df, aes(x = x, y = y, label = label)) +
geom_point(size = 3) +
geom_text(vjust = -1) +
xlim(-1, 5) +
ylim(-1, 5) +
coord_fixed() + # Ensure equal scaling for x and y axes
labs(title = "2D Representation of Text Distances")+
theme_bw()
text_1 | text_2 | text_3 | |
---|---|---|---|
text_1 | 0.0 | 3.0 | 3.741657 |
text_2 | 3.0 | 0.0 | 4.582576 |
text_3 | 3.741657 | 4.582576 | 0.0 |
R
# Extract coordinates from points_df for clarity
text_1 <- points_df[points_df$label == "text_1", c("x", "y")]
text_2 <- points_df[points_df$label == "text_2", c("x", "y")]
text_3 <- points_df[points_df$label == "text_3", c("x", "y")]
# Define full lines dataframe (all 3 connections)
lines_df <- data.frame(
x = c(text_1$x, text_1$x, text_2$x),
y = c(text_1$y, text_1$y, text_2$y),
xend = c(text_2$x, text_3$x, text_3$x),
yend = c(text_2$y, text_3$y, text_3$y),
distance = c(dist_12, dist_13, dist_23)
)
# Define individual line dataframes
lines_df1 <- data.frame(
x = text_1$x,
y = text_1$y,
xend = text_2$x,
yend = text_2$y,
distance = dist_12
)
lines_df2 <- data.frame(
x = text_1$x,
y = text_1$y,
xend = text_3$x,
yend = text_3$y,
distance = dist_13
)
lines_df3 <- data.frame(
x = text_2$x,
y = text_2$y,
xend = text_3$x,
yend = text_3$y,
distance = dist_23
)
# Plot points, lines, and distances
ggplot() +
geom_point(data = points_df, aes(x = x, y = y), size = 3) +
geom_text(data = points_df, aes(x = x, y = y, label = label), vjust = -1) +
geom_segment(data = lines_df1, aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", color="grey50") +
geom_segment(data = lines_df2, aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", color="grey50") +
geom_segment(data = lines_df3, aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", color="grey50") +
# Annotate distances at midpoint of each line
annotate("text",
x = (text_1$x + text_2$x) / 2,
y = (text_1$y + text_2$y) / 2,
label = round(dist_12, 3),
vjust = -0.5) +
annotate("text",
x = (text_1$x + text_3$x) / 2,
y = (text_1$y + text_3$y) / 2,
label = round(dist_13, 3),
vjust = -0.5) +
annotate("text",
x = (text_2$x + text_3$x) / 2,
y = (text_2$y + text_3$y) / 2,
label = round(dist_23, 3),
vjust = -0.5)+
coord_fixed() + # Ensures equal scaling for x and y axes
labs(title = "2D Representation of Text Distances") +
xlim(-1, 5) +
ylim(-1, 5)+
theme_bw()
R
# Extract coordinates from points_df for clarity
text_1 <- points_df[points_df$label == "text_1", c("x", "y")]
text_2 <- points_df[points_df$label == "text_2", c("x", "y")]
text_3 <- points_df[points_df$label == "text_3", c("x", "y")]
# Define full lines dataframe (all 3 connections)
lines_df <- data.frame(
x = c(text_1$x, text_1$x, text_2$x),
y = c(text_1$y, text_1$y, text_2$y),
xend = c(text_2$x, text_3$x, text_3$x),
yend = c(text_2$y, text_3$y, text_3$y),
distance = c(dist_12, dist_13, dist_23)
)
# Define individual line dataframes
lines_df1 <- data.frame(
x = text_1$x,
y = text_1$y,
xend = text_2$x,
yend = text_2$y,
distance = dist_12
)
lines_df2 <- data.frame(
x = text_1$x,
y = text_1$y,
xend = text_3$x,
yend = text_3$y,
distance = dist_13
)
lines_df3 <- data.frame(
x = text_2$x,
y = text_2$y,
xend = text_3$x,
yend = text_3$y,
distance = dist_23
)
# Plot points, lines, and distances
ggplot() +
geom_point(data = points_df, aes(x = x, y = y), size = 3) +
geom_text(data = points_df, aes(x = x, y = y, label = label), vjust = -1) +
geom_segment(data = lines_df1, aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", color="grey50") +
geom_segment(data = lines_df2, aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", color="grey50") +
geom_segment(data = lines_df3, aes(x = x, y = y, xend = xend, yend = yend), linetype = "dashed", color="grey50") +
# Annotate distances at midpoint of each line
annotate("text",
x = (text_1$x + text_2$x) / 2,
y = (text_1$y + text_2$y) / 2,
label = round(dist_12, 3),
vjust = -0.5) +
annotate("text",
x = (text_1$x + text_3$x) / 2,
y = (text_1$y + text_3$y) / 2,
label = round(dist_13, 3),
vjust = -0.5) +
annotate("text",
x = (text_2$x + text_3$x) / 2,
y = (text_2$y + text_3$y) / 2,
label = round(dist_23, 3),
vjust = -0.5)+
coord_fixed() + # Ensures equal scaling for x and y axes
labs(title = "2D Representation of Text Distances") +
xlim(-1, 5) +
ylim(-1, 5)+
theme_bw()
Measures of document similarity should not be sensitive to the number of words in each of the documents
We don’t want long documents to be “more similar” than shorter documents just as a function of length
Cosine similarity is a measure of similarity that is based on the normalized inner product of two vectors.
It can be interpreted as:
The cosine similarity \(cos(\theta)\) between two vectors a and b can be defined as:
\[cos(\theta) = \frac{a \cdot b}{||a|| ||b||}\]
where:
\(\theta\) - the angle between the two vectors ||a|| and ||b|| are the magnitudes of the vectors a and b
The vector magnitude is in turn given by:
\[ \begin{aligned} ||a||&=\sqrt{a\cdot a}\\ &=\sqrt{a_{1}^2 + a_{2}^2 + ... + a^2_{J}} \end{aligned} \]
Python
## So, I will work again on the same three sentences aiming to achieve the desired outcome by using the Cosine similarity module
text_1 = "I love playing football with my friends"
text_2 = "I hate waching and playing basketball"
text_3 = "When I was a kid I was playing football with my friends every day all the evening"
texts = [text_1, text_2, text_3]
## Constract again the bag of words table
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
matrix = count_vectorizer.fit_transform(texts)
## Creating a data frame to represent the number of the words in every sentence
table = matrix.todense()
df = pd.DataFrame(table,
columns=count_vectorizer.get_feature_names_out(),
index=['text_1', 'text_2', 'text_3'])
## Aplying the Cosine similarity module
from sklearn.metrics.pairwise import cosine_similarity
values = cosine_similarity(df, df)
df = pd.DataFrame(values, columns=["Text 1", "Text 2", "Text 3"], index = ["Text 1", "Text 2", "Text 3"])
print(df)
Text 1 Text 2 Text 3
Text 1 1.000000 0.182574 0.510310
Text 2 0.182574 1.000000 0.111803
Text 3 0.510310 0.111803 1.000000
The value of cosine similarity ranges from -1 to 1
If we use the same dataset as before, we want to see how similar politicians are.
Python
import pandas as pd
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
#aggression_texts.head(3)
aggression_texts_aggregated = aggression_texts.groupby("name")["body"].apply(" ".join).reset_index()
aggression_texts_aggregated
name body
0 Adam Afriyie I welcomed much of what was said by the honou...
1 Adam Holloway What recent discussions he has had on the futu...
2 Adam Ingram I think I said that it was an additional power...
3 Adam Price There has been much talk in the Chamber this a...
4 Adrian Bailey Given the failure of successive well-intention...
... ... ...
1593 Willie Rennie I am greatly concerned about the Government's ...
1594 Yasmin Qureshi The Gracious Speech represents a missed opport...
1595 Yvette Cooper I begin by responding to some of the concerns ...
1596 Yvonne Fovargue What recent assessment her Department has made...
1597 Zac Goldsmith I absolutely do, and I strongly encourage the ...
[1598 rows x 2 columns]
We first create the Count vectorized representation of speeches
We then select the politician of interest
Python
#Selecting the politician of interest
boris_index = aggression_texts_aggregated[aggression_texts_aggregated['name'] == 'Boris Johnson'].index[0]
# Calculate the cosine similarity between Boris Johnson and all other politicians
cosine_sim_50 = cosine_similarity(text_dfm[boris_index], text_dfm).flatten()
We then create a more friendly dataframe that we can visualize:
Python
# Create a DataFrame to display the similarities
similarity_df = pd.DataFrame({
'name_politician': aggression_texts_aggregated['name'], # Use the names from conserv_aggregated
'cosine_similarity': cosine_sim_50
})
# Sort by cosine similarity in descending order
similarity_df = similarity_df.sort_values(by='cosine_similarity', ascending=False)
# Show the top similarities
print(similarity_df.head())
name_politician cosine_similarity
145 Boris Johnson 1.000000
1590 William Hague 0.964526
283 David Miliband 0.964144
278 David Lidington 0.963183
1356 Philip Hammond 0.963094
R
library(dplyr)
similarity_df2 <- reticulate::py$similarity_df
similarity_df2<-head(similarity_df2, 11)
similarity_df2<-subset(similarity_df2, name_politician!="Boris Johnson")
# Create the bar chart
ggplot(similarity_df2, aes(x = reorder(name_politician, -cosine_similarity), y = cosine_similarity)) +
geom_col() +
geom_text(aes(label = round(cosine_similarity, 3)), vjust = -0.5, size = 3.5) +
labs(
title = "Cosine Similarity Scores with Boris Johnson",
x = "Politicians",
y = "Cosine Similarity"
) +
theme_bw()+
ylim(0, 1)+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We can also examnine the histogram of cosine similarities among all the politicians.
Most politicians have a very similar speech as indicated below.
R
library(dplyr)
similarity_df2 <- reticulate::py$similarity_df
#similarity_df2<-head(similarity_df2, 11)
#similarity_df2<-subset(similarity_df2, name_politician!="Boris Johnson")
# Create the bar chart
ggplot(similarity_df2, aes(x = cosine_similarity)) +
geom_histogram(binwidth = 0.05, color = "white", alpha = 0.7) +
labs(title = "Histogram of Cosine Similarities",
x = "Cosine Similarity",
y = "Frequency")+
theme_bw()
Text as Vectors: Represent text as vectors in a document-feature matrix.
Similarity Metrics: Use edit distance, inner product, Euclidean distance, or cosine similarity.
Edit Distance: Best for short texts; computationally costly for long texts.
Inner Product: Measures feature overlap but is sensitive to text length.
Cosine Similarity: Adjusts for text length; ideal for comparing documents.
Applications: Compare speeches, cluster texts, and analyze document similarity.
Popescu (JCU): Lecture 17