L21: Unsupervised Learning: Topic Models

Bogdan G. Popescu

John Cabot University

Introduction

Topic Models are unsupervised learning models.

They allow us to cluster similar documents in a corpus together.

Code

```{mermaid}
%%| code-fold: true
%%| code-summary: "See code"
flowchart TD
  A[Do we know the categories of the document?   ] --> B(Yes)
  A[Do we know the categories of the document?   ] --> C(No)
  C(No) --> D[Topic Models]
  B(Yes) --> E[Do you know the rule for placing documents in categories?   ]
  E[Do you know the rule for placing documents in categories?   ] --> F[Yes] --> I[Dictionaries]
  E[Do you know the rule for placing documents in categories?   ] --> G[No] --> J[Supervised Learning]
```

flowchart TD
  A[Do we know the categories of the document?   ] --> B(Yes)
  A[Do we know the categories of the document?   ] --> C(No)
  C(No) --> D[Topic Models]
  B(Yes) --> E[Do you know the rule for placing documents in categories?   ]
  E[Do you know the rule for placing documents in categories?   ] --> F[Yes] --> I[Dictionaries]
  E[Do you know the rule for placing documents in categories?   ] --> G[No] --> J[Supervised Learning]

Topic Models

Topic Models are an automatic procedure for us to discover the main themes in an unstructured corpus.

There is no requirement for a training set or for labelling of text before estimation

Overall, topic models allow us to organize, understand, and summarize large corpora of data.

Latent Dirichlet Allocation (LDA) has been a common approach

it is easy to implement
it performs well on small data

However:

on large datasets with nuanced topics, LDA might struggle to capture the complexity of the text
when dealing with sparse data, LDA might struggle
the alternative would be word-embedding techniques such as BERT

Topic Models as Language Models

A language model is a probabilty distribution over words.

For example, in the Naive Bayes model, we tried to estimate a probability distribution for each category of interest.

Each document was assigned to a single category.

For topic models, we estimate probability distributions for each topic.

each document can belong to multiple topics

What is a topic?

A “topic” is a probability distribution over a fixed word vocabulary.

Consider a vocabulary: democracy, elections, government, economy, growth, market.

When discussing democracy:

Frequently use the words democracy, elections, and government.
Infrequently use the words economy, growth, and market.

When discussing the economy:

Frequently use the words economy, growth, and market.
Infrequently use the words democracy, elections, and government.

We now compute the topic word probabilities to: 1) understand the document content; 2) identify patterns across a corpus; 3) facilitate human interpretation

Topic Word Probabilities Table

Topic	democracy	elections	government	economy	growth	market
Democracy	0.4	0.3	0.25	0.02	0.02	0.01
Economy	0.02	0.01	0.02	0.3	0.35	0.3

What is a document?

In a topic model, each document is described as being composed of a mixture of corpus-wide topics

For each document, we find the topic proportions that maximize the probability that we would observe the words in that particular document.

Topic	democracy	elections	government	economy	growth	market
Democracy	0.4	0.3	0.25	0.02	0.02	0.01
Economy	0.02	0.01	0.02	0.3	0.35	0.3

Doc	democracy	elections	government	economy	growth	market
A	4	3	2	1	1	0
B	1	1	1	3	4	3

What is a document?

Topic	democracy	elections	government	economy	growth	market
Democracy	0.4	0.3	0.25	0.02	0.02	0.01
Economy	0.02	0.01	0.02	0.3	0.35	0.3

Doc	democracy	elections	government	economy	growth	market
A	4	3	2	1	1	0
B	1	1	1	3	4	3

What is the probability of observing Document A’s word counts under the “Democracy” topic?

This following equation shows how likely a document’s words fit under a specific topic based on predefined probabilities. \[ \begin{equation} \begin{aligned} P(W_A|\mu_{\text{Democracy}}) &= \frac{M_{i}!}{\Pi_{j=1}^{J} W_{Aj!}}\Pi^{J}_{j=1}\mu^{W_{Aj}}_{\text{Democracy}j}\\ &= \frac{(4+3+2+1+1+0)!}{4! \cdot 3! \cdot 2! \cdot 1! \cdot 0!} \cdot 0.4^4 \cdot 0.3^3 \cdot 0.25^2 \cdot 0.02^1 \cdot 0.02^1 \cdot 0.01^0\\ &= \frac{11!}{288} \cdot 0.0000001728\\ &= \frac{39916800}{288} \cdot 0.0000001728\\ &=0.02393 \end{aligned} \end{equation} \]

What is a document?

Topic	democracy	elections	government	economy	growth	market
Democracy	0.4	0.3	0.25	0.02	0.02	0.01
Economy	0.02	0.01	0.02	0.3	0.35	0.3

Doc	democracy	elections	government	economy	growth	market
A	4	3	2	1	1	0
B	1	1	1	3	4	3

What is the probability of observing Document A’s word counts under the “Economy” topic?

\[ \begin{equation} \begin{aligned} P(W_A|\mu_{\text{Economy}}) &= \frac{M_{i}!}{\Pi_{j=1}^{J} W_{Aj!}}\Pi^{J}_{j=1}\mu^{W_{Aj}}_{\text{Economy}j}\\ &= \frac{(4+3+2+1+1+0)!}{4! \cdot 3! \cdot 2! \cdot 1! \cdot 0!} \cdot 0.02^4 \cdot 0.01^3 \cdot 0.02^2 \cdot 0.3^1 \cdot 0.35^1 \cdot 0.3^0\\ &= \frac{11!}{288} \cdot 0.00000000000672\\ &=\frac{39916800}{288} \cdot 0.00000000000672\\ &=0.000000931392 \end{aligned} \end{equation} \]

What is a document?

Topic	democracy	elections	government	economy	growth	market
Democracy	0.4	0.3	0.25	0.02	0.02	0.01
Economy	0.02	0.01	0.02	0.3	0.35	0.3

Doc	democracy	elections	government	economy	growth	market
A	4	3	2	1	1	0
B	1	1	1	3	4	3

What is the probability of observing Document B’s word counts under the “Democracy” topic?

\[ \begin{equation} \begin{aligned} P(W_B|\mu_{\text{Democracy}}) &= \frac{M_{i}!}{\Pi_{j=1}^{J} W_{Bj!}}\Pi^{J}_{j=1}\mu^{W_{Bj}}_{\text{Democracy}j}\\ &= \frac{(1+1+1+3+4+3)!}{1! \cdot 1! \cdot 1! \cdot 3! \cdot 4! \cdot 3!} \cdot 0.4^1 \cdot 0.3^1 \cdot 0.25^1 \cdot 0.02^3 \cdot 0.02^4 \cdot 0.01^3\\ &= \frac{13!}{1 \cdot 1 \cdot 1 \cdot 6 \cdot 24 \cdot 6} \cdot 0.000000000000768\\ &= \frac{6227020800}{864} \cdot 0.000000000000768\\ &= 0.0000055300352 \end{aligned} \end{equation} \]

What is a document?

Topic	democracy	elections	government	economy	growth	market
Democracy	0.4	0.3	0.25	0.02	0.02	0.01
Economy	0.02	0.01	0.02	0.3	0.35	0.3

Doc	democracy	elections	government	economy	growth	market
A	4	3	2	1	1	0
B	1	1	1	3	4	3

What is the probability of observing Document B’s word counts under the “Economy” topic?

\[ \begin{equation} \begin{aligned} P(W_B|\mu_{\text{Economy}}) &= \frac{M_{i}!}{\Pi_{j=1}^{J} W_{Bj!}}\Pi^{J}_{j=1}\mu^{W_{Bj}}_{\text{Economy}j}\\ &= \frac{(1+1+1+3+4+3)!}{1! \cdot 1! \cdot 1! \cdot 3! \cdot 4! \cdot 3!} \cdot 0.02^1 \cdot 0.01^1 \cdot 0.02^1 \cdot 0.3^3 \cdot 0.35^4 \cdot 0.3^3\\ &= \frac{13!}{1 \cdot 1 \cdot 1 \cdot 6 \cdot 24 \cdot 6} \cdot 0.000000000001296\\ &= \frac{6227020800}{864} \cdot 0.000000000001296\\ &= 0.0000093402048 \end{aligned} \end{equation} \]

What is a document?

Topic	democracy	elections	government	economy	growth	market
Democracy	0.4	0.3	0.25	0.02	0.02	0.01
Economy	0.02	0.01	0.02	0.3	0.35	0.3

Doc	democracy	elections	government	economy	growth	market
A	4	3	2	1	1	0
B	1	1	1	3	4	3

What is the probability of observing Document A’s word counts under an equal measure of both Democracy and Economy?

\[ \begin{equation} \begin{aligned} P(W_A|\mu_{\text{Democracy + Economy}}) &= \frac{M_{i}!}{\Pi_{j=1}^{J} W_{Aj!}}\Pi^{J}_{j=1}\mu^{W_{Aj}}_{\text{Democracy + Economy}j}\\ &= \frac{(4+3+2+1+1+0)!}{4! \cdot 3! \cdot 2! \cdot 1! \cdot 1! \cdot 0!} \cdot [(0.4+0.02)/2]^4 \cdot 0.155^3 \cdot 0.135^2 \cdot 0.16^1 \cdot 0.185^1 \cdot 0.155^0\\ &= \frac{11!}{24 \cdot 6 \cdot 2 \cdot 1 \cdot 1 \cdot 1} \cdot 0.000019246\\ &= \frac{39916800}{288} \cdot 0.000019246\\ &= 138600 \cdot 0.000019246\\ &= 0.002668524 \end{aligned} \end{equation} \]

What is a document?

Topic	democracy	elections	government	economy	growth	market
Democracy	0.4	0.3	0.25	0.02	0.02	0.01
Economy	0.02	0.01	0.02	0.3	0.35	0.3

Doc	democracy	elections	government	economy	growth	market
A	4	3	2	1	1	0
B	1	1	1	3	4	3

What is the probability of observing Document B’s word counts under an equal measure of both Democracy and Economy?

\[ \begin{equation} \begin{aligned} P(W_B|\mu_{\text{Democracy + Economy}}) &= \frac{M_{i}!}{\Pi_{j=1}^{J} W_{Bj!}}\Pi^{J}_{j=1}\mu^{W_{Bj}}_{\text{Democracy + Economy}j}\\ &= \frac{(1+1+1+3+4+3)!}{1! \cdot 1! \cdot 1! \cdot 3! \cdot 4! \cdot 3!} \cdot 0.21^1 \cdot 0.155^1 \cdot 0.135^1 \cdot 0.16^3 \cdot 0.185^4 \cdot 0.155^3\\ &= \frac{13!}{1 \cdot 1 \cdot 1 \cdot 6 \cdot 24 \cdot 6} \cdot 0.00000000862176\\ &= \frac{6227020800}{864} \cdot 0.00000000862176\\ &= 7200480 \cdot 0.00000000862176\\ &= 0.00006206069 \end{aligned} \end{equation} \]

What is a document?

Overall, documents may be better described in terms of mixtures of different topics than by one topic alone.

A topic estimates two sets of probabilities:

probability of observing each word for each topic
probability of observing each topic in each document

These quantities can then be used to organise documents by topic, assess how topics vary across documents, etc.

Latent Dirichlet Allocation (LDA)

LDA is a probabilistic language model.

Each document \(d\) in the corpus is generated as follows:

A set of \(K\) topics exists before the data
- Each topic \(k\) is a probability distribution over words (\(\beta\))
- \(\beta\) tells us how likely each word in the vocabulary is to belong to a given topic.

A specific mix of those topics is randomly extracted to generate a document
- this mix is a specific probability distribution over topics (\(\theta\))
- \(\theta\) tells us the proportion of each topic present in a given document. Unlike \(\beta\), which focuses on words, \(\theta\) focuses on how much of each topic is represented in a document.

Each word in a document is generated by:
- First, choosing a topic \(k\) at random from the probability distribution over topics \(\theta\)
- Then, choosing a word \(w\) at random from the topic-specific probability distribition over documents (\(\beta_k\))

The goal of LDA is to estimate hidden parameter (\(\beta\) and \(\theta\)) starting from \(w\).

Latent Dirichlet Allocation (LDA)

The researcher picks a number of topics, \(K\)

Each topic (\(k\)) is a distribution over words

Each document (\(d\)) is a mixture of corpus-wide topics

Each word (\(w\)) is drawn from one of the topics

Latent Dirichlet Allocation (LDA)

Estimation of the LDA model is done in a Bayesian framework

We use Bayes’ rule to update these prior distributions to obtain a posterior distribution for each \(\theta_d\) and \(\beta_k\)

The means of these posterior distributions are the outputs of statistical packages and which we use to investigate the \(\theta_d\) and \(\beta_k\)

LDA has two goals:

for each document, allocate its words to as few topics as possible (\(\alpha\))
for each topic, assign high probability to as few terms as possible (\(\eta\))

Latent Dirichlet Allocation (LDA)

Imagine that we have the following parameters: \(D = 1,000\) documents, \(J = 10,000\) words, and \(K = 3\) topics

\[\begin{equation} \theta = \underbrace{\begin{pmatrix} \theta_{1,1} & \theta_{1,2} & \theta_{1,3}\\ \theta_{2,1} & \theta_{2,2} & \theta_{2,3}\\ ... & ... & ...\\ \theta_{D,1} & \theta_{D,2} & \theta_{D,3}\\ \end{pmatrix}}_{D\times K} = \underbrace{\begin{pmatrix} 0.7 & 0.2 & 0.1\\ 0.1 & 0.8 & 0.1\\ ... & ... & ...\\ 0.3 & 0.3 & 0.4\\ \end{pmatrix}}_{1000 \times 3} \end{equation}\]

\[\begin{equation} \beta = \underbrace{\begin{pmatrix} \beta_{1,1} & \beta_{1,2} & ... & \beta_{1,J}\\ \beta_{2,1} & \beta_{2,2} & ... & \beta_{2,J}\\ \beta_{3,1} & \beta_{3,2} & ... & \beta_{3,J}\\ \end{pmatrix}}_{K\times J} = \underbrace{\begin{pmatrix} 0.04 & 0.0001 & ... & 0.003\\ 0.0004 & 0.001 & ... & 0.00005\\ 0.002 & 0.0003 & ... & 0.0008\\ \end{pmatrix}}_{3 \times 10,000} \end{equation}\]

LDA Example

We have the following parameters:

Python

import pandas as pd
import numpy as np
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
# Count the non-null values in the "body" column
speech_count = aggression_texts["body"].notnull().sum()
# Print the result
print(f"Number of speeches: {speech_count}")

Number of speeches: 28645

Python

# Combine all text in the "body" column into a single string
all_text = " ".join(aggression_texts["body"].dropna())
# Tokenize the text into words (splitting by whitespace is a simple tokenizer)
all_words = all_text.split()
# Count the total number of words
total_word_count = len(all_words)
print(f"Total number of words: {total_word_count}")

Total number of words: 4576262

LDA Example

We have the following parameters:

Python

# Combine all text in the "body" column into a single string
all_text = " ".join(aggression_texts["body"].dropna())
# Tokenize the text into words (splitting by whitespace is a simple tokenizer)
all_words = all_text.split()
# Convert the list of words to a set to find unique words
unique_words = set(all_words)
# Count the number of unique words
unique_word_count = len(unique_words)
print(f"Number of unique words: {unique_word_count}")

Number of unique words: 99222

LDA Example Application

Step 1: Load the relevant libraries

Python

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.ldamodel import LdaModel
from gensim import corpora
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import numpy as np

# Load your data (assuming 'pmq' DataFrame has a column "body")
pmq = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")

LDA Example Application

Step 2: Pre-processing text

We need to preprocess the text to remove noise like punctuation and stopwords, which can skew our topic analysis.

Python

# Preprocess the text
def preprocess(text):
    stemmer = PorterStemmer()
    stop_words = set(stopwords.words("english"))
    tokens = text.lower().split()  # Simple tokenization
    tokens = [word for word in tokens if word.isalnum()]  # Remove punctuation
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    tokens = [stemmer.stem(word) for word in tokens]  # Stem words
    return tokens

pmq["processed"] = pmq["body"].apply(preprocess)
pmq["speech_id"] = range(1, len(pmq) + 1)  # Add row numbers starting from 1

LDA Example Application

Step 3: Creating a dictionary and corpus for LDA

Python

# Create a dictionary and corpus for LDA
dictionary = corpora.Dictionary(pmq["processed"])
dictionary.filter_extremes(no_below=10, no_above=0.5)  # Equivalent to dfm_trim(min_termfreq=5)
corpus = [dictionary.doc2bow(text) for text in pmq["processed"]]

# Train LDA model
lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, random_state=42)

LDA Example Application

Step 4: Extract topics and terms with top probabilities (equivalent to beta)

Python

top_terms = []
for topic_id in range(lda.num_topics):
    top_terms += [(topic_id, term, prob) for term, prob in lda.show_topic(topic_id, topn=10)]

# Convert top terms to a DataFrame
top_terms_df = pd.DataFrame(top_terms, columns=["topic", "term", "beta"])

LDA Example Application

We can then use the following:

\[ \text{term-score}_{k,v} = \hat{\beta}_{k,v}\log\left(\frac{\hat{\beta}_{k,v}}{(\prod_{j=1}^{K}\hat{\beta}_{j,v})^{\frac{1}{K}}}\right) \]

The first \(\hat{\beta}_{k,v}\) is the probability of term \(v\) in topic \(k\) and is akin to the term frequency.

The second term down-weighs terms that have high probability under all topics.

This formulation is similar to the TF-IDF (Term Frequency-Inverse Document Frequency) term score.

Showing the LDA

See code

Python

# Step 1: Rank beta within each topic group (descending order)
top_terms_df['rank'] = top_terms_df.groupby('topic')['beta'] \
                             .rank(method='min', ascending=False)

# Step 2: Filter for topics 1–10 (assuming topics are numbered starting from 1)
result_df = top_terms_df[top_terms_df['topic'] <= 10]

See code

library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
result_df <- reticulate::py$result_df

#Creating a graph with the topics
topics_lda_keys<-ggplot(result_df, aes(y = reorder(topic, -topic), x = as.numeric(rank))) +   
  geom_tile(aes(fill = beta))+
  scale_fill_viridis_c()+
  geom_label(aes(y=reorder(topic, -topic), x=rank, label=term), fill="white", size=3)+
  scale_x_continuous(breaks = seq(min(result_df$rank), max(result_df$rank))) +  # Ensure all ranks are shown
  ylab("Topic")+ xlab("Top 10 words")+
  theme(legend.position = "bottom")
topics_lda_keys

Some Considerations about using LDA

Advantages

Unsupervised learning: No labeled data required; useful for exploring large corpora.
Interpretable topics: Topics are distributions over words, making them human-readable.
Scalable: Efficient algorithms for large datasets.
Flexibility: Can model diverse datasets and adapt to various domains.

Disadvantages

Bag-of-words assumption: Ignores word order and context, losing semantic nuance.
Sensitive to preprocessing: Requires careful tokenization, stopword removal, and lemmatization.
Fixed number of topics: Must predefine the number of topics, which may not align with the data.
Performance limitations: Struggles with short texts and highly overlapping topics.
Hyperparameter tuning: Requires tuning (e.g., α, β) to improve coherence and quality.

Validating LDA

LDA requires to make important decisions
K or the number of topics is something that researchers need to choose
How do we select K?

Held-out likelihood

We can ask the model which words are likely to be in a document
We can split texts in half, train a topic on one half, and calculate the held-out likelihood for the other half

Semantic Coherence

Do the most common words from a topic also co-occur together frequently in the same documents?

Semantic Coherence

Coherence: Evaluates the interpretability of topics, with higher values indicating more semantically meaningful topics.

Higher coherence scores indicate better topics in terms of human interpretability.
Lower score indicate worse topics in terms of human interpretability.

Semantic Coherence

This is how it works:

Calculating Coherence Score

Python

#DON'T RUN: it takes a few mins to run
from gensim.models.coherencemodel import CoherenceModel

# Initialize an empty list to store coherence scores
coherence_scores = []

# Iterate over the range of topics in increments of 5
for num_topics in range(2, 21, 3):  # From 3 to 40 topics, step = 5
    print(f"Training LDA model for {num_topics} topics...")
    
    # Train LDA model
    lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=42)
    
    # Calculate coherence
    coherence_model = CoherenceModel(model=lda, texts=pmq["processed"], dictionary=dictionary, coherence="c_v")
    coherence_score = coherence_model.get_coherence()
    
    # Append results as a tuple
    coherence_scores.append((num_topics, coherence_score))
    print(f"Number of topics: {num_topics}, Coherence Score: {coherence_score}")

# Convert coherence scores into a DataFrame
coherence_df = pd.DataFrame(coherence_scores, columns=["num_topics", "coherence"])

Manually Creating the DF

Python

#Because the previous codes takes a few mins to run, I am recreating its ouput
data = [
     {"num_topics": 2, "coherence": 0.299716},
     {"num_topics": 5, "coherence": 0.312965},
     {"num_topics": 8, "coherence": 0.342897},
     {"num_topics": 11, "coherence": 0.362278},
     {"num_topics": 14, "coherence": 0.374748},
     {"num_topics": 17, "coherence": 0.364560},
     {"num_topics": 20, "coherence": 0.359184},
 ]

# Create a DataFrame
coherence_df = pd.DataFrame(data)

See code

library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
coherence_df <- reticulate::py$coherence_df

ggplot(data=coherence_df)+
  geom_line(aes(x=num_topics, y=coherence))+
  scale_x_continuous(breaks = seq(min(coherence_df$num_topics), max(coherence_df$num_topics), by= 5)) +  # Ensure all ranks are shown
  theme(legend.position = "bottom",
            axis.text.x = element_text(angle = 45, hjust = 1))+
  theme_bw()

Held-out Likelihood

Held-Out Likelihood: Measures how well the model fits unseen data. Lower values (more negative) indicate worse performance.

This metric evaluates the model’s ability to generalize to unseen data.
The goal is to maximize (less negative) the held-out likelihood.
Optimizing only this metric can lead to overfitting, where the model captures noise in the data instead of meaningful patterns.

Held-out Likelihood

This is how it works:

Calculating Held-out Likelihood

Python

#DON'T RUN: it takes a few mins to run
# Held-out likelihood
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets
train_texts, test_texts = train_test_split(pmq["processed"], test_size=0.2, random_state=42)

# Create a dictionary and corpus for the training set
train_dictionary = corpora.Dictionary(train_texts)
train_dictionary.filter_extremes(no_below=5)
train_corpus = [train_dictionary.doc2bow(text) for text in train_texts]

# Create a corpus for the validation set using the same dictionary
test_corpus = [train_dictionary.doc2bow(text) for text in test_texts]


held_out_scores = []

for num_topics in range(2, 21, 3):  # From 3 to 70 topics
    print(f"Training LDA model for {num_topics} topics...")
    
    # Train LDA model on the training corpus
    lda = LdaModel(corpus=train_corpus, id2word=train_dictionary, num_topics=num_topics, random_state=42)
    
    # Calculate held-out likelihood
    held_out_likelihood = lda.log_perplexity(test_corpus)
    held_out_scores.append((num_topics, held_out_likelihood))
    
    print(f"Number of topics: {num_topics}, Held-Out Likelihood: {held_out_likelihood}")

# Convert the scores into a DataFrame
held_out_df = pd.DataFrame(held_out_scores, columns=["num_topics", "held_out_likelihood"])

Manually Creating the DF

Python

#Because the previous codes takes a few mins to run, I am recreating its ouput
held_out_df = [
     {"num_topics": 2, "held_out_likelihood": -7.489501},
     {"num_topics": 5, "held_out_likelihood": -7.566638},
     {"num_topics": 8, "held_out_likelihood": -7.646296},
     {"num_topics": 11, "held_out_likelihood": -7.844623},
     {"num_topics": 14, "held_out_likelihood": -7.976968},
     {"num_topics": 17, "held_out_likelihood": -8.071734},
     {"num_topics": 20, "held_out_likelihood": -8.171345}
]

# Create a DataFrame
df_likelihood = pd.DataFrame(held_out_df)

See code

library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
held_out_df <- reticulate::py$df_likelihood

ggplot(data=held_out_df)+
  geom_line(aes(x=num_topics, y=held_out_likelihood))+
  scale_x_continuous(breaks = seq(min(coherence_df$num_topics), max(coherence_df$num_topics), by= 5)) +  # Ensure all ranks are shown
  theme(legend.position = "bottom",
            axis.text.x = element_text(angle = 45, hjust = 1))+
  theme_bw()

Reconciling Coherence and Heldout Likelihood

Held-out Likelihood measures generalizability to unseen documents (predictive power) - best at 2 topics
Coherence measures semantic interpretability of topics - best at 12 topics

So:

Fewer topics (e.g., 2) generalize well to new data but are often too coarse to be interpretable.

More topics (e.g., 12) capture more meaningful distinctions between themes in your corpus but may overfit slightly, harming generalization.

We could opt for the average of the two metrics which performs well on both metrics.

Exclusivity

Average Exclusivity: Reflects how distinct the topics are from one another. Higher values indicate less overlap in the terms associated with topics.

There is typically a trade-off between coherence and exclusivity
As coherence increases (topics are more interpretable)
Exclusivity may decrease (topics overlap more).

Exclusivity

This is how it works:

Calculating Exclusivity

Python

##DON'T RUN: it takes a few mins to run
from collections import defaultdict

exclusivity_scores = []

for num_topics in range(2, 21, 3):  # From 3 to 70 topics
    print(f"Training LDA model for {num_topics} topics...")
    
    # Train LDA model
    lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=42)
    
    # Get the term-topic matrix
    term_topic_matrix = defaultdict(list)
    for topic_id in range(lda.num_topics):
        terms = lda.show_topic(topic_id, topn=20)  # Top 20 terms
        for term, prob in terms:
            term_topic_matrix[term].append(prob)
    
    # Compute exclusivity
    topic_exclusivity = []
    for topic_id in range(lda.num_topics):
        topic_terms = lda.show_topic(topic_id, topn=20)
        exclusivity = 0
        for term, prob in topic_terms:
            # Exclusivity = term probability in current topic divided by sum of its probabilities across all topics
            term_sum_prob = sum(term_topic_matrix[term])
            exclusivity += prob / term_sum_prob if term_sum_prob > 0 else 0
        topic_exclusivity.append(exclusivity / len(topic_terms))  # Normalize by number of terms in topic
    
    # Average exclusivity across topics for the model
    avg_exclusivity = sum(topic_exclusivity) / len(topic_exclusivity)
    exclusivity_scores.append((num_topics, avg_exclusivity))
    print(f"Number of topics: {num_topics}, Avg Exclusivity: {avg_exclusivity}")

# Convert to DataFrame
exclusivity_df = pd.DataFrame(exclusivity_scores, columns=["num_topics", "exclusivity"])

# Save to CSV (optional)
#exclusivity_df.to_csv("exclusivity_scores.csv", index=False)

# Display the DataFrame
#exclusivity_df

Manually Creating the DF

Python

#Because the previous codes takes a few mins to run, I am recreating its ouput
exclusivity_df = [
     {"num_topics": 2, "avg_exclusivity": 0.725},
     {"num_topics": 5, "avg_exclusivity": 0.6},
     {"num_topics": 8, "avg_exclusivity": 0.5625},
     {"num_topics": 11, "avg_exclusivity": 0.5636363636363636},
     {"num_topics": 14, "avg_exclusivity": 0.5785714285714286},
     {"num_topics": 17, "avg_exclusivity": 0.5470588235294117},
 ]
exclusivity_df = pd.DataFrame(exclusivity_df)

See code

library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
exclusivity_df <- reticulate::py$exclusivity_df

ggplot(data=exclusivity_df)+
  geom_line(aes(x=num_topics, y=avg_exclusivity))+
  scale_x_continuous(breaks = seq(min(coherence_df$num_topics), max(coherence_df$num_topics), by= 5)) +  # Ensure all ranks are shown
  theme(legend.position = "bottom",
            axis.text.x = element_text(angle = 45, hjust = 1))+
  theme_bw()

Reflection

In our case, 14 seems to be the number of topics with the highest coherence and best exclusivity.

By combining these metrics with domain-specific insights, we can make an informed choice about the best number of topics for our analysis.

The Trade-Off: More Coherence → Less Exclusivity

Topics overlap as similar words appear in multiple themes.
Example: “Growth” might appear in both “Economy” and “Healthcare” topics.

The Trade-Off: More Exclusivity → Less Coherence

Topics become overly specific and less interpretable.
Example: A topic with rare words like “NASDAQ” and “dividends” may not clearly reflect “Economy.”

Conclusion

Topic models help uncover key themes across large text corpora by analyzing word distributions.
Documents are represented as mixtures of overarching topics, with topics defined as probabilistic distributions of words.

Strengths: Effective for initial exploration of textual data, requiring little upfront work.
Limitations: Results demand thorough interpretation and rigorous validation for meaningful use.

We still need to pay attention to interpreting and validating our results.

L21: Unsupervised Learning: Topic Models

Introduction

Topic Models

Topic Models as Language Models

What is a topic?

What is a document?

What is a document?

What is a document?

What is a document?

What is a document?

What is a document?

What is a document?

What is a document?

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA)

LDA Example

LDA Example

LDA Example Application

LDA Example Application

LDA Example Application

LDA Example Application

LDA Example Application

Showing the LDA

Top Document by Topic

Top Topics in Boris Johnson

Top Topics in Boris Johnson

Top Topics in Boris Johnson

Some Considerations about using LDA

Validating LDA

Semantic Coherence

Semantic Coherence

Held-out Likelihood

Held-out Likelihood

Reconciling Coherence and Heldout Likelihood

Exclusivity

Exclusivity

Reflection

Conclusion