L21: Unsupervised Learning: Topic Models

Bogdan G. Popescu

John Cabot University

Introduction

Topic Models are unsupervised learning models.

They allow us to cluster similar documents in a corpus together.

Code
```{mermaid}
%%| code-fold: true
%%| code-summary: "See code"
flowchart TD
  A[Do we know the categories of the document?   ] --> B(Yes)
  A[Do we know the categories of the document?   ] --> C(No)
  C(No) --> D[Topic Models]
  B(Yes) --> E[Do you know the rule for placing documents in categories?   ]
  E[Do you know the rule for placing documents in categories?   ] --> F[Yes] --> I[Dictionaries]
  E[Do you know the rule for placing documents in categories?   ] --> G[No] --> J[Supervised Learning]
```

flowchart TD
  A[Do we know the categories of the document?   ] --> B(Yes)
  A[Do we know the categories of the document?   ] --> C(No)
  C(No) --> D[Topic Models]
  B(Yes) --> E[Do you know the rule for placing documents in categories?   ]
  E[Do you know the rule for placing documents in categories?   ] --> F[Yes] --> I[Dictionaries]
  E[Do you know the rule for placing documents in categories?   ] --> G[No] --> J[Supervised Learning]

Topic Models

Topic Models are an automatic procedure for us to discover the main themes in an unstructured corpus.

There is no requirement for a training set or for labelling of text before estimation

Overall, topic models allow us to organize, understand, and summarize large corpora of data.

Latent Dirichlet Allocation (LDA) has been a common approach

  • it is easy to implement
  • it performs well on small data

However:

  • on large datasets with nuanced topics, LDA might struggle to capture the complexity of the text
  • when dealing with sparse data, LDA might struggle
  • the alternative would be word-embedding techniques such as BERT

Topic Models as Language Models

A language model is a probabilty distribution over words.

For example, in the Naive Bayes model, we tried to estimate a probability distribution for each category of interest.

Each document was assigned to a single category.

For topic models, we estimate probability distributions for each topic.

  • each document can belong to multiple topics

What is a topic?

A “topic” is a probability distribution over a fixed word vocabulary.

Consider a vocabulary: democracy, elections, government, economy, growth, market.

When discussing democracy:

  • Frequently use the words democracy, elections, and government.
  • Infrequently use the words economy, growth, and market.

When discussing the economy:

  • Frequently use the words economy, growth, and market.
  • Infrequently use the words democracy, elections, and government.

We now compute the topic word probabilities to: 1) understand the document content; 2) identify patterns across a corpus; 3) facilitate human interpretation

Topic Word Probabilities Table

Topic democracy elections government economy growth market
Democracy 0.4 0.3 0.25 0.02 0.02 0.01
Economy 0.02 0.01 0.02 0.3 0.35 0.3

What is a document?

In a topic model, each document is described as being composed of a mixture of corpus-wide topics

For each document, we find the topic proportions that maximize the probability that we would observe the words in that particular document.

Topic democracy elections government economy growth market
Democracy 0.4 0.3 0.25 0.02 0.02 0.01
Economy 0.02 0.01 0.02 0.3 0.35 0.3
Doc democracy elections government economy growth market
A 4 3 2 1 1 0
B 1 1 1 3 4 3

What is a document?

Topic democracy elections government economy growth market
Democracy 0.4 0.3 0.25 0.02 0.02 0.01
Economy 0.02 0.01 0.02 0.3 0.35 0.3
Doc democracy elections government economy growth market
A 4 3 2 1 1 0
B 1 1 1 3 4 3

What is the probability of observing Document A’s word counts under the “Democracy” topic?

This following equation shows how likely a document’s words fit under a specific topic based on predefined probabilities. \[ \begin{equation} \begin{aligned} P(W_A|\mu_{\text{Democracy}}) &= \frac{M_{i}!}{\Pi_{j=1}^{J} W_{Aj!}}\Pi^{J}_{j=1}\mu^{W_{Aj}}_{\text{Democracy}j}\\ &= \frac{(4+3+2+1+1+0)!}{4! \cdot 3! \cdot 2! \cdot 1! \cdot 0!} \cdot 0.4^4 \cdot 0.3^3 \cdot 0.25^2 \cdot 0.02^1 \cdot 0.02^1 \cdot 0.01^0\\ &= \frac{11!}{288} \cdot 0.0000001728\\ &= \frac{39916800}{288} \cdot 0.0000001728\\ &=0.02393 \end{aligned} \end{equation} \]

What is a document?

Topic democracy elections government economy growth market
Democracy 0.4 0.3 0.25 0.02 0.02 0.01
Economy 0.02 0.01 0.02 0.3 0.35 0.3
Doc democracy elections government economy growth market
A 4 3 2 1 1 0
B 1 1 1 3 4 3

What is the probability of observing Document A’s word counts under the “Economy” topic?

\[ \begin{equation} \begin{aligned} P(W_A|\mu_{\text{Economy}}) &= \frac{M_{i}!}{\Pi_{j=1}^{J} W_{Aj!}}\Pi^{J}_{j=1}\mu^{W_{Aj}}_{\text{Economy}j}\\ &= \frac{(4+3+2+1+1+0)!}{4! \cdot 3! \cdot 2! \cdot 1! \cdot 0!} \cdot 0.02^4 \cdot 0.01^3 \cdot 0.02^2 \cdot 0.3^1 \cdot 0.35^1 \cdot 0.3^0\\ &= \frac{11!}{288} \cdot 0.00000000000672\\ &=\frac{39916800}{288} \cdot 0.00000000000672\\ &=0.000000931392 \end{aligned} \end{equation} \]

What is a document?

Topic democracy elections government economy growth market
Democracy 0.4 0.3 0.25 0.02 0.02 0.01
Economy 0.02 0.01 0.02 0.3 0.35 0.3
Doc democracy elections government economy growth market
A 4 3 2 1 1 0
B 1 1 1 3 4 3

What is the probability of observing Document B’s word counts under the “Democracy” topic?

\[ \begin{equation} \begin{aligned} P(W_B|\mu_{\text{Democracy}}) &= \frac{M_{i}!}{\Pi_{j=1}^{J} W_{Bj!}}\Pi^{J}_{j=1}\mu^{W_{Bj}}_{\text{Democracy}j}\\ &= \frac{(1+1+1+3+4+3)!}{1! \cdot 1! \cdot 1! \cdot 3! \cdot 4! \cdot 3!} \cdot 0.4^1 \cdot 0.3^1 \cdot 0.25^1 \cdot 0.02^3 \cdot 0.02^4 \cdot 0.01^3\\ &= \frac{13!}{1 \cdot 1 \cdot 1 \cdot 6 \cdot 24 \cdot 6} \cdot 0.000000000000768\\ &= \frac{6227020800}{864} \cdot 0.000000000000768\\ &= 0.0000055300352 \end{aligned} \end{equation} \]

What is a document?

Topic democracy elections government economy growth market
Democracy 0.4 0.3 0.25 0.02 0.02 0.01
Economy 0.02 0.01 0.02 0.3 0.35 0.3
Doc democracy elections government economy growth market
A 4 3 2 1 1 0
B 1 1 1 3 4 3

What is the probability of observing Document B’s word counts under the “Economy” topic?

\[ \begin{equation} \begin{aligned} P(W_B|\mu_{\text{Economy}}) &= \frac{M_{i}!}{\Pi_{j=1}^{J} W_{Bj!}}\Pi^{J}_{j=1}\mu^{W_{Bj}}_{\text{Economy}j}\\ &= \frac{(1+1+1+3+4+3)!}{1! \cdot 1! \cdot 1! \cdot 3! \cdot 4! \cdot 3!} \cdot 0.02^1 \cdot 0.01^1 \cdot 0.02^1 \cdot 0.3^3 \cdot 0.35^4 \cdot 0.3^3\\ &= \frac{13!}{1 \cdot 1 \cdot 1 \cdot 6 \cdot 24 \cdot 6} \cdot 0.000000000001296\\ &= \frac{6227020800}{864} \cdot 0.000000000001296\\ &= 0.0000093402048 \end{aligned} \end{equation} \]

What is a document?

Topic democracy elections government economy growth market
Democracy 0.4 0.3 0.25 0.02 0.02 0.01
Economy 0.02 0.01 0.02 0.3 0.35 0.3
Doc democracy elections government economy growth market
A 4 3 2 1 1 0
B 1 1 1 3 4 3

What is the probability of observing Document A’s word counts under an equal measure of both Democracy and Economy?

\[ \begin{equation} \begin{aligned} P(W_A|\mu_{\text{Democracy + Economy}}) &= \frac{M_{i}!}{\Pi_{j=1}^{J} W_{Aj!}}\Pi^{J}_{j=1}\mu^{W_{Aj}}_{\text{Democracy + Economy}j}\\ &= \frac{(4+3+2+1+1+0)!}{4! \cdot 3! \cdot 2! \cdot 1! \cdot 1! \cdot 0!} \cdot [(0.4+0.02)/2]^4 \cdot 0.155^3 \cdot 0.135^2 \cdot 0.16^1 \cdot 0.185^1 \cdot 0.155^0\\ &= \frac{11!}{24 \cdot 6 \cdot 2 \cdot 1 \cdot 1 \cdot 1} \cdot 0.000019246\\ &= \frac{39916800}{288} \cdot 0.000019246\\ &= 138600 \cdot 0.000019246\\ &= 0.002668524 \end{aligned} \end{equation} \]

What is a document?

Topic democracy elections government economy growth market
Democracy 0.4 0.3 0.25 0.02 0.02 0.01
Economy 0.02 0.01 0.02 0.3 0.35 0.3
Doc democracy elections government economy growth market
A 4 3 2 1 1 0
B 1 1 1 3 4 3

What is the probability of observing Document B’s word counts under an equal measure of both Democracy and Economy?

\[ \begin{equation} \begin{aligned} P(W_B|\mu_{\text{Democracy + Economy}}) &= \frac{M_{i}!}{\Pi_{j=1}^{J} W_{Bj!}}\Pi^{J}_{j=1}\mu^{W_{Bj}}_{\text{Democracy + Economy}j}\\ &= \frac{(1+1+1+3+4+3)!}{1! \cdot 1! \cdot 1! \cdot 3! \cdot 4! \cdot 3!} \cdot 0.21^1 \cdot 0.155^1 \cdot 0.135^1 \cdot 0.16^3 \cdot 0.185^4 \cdot 0.155^3\\ &= \frac{13!}{1 \cdot 1 \cdot 1 \cdot 6 \cdot 24 \cdot 6} \cdot 0.00000000862176\\ &= \frac{6227020800}{864} \cdot 0.00000000862176\\ &= 7200480 \cdot 0.00000000862176\\ &= 0.00006206069 \end{aligned} \end{equation} \]

What is a document?

Overall, documents may be better described in terms of mixtures of different topics than by one topic alone.

A topic estimates two sets of probabilities:

  • probability of observing each word for each topic
  • probability of observing each topic in each document

These quantities can then be used to organise documents by topic, assess how topics vary across documents, etc.

Latent Dirichlet Allocation (LDA)

LDA is a probabilistic language model.

Each document \(d\) in the corpus is generated as follows:

  • A set of \(K\) topics exists before the data
    • Each topic \(k\) is a probability distribution over words (\(\beta\))
    • \(\beta\) tells us how likely each word in the vocabulary is to belong to a given topic.
  • A specific mix of those topics is randomly extracted to generate a document
    • this mix is a specific probability distribution over topics (\(\theta\))
    • \(\theta\) tells us the proportion of each topic present in a given document. Unlike \(\beta\), which focuses on words, \(\theta\) focuses on how much of each topic is represented in a document.
  • Each word in a document is generated by:
    • First, choosing a topic \(k\) at random from the probability distribution over topics \(\theta\)
    • Then, choosing a word \(w\) at random from the topic-specific probability distribition over documents (\(\beta_k\))

The goal of LDA is to estimate hidden parameter (\(\beta\) and \(\theta\)) starting from \(w\).

Latent Dirichlet Allocation (LDA)

  • The researcher picks a number of topics, \(K\)
  • Each topic (\(k\)) is a distribution over words
  • Each document (\(d\)) is a mixture of corpus-wide topics
  • Each word (\(w\)) is drawn from one of the topics

Latent Dirichlet Allocation (LDA)

Estimation of the LDA model is done in a Bayesian framework

We use Bayes’ rule to update these prior distributions to obtain a posterior distribution for each \(\theta_d\) and \(\beta_k\)

The means of these posterior distributions are the outputs of statistical packages and which we use to investigate the \(\theta_d\) and \(\beta_k\)

LDA has two goals:

  • for each document, allocate its words to as few topics as possible (\(\alpha\))
  • for each topic, assign high probability to as few terms as possible (\(\eta\))

Latent Dirichlet Allocation (LDA)

Imagine that we have the following parameters: \(D = 1,000\) documents, \(J = 10,000\) words, and \(K = 3\) topics

\[\begin{equation} \theta = \underbrace{\begin{pmatrix} \theta_{1,1} & \theta_{1,2} & \theta_{1,3}\\ \theta_{2,1} & \theta_{2,2} & \theta_{2,3}\\ ... & ... & ...\\ \theta_{D,1} & \theta_{D,2} & \theta_{D,3}\\ \end{pmatrix}}_{D\times K} = \underbrace{\begin{pmatrix} 0.7 & 0.2 & 0.1\\ 0.1 & 0.8 & 0.1\\ ... & ... & ...\\ 0.3 & 0.3 & 0.4\\ \end{pmatrix}}_{1000 \times 3} \end{equation}\]

\[\begin{equation} \beta = \underbrace{\begin{pmatrix} \beta_{1,1} & \beta_{1,2} & ... & \beta_{1,J}\\ \beta_{2,1} & \beta_{2,2} & ... & \beta_{2,J}\\ \beta_{3,1} & \beta_{3,2} & ... & \beta_{3,J}\\ \end{pmatrix}}_{K\times J} = \underbrace{\begin{pmatrix} 0.04 & 0.0001 & ... & 0.003\\ 0.0004 & 0.001 & ... & 0.00005\\ 0.002 & 0.0003 & ... & 0.0008\\ \end{pmatrix}}_{3 \times 10,000} \end{equation}\]

LDA Example

We have the following parameters:

Python
import pandas as pd
import numpy as np
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
# Count the non-null values in the "body" column
speech_count = aggression_texts["body"].notnull().sum()
# Print the result
print(f"Number of speeches: {speech_count}")
Number of speeches: 28645
Python
# Combine all text in the "body" column into a single string
all_text = " ".join(aggression_texts["body"].dropna())
# Tokenize the text into words (splitting by whitespace is a simple tokenizer)
all_words = all_text.split()
# Count the total number of words
total_word_count = len(all_words)
print(f"Total number of words: {total_word_count}")
Total number of words: 4576262

LDA Example

We have the following parameters:

Python
# Combine all text in the "body" column into a single string
all_text = " ".join(aggression_texts["body"].dropna())
# Tokenize the text into words (splitting by whitespace is a simple tokenizer)
all_words = all_text.split()
# Convert the list of words to a set to find unique words
unique_words = set(all_words)
# Count the number of unique words
unique_word_count = len(unique_words)
print(f"Number of unique words: {unique_word_count}")
Number of unique words: 99222

LDA Example Application

Step 1: Load the relevant libraries

Python
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models.ldamodel import LdaModel
from gensim import corpora
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import numpy as np

# Load your data (assuming 'pmq' DataFrame has a column "body")
pmq = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")

LDA Example Application

Step 2: Pre-processing text

We need to preprocess the text to remove noise like punctuation and stopwords, which can skew our topic analysis.

Python
# Preprocess the text
def preprocess(text):
    stemmer = PorterStemmer()
    stop_words = set(stopwords.words("english"))
    tokens = text.lower().split()  # Simple tokenization
    tokens = [word for word in tokens if word.isalnum()]  # Remove punctuation
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    tokens = [stemmer.stem(word) for word in tokens]  # Stem words
    return tokens

pmq["processed"] = pmq["body"].apply(preprocess)
pmq["speech_id"] = range(1, len(pmq) + 1)  # Add row numbers starting from 1

LDA Example Application

Step 3: Creating a dictionary and corpus for LDA

Python
# Create a dictionary and corpus for LDA
dictionary = corpora.Dictionary(pmq["processed"])
dictionary.filter_extremes(no_below=10, no_above=0.5)  # Equivalent to dfm_trim(min_termfreq=5)
corpus = [dictionary.doc2bow(text) for text in pmq["processed"]]

# Train LDA model
lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, random_state=42)

LDA Example Application

Step 4: Extract topics and terms with top probabilities (equivalent to beta)

Python
top_terms = []
for topic_id in range(lda.num_topics):
    top_terms += [(topic_id, term, prob) for term, prob in lda.show_topic(topic_id, topn=10)]

# Convert top terms to a DataFrame
top_terms_df = pd.DataFrame(top_terms, columns=["topic", "term", "beta"])

LDA Example Application

We can then use the following:

\[ \text{term-score}_{k,v} = \hat{\beta}_{k,v}\log\left(\frac{\hat{\beta}_{k,v}}{(\prod_{j=1}^{K}\hat{\beta}_{j,v})^{\frac{1}{K}}}\right) \]

The first \(\hat{\beta}_{k,v}\) is the probability of term \(v\) in topic \(k\) and is akin to the term frequency.

The second term down-weighs terms that have high probability under all topics.

This formulation is similar to the TF-IDF (Term Frequency-Inverse Document Frequency) term score.

Showing the LDA

See code
Python
# Step 1: Rank beta within each topic group (descending order)
top_terms_df['rank'] = top_terms_df.groupby('topic')['beta'] \
                             .rank(method='min', ascending=False)

# Step 2: Filter for topics 1–10 (assuming topics are numbered starting from 1)
result_df = top_terms_df[top_terms_df['topic'] <= 10]
See code
R
library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
result_df <- reticulate::py$result_df

#Creating a graph with the topics
topics_lda_keys<-ggplot(result_df, aes(y = reorder(topic, -topic), x = as.numeric(rank))) +   
  geom_tile(aes(fill = beta))+
  scale_fill_viridis_c()+
  geom_label(aes(y=reorder(topic, -topic), x=rank, label=term), fill="white", size=3)+
  scale_x_continuous(breaks = seq(min(result_df$rank), max(result_df$rank))) +  # Ensure all ranks are shown
  ylab("Topic")+ xlab("Top 10 words")+
  theme(legend.position = "bottom")
topics_lda_keys

Top Document by Topic

We can also identify the top documents by topic

Python
# Get topic distributions for each document
document_topics = []

for doc_id, doc_bow in enumerate(corpus):
    doc_topics = lda.get_document_topics(doc_bow, minimum_probability=0)  # Get probabilities for all topics
    sorted_topics = sorted(doc_topics, key=lambda x: x[1], reverse=True)  # Sort topics by probability
    top_5_topics = sorted_topics[:5]  # Take the top 5 topics
    for topic_id, prob in top_5_topics:
        # Extract top terms for the topic
        terms = ", ".join([term for term, _ in lda.show_topic(topic_id, topn=5)])  # Top 5 terms as a string
        document_topics.append((doc_id + 1, topic_id, prob, terms))  # Add doc_id, topic_id, probability, and terms

# Convert to a DataFrame
document_topics_df = pd.DataFrame(document_topics, columns=["speech_id", "topic", "percentage", "terms"])

# Merge top 5 topics with the original DataFrame
pmq_with_topics = pmq.merge(document_topics_df, on="speech_id", how="left")

Top Topics in Boris Johnson

Python
pmq_boris = pmq_with_topics[pmq_with_topics["name"] == "Boris Johnson"]
See code
R
library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
result_df <- reticulate::py$pmq_boris

# Add small increments only to duplicates in `percentage`
reshaped_df2 <- result_df %>%
  group_by(speech_id) %>%
  mutate(
    percentage = if_else(
      duplicated(percentage) | duplicated(percentage, fromLast = TRUE),  # Identify duplicates
      percentage + row_number() * 1e-6,  # Add small increments to duplicates
      percentage  # Keep original value for non-duplicates
    )
  )

reshaped_df2<-reshaped_df2%>%
  group_by(speech_id)%>%
  mutate(rank = rank(-percentage, ties.method = "min"))

#Creating a graph with the topics
topics_lda_keys<-ggplot(reshaped_df2, aes(y = reorder(speech_id, -speech_id), x = as.numeric(rank))) +   
  geom_tile(aes(fill = percentage))+
  scale_fill_viridis_c()+
  geom_label(aes(y = reorder(speech_id, -speech_id), x=as.numeric(rank), label=topic), fill="white", size=3)+
  scale_x_continuous(breaks = seq(min(reshaped_df2$rank), max(reshaped_df2$rank))) +  # Ensure all ranks are shown
  ylab("Speech")+ xlab("Top 5 Topics")+
  theme(legend.position = "bottom")
topics_lda_keys

Top Topics in Boris Johnson

Python
pmq_boris = pmq_with_topics[pmq_with_topics["name"] == "Boris Johnson"]
See code
R
library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
result_df <- reticulate::py$pmq_boris

# Add small increments only to duplicates in `percentage`
reshaped_df2 <- result_df %>%
  group_by(speech_id) %>%
  mutate(
    percentage = if_else(
      duplicated(percentage) | duplicated(percentage, fromLast = TRUE),  # Identify duplicates
      percentage + row_number() * 1e-6,  # Add small increments to duplicates
      percentage  # Keep original value for non-duplicates
    )
  )

reshaped_df2<-reshaped_df2%>%
  group_by(speech_id)%>%
  mutate(rank = rank(-percentage, ties.method = "min"))

#reshaped_df3 <- reshaped_df2 %>%
#  group_by(speech_id) %>%
#  mutate(percentage = if_else(duplicated(percentage), NA_real_, percentage),
#         terms = if_else(duplicated(terms), NA_character_, terms))

reshaped_df2$terms <- gsub(
  pattern = "^(([^,]*,){2}[^,]*),",  # Match up to the third comma
  replacement = "\\1\n",            # Replace the third comma with a newline
  x = reshaped_df2$terms
)

#Creating a graph with the topics
topics_lda_keys<-ggplot(reshaped_df2, aes(y = reorder(speech_id, -speech_id), x = as.numeric(rank))) +   
  geom_tile(aes(fill = percentage))+
  scale_fill_viridis_c()+
  geom_label(aes(y = reorder(speech_id, -speech_id), x=as.numeric(rank), label=terms), fill="white", size=2)+
  scale_x_continuous(breaks = seq(min(reshaped_df2$rank), max(reshaped_df2$rank))) +  # Ensure all ranks are shown
  ylab("Speech")+ xlab("Top 5 Topics")+
  theme(legend.position = "bottom")
topics_lda_keys

Top Topics in Boris Johnson

Let us look at the speech 3900.

Python
pmq_boris_chosen = pmq_with_topics[(pmq_with_topics["name"] == "Boris Johnson") & (pmq_with_topics["speech_id"] == 3900)]

Johnson’s speech - 3900, features under the following top 5 topics: 15, 0, 17, 13, 6.

The topic with the highest score - 0.3026066 is 15: industri, unit, british, trade, uk

This speech reads as follows:

Johnson’s speech - 3900

I hesitate for an age before correcting you, Mr Speaker, but it was a serious discussion of the advancement of free trade. The subject of free trade in the African Union, which my honourable Friend raises, is a very good one. The only advice I would give to the African Union is not to acquire a parliament, a court or a single currency..

Some Considerations about using LDA

Advantages

  • Unsupervised learning: No labeled data required; useful for exploring large corpora.
  • Interpretable topics: Topics are distributions over words, making them human-readable.
  • Scalable: Efficient algorithms for large datasets.
  • Flexibility: Can model diverse datasets and adapt to various domains.

Disadvantages

  • Bag-of-words assumption: Ignores word order and context, losing semantic nuance.
  • Sensitive to preprocessing: Requires careful tokenization, stopword removal, and lemmatization.
  • Fixed number of topics: Must predefine the number of topics, which may not align with the data.
  • Performance limitations: Struggles with short texts and highly overlapping topics.
  • Hyperparameter tuning: Requires tuning (e.g., α, β) to improve coherence and quality.

Validating LDA

  • LDA requires to make important decisions
  • K or the number of topics is something that researchers need to choose
  • How do we select K?

Held-out likelihood

  • We can ask the model which words are likely to be in a document
  • We can split texts in half, train a topic on one half, and calculate the held-out likelihood for the other half

Semantic Coherence

  • Do the most common words from a topic also co-occur together frequently in the same documents?

Semantic Coherence

Coherence: Evaluates the interpretability of topics, with higher values indicating more semantically meaningful topics.

  • Higher coherence scores indicate better topics in terms of human interpretability.
  • Lower score indicate worse topics in terms of human interpretability.

Semantic Coherence

This is how it works:

Calculating Coherence Score
Python
#DON'T RUN: it takes a few mins to run
from gensim.models.coherencemodel import CoherenceModel

# Initialize an empty list to store coherence scores
coherence_scores = []

# Iterate over the range of topics in increments of 5
for num_topics in range(2, 21, 3):  # From 3 to 40 topics, step = 5
    print(f"Training LDA model for {num_topics} topics...")
    
    # Train LDA model
    lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=42)
    
    # Calculate coherence
    coherence_model = CoherenceModel(model=lda, texts=pmq["processed"], dictionary=dictionary, coherence="c_v")
    coherence_score = coherence_model.get_coherence()
    
    # Append results as a tuple
    coherence_scores.append((num_topics, coherence_score))
    print(f"Number of topics: {num_topics}, Coherence Score: {coherence_score}")

# Convert coherence scores into a DataFrame
coherence_df = pd.DataFrame(coherence_scores, columns=["num_topics", "coherence"])
Manually Creating the DF
Python
#Because the previous codes takes a few mins to run, I am recreating its ouput
data = [
     {"num_topics": 2, "coherence": 0.299716},
     {"num_topics": 5, "coherence": 0.312965},
     {"num_topics": 8, "coherence": 0.342897},
     {"num_topics": 11, "coherence": 0.362278},
     {"num_topics": 14, "coherence": 0.374748},
     {"num_topics": 17, "coherence": 0.364560},
     {"num_topics": 20, "coherence": 0.359184},
 ]

# Create a DataFrame
coherence_df = pd.DataFrame(data)
See code
R
library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
coherence_df <- reticulate::py$coherence_df

ggplot(data=coherence_df)+
  geom_line(aes(x=num_topics, y=coherence))+
  scale_x_continuous(breaks = seq(min(coherence_df$num_topics), max(coherence_df$num_topics), by= 5)) +  # Ensure all ranks are shown
  theme(legend.position = "bottom",
            axis.text.x = element_text(angle = 45, hjust = 1))+
  theme_bw()

Held-out Likelihood

Held-Out Likelihood: Measures how well the model fits unseen data. Lower values (more negative) indicate worse performance.

  • This metric evaluates the model’s ability to generalize to unseen data.
  • The goal is to maximize (less negative) the held-out likelihood.
  • Optimizing only this metric can lead to overfitting, where the model captures noise in the data instead of meaningful patterns.

Held-out Likelihood

This is how it works:

Calculating Held-out Likelihood
Python
#DON'T RUN: it takes a few mins to run
# Held-out likelihood
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets
train_texts, test_texts = train_test_split(pmq["processed"], test_size=0.2, random_state=42)

# Create a dictionary and corpus for the training set
train_dictionary = corpora.Dictionary(train_texts)
train_dictionary.filter_extremes(no_below=5)
train_corpus = [train_dictionary.doc2bow(text) for text in train_texts]

# Create a corpus for the validation set using the same dictionary
test_corpus = [train_dictionary.doc2bow(text) for text in test_texts]


held_out_scores = []

for num_topics in range(2, 21, 3):  # From 3 to 70 topics
    print(f"Training LDA model for {num_topics} topics...")
    
    # Train LDA model on the training corpus
    lda = LdaModel(corpus=train_corpus, id2word=train_dictionary, num_topics=num_topics, random_state=42)
    
    # Calculate held-out likelihood
    held_out_likelihood = lda.log_perplexity(test_corpus)
    held_out_scores.append((num_topics, held_out_likelihood))
    
    print(f"Number of topics: {num_topics}, Held-Out Likelihood: {held_out_likelihood}")

# Convert the scores into a DataFrame
held_out_df = pd.DataFrame(held_out_scores, columns=["num_topics", "held_out_likelihood"])
Manually Creating the DF
Python
#Because the previous codes takes a few mins to run, I am recreating its ouput
held_out_df = [
     {"num_topics": 2, "held_out_likelihood": -7.489501},
     {"num_topics": 5, "held_out_likelihood": -7.566638},
     {"num_topics": 8, "held_out_likelihood": -7.646296},
     {"num_topics": 11, "held_out_likelihood": -7.844623},
     {"num_topics": 14, "held_out_likelihood": -7.976968},
     {"num_topics": 17, "held_out_likelihood": -8.071734},
     {"num_topics": 20, "held_out_likelihood": -8.171345}
]

# Create a DataFrame
df_likelihood = pd.DataFrame(held_out_df)
See code
R
library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
held_out_df <- reticulate::py$df_likelihood

ggplot(data=held_out_df)+
  geom_line(aes(x=num_topics, y=held_out_likelihood))+
  scale_x_continuous(breaks = seq(min(coherence_df$num_topics), max(coherence_df$num_topics), by= 5)) +  # Ensure all ranks are shown
  theme(legend.position = "bottom",
            axis.text.x = element_text(angle = 45, hjust = 1))+
  theme_bw()

Reconciling Coherence and Heldout Likelihood

  • Held-out Likelihood measures generalizability to unseen documents (predictive power) - best at 2 topics
  • Coherence measures semantic interpretability of topics - best at 12 topics

So:

Fewer topics (e.g., 2) generalize well to new data but are often too coarse to be interpretable.

More topics (e.g., 12) capture more meaningful distinctions between themes in your corpus but may overfit slightly, harming generalization.

We could opt for the average of the two metrics which performs well on both metrics.

Exclusivity

Average Exclusivity: Reflects how distinct the topics are from one another. Higher values indicate less overlap in the terms associated with topics.

  • There is typically a trade-off between coherence and exclusivity
  • As coherence increases (topics are more interpretable)
  • Exclusivity may decrease (topics overlap more).

Exclusivity

This is how it works:

Calculating Exclusivity
Python
##DON'T RUN: it takes a few mins to run
from collections import defaultdict

exclusivity_scores = []

for num_topics in range(2, 21, 3):  # From 3 to 70 topics
    print(f"Training LDA model for {num_topics} topics...")
    
    # Train LDA model
    lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=42)
    
    # Get the term-topic matrix
    term_topic_matrix = defaultdict(list)
    for topic_id in range(lda.num_topics):
        terms = lda.show_topic(topic_id, topn=20)  # Top 20 terms
        for term, prob in terms:
            term_topic_matrix[term].append(prob)
    
    # Compute exclusivity
    topic_exclusivity = []
    for topic_id in range(lda.num_topics):
        topic_terms = lda.show_topic(topic_id, topn=20)
        exclusivity = 0
        for term, prob in topic_terms:
            # Exclusivity = term probability in current topic divided by sum of its probabilities across all topics
            term_sum_prob = sum(term_topic_matrix[term])
            exclusivity += prob / term_sum_prob if term_sum_prob > 0 else 0
        topic_exclusivity.append(exclusivity / len(topic_terms))  # Normalize by number of terms in topic
    
    # Average exclusivity across topics for the model
    avg_exclusivity = sum(topic_exclusivity) / len(topic_exclusivity)
    exclusivity_scores.append((num_topics, avg_exclusivity))
    print(f"Number of topics: {num_topics}, Avg Exclusivity: {avg_exclusivity}")

# Convert to DataFrame
exclusivity_df = pd.DataFrame(exclusivity_scores, columns=["num_topics", "exclusivity"])

# Save to CSV (optional)
#exclusivity_df.to_csv("exclusivity_scores.csv", index=False)

# Display the DataFrame
#exclusivity_df
Manually Creating the DF
Python
#Because the previous codes takes a few mins to run, I am recreating its ouput
exclusivity_df = [
     {"num_topics": 2, "avg_exclusivity": 0.725},
     {"num_topics": 5, "avg_exclusivity": 0.6},
     {"num_topics": 8, "avg_exclusivity": 0.5625},
     {"num_topics": 11, "avg_exclusivity": 0.5636363636363636},
     {"num_topics": 14, "avg_exclusivity": 0.5785714285714286},
     {"num_topics": 17, "avg_exclusivity": 0.5470588235294117},
 ]
exclusivity_df = pd.DataFrame(exclusivity_df)
See code
R
library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
exclusivity_df <- reticulate::py$exclusivity_df

ggplot(data=exclusivity_df)+
  geom_line(aes(x=num_topics, y=avg_exclusivity))+
  scale_x_continuous(breaks = seq(min(coherence_df$num_topics), max(coherence_df$num_topics), by= 5)) +  # Ensure all ranks are shown
  theme(legend.position = "bottom",
            axis.text.x = element_text(angle = 45, hjust = 1))+
  theme_bw()

Reflection

In our case, 14 seems to be the number of topics with the highest coherence and best exclusivity.

By combining these metrics with domain-specific insights, we can make an informed choice about the best number of topics for our analysis.

The Trade-Off: More Coherence → Less Exclusivity

  • Topics overlap as similar words appear in multiple themes.
  • Example: “Growth” might appear in both “Economy” and “Healthcare” topics.

The Trade-Off: More Exclusivity → Less Coherence

  • Topics become overly specific and less interpretable.
  • Example: A topic with rare words like “NASDAQ” and “dividends” may not clearly reflect “Economy.”

Conclusion

  • Topic models help uncover key themes across large text corpora by analyzing word distributions.

  • Documents are represented as mixtures of overarching topics, with topics defined as probabilistic distributions of words.

  • Strengths: Effective for initial exploration of textual data, requiring little upfront work.

  • Limitations: Results demand thorough interpretation and rigorous validation for meaningful use.

  • We still need to pay attention to interpreting and validating our results.