They allow us to cluster similar documents in a corpus together.
Code
```{mermaid}%%| code-fold: true%%| code-summary: "See code"flowchart TD A[Do we know the categories of the document? ] --> B(Yes) A[Do we know the categories of the document? ] --> C(No) C(No) --> D[Topic Models] B(Yes) --> E[Do you know the rule for placing documents in categories? ] E[Do you know the rule for placing documents in categories? ] --> F[Yes] --> I[Dictionaries] E[Do you know the rule for placing documents in categories? ] --> G[No] --> J[Supervised Learning]```
flowchart TD
A[Do we know the categories of the document? ] --> B(Yes)
A[Do we know the categories of the document? ] --> C(No)
C(No) --> D[Topic Models]
B(Yes) --> E[Do you know the rule for placing documents in categories? ]
E[Do you know the rule for placing documents in categories? ] --> F[Yes] --> I[Dictionaries]
E[Do you know the rule for placing documents in categories? ] --> G[No] --> J[Supervised Learning]
Topic Models
Topic Models are an automatic procedure for us to discover the main themes in an unstructured corpus.
There is no requirement for a training set or for labelling of text before estimation
Overall, topic models allow us to organize, understand, and summarize large corpora of data.
Latent Dirichlet Allocation (LDA) has been a common approach
it is easy to implement
it performs well on small data
However:
on large datasets with nuanced topics, LDA might struggle to capture the complexity of the text
when dealing with sparse data, LDA might struggle
the alternative would be word-embedding techniques such as BERT
Topic Models as Language Models
A language model is a probabilty distribution over words.
For example, in the Naive Bayes model, we tried to estimate a probability distribution for each category of interest.
Each document was assigned to a single category.
For topic models, we estimate probability distributions for each topic.
each document can belong to multiple topics
What is a topic?
A “topic” is a probability distribution over a fixed word vocabulary.
Consider a vocabulary: democracy, elections, government, economy, growth, market.
When discussing democracy:
Frequently use the words democracy, elections, and government.
Infrequently use the words economy, growth, and market.
When discussing the economy:
Frequently use the words economy, growth, and market.
Infrequently use the words democracy, elections, and government.
We now compute the topic word probabilities to: 1) understand the document content; 2) identify patterns across a corpus; 3) facilitate human interpretation
Topic Word Probabilities Table
Topic
democracy
elections
government
economy
growth
market
Democracy
0.4
0.3
0.25
0.02
0.02
0.01
Economy
0.02
0.01
0.02
0.3
0.35
0.3
What is a document?
In a topic model, each document is described as being composed of a mixture of corpus-wide topics
For each document, we find the topic proportions that maximize the probability that we would observe the words in that particular document.
Topic
democracy
elections
government
economy
growth
market
Democracy
0.4
0.3
0.25
0.02
0.02
0.01
Economy
0.02
0.01
0.02
0.3
0.35
0.3
Doc
democracy
elections
government
economy
growth
market
A
4
3
2
1
1
0
B
1
1
1
3
4
3
What is a document?
Topic
democracy
elections
government
economy
growth
market
Democracy
0.4
0.3
0.25
0.02
0.02
0.01
Economy
0.02
0.01
0.02
0.3
0.35
0.3
Doc
democracy
elections
government
economy
growth
market
A
4
3
2
1
1
0
B
1
1
1
3
4
3
What is the probability of observing Document A’s word counts under the “Democracy” topic?
This following equation shows how likely a document’s words fit under a specific topic based on predefined probabilities. \[
\begin{equation}
\begin{aligned}
P(W_A|\mu_{\text{Democracy}}) &= \frac{M_{i}!}{\Pi_{j=1}^{J} W_{Aj!}}\Pi^{J}_{j=1}\mu^{W_{Aj}}_{\text{Democracy}j}\\
&= \frac{(4+3+2+1+1+0)!}{4! \cdot 3! \cdot 2! \cdot 1! \cdot 0!} \cdot 0.4^4 \cdot 0.3^3 \cdot 0.25^2 \cdot 0.02^1 \cdot 0.02^1 \cdot 0.01^0\\
&= \frac{11!}{288} \cdot 0.0000001728\\
&= \frac{39916800}{288} \cdot 0.0000001728\\
&=0.02393
\end{aligned}
\end{equation}
\]
What is a document?
Topic
democracy
elections
government
economy
growth
market
Democracy
0.4
0.3
0.25
0.02
0.02
0.01
Economy
0.02
0.01
0.02
0.3
0.35
0.3
Doc
democracy
elections
government
economy
growth
market
A
4
3
2
1
1
0
B
1
1
1
3
4
3
What is the probability of observing Document A’s word counts under the “Economy” topic?
Overall, documents may be better described in terms of mixtures of different topics than by one topic alone.
A topic estimates two sets of probabilities:
probability of observing each word for each topic
probability of observing each topic in each document
These quantities can then be used to organise documents by topic, assess how topics vary across documents, etc.
Latent Dirichlet Allocation (LDA)
LDA is a probabilistic language model.
Each document \(d\) in the corpus is generated as follows:
A set of \(K\) topics exists before the data
Each topic \(k\) is a probability distribution over words (\(\beta\))
\(\beta\) tells us how likely each word in the vocabulary is to belong to a given topic.
A specific mix of those topics is randomly extracted to generate a document
this mix is a specific probability distribution over topics (\(\theta\))
\(\theta\) tells us the proportion of each topic present in a given document. Unlike \(\beta\), which focuses on words, \(\theta\) focuses on how much of each topic is represented in a document.
Each word in a document is generated by:
First, choosing a topic \(k\) at random from the probability distribution over topics \(\theta\)
Then, choosing a word \(w\) at random from the topic-specific probability distribition over documents (\(\beta_k\))
The goal of LDA is to estimate hidden parameter (\(\beta\) and \(\theta\)) starting from \(w\).
Latent Dirichlet Allocation (LDA)
The researcher picks a number of topics, \(K\)
Each topic (\(k\)) is a distribution over words
Each document (\(d\)) is a mixture of corpus-wide topics
Each word (\(w\)) is drawn from one of the topics
Latent Dirichlet Allocation (LDA)
Estimation of the LDA model is done in a Bayesian framework
We use Bayes’ rule to update these prior distributions to obtain a posterior distribution for each \(\theta_d\) and \(\beta_k\)
The means of these posterior distributions are the outputs of statistical packages and which we use to investigate the \(\theta_d\) and \(\beta_k\)
LDA has two goals:
for each document, allocate its words to as few topics as possible (\(\alpha\))
for each topic, assign high probability to as few terms as possible (\(\eta\))
Latent Dirichlet Allocation (LDA)
Imagine that we have the following parameters: \(D = 1,000\) documents, \(J = 10,000\) words, and \(K = 3\) topics
import pandas as pdimport numpy as npaggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")# Count the non-null values in the "body" columnspeech_count = aggression_texts["body"].notnull().sum()# Print the resultprint(f"Number of speeches: {speech_count}")
Number of speeches: 28645
Python
# Combine all text in the "body" column into a single stringall_text =" ".join(aggression_texts["body"].dropna())# Tokenize the text into words (splitting by whitespace is a simple tokenizer)all_words = all_text.split()# Count the total number of wordstotal_word_count =len(all_words)print(f"Total number of words: {total_word_count}")
Total number of words: 4576262
LDA Example
We have the following parameters:
Python
# Combine all text in the "body" column into a single stringall_text =" ".join(aggression_texts["body"].dropna())# Tokenize the text into words (splitting by whitespace is a simple tokenizer)all_words = all_text.split()# Convert the list of words to a set to find unique wordsunique_words =set(all_words)# Count the number of unique wordsunique_word_count =len(unique_words)print(f"Number of unique words: {unique_word_count}")
Number of unique words: 99222
LDA Example Application
Step 1: Load the relevant libraries
Python
import pandas as pdfrom sklearn.feature_extraction.text import CountVectorizerfrom gensim.models.ldamodel import LdaModelfrom gensim import corporafrom nltk.corpus import stopwordsfrom nltk.stem.porter import PorterStemmerimport numpy as np# Load your data (assuming 'pmq' DataFrame has a column "body")pmq = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
LDA Example Application
Step 2: Pre-processing text
We need to preprocess the text to remove noise like punctuation and stopwords, which can skew our topic analysis.
Python
# Preprocess the textdef preprocess(text): stemmer = PorterStemmer() stop_words =set(stopwords.words("english")) tokens = text.lower().split() # Simple tokenization tokens = [word for word in tokens if word.isalnum()] # Remove punctuation tokens = [word for word in tokens if word notin stop_words] # Remove stopwords tokens = [stemmer.stem(word) for word in tokens] # Stem wordsreturn tokenspmq["processed"] = pmq["body"].apply(preprocess)pmq["speech_id"] =range(1, len(pmq) +1) # Add row numbers starting from 1
LDA Example Application
Step 3: Creating a dictionary and corpus for LDA
Python
# Create a dictionary and corpus for LDAdictionary = corpora.Dictionary(pmq["processed"])dictionary.filter_extremes(no_below=10, no_above=0.5) # Equivalent to dfm_trim(min_termfreq=5)corpus = [dictionary.doc2bow(text) for text in pmq["processed"]]# Train LDA modellda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, random_state=42)
LDA Example Application
Step 4: Extract topics and terms with top probabilities (equivalent to beta)
Python
top_terms = []for topic_id inrange(lda.num_topics): top_terms += [(topic_id, term, prob) for term, prob in lda.show_topic(topic_id, topn=10)]# Convert top terms to a DataFrametop_terms_df = pd.DataFrame(top_terms, columns=["topic", "term", "beta"])
The first \(\hat{\beta}_{k,v}\) is the probability of term \(v\) in topic \(k\) and is akin to the term frequency.
The second term down-weighs terms that have high probability under all topics.
This formulation is similar to the TF-IDF (Term Frequency-Inverse Document Frequency) term score.
Showing the LDA
See code
Python
# Step 1: Rank beta within each topic group (descending order)top_terms_df['rank'] = top_terms_df.groupby('topic')['beta'] \ .rank(method='min', ascending=False)# Step 2: Filter for topics 1–10 (assuming topics are numbered starting from 1)result_df = top_terms_df[top_terms_df['topic'] <=10]
See code
R
library(dplyr)library(broom)library(ggplot2)#Turning the Pandas dataframe to Rresult_df <- reticulate::py$result_df#Creating a graph with the topicstopics_lda_keys<-ggplot(result_df, aes(y =reorder(topic, -topic), x =as.numeric(rank))) +geom_tile(aes(fill = beta))+scale_fill_viridis_c()+geom_label(aes(y=reorder(topic, -topic), x=rank, label=term), fill="white", size=3)+scale_x_continuous(breaks =seq(min(result_df$rank), max(result_df$rank))) +# Ensure all ranks are shownylab("Topic")+xlab("Top 10 words")+theme(legend.position ="bottom")topics_lda_keys
Top Document by Topic
We can also identify the top documents by topic
Python
# Get topic distributions for each documentdocument_topics = []for doc_id, doc_bow inenumerate(corpus): doc_topics = lda.get_document_topics(doc_bow, minimum_probability=0) # Get probabilities for all topics sorted_topics =sorted(doc_topics, key=lambda x: x[1], reverse=True) # Sort topics by probability top_5_topics = sorted_topics[:5] # Take the top 5 topicsfor topic_id, prob in top_5_topics:# Extract top terms for the topic terms =", ".join([term for term, _ in lda.show_topic(topic_id, topn=5)]) # Top 5 terms as a string document_topics.append((doc_id +1, topic_id, prob, terms)) # Add doc_id, topic_id, probability, and terms# Convert to a DataFramedocument_topics_df = pd.DataFrame(document_topics, columns=["speech_id", "topic", "percentage", "terms"])# Merge top 5 topics with the original DataFramepmq_with_topics = pmq.merge(document_topics_df, on="speech_id", how="left")
library(dplyr)library(broom)library(ggplot2)#Turning the Pandas dataframe to Rresult_df <- reticulate::py$pmq_boris# Add small increments only to duplicates in `percentage`reshaped_df2 <- result_df %>%group_by(speech_id) %>%mutate(percentage =if_else(duplicated(percentage) |duplicated(percentage, fromLast =TRUE), # Identify duplicates percentage +row_number() *1e-6, # Add small increments to duplicates percentage # Keep original value for non-duplicates ) )reshaped_df2<-reshaped_df2%>%group_by(speech_id)%>%mutate(rank =rank(-percentage, ties.method ="min"))#Creating a graph with the topicstopics_lda_keys<-ggplot(reshaped_df2, aes(y =reorder(speech_id, -speech_id), x =as.numeric(rank))) +geom_tile(aes(fill = percentage))+scale_fill_viridis_c()+geom_label(aes(y =reorder(speech_id, -speech_id), x=as.numeric(rank), label=topic), fill="white", size=3)+scale_x_continuous(breaks =seq(min(reshaped_df2$rank), max(reshaped_df2$rank))) +# Ensure all ranks are shownylab("Speech")+xlab("Top 5 Topics")+theme(legend.position ="bottom")topics_lda_keys
library(dplyr)library(broom)library(ggplot2)#Turning the Pandas dataframe to Rresult_df <- reticulate::py$pmq_boris# Add small increments only to duplicates in `percentage`reshaped_df2 <- result_df %>%group_by(speech_id) %>%mutate(percentage =if_else(duplicated(percentage) |duplicated(percentage, fromLast =TRUE), # Identify duplicates percentage +row_number() *1e-6, # Add small increments to duplicates percentage # Keep original value for non-duplicates ) )reshaped_df2<-reshaped_df2%>%group_by(speech_id)%>%mutate(rank =rank(-percentage, ties.method ="min"))#reshaped_df3 <- reshaped_df2 %>%# group_by(speech_id) %>%# mutate(percentage = if_else(duplicated(percentage), NA_real_, percentage),# terms = if_else(duplicated(terms), NA_character_, terms))reshaped_df2$terms <-gsub(pattern ="^(([^,]*,){2}[^,]*),", # Match up to the third commareplacement ="\\1\n", # Replace the third comma with a newlinex = reshaped_df2$terms)#Creating a graph with the topicstopics_lda_keys<-ggplot(reshaped_df2, aes(y =reorder(speech_id, -speech_id), x =as.numeric(rank))) +geom_tile(aes(fill = percentage))+scale_fill_viridis_c()+geom_label(aes(y =reorder(speech_id, -speech_id), x=as.numeric(rank), label=terms), fill="white", size=2)+scale_x_continuous(breaks =seq(min(reshaped_df2$rank), max(reshaped_df2$rank))) +# Ensure all ranks are shownylab("Speech")+xlab("Top 5 Topics")+theme(legend.position ="bottom")topics_lda_keys
Johnson’s speech - 3900, features under the following top 5 topics: 15, 0, 17, 13, 6.
The topic with the highest score - 0.3026066 is 15: industri, unit, british, trade, uk
This speech reads as follows:
Johnson’s speech - 3900
I hesitate for an age before correcting you, Mr Speaker, but it was a serious discussion of the advancement of free trade. The subject of free trade in the African Union, which my honourable Friend raises, is a very good one. The only advice I would give to the African Union is not to acquire a parliament, a court or a single currency..
Some Considerations about using LDA
Advantages
Unsupervised learning: No labeled data required; useful for exploring large corpora.
Interpretable topics: Topics are distributions over words, making them human-readable.
Scalable: Efficient algorithms for large datasets.
Flexibility: Can model diverse datasets and adapt to various domains.
Disadvantages
Bag-of-words assumption: Ignores word order and context, losing semantic nuance.
Sensitive to preprocessing: Requires careful tokenization, stopword removal, and lemmatization.
Fixed number of topics: Must predefine the number of topics, which may not align with the data.
Performance limitations: Struggles with short texts and highly overlapping topics.
Hyperparameter tuning: Requires tuning (e.g., α, β) to improve coherence and quality.
Validating LDA
LDA requires to make important decisions
K or the number of topics is something that researchers need to choose
How do we select K?
Held-out likelihood
We can ask the model which words are likely to be in a document
We can split texts in half, train a topic on one half, and calculate the held-out likelihood for the other half
Semantic Coherence
Do the most common words from a topic also co-occur together frequently in the same documents?
Semantic Coherence
Coherence: Evaluates the interpretability of topics, with higher values indicating more semantically meaningful topics.
Higher coherence scores indicate better topics in terms of human interpretability.
Lower score indicate worse topics in terms of human interpretability.
Semantic Coherence
This is how it works:
Calculating Coherence Score
Python
#DON'T RUN: it takes a few mins to runfrom gensim.models.coherencemodel import CoherenceModel# Initialize an empty list to store coherence scorescoherence_scores = []# Iterate over the range of topics in increments of 5for num_topics inrange(2, 21, 3): # From 3 to 40 topics, step = 5print(f"Training LDA model for {num_topics} topics...")# Train LDA model lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=42)# Calculate coherence coherence_model = CoherenceModel(model=lda, texts=pmq["processed"], dictionary=dictionary, coherence="c_v") coherence_score = coherence_model.get_coherence()# Append results as a tuple coherence_scores.append((num_topics, coherence_score))print(f"Number of topics: {num_topics}, Coherence Score: {coherence_score}")# Convert coherence scores into a DataFramecoherence_df = pd.DataFrame(coherence_scores, columns=["num_topics", "coherence"])
Manually Creating the DF
Python
#Because the previous codes takes a few mins to run, I am recreating its ouputdata = [ {"num_topics": 2, "coherence": 0.299716}, {"num_topics": 5, "coherence": 0.312965}, {"num_topics": 8, "coherence": 0.342897}, {"num_topics": 11, "coherence": 0.362278}, {"num_topics": 14, "coherence": 0.374748}, {"num_topics": 17, "coherence": 0.364560}, {"num_topics": 20, "coherence": 0.359184}, ]# Create a DataFramecoherence_df = pd.DataFrame(data)
See code
R
library(dplyr)library(broom)library(ggplot2)#Turning the Pandas dataframe to Rcoherence_df <- reticulate::py$coherence_dfggplot(data=coherence_df)+geom_line(aes(x=num_topics, y=coherence))+scale_x_continuous(breaks =seq(min(coherence_df$num_topics), max(coherence_df$num_topics), by=5)) +# Ensure all ranks are showntheme(legend.position ="bottom",axis.text.x =element_text(angle =45, hjust =1))+theme_bw()
Held-out Likelihood
Held-Out Likelihood: Measures how well the model fits unseen data. Lower values (more negative) indicate worse performance.
This metric evaluates the model’s ability to generalize to unseen data.
The goal is to maximize (less negative) the held-out likelihood.
Optimizing only this metric can lead to overfitting, where the model captures noise in the data instead of meaningful patterns.
Held-out Likelihood
This is how it works:
Calculating Held-out Likelihood
Python
#DON'T RUN: it takes a few mins to run# Held-out likelihoodfrom sklearn.model_selection import train_test_split# Split the data into training and validation setstrain_texts, test_texts = train_test_split(pmq["processed"], test_size=0.2, random_state=42)# Create a dictionary and corpus for the training settrain_dictionary = corpora.Dictionary(train_texts)train_dictionary.filter_extremes(no_below=5)train_corpus = [train_dictionary.doc2bow(text) for text in train_texts]# Create a corpus for the validation set using the same dictionarytest_corpus = [train_dictionary.doc2bow(text) for text in test_texts]held_out_scores = []for num_topics inrange(2, 21, 3): # From 3 to 70 topicsprint(f"Training LDA model for {num_topics} topics...")# Train LDA model on the training corpus lda = LdaModel(corpus=train_corpus, id2word=train_dictionary, num_topics=num_topics, random_state=42)# Calculate held-out likelihood held_out_likelihood = lda.log_perplexity(test_corpus) held_out_scores.append((num_topics, held_out_likelihood))print(f"Number of topics: {num_topics}, Held-Out Likelihood: {held_out_likelihood}")# Convert the scores into a DataFrameheld_out_df = pd.DataFrame(held_out_scores, columns=["num_topics", "held_out_likelihood"])
Manually Creating the DF
Python
#Because the previous codes takes a few mins to run, I am recreating its ouputheld_out_df = [ {"num_topics": 2, "held_out_likelihood": -7.489501}, {"num_topics": 5, "held_out_likelihood": -7.566638}, {"num_topics": 8, "held_out_likelihood": -7.646296}, {"num_topics": 11, "held_out_likelihood": -7.844623}, {"num_topics": 14, "held_out_likelihood": -7.976968}, {"num_topics": 17, "held_out_likelihood": -8.071734}, {"num_topics": 20, "held_out_likelihood": -8.171345}]# Create a DataFramedf_likelihood = pd.DataFrame(held_out_df)
See code
R
library(dplyr)library(broom)library(ggplot2)#Turning the Pandas dataframe to Rheld_out_df <- reticulate::py$df_likelihoodggplot(data=held_out_df)+geom_line(aes(x=num_topics, y=held_out_likelihood))+scale_x_continuous(breaks =seq(min(coherence_df$num_topics), max(coherence_df$num_topics), by=5)) +# Ensure all ranks are showntheme(legend.position ="bottom",axis.text.x =element_text(angle =45, hjust =1))+theme_bw()
Reconciling Coherence and Heldout Likelihood
Held-out Likelihood measures generalizability to unseen documents (predictive power) - best at 2 topics
Coherence measures semantic interpretability of topics - best at 12 topics
So:
Fewer topics (e.g., 2) generalize well to new data but are often too coarse to be interpretable.
More topics (e.g., 12) capture more meaningful distinctions between themes in your corpus but may overfit slightly, harming generalization.
We could opt for the average of the two metrics which performs well on both metrics.
Exclusivity
Average Exclusivity: Reflects how distinct the topics are from one another. Higher values indicate less overlap in the terms associated with topics.
There is typically a trade-off between coherence and exclusivity
As coherence increases (topics are more interpretable)
Exclusivity may decrease (topics overlap more).
Exclusivity
This is how it works:
Calculating Exclusivity
Python
##DON'T RUN: it takes a few mins to runfrom collections import defaultdictexclusivity_scores = []for num_topics inrange(2, 21, 3): # From 3 to 70 topicsprint(f"Training LDA model for {num_topics} topics...")# Train LDA model lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=42)# Get the term-topic matrix term_topic_matrix = defaultdict(list)for topic_id inrange(lda.num_topics): terms = lda.show_topic(topic_id, topn=20) # Top 20 termsfor term, prob in terms: term_topic_matrix[term].append(prob)# Compute exclusivity topic_exclusivity = []for topic_id inrange(lda.num_topics): topic_terms = lda.show_topic(topic_id, topn=20) exclusivity =0for term, prob in topic_terms:# Exclusivity = term probability in current topic divided by sum of its probabilities across all topics term_sum_prob =sum(term_topic_matrix[term]) exclusivity += prob / term_sum_prob if term_sum_prob >0else0 topic_exclusivity.append(exclusivity /len(topic_terms)) # Normalize by number of terms in topic# Average exclusivity across topics for the model avg_exclusivity =sum(topic_exclusivity) /len(topic_exclusivity) exclusivity_scores.append((num_topics, avg_exclusivity))print(f"Number of topics: {num_topics}, Avg Exclusivity: {avg_exclusivity}")# Convert to DataFrameexclusivity_df = pd.DataFrame(exclusivity_scores, columns=["num_topics", "exclusivity"])# Save to CSV (optional)#exclusivity_df.to_csv("exclusivity_scores.csv", index=False)# Display the DataFrame#exclusivity_df
Manually Creating the DF
Python
#Because the previous codes takes a few mins to run, I am recreating its ouputexclusivity_df = [ {"num_topics": 2, "avg_exclusivity": 0.725}, {"num_topics": 5, "avg_exclusivity": 0.6}, {"num_topics": 8, "avg_exclusivity": 0.5625}, {"num_topics": 11, "avg_exclusivity": 0.5636363636363636}, {"num_topics": 14, "avg_exclusivity": 0.5785714285714286}, {"num_topics": 17, "avg_exclusivity": 0.5470588235294117}, ]exclusivity_df = pd.DataFrame(exclusivity_df)
See code
R
library(dplyr)library(broom)library(ggplot2)#Turning the Pandas dataframe to Rexclusivity_df <- reticulate::py$exclusivity_dfggplot(data=exclusivity_df)+geom_line(aes(x=num_topics, y=avg_exclusivity))+scale_x_continuous(breaks =seq(min(coherence_df$num_topics), max(coherence_df$num_topics), by=5)) +# Ensure all ranks are showntheme(legend.position ="bottom",axis.text.x =element_text(angle =45, hjust =1))+theme_bw()
Reflection
In our case, 14 seems to be the number of topics with the highest coherence and best exclusivity.
By combining these metrics with domain-specific insights, we can make an informed choice about the best number of topics for our analysis.
The Trade-Off: More Coherence → Less Exclusivity
Topics overlap as similar words appear in multiple themes.
Example: “Growth” might appear in both “Economy” and “Healthcare” topics.
The Trade-Off: More Exclusivity → Less Coherence
Topics become overly specific and less interpretable.
Example: A topic with rare words like “NASDAQ” and “dividends” may not clearly reflect “Economy.”
Conclusion
Topic models help uncover key themes across large text corpora by analyzing word distributions.
Documents are represented as mixtures of overarching topics, with topics defined as probabilistic distributions of words.
Strengths: Effective for initial exploration of textual data, requiring little upfront work.
Limitations: Results demand thorough interpretation and rigorous validation for meaningful use.
We still need to pay attention to interpreting and validating our results.