About LDA:
The GAP:
Importance of Word Embeddings
How words have been represented so far:
Problems with Sparse Representations:
\[ cos(\theta) = w_{\text{cat}}^Tw_{\text{dog}} = \frac{\mathbf{w_{\text{cat}}} \cdot \mathbf{w_{\text{dog}}}}{\left|\left| \mathbf{w_{\text{cat}}} \right|\right| \left|\left| \mathbf{w_{\text{dog}}} \right|\right|}= 0 \]
This means that models cannot learn or infer word relationships like synonyms, analogies, or context.
1.Similarity
2.Classification
What are word embeddings?
How embeddings solve the limitations:
Distributional Semantics means that the meaning of a word can be derived from the distribution of contexts in which it appears.
The hypothesis implies that words that appear in similar “contexts” will share similar meanings.
When a word \(j\) appears in a text, its “context” is the set of words that appear nearby (within a fixed-size window).
We use the many contexts of \(w\) to build up a representation of \(w\).
Pre | Keyword | Post |
---|---|---|
the sacrifices made to secure our | freedom | will never be forgotten. |
ensure that every citizen enjoys the | freedom | to speak their mind without fear. |
we must defend the values of democracy and | freedom | against all forms of tyranny. |
the right to pursue life, liberty, and | freedom | is fundamental to our society. |
not everyone in the world experiences the | freedom | we often take for granted. |
policies are designed to protect the | freedom | of individuals while ensuring equality. |
The meaning of each word is based on the distribution of terms with which it co-occurs
We represent this meaning using a vector for each word
Vectors are constructed such that similar words are close to each other in “semantic” space
We build this space automatically by seeing which words are close to one another in texts
Our goal is to build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts (measuring similarity as the dot product)
\[ \begin{align} w_{\text{cat}} &= \begin{bmatrix} 0.73 \\ 0.04 \\ 0.07 \\ -0.18 \\ 0.81 \\ -0.97 \end{bmatrix} \end{align} \]
\[ \begin{align} w_{\text{dog}} &= \begin{bmatrix} 0.63 \\ .14 \\ .02 \\ -0.58 \\ 0.43 \\ -0.66 \end{bmatrix} \end{align} \]
These representations are known as word embeddings because we “embed” words into a low-dimensional space (low compared to the vocabulary size).
Low-dimensional word embeddings offer some advantages:
Training Data Requirements
Context Window Size
Dimension of the Embedding
A word embedding is a dense vector representation of a word in a continuous vector space, where similar words are placed closer together.
However, there are some limitations:
Static Representations fall short:
BERTopic is a topic modelling technique that leverages transformers and c-TF-IDF to create clusters for easily interpretable topics.
What are Transformers?
What is TF-IDF?
What is c-TF-IDF?
How c-TF-IDF Works in BERTopic
Why This Combination Works
Python
The next section reads the relevant documents.
Python
import pandas as pd
# Load the aggression_texts CSV and extract the 'body' column
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
aggression_texts['document_id'] = range(1, len(aggression_texts) + 1)
We now initialize the Bert model
Python
# The following two lines will remove some annoying message in Bert: "huggingface/tokenizers: The current process just got forked..."
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Importing Bert
from bertopic import BERTopic
# Create a UMAP instance with a fixed random state for reproducibility
from umap import UMAP
umap_model = UMAP(random_state=42)
# Initialize the BERTopic model with probability calculation enabled and the UMAP model specified
topic_model = BERTopic(calculate_probabilities=True,umap_model=umap_model, verbose=True)
topics, probs = topic_model.fit_transform(aggression_texts["body"])
Batches: 0%| | 0/896 [00:00<?, ?it/s]
Batches: 0%| | 1/896 [00:00<08:13, 1.81it/s]
Batches: 0%| | 2/896 [00:00<06:28, 2.30it/s]
Batches: 0%| | 3/896 [00:01<05:37, 2.64it/s]
Batches: 0%| | 4/896 [00:01<05:11, 2.86it/s]
Batches: 1%| | 5/896 [00:01<04:50, 3.06it/s]
Batches: 1%| | 6/896 [00:02<04:35, 3.23it/s]
Batches: 1%| | 7/896 [00:02<04:23, 3.37it/s]
Batches: 1%| | 8/896 [00:02<04:14, 3.49it/s]
Batches: 1%|1 | 9/896 [00:02<04:11, 3.53it/s]
Batches: 1%|1 | 10/896 [00:03<04:07, 3.57it/s]
Batches: 1%|1 | 11/896 [00:03<04:03, 3.64it/s]
Batches: 1%|1 | 12/896 [00:03<03:59, 3.69it/s]
Batches: 1%|1 | 13/896 [00:03<03:55, 3.74it/s]
Batches: 2%|1 | 14/896 [00:04<03:51, 3.81it/s]
Batches: 2%|1 | 15/896 [00:04<03:42, 3.95it/s]
Batches: 2%|1 | 16/896 [00:04<03:40, 4.00it/s]
Batches: 2%|1 | 17/896 [00:04<03:34, 4.10it/s]
Batches: 2%|2 | 18/896 [00:05<03:29, 4.19it/s]
Batches: 2%|2 | 19/896 [00:05<03:24, 4.28it/s]
Batches: 2%|2 | 20/896 [00:05<03:25, 4.26it/s]
Batches: 2%|2 | 21/896 [00:05<03:23, 4.29it/s]
Batches: 2%|2 | 22/896 [00:06<03:22, 4.31it/s]
Batches: 3%|2 | 23/896 [00:06<03:19, 4.38it/s]
Batches: 3%|2 | 24/896 [00:06<03:16, 4.45it/s]
Batches: 3%|2 | 25/896 [00:06<03:13, 4.51it/s]
Batches: 3%|2 | 26/896 [00:06<03:10, 4.57it/s]
Batches: 3%|3 | 27/896 [00:07<03:09, 4.58it/s]
Batches: 3%|3 | 28/896 [00:07<03:06, 4.66it/s]
Batches: 3%|3 | 29/896 [00:07<03:02, 4.76it/s]
Batches: 3%|3 | 30/896 [00:07<02:58, 4.84it/s]
Batches: 3%|3 | 31/896 [00:07<02:55, 4.92it/s]
Batches: 4%|3 | 32/896 [00:08<02:53, 4.98it/s]
Batches: 4%|3 | 33/896 [00:08<02:51, 5.03it/s]
Batches: 4%|3 | 34/896 [00:08<02:48, 5.11it/s]
Batches: 4%|3 | 35/896 [00:08<02:48, 5.10it/s]
Batches: 4%|4 | 36/896 [00:08<02:47, 5.13it/s]
Batches: 4%|4 | 37/896 [00:09<02:47, 5.13it/s]
Batches: 4%|4 | 38/896 [00:09<02:48, 5.09it/s]
Batches: 4%|4 | 39/896 [00:09<02:46, 5.16it/s]
Batches: 4%|4 | 40/896 [00:09<02:44, 5.20it/s]
Batches: 5%|4 | 41/896 [00:09<02:42, 5.28it/s]
Batches: 5%|4 | 42/896 [00:10<02:41, 5.29it/s]
Batches: 5%|4 | 43/896 [00:10<02:39, 5.35it/s]
Batches: 5%|4 | 44/896 [00:10<02:38, 5.39it/s]
Batches: 5%|5 | 45/896 [00:10<02:36, 5.43it/s]
Batches: 5%|5 | 46/896 [00:10<02:35, 5.48it/s]
Batches: 5%|5 | 47/896 [00:10<02:33, 5.52it/s]
Batches: 5%|5 | 48/896 [00:11<02:32, 5.56it/s]
Batches: 5%|5 | 49/896 [00:11<02:31, 5.60it/s]
Batches: 6%|5 | 50/896 [00:11<02:29, 5.64it/s]
Batches: 6%|5 | 51/896 [00:11<02:29, 5.66it/s]
Batches: 6%|5 | 52/896 [00:11<02:30, 5.61it/s]
Batches: 6%|5 | 53/896 [00:11<02:29, 5.66it/s]
Batches: 6%|6 | 54/896 [00:12<02:27, 5.71it/s]
Batches: 6%|6 | 55/896 [00:12<02:26, 5.76it/s]
Batches: 6%|6 | 56/896 [00:12<02:25, 5.76it/s]
Batches: 6%|6 | 57/896 [00:12<02:24, 5.82it/s]
Batches: 6%|6 | 58/896 [00:12<02:23, 5.86it/s]
Batches: 7%|6 | 59/896 [00:13<02:22, 5.88it/s]
Batches: 7%|6 | 60/896 [00:13<02:22, 5.85it/s]
Batches: 7%|6 | 61/896 [00:13<02:22, 5.85it/s]
Batches: 7%|6 | 62/896 [00:13<02:22, 5.86it/s]
Batches: 7%|7 | 63/896 [00:13<02:21, 5.89it/s]
Batches: 7%|7 | 64/896 [00:13<02:20, 5.94it/s]
Batches: 7%|7 | 65/896 [00:14<02:18, 6.00it/s]
Batches: 7%|7 | 66/896 [00:14<02:16, 6.07it/s]
Batches: 7%|7 | 67/896 [00:14<02:15, 6.13it/s]
Batches: 8%|7 | 68/896 [00:14<02:14, 6.18it/s]
Batches: 8%|7 | 69/896 [00:14<02:13, 6.20it/s]
Batches: 8%|7 | 70/896 [00:14<02:12, 6.21it/s]
Batches: 8%|7 | 71/896 [00:14<02:12, 6.23it/s]
Batches: 8%|8 | 72/896 [00:15<02:12, 6.24it/s]
Batches: 8%|8 | 73/896 [00:15<02:11, 6.28it/s]
Batches: 8%|8 | 74/896 [00:15<02:10, 6.30it/s]
Batches: 8%|8 | 75/896 [00:15<02:10, 6.31it/s]
Batches: 8%|8 | 76/896 [00:15<02:09, 6.33it/s]
Batches: 9%|8 | 77/896 [00:15<02:08, 6.36it/s]
Batches: 9%|8 | 78/896 [00:16<02:08, 6.38it/s]
Batches: 9%|8 | 79/896 [00:16<02:06, 6.44it/s]
Batches: 9%|8 | 80/896 [00:16<02:06, 6.47it/s]
Batches: 9%|9 | 81/896 [00:16<02:05, 6.49it/s]
Batches: 9%|9 | 82/896 [00:16<02:05, 6.50it/s]
Batches: 9%|9 | 83/896 [00:16<02:04, 6.54it/s]
Batches: 9%|9 | 84/896 [00:16<02:03, 6.58it/s]
Batches: 9%|9 | 85/896 [00:17<02:02, 6.60it/s]
Batches: 10%|9 | 86/896 [00:17<02:11, 6.16it/s]
Batches: 10%|9 | 87/896 [00:17<02:08, 6.31it/s]
Batches: 10%|9 | 88/896 [00:17<02:05, 6.42it/s]
Batches: 10%|9 | 89/896 [00:17<02:04, 6.51it/s]
Batches: 10%|# | 90/896 [00:17<02:02, 6.59it/s]
Batches: 10%|# | 91/896 [00:18<02:01, 6.65it/s]
Batches: 10%|# | 92/896 [00:18<02:00, 6.68it/s]
Batches: 10%|# | 93/896 [00:18<01:59, 6.72it/s]
Batches: 10%|# | 94/896 [00:18<01:58, 6.75it/s]
Batches: 11%|# | 95/896 [00:18<01:58, 6.79it/s]
Batches: 11%|# | 96/896 [00:18<01:57, 6.81it/s]
Batches: 11%|# | 97/896 [00:18<01:56, 6.83it/s]
Batches: 11%|# | 98/896 [00:19<01:56, 6.86it/s]
Batches: 11%|#1 | 99/896 [00:19<01:55, 6.87it/s]
Batches: 11%|#1 | 100/896 [00:19<01:55, 6.90it/s]
Batches: 11%|#1 | 101/896 [00:19<01:54, 6.92it/s]
Batches: 11%|#1 | 102/896 [00:19<01:54, 6.95it/s]
Batches: 11%|#1 | 103/896 [00:19<01:54, 6.95it/s]
Batches: 12%|#1 | 104/896 [00:19<01:53, 6.99it/s]
Batches: 12%|#1 | 105/896 [00:20<01:53, 7.00it/s]
Batches: 12%|#1 | 106/896 [00:20<01:52, 7.03it/s]
Batches: 12%|#1 | 107/896 [00:20<01:51, 7.05it/s]
Batches: 12%|#2 | 108/896 [00:20<01:51, 7.07it/s]
Batches: 12%|#2 | 109/896 [00:20<01:51, 7.06it/s]
Batches: 12%|#2 | 110/896 [00:20<01:51, 7.06it/s]
Batches: 12%|#2 | 111/896 [00:20<01:50, 7.08it/s]
Batches: 12%|#2 | 112/896 [00:21<01:50, 7.11it/s]
Batches: 13%|#2 | 113/896 [00:21<01:49, 7.15it/s]
Batches: 13%|#2 | 114/896 [00:21<02:27, 5.29it/s]
Batches: 13%|#2 | 115/896 [00:21<02:15, 5.76it/s]
Batches: 13%|#2 | 116/896 [00:21<02:07, 6.13it/s]
Batches: 13%|#3 | 117/896 [00:21<02:01, 6.43it/s]
Batches: 13%|#3 | 118/896 [00:22<01:56, 6.67it/s]
Batches: 13%|#3 | 119/896 [00:22<01:53, 6.85it/s]
Batches: 13%|#3 | 120/896 [00:22<01:50, 7.00it/s]
Batches: 14%|#3 | 121/896 [00:22<01:49, 7.10it/s]
Batches: 14%|#3 | 122/896 [00:22<01:47, 7.18it/s]
Batches: 14%|#3 | 123/896 [00:22<01:46, 7.24it/s]
Batches: 14%|#3 | 124/896 [00:22<01:45, 7.30it/s]
Batches: 14%|#3 | 125/896 [00:23<01:45, 7.33it/s]
Batches: 14%|#4 | 126/896 [00:23<01:44, 7.37it/s]
Batches: 14%|#4 | 127/896 [00:23<01:44, 7.36it/s]
Batches: 14%|#4 | 128/896 [00:23<01:43, 7.39it/s]
Batches: 14%|#4 | 129/896 [00:23<01:43, 7.40it/s]
Batches: 15%|#4 | 130/896 [00:23<01:43, 7.41it/s]
Batches: 15%|#4 | 131/896 [00:23<01:42, 7.43it/s]
Batches: 15%|#4 | 132/896 [00:23<01:42, 7.46it/s]
Batches: 15%|#4 | 133/896 [00:24<01:41, 7.49it/s]
Batches: 15%|#4 | 134/896 [00:24<01:41, 7.50it/s]
Batches: 15%|#5 | 135/896 [00:24<01:46, 7.18it/s]
Batches: 15%|#5 | 136/896 [00:24<01:45, 7.18it/s]
Batches: 15%|#5 | 137/896 [00:24<01:45, 7.20it/s]
Batches: 15%|#5 | 138/896 [00:24<01:45, 7.21it/s]
Batches: 16%|#5 | 139/896 [00:24<01:44, 7.23it/s]
Batches: 16%|#5 | 140/896 [00:25<01:44, 7.25it/s]
Batches: 16%|#5 | 141/896 [00:25<01:43, 7.26it/s]
Batches: 16%|#5 | 142/896 [00:25<01:43, 7.27it/s]
Batches: 16%|#5 | 143/896 [00:25<01:43, 7.30it/s]
Batches: 16%|#6 | 144/896 [00:25<01:42, 7.30it/s]
Batches: 16%|#6 | 145/896 [00:25<01:42, 7.33it/s]
Batches: 16%|#6 | 146/896 [00:25<01:41, 7.36it/s]
Batches: 16%|#6 | 147/896 [00:26<01:41, 7.37it/s]
Batches: 17%|#6 | 148/896 [00:26<01:41, 7.38it/s]
Batches: 17%|#6 | 149/896 [00:26<01:41, 7.39it/s]
Batches: 17%|#6 | 150/896 [00:26<01:40, 7.41it/s]
Batches: 17%|#6 | 151/896 [00:26<01:40, 7.41it/s]
Batches: 17%|#6 | 152/896 [00:26<01:53, 6.58it/s]
Batches: 17%|#7 | 153/896 [00:26<01:48, 6.82it/s]
Batches: 17%|#7 | 154/896 [00:27<01:45, 7.01it/s]
Batches: 17%|#7 | 155/896 [00:27<01:55, 6.40it/s]
Batches: 17%|#7 | 156/896 [00:27<02:01, 6.07it/s]
Batches: 18%|#7 | 157/896 [00:27<02:07, 5.78it/s]
Batches: 18%|#7 | 158/896 [00:27<01:59, 6.19it/s]
Batches: 18%|#7 | 159/896 [00:27<02:03, 5.97it/s]
Batches: 18%|#7 | 160/896 [00:28<02:08, 5.71it/s]
Batches: 18%|#7 | 161/896 [00:28<02:12, 5.56it/s]
Batches: 18%|#8 | 162/896 [00:28<02:13, 5.52it/s]
Batches: 18%|#8 | 163/896 [00:28<02:14, 5.44it/s]
Batches: 18%|#8 | 164/896 [00:28<02:16, 5.37it/s]
Batches: 18%|#8 | 165/896 [00:28<02:02, 5.96it/s]
Batches: 19%|#8 | 166/896 [00:29<02:05, 5.81it/s]
Batches: 19%|#8 | 167/896 [00:29<02:10, 5.59it/s]
Batches: 19%|#8 | 168/896 [00:29<01:58, 6.13it/s]
Batches: 19%|#8 | 169/896 [00:29<02:04, 5.85it/s]
Batches: 19%|#8 | 170/896 [00:29<01:53, 6.42it/s]
Batches: 19%|#9 | 171/896 [00:29<02:00, 6.04it/s]
Batches: 19%|#9 | 172/896 [00:30<02:04, 5.80it/s]
Batches: 19%|#9 | 173/896 [00:30<02:09, 5.60it/s]
Batches: 19%|#9 | 174/896 [00:30<02:11, 5.48it/s]
Batches: 20%|#9 | 175/896 [00:30<02:12, 5.46it/s]
Batches: 20%|#9 | 176/896 [00:30<02:12, 5.45it/s]
Batches: 20%|#9 | 177/896 [00:31<02:13, 5.38it/s]
Batches: 20%|#9 | 178/896 [00:31<02:12, 5.40it/s]
Batches: 20%|#9 | 179/896 [00:31<02:14, 5.31it/s]
Batches: 20%|## | 180/896 [00:31<02:13, 5.36it/s]
Batches: 20%|## | 181/896 [00:31<02:15, 5.29it/s]
Batches: 20%|## | 182/896 [00:31<01:58, 6.02it/s]
Batches: 20%|## | 183/896 [00:32<02:01, 5.85it/s]
Batches: 21%|## | 184/896 [00:32<02:05, 5.67it/s]
Batches: 21%|## | 185/896 [00:32<02:09, 5.49it/s]
Batches: 21%|## | 186/896 [00:32<02:12, 5.34it/s]
Batches: 21%|## | 187/896 [00:32<02:15, 5.22it/s]
Batches: 21%|## | 188/896 [00:33<02:17, 5.16it/s]
Batches: 21%|##1 | 189/896 [00:33<02:17, 5.13it/s]
Batches: 21%|##1 | 190/896 [00:33<02:18, 5.09it/s]
Batches: 21%|##1 | 191/896 [00:33<02:19, 5.07it/s]
Batches: 21%|##1 | 192/896 [00:33<02:18, 5.08it/s]
Batches: 22%|##1 | 193/896 [00:34<02:17, 5.13it/s]
Batches: 22%|##1 | 194/896 [00:34<01:58, 5.92it/s]
Batches: 22%|##1 | 195/896 [00:34<02:01, 5.75it/s]
Batches: 22%|##1 | 196/896 [00:34<02:04, 5.64it/s]
Batches: 22%|##1 | 197/896 [00:34<01:48, 6.44it/s]
Batches: 22%|##2 | 198/896 [00:34<01:56, 6.01it/s]
Batches: 22%|##2 | 199/896 [00:35<02:01, 5.76it/s]
Batches: 22%|##2 | 200/896 [00:35<01:46, 6.52it/s]
Batches: 23%|##2 | 202/896 [00:35<01:41, 6.83it/s]
Batches: 23%|##2 | 204/896 [00:35<01:40, 6.91it/s]
Batches: 23%|##2 | 205/896 [00:35<01:47, 6.43it/s]
Batches: 23%|##3 | 207/896 [00:36<01:43, 6.68it/s]
Batches: 23%|##3 | 208/896 [00:36<01:48, 6.33it/s]
Batches: 23%|##3 | 209/896 [00:36<01:53, 6.05it/s]
Batches: 24%|##3 | 211/896 [00:36<01:33, 7.34it/s]
Batches: 24%|##3 | 212/896 [00:36<01:39, 6.87it/s]
Batches: 24%|##3 | 214/896 [00:37<01:36, 7.07it/s]
Batches: 24%|##3 | 215/896 [00:37<01:42, 6.65it/s]
Batches: 24%|##4 | 217/896 [00:37<01:37, 6.96it/s]
Batches: 24%|##4 | 219/896 [00:37<01:25, 7.96it/s]
Batches: 25%|##4 | 220/896 [00:38<01:31, 7.43it/s]
Batches: 25%|##4 | 222/896 [00:38<01:20, 8.42it/s]
Batches: 25%|##4 | 223/896 [00:38<01:27, 7.73it/s]
Batches: 25%|##5 | 225/896 [00:38<01:16, 8.77it/s]
Batches: 25%|##5 | 226/896 [00:38<01:24, 7.91it/s]
Batches: 25%|##5 | 227/896 [00:38<01:33, 7.18it/s]
Batches: 25%|##5 | 228/896 [00:39<01:41, 6.59it/s]
Batches: 26%|##5 | 229/896 [00:39<01:48, 6.14it/s]
Batches: 26%|##5 | 231/896 [00:39<01:26, 7.68it/s]
Batches: 26%|##5 | 232/896 [00:39<01:35, 6.94it/s]
Batches: 26%|##6 | 234/896 [00:39<01:20, 8.26it/s]
Batches: 26%|##6 | 236/896 [00:40<01:11, 9.19it/s]
Batches: 27%|##6 | 238/896 [00:40<01:16, 8.62it/s]
Batches: 27%|##6 | 240/896 [00:40<01:09, 9.40it/s]
Batches: 27%|##6 | 241/896 [00:40<01:21, 8.02it/s]
Batches: 27%|##7 | 243/896 [00:40<01:25, 7.63it/s]
Batches: 27%|##7 | 245/896 [00:41<01:27, 7.43it/s]
Batches: 27%|##7 | 246/896 [00:41<01:23, 7.77it/s]
Batches: 28%|##7 | 248/896 [00:41<01:25, 7.61it/s]
Batches: 28%|##7 | 249/896 [00:41<01:34, 6.88it/s]
Batches: 28%|##7 | 250/896 [00:42<01:40, 6.43it/s]
Batches: 28%|##8 | 251/896 [00:42<01:46, 6.06it/s]
Batches: 28%|##8 | 252/896 [00:42<01:51, 5.79it/s]
Batches: 28%|##8 | 254/896 [00:42<01:26, 7.44it/s]
Batches: 28%|##8 | 255/896 [00:42<01:35, 6.73it/s]
Batches: 29%|##8 | 256/896 [00:42<01:43, 6.21it/s]
Batches: 29%|##8 | 258/896 [00:43<01:37, 6.54it/s]
Batches: 29%|##8 | 259/896 [00:43<01:43, 6.18it/s]
Batches: 29%|##9 | 261/896 [00:43<01:21, 7.75it/s]
Batches: 29%|##9 | 262/896 [00:43<01:29, 7.10it/s]
Batches: 29%|##9 | 264/896 [00:43<01:15, 8.42it/s]
Batches: 30%|##9 | 266/896 [00:44<01:06, 9.43it/s]
Batches: 30%|##9 | 268/896 [00:44<01:02, 10.11it/s]
Batches: 30%|### | 270/896 [00:44<01:08, 9.10it/s]
Batches: 30%|### | 272/896 [00:44<01:03, 9.90it/s]
Batches: 31%|### | 274/896 [00:44<01:09, 8.90it/s]
Batches: 31%|### | 275/896 [00:45<01:19, 7.83it/s]
Batches: 31%|### | 276/896 [00:45<01:27, 7.11it/s]
Batches: 31%|###1 | 278/896 [00:45<01:14, 8.30it/s]
Batches: 31%|###1 | 280/896 [00:45<01:05, 9.35it/s]
Batches: 31%|###1 | 282/896 [00:45<01:11, 8.55it/s]
Batches: 32%|###1 | 284/896 [00:46<01:04, 9.47it/s]
Batches: 32%|###1 | 286/896 [00:46<01:09, 8.73it/s]
Batches: 32%|###2 | 288/896 [00:46<01:13, 8.31it/s]
Batches: 32%|###2 | 289/896 [00:46<01:20, 7.55it/s]
Batches: 32%|###2 | 290/896 [00:47<01:27, 6.90it/s]
Batches: 32%|###2 | 291/896 [00:47<01:33, 6.45it/s]
Batches: 33%|###2 | 293/896 [00:47<01:15, 8.02it/s]
Batches: 33%|###2 | 295/896 [00:47<01:04, 9.26it/s]
Batches: 33%|###3 | 297/896 [00:47<00:58, 10.18it/s]
Batches: 33%|###3 | 299/896 [00:47<01:06, 8.98it/s]
Batches: 34%|###3 | 301/896 [00:48<00:59, 9.96it/s]
Batches: 34%|###3 | 303/896 [00:48<01:05, 9.01it/s]
Batches: 34%|###3 | 304/896 [00:48<01:14, 8.00it/s]
Batches: 34%|###4 | 305/896 [00:48<01:22, 7.20it/s]
Batches: 34%|###4 | 307/896 [00:49<01:20, 7.34it/s]
Batches: 34%|###4 | 309/896 [00:49<01:07, 8.65it/s]
Batches: 35%|###4 | 310/896 [00:49<01:15, 7.74it/s]
Batches: 35%|###4 | 311/896 [00:49<01:22, 7.08it/s]
Batches: 35%|###4 | 313/896 [00:49<01:18, 7.39it/s]
Batches: 35%|###5 | 315/896 [00:49<01:05, 8.90it/s]
Batches: 35%|###5 | 317/896 [00:50<00:57, 10.11it/s]
Batches: 36%|###5 | 319/896 [00:50<00:52, 11.05it/s]
Batches: 36%|###5 | 321/896 [00:50<00:56, 10.12it/s]
Batches: 36%|###6 | 323/896 [00:50<01:01, 9.36it/s]
Batches: 36%|###6 | 325/896 [00:50<01:03, 9.02it/s]
Batches: 36%|###6 | 327/896 [00:51<00:57, 9.87it/s]
Batches: 37%|###6 | 329/896 [00:51<00:52, 10.75it/s]
Batches: 37%|###6 | 331/896 [00:51<00:49, 11.42it/s]
Batches: 37%|###7 | 333/896 [00:51<00:45, 12.25it/s]
Batches: 37%|###7 | 335/896 [00:51<00:53, 10.54it/s]
Batches: 38%|###7 | 337/896 [00:51<00:49, 11.32it/s]
Batches: 38%|###7 | 339/896 [00:52<00:45, 12.36it/s]
Batches: 38%|###8 | 341/896 [00:52<00:52, 10.52it/s]
Batches: 38%|###8 | 343/896 [00:52<00:56, 9.70it/s]
Batches: 39%|###8 | 345/896 [00:52<00:50, 10.96it/s]
Batches: 39%|###8 | 347/896 [00:52<00:45, 12.09it/s]
Batches: 39%|###8 | 349/896 [00:53<00:51, 10.62it/s]
Batches: 39%|###9 | 351/896 [00:53<01:04, 8.43it/s]
Batches: 39%|###9 | 353/896 [00:53<01:04, 8.40it/s]
Batches: 40%|###9 | 355/896 [00:53<00:55, 9.80it/s]
Batches: 40%|###9 | 357/896 [00:54<00:58, 9.24it/s]
Batches: 40%|#### | 359/896 [00:54<00:50, 10.60it/s]
Batches: 40%|#### | 361/896 [00:54<00:45, 11.82it/s]
Batches: 41%|#### | 363/896 [00:54<00:40, 13.00it/s]
Batches: 41%|#### | 365/896 [00:54<00:39, 13.55it/s]
Batches: 41%|#### | 367/896 [00:54<00:45, 11.74it/s]
Batches: 41%|####1 | 369/896 [00:54<00:41, 12.67it/s]
Batches: 41%|####1 | 371/896 [00:55<00:38, 13.61it/s]
Batches: 42%|####1 | 373/896 [00:55<00:45, 11.44it/s]
Batches: 42%|####1 | 375/896 [00:55<00:42, 12.22it/s]
Batches: 42%|####2 | 377/896 [00:55<00:47, 10.82it/s]
Batches: 42%|####2 | 379/896 [00:55<00:51, 10.00it/s]
Batches: 43%|####2 | 381/896 [00:56<00:46, 11.15it/s]
Batches: 43%|####2 | 383/896 [00:56<00:50, 10.19it/s]
Batches: 43%|####2 | 385/896 [00:56<00:44, 11.55it/s]
Batches: 43%|####3 | 387/896 [00:56<00:39, 12.84it/s]
Batches: 43%|####3 | 389/896 [00:56<00:36, 13.74it/s]
Batches: 44%|####3 | 391/896 [00:56<00:34, 14.51it/s]
Batches: 44%|####3 | 393/896 [00:56<00:40, 12.27it/s]
Batches: 44%|####4 | 395/896 [00:57<00:38, 13.06it/s]
Batches: 44%|####4 | 397/896 [00:57<00:35, 14.06it/s]
Batches: 45%|####4 | 399/896 [00:57<00:41, 11.89it/s]
Batches: 45%|####4 | 401/896 [00:57<00:39, 12.54it/s]
Batches: 45%|####4 | 403/896 [00:57<00:44, 11.10it/s]
Batches: 45%|####5 | 405/896 [00:57<00:39, 12.44it/s]
Batches: 45%|####5 | 407/896 [00:58<00:35, 13.71it/s]
Batches: 46%|####5 | 409/896 [00:58<00:43, 11.21it/s]
Batches: 46%|####5 | 411/896 [00:58<00:48, 9.99it/s]
Batches: 46%|####6 | 413/896 [00:58<01:03, 7.60it/s]
Batches: 46%|####6 | 415/896 [00:59<00:52, 9.14it/s]
Batches: 47%|####6 | 417/896 [00:59<00:44, 10.77it/s]
Batches: 47%|####6 | 419/896 [00:59<00:38, 12.33it/s]
Batches: 47%|####6 | 421/896 [00:59<00:34, 13.67it/s]
Batches: 47%|####7 | 423/896 [00:59<00:42, 11.15it/s]
Batches: 47%|####7 | 425/896 [00:59<00:37, 12.59it/s]
Batches: 48%|####7 | 427/896 [00:59<00:33, 13.80it/s]
Batches: 48%|####7 | 429/896 [01:00<00:41, 11.12it/s]
Batches: 48%|####8 | 431/896 [01:00<00:36, 12.59it/s]
Batches: 48%|####8 | 433/896 [01:00<00:33, 13.98it/s]
Batches: 49%|####8 | 435/896 [01:00<00:30, 14.95it/s]
Batches: 49%|####8 | 437/896 [01:00<00:38, 11.80it/s]
Batches: 49%|####8 | 439/896 [01:00<00:44, 10.25it/s]
Batches: 49%|####9 | 441/896 [01:01<00:38, 11.76it/s]
Batches: 49%|####9 | 443/896 [01:01<00:43, 10.31it/s]
Batches: 50%|####9 | 445/896 [01:01<00:38, 11.84it/s]
Batches: 50%|####9 | 447/896 [01:01<00:44, 10.18it/s]
Batches: 50%|##### | 449/896 [01:01<00:38, 11.70it/s]
Batches: 50%|##### | 451/896 [01:02<00:43, 10.25it/s]
Batches: 51%|##### | 453/896 [01:02<00:36, 11.98it/s]
Batches: 51%|##### | 455/896 [01:02<00:33, 13.31it/s]
Batches: 51%|#####1 | 457/896 [01:02<00:30, 14.37it/s]
Batches: 51%|#####1 | 459/896 [01:02<00:28, 15.36it/s]
Batches: 51%|#####1 | 461/896 [01:02<00:26, 16.38it/s]
Batches: 52%|#####1 | 463/896 [01:02<00:35, 12.36it/s]
Batches: 52%|#####1 | 465/896 [01:02<00:31, 13.58it/s]
Batches: 52%|#####2 | 467/896 [01:03<00:29, 14.76it/s]
Batches: 52%|#####2 | 469/896 [01:03<00:27, 15.69it/s]
Batches: 53%|#####2 | 471/896 [01:03<00:26, 16.28it/s]
Batches: 53%|#####2 | 473/896 [01:03<00:35, 12.00it/s]
Batches: 53%|#####3 | 475/896 [01:03<00:31, 13.52it/s]
Batches: 53%|#####3 | 477/896 [01:03<00:37, 11.10it/s]
Batches: 53%|#####3 | 479/896 [01:04<00:32, 12.80it/s]
Batches: 54%|#####3 | 481/896 [01:04<00:29, 14.16it/s]
Batches: 54%|#####3 | 483/896 [01:04<00:36, 11.23it/s]
Batches: 54%|#####4 | 485/896 [01:04<00:32, 12.78it/s]
Batches: 54%|#####4 | 487/896 [01:04<00:38, 10.53it/s]
Batches: 55%|#####4 | 489/896 [01:05<00:43, 9.43it/s]
Batches: 55%|#####4 | 491/896 [01:05<00:45, 8.82it/s]
Batches: 55%|#####5 | 494/896 [01:05<00:35, 11.39it/s]
Batches: 55%|#####5 | 496/896 [01:05<00:31, 12.65it/s]
Batches: 56%|#####5 | 498/896 [01:05<00:45, 8.68it/s]
Batches: 56%|#####5 | 501/896 [01:06<00:35, 11.00it/s]
Batches: 56%|#####6 | 504/896 [01:06<00:30, 13.05it/s]
Batches: 57%|#####6 | 507/896 [01:06<00:26, 14.89it/s]
Batches: 57%|#####6 | 510/896 [01:06<00:23, 16.34it/s]
Batches: 57%|#####7 | 512/896 [01:06<00:29, 13.04it/s]
Batches: 57%|#####7 | 515/896 [01:06<00:25, 15.00it/s]
Batches: 58%|#####7 | 518/896 [01:07<00:22, 16.66it/s]
Batches: 58%|#####8 | 521/896 [01:07<00:21, 17.82it/s]
Batches: 58%|#####8 | 524/896 [01:07<00:19, 18.60it/s]
Batches: 59%|#####8 | 526/896 [01:07<00:19, 18.65it/s]
Batches: 59%|#####8 | 528/896 [01:07<00:26, 14.00it/s]
Batches: 59%|#####9 | 530/896 [01:07<00:24, 15.15it/s]
Batches: 59%|#####9 | 532/896 [01:08<00:29, 12.17it/s]
Batches: 60%|#####9 | 535/896 [01:08<00:25, 14.33it/s]
Batches: 60%|###### | 538/896 [01:08<00:22, 16.12it/s]
Batches: 60%|###### | 540/896 [01:08<00:27, 12.74it/s]
Batches: 61%|###### | 543/896 [01:08<00:24, 14.57it/s]
Batches: 61%|###### | 546/896 [01:08<00:21, 16.23it/s]
Batches: 61%|######1 | 548/896 [01:09<00:26, 12.89it/s]
Batches: 61%|######1 | 550/896 [01:09<00:30, 11.17it/s]
Batches: 62%|######1 | 552/896 [01:09<00:34, 10.07it/s]
Batches: 62%|######1 | 555/896 [01:09<00:27, 12.63it/s]
Batches: 62%|######2 | 557/896 [01:10<00:31, 10.90it/s]
Batches: 62%|######2 | 559/896 [01:10<00:27, 12.29it/s]
Batches: 63%|######2 | 562/896 [01:10<00:22, 14.62it/s]
Batches: 63%|######2 | 564/896 [01:10<00:34, 9.70it/s]
Batches: 63%|######3 | 567/896 [01:10<00:27, 12.18it/s]
Batches: 64%|######3 | 569/896 [01:11<00:30, 10.86it/s]
Batches: 64%|######3 | 571/896 [01:11<00:26, 12.27it/s]
Batches: 64%|######3 | 573/896 [01:11<00:29, 10.83it/s]
Batches: 64%|######4 | 575/896 [01:11<00:26, 11.91it/s]
Batches: 65%|######4 | 578/896 [01:11<00:21, 14.46it/s]
Batches: 65%|######4 | 581/896 [01:11<00:18, 16.58it/s]
Batches: 65%|######5 | 583/896 [01:12<00:23, 13.30it/s]
Batches: 65%|######5 | 586/896 [01:12<00:19, 15.72it/s]
Batches: 66%|######5 | 589/896 [01:12<00:17, 17.84it/s]
Batches: 66%|######6 | 592/896 [01:12<00:15, 19.35it/s]
Batches: 66%|######6 | 595/896 [01:12<00:14, 20.39it/s]
Batches: 67%|######6 | 598/896 [01:12<00:14, 20.96it/s]
Batches: 67%|######7 | 601/896 [01:13<00:23, 12.66it/s]
Batches: 67%|######7 | 603/896 [01:13<00:25, 11.28it/s]
Batches: 68%|######7 | 606/896 [01:13<00:21, 13.42it/s]
Batches: 68%|######7 | 609/896 [01:13<00:18, 15.69it/s]
Batches: 68%|######8 | 612/896 [01:13<00:20, 13.60it/s]
Batches: 69%|######8 | 615/896 [01:14<00:22, 12.33it/s]
Batches: 69%|######8 | 618/896 [01:14<00:18, 14.64it/s]
Batches: 69%|######9 | 620/896 [01:14<00:21, 12.60it/s]
Batches: 69%|######9 | 622/896 [01:14<00:24, 11.20it/s]
Batches: 70%|######9 | 625/896 [01:14<00:19, 13.80it/s]
Batches: 70%|######9 | 627/896 [01:15<00:23, 11.63it/s]
Batches: 70%|####### | 630/896 [01:15<00:18, 14.13it/s]
Batches: 71%|####### | 633/896 [01:15<00:16, 16.24it/s]
Batches: 71%|####### | 635/896 [01:15<00:19, 13.47it/s]
Batches: 71%|#######1 | 637/896 [01:15<00:21, 11.82it/s]
Batches: 71%|#######1 | 640/896 [01:16<00:17, 14.54it/s]
Batches: 72%|#######1 | 642/896 [01:16<00:20, 12.39it/s]
Batches: 72%|#######1 | 645/896 [01:16<00:16, 14.83it/s]
Batches: 72%|#######2 | 648/896 [01:16<00:14, 17.27it/s]
Batches: 73%|#######2 | 651/896 [01:16<00:12, 19.34it/s]
Batches: 73%|#######2 | 654/896 [01:16<00:11, 20.92it/s]
Batches: 73%|#######3 | 657/896 [01:16<00:10, 22.58it/s]
Batches: 74%|#######3 | 660/896 [01:17<00:14, 16.49it/s]
Batches: 74%|#######3 | 663/896 [01:17<00:16, 14.42it/s]
Batches: 74%|#######4 | 666/896 [01:17<00:14, 16.29it/s]
Batches: 75%|#######4 | 669/896 [01:17<00:15, 14.42it/s]
Batches: 75%|#######5 | 673/896 [01:17<00:12, 17.91it/s]
Batches: 75%|#######5 | 676/896 [01:18<00:14, 15.33it/s]
Batches: 76%|#######5 | 678/896 [01:18<00:16, 13.18it/s]
Batches: 76%|#######5 | 680/896 [01:18<00:18, 11.99it/s]
Batches: 76%|#######6 | 682/896 [01:18<00:19, 10.87it/s]
Batches: 76%|#######6 | 684/896 [01:19<00:21, 9.89it/s]
Batches: 77%|#######6 | 687/896 [01:19<00:16, 12.55it/s]
Batches: 77%|#######7 | 690/896 [01:19<00:13, 15.36it/s]
Batches: 77%|#######7 | 693/896 [01:19<00:11, 17.75it/s]
Batches: 78%|#######7 | 696/896 [01:19<00:09, 20.02it/s]
Batches: 78%|#######8 | 699/896 [01:19<00:12, 16.27it/s]
Batches: 78%|#######8 | 702/896 [01:20<00:13, 14.30it/s]
Batches: 79%|#######8 | 705/896 [01:20<00:11, 16.90it/s]
Batches: 79%|#######9 | 708/896 [01:20<00:12, 15.15it/s]
Batches: 79%|#######9 | 711/896 [01:20<00:13, 13.80it/s]
Batches: 80%|#######9 | 714/896 [01:20<00:11, 16.28it/s]
Batches: 80%|#######9 | 716/896 [01:21<00:12, 14.04it/s]
Batches: 80%|######## | 720/896 [01:21<00:09, 18.16it/s]
Batches: 81%|######## | 724/896 [01:21<00:07, 21.67it/s]
Batches: 81%|########1 | 727/896 [01:21<00:09, 17.72it/s]
Batches: 81%|########1 | 730/896 [01:21<00:10, 15.69it/s]
Batches: 82%|########1 | 732/896 [01:21<00:11, 13.70it/s]
Batches: 82%|########2 | 735/896 [01:22<00:12, 13.11it/s]
Batches: 82%|########2 | 738/896 [01:22<00:10, 15.79it/s]
Batches: 83%|########2 | 741/896 [01:22<00:08, 18.28it/s]
Batches: 83%|########3 | 744/896 [01:22<00:09, 16.38it/s]
Batches: 83%|########3 | 746/896 [01:22<00:10, 14.03it/s]
Batches: 84%|########3 | 750/896 [01:23<00:08, 18.18it/s]
Batches: 84%|########4 | 754/896 [01:23<00:06, 21.30it/s]
Batches: 84%|########4 | 757/896 [01:23<00:07, 17.53it/s]
Batches: 85%|########4 | 760/896 [01:23<00:10, 12.99it/s]
Batches: 85%|########5 | 763/896 [01:24<00:10, 12.94it/s]
Batches: 86%|########5 | 767/896 [01:24<00:09, 13.82it/s]
Batches: 86%|########6 | 771/896 [01:24<00:07, 16.78it/s]
Batches: 86%|########6 | 773/896 [01:24<00:08, 14.67it/s]
Batches: 86%|########6 | 775/896 [01:24<00:09, 13.24it/s]
Batches: 87%|########6 | 777/896 [01:25<00:09, 12.38it/s]
Batches: 87%|########7 | 781/896 [01:25<00:06, 16.91it/s]
Batches: 88%|########7 | 785/896 [01:25<00:05, 21.38it/s]
Batches: 88%|########8 | 789/896 [01:25<00:04, 25.04it/s]
Batches: 88%|########8 | 792/896 [01:25<00:06, 16.19it/s]
Batches: 89%|########8 | 796/896 [01:25<00:05, 19.40it/s]
Batches: 89%|########9 | 799/896 [01:26<00:05, 17.15it/s]
Batches: 90%|########9 | 802/896 [01:26<00:05, 16.22it/s]
Batches: 90%|########9 | 805/896 [01:26<00:05, 15.39it/s]
Batches: 90%|######### | 807/896 [01:26<00:06, 14.06it/s]
Batches: 90%|######### | 810/896 [01:26<00:06, 13.88it/s]
Batches: 91%|######### | 812/896 [01:27<00:06, 12.93it/s]
Batches: 91%|#########1| 817/896 [01:27<00:05, 15.38it/s]
Batches: 92%|#########1| 822/896 [01:27<00:03, 20.79it/s]
Batches: 92%|#########2| 825/896 [01:27<00:03, 18.48it/s]
Batches: 93%|#########2| 830/896 [01:27<00:02, 23.28it/s]
Batches: 93%|#########3| 834/896 [01:28<00:03, 20.33it/s]
Batches: 93%|#########3| 837/896 [01:28<00:03, 15.17it/s]
Batches: 94%|#########3| 839/896 [01:28<00:04, 14.09it/s]
Batches: 94%|#########4| 844/896 [01:28<00:02, 19.68it/s]
Batches: 95%|#########4| 847/896 [01:29<00:03, 14.73it/s]
Batches: 95%|#########4| 850/896 [01:29<00:03, 14.71it/s]
Batches: 96%|#########5| 856/896 [01:29<00:02, 17.45it/s]
Batches: 96%|#########5| 859/896 [01:29<00:02, 16.55it/s]
Batches: 96%|#########6| 863/896 [01:29<00:01, 16.62it/s]
Batches: 97%|#########6| 867/896 [01:30<00:01, 17.22it/s]
Batches: 97%|#########7| 871/896 [01:30<00:01, 17.63it/s]
Batches: 98%|#########7| 875/896 [01:30<00:01, 17.00it/s]
Batches: 98%|#########7| 878/896 [01:30<00:01, 16.49it/s]
Batches: 98%|#########8| 881/896 [01:31<00:00, 15.76it/s]
Batches: 99%|#########8| 883/896 [01:31<00:00, 14.42it/s]
Batches: 99%|#########8| 886/896 [01:31<00:00, 14.48it/s]
Batches: 99%|#########9| 890/896 [01:31<00:00, 16.10it/s]
Batches: 100%|#########9| 893/896 [01:31<00:00, 16.30it/s]
Batches: 100%|#########9| 895/896 [01:31<00:00, 14.89it/s]
Batches: 100%|##########| 896/896 [01:32<00:00, 9.72it/s]
Let us know look at the main topics and their representation:
Python
# Retrieve topic information and select specific columns: 'Topic', 'Count', 'Name', and 'Representation'
topic_info = topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]
# Create a simplified version of the topic information, retaining only 'Topic', 'Count', and 'Representation' columns
topic_info_simple = topic_info[["Topic", 'Count', "Representation"]]
# Modify the 'Representation' column to remove duplicate words while preserving their order
# This uses a dictionary to enforce uniqueness and joins the keys (unique words) back into a string
topic_info_simple['Representation'] = topic_info_simple['Representation'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))
# Convert the simplified topic information into a DataFrame for easier analysis or visualization
frequency_table = pd.DataFrame(topic_info_simple)
Topic | Count | Representation |
---|---|---|
-1 | 11169 | the to of that and in is for it not |
0 | 2096 | tax chancellor the government that we is businesses in business |
1 | 1330 | health nhs care patients hospital service hospitals services patient that |
2 | 551 | rail transport network road london trains line railway railways roads |
3 | 520 | ireland northern agreement irish ira sinn parties fein belfast decommissioning |
4 | 456 | energy climate carbon fuel gas emissions wind change electricity efficiency |
5 | 398 | police officers crime policing home chief metropolitan force constable constables |
Python
# Initialize an empty list to store rows
topic_words_scores = []
# Loop through each topic and extract its word scores
for topic_id in topic_info_simple["Topic"]:
words_scores = topic_model.get_topic(topic_id) # Get words and scores for the topic
if words_scores: # Ensure there are words for the topic
for word, score in words_scores:
topic_words_scores.append({"Topic": topic_id, "Word": word, "Score": score})
# Convert the list into a DataFrame
topic_words_scores_df = pd.DataFrame(topic_words_scores)
Python
# Step 1: Group by 'Topic' and rank terms by descending 'Score'
topic_words_scores_df['rank'] = topic_words_scores_df.groupby('Topic')['Score'] \
.rank(method='min', ascending=False)
# Step 2: Filter only the first 11 topics: "-1" to "10"
# First, make sure Topic is treated as a string (if needed)
topic_words_scores_df['Topic'] = topic_words_scores_df['Topic'].astype(str)
selected_topics = [str(i) for i in range(-1, 11)]
reshaped_df2 = topic_words_scores_df[topic_words_scores_df['Topic'].isin(selected_topics)]
R
# Load necessary libraries
library(dplyr) # For data manipulation
library(broom) # For tidying data (not used explicitly in this code but useful in many workflows)
library(ggplot2) # For creating plots
library(ggpubr) # For arranging multiple plots into a single figure
library(forcats) # for fct_rev
reshaped_df2 <- reticulate::py$reshaped_df2
#Creating a graph with the topics
topics_df<-ggplot(reshaped_df2, aes(y = fct_rev(as.factor(Topic)), x = as.numeric(rank))) +
geom_tile(aes(fill = Score))+
scale_fill_viridis_c()+
geom_label(aes(y = fct_rev(as.factor(Topic)), x=rank, label=Word), fill="white", size=3)+
scale_x_continuous(breaks = seq(min(reshaped_df2$rank), max(reshaped_df2$rank))) + # Ensure all ranks are shown
ylab("Topic")+ xlab("Top 10 words")+
theme(legend.position = "bottom")
topics_df
R
# Load necessary libraries
library(dplyr) # For data manipulation
library(broom) # For tidying data (not used explicitly in this code but useful in many workflows)
library(ggplot2) # For creating plots
library(ggpubr) # For arranging multiple plots into a single figure
# Accessing the Python dataframe (using reticulate) and converting it to an R dataframe
# `reticulate` bridges R and Python environments
topics_df <- reticulate::py$topic_words_scores_df
# Filter the dataframe to include only topics less than 3
# This creates a subset of data for visualization
topics_df8 <- topics_df[topics_df$Topic < 3, ]
# Create an empty list to store plots
# This list will hold the individual plots for each topic
plot_list <- list()
# Loop through each unique topic in the filtered dataframe
# Each iteration generates a plot for one topic
for (topic_id in unique(topics_df8$Topic)) {
# Filter the dataframe for the current topic
# This ensures that each plot uses only the relevant data
topic_data <- topics_df8[topics_df8$Topic == topic_id, ]
# Extract the title of the topic for use in the plot title
# The `unique` function ensures that only one title is used
topic_title <- unique(topic_data$Title)
# Create the bar plot for the current topic
# The x-axis represents scores, and the y-axis represents words sorted by score
p <- ggplot(topic_data, aes(x = Score, y = reorder(Word, Score))) +
geom_bar(stat = "identity") + # `geom_bar` creates the bar chart
theme_bw() + # `theme_bw` applies a clean, minimal theme
theme(axis.title.x = element_blank(), # Remove x-axis title
axis.title.y = element_blank()) + # Remove y-axis title
ggtitle(paste("Topic", topic_id, ":", topic_title)) # Add a descriptive title to the plot
# Store the plot in the list, with the topic ID as the key
plot_list[[as.character(topic_id)]] <- p
}
# Arrange the stored plots in a grid layout
# `ncol` specifies the number of columns, and `nrow` specifies the number of rows
# `ggarrange` makes it easy to combine multiple plots into a single figure
ggarrange(plotlist = plot_list, ncol = 2, nrow = 2)
$`1`
$`2`
$`3`
$`4`
$`5`
$`6`
$`7`
$`8`
$`9`
$`10`
$`11`
$`12`
$`13`
$`14`
$`15`
$`16`
$`17`
$`18`
$`19`
$`20`
$`21`
$`22`
$`23`
$`24`
$`25`
$`26`
$`27`
$`28`
$`29`
$`30`
$`31`
$`32`
$`33`
$`34`
$`35`
$`36`
$`37`
$`38`
$`39`
$`40`
$`41`
attr(,"class")
[1] "list" "ggarrange"
This is how we get topics per document:
Python
# Initialize a list to hold probabilities with document IDs
probs_with_outliers = []
# Use document_id from aggression_texts
document_ids = aggression_texts['document_id'].tolist()
# Iterate over topics, probabilities, and document IDs
for doc_id, topic, prob_row in zip(document_ids, topics, probs):
if topic == -1:
# Assign all probability to topic -1
prob_row_with_outlier = [doc_id] + list(prob_row) + [1] # Add document ID and probabilities
else:
# Calculate residual probability for topic -1
outlier_prob = max(0, 1 - sum(prob_row)) # Ensure non-negative
prob_row_with_outlier = [doc_id] + list(prob_row) + [outlier_prob]
probs_with_outliers.append(prob_row_with_outlier)
# Define columns, including document_id
topic_columns = [f"topic_{i}" for i in range(len(probs[0]))] + ["topic_-1"]
columns = ["document_id"] + topic_columns
# Convert to DataFrame
probs_df_with_outliers = pd.DataFrame(probs_with_outliers, columns=columns)
# Add winning topic
probs_df_with_outliers["winning_topic"] = probs_df_with_outliers[topic_columns].idxmax(axis=1)
# Get keywords for all topics
topic_keywords = {topic: topic_model.get_topic(topic) for topic in topic_model.get_topics()}
# Function to extract keywords for the winning topic
def get_keywords(topic):
topic_id = int(topic.split("_")[1]) # Extract numeric topic ID
# Format keywords as a comma-separated string
return ", ".join([word for word, _ in topic_keywords[topic_id]])
# Add keywords for the winning topic
probs_df_with_outliers["winning_topic_keywords"] = probs_df_with_outliers["winning_topic"].apply(get_keywords)
# Add winning topic probability
probs_df_with_outliers["winning_topics_prob"] = probs_df_with_outliers[topic_columns].max(axis=1)
# Merge with original text data
merged_df = pd.merge(probs_df_with_outliers, aggression_texts[["document_id","body", "name"]],how='left',on=['document_id'])
# Rearrange columns to have: document_id, winning_topic, winning_topic_keywords, winning_topics_prob, body, and then topic columns
cols = ["document_id", "winning_topic", "winning_topic_keywords", "winning_topics_prob", "name", "body", "topic_-1"] + [f"topic_{i}" for i in range(len(probs[0]))]
merged_df = merged_df[cols]
Python
pmq_boris = merged_df[merged_df["name"] == "Boris Johnson"]
cols = ["document_id", "winning_topic", "winning_topic_keywords", "winning_topics_prob", "name", "body", "topic_-1"] + [f"topic_{i}" for i in range(0,10)]
pmq_boris = pmq_boris[cols]
# Convert to long format
long_format = pmq_boris.melt(
id_vars=["document_id", "winning_topic"], # Columns to keep as identifiers
value_vars=[col for col in pmq_boris.columns if col.startswith("topic_")], # Columns to unpivot
var_name="topic", # Name of the resulting topic column
value_name="importance" # Name of the resulting value column
)
# Sort for better readability (optional)
long_format = long_format.sort_values(by=["document_id", "topic"]).reset_index(drop=True)
# Function to get the first five words for a given topic
def get_top_words(topic):
# Extract the topic number
topic_number = int(topic.replace('topic_', '').replace('_', '-')) # Handles cases like 'topic_-1'
# Get the first five words for the topic
if topic_number in topic_keywords:
return ', '.join(word for word, _ in topic_keywords[topic_number][:3])
return None # Return None if the topic is not in the dictionary
# Apply the function to the 'topic' column
long_format['top_words'] = long_format['topic'].apply(get_top_words)
Python
# Step 1: Sort by document_id to prepare for group-wise operation
result_df = long_format.sort_values(['document_id', 'importance'], ascending=[True, False])
# Step 2: Within each document_id, identify duplicates and add small increments
result_df['is_dup_forward'] = result_df.duplicated(['document_id', 'importance'])
result_df['is_dup_backward'] = result_df.duplicated(['document_id', 'importance'], keep='last')
# Combine both to detect any kind of duplicate
result_df['is_dup'] = result_df['is_dup_forward'] | result_df['is_dup_backward']
# Step 3: Generate row numbers within each group to apply tiny increments
result_df['row_number'] = result_df.groupby('document_id').cumcount() + 1
# Step 4: Apply increment only to duplicates
result_df['importance_adjusted'] = result_df['importance']
result_df.loc[result_df['is_dup'], 'importance_adjusted'] += result_df['row_number'] * 1e-6
# Step 5: Rank by adjusted importance (descending), ties get minimum rank
result_df['rank'] = result_df.groupby('document_id')['importance_adjusted'] \
.rank(ascending=False, method='min')
# Optional: drop the intermediate columns if not needed
reshaped_df2 = result_df.drop(columns=['is_dup_forward', 'is_dup_backward', 'is_dup', 'row_number'])
R
library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
reshaped_df2 <- reticulate::py$reshaped_df2
#Creating a graph with the topics
topics_lda_keys<-ggplot(reshaped_df2, aes(y = reorder(document_id, -document_id), x = as.numeric(rank))) +
geom_tile(aes(fill = importance))+
scale_fill_viridis_c()+
geom_label(aes(y = reorder(document_id, -document_id), x=as.numeric(rank), label=topic), fill="white", size=3)+
scale_x_continuous(breaks = seq(min(reshaped_df2$rank), max(reshaped_df2$rank))) + # Ensure all ranks are shown
ylab("Speech")+ xlab("Top Topics")+
theme(legend.position = "bottom")
topics_lda_keys
Python
# Step 1: Rank within each document_id, descending order, ties.method = "min"
result_df['rank'] = result_df.groupby('document_id')['importance'].rank(ascending=False, method='min')
# Step 2: Apply regex to format `top_words` as `terms`
# Add a newline after the second comma-separated word
result_df['terms'] = result_df['top_words'].str.replace(r'^([^,]+,[^,]+),', r'\1,\n', regex=True)
# Final result
reshaped_df2 = result_df
R
#
library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
reshaped_df2 <- reticulate::py$reshaped_df2
#Creating a graph with the topics
topics_lda_keys<-ggplot(reshaped_df2, aes(y = reorder(document_id, -document_id), x = as.numeric(rank))) +
geom_tile(aes(fill = importance))+
scale_fill_viridis_c()+
geom_label(aes(y = reorder(document_id, -document_id), x=as.numeric(rank), label=terms), fill="white", size=2)+
scale_x_continuous(breaks = seq(min(reshaped_df2$rank), max(reshaped_df2$rank))) + # Ensure all ranks are shown
ylab("Speech")+ xlab("Top Topics")+
theme(legend.position = "bottom")
topics_lda_keys
Let us look at a text and its associated topic (where the winning topic is not -1):
For example, speech 6113 has the following winning keywords: parliament, referendum, treaty, viii, henry, clauses, osce, passerelle, stunt, vote.
This speech reads as follows:
Johnson’s speech - 6113
We have been clear that the Government do not agree, as I have said previously to the House, with the recent changes to US immigration policy, and that that is not the approach the UK would take..
This is how we get topics per document:
document_id | winning_topic | winning_topic_keywords | winning_topics_prob | name | body | topic_-1 | topic_0 | topic_1 | topic_2 | topic_3 | topic_4 | topic_5 | topic_6 | topic_7 | topic_8 | topic_9 | topic_10 | topic_11 | topic_12 | topic_13 | topic_14 | topic_15 | topic_16 | topic_17 | topic_18 | topic_19 | topic_20 | topic_21 | topic_22 | topic_23 | topic_24 | topic_25 | topic_26 | topic_27 | topic_28 | topic_29 | topic_30 | topic_31 | topic_32 | topic_33 | topic_34 | topic_35 | topic_36 | topic_37 | topic_38 | topic_39 | topic_40 | topic_41 | topic_42 | topic_43 | topic_44 | topic_45 | topic_46 | topic_47 | topic_48 | topic_49 | topic_50 | topic_51 | topic_52 | topic_53 | topic_54 | topic_55 | topic_56 | topic_57 | topic_58 | topic_59 | topic_60 | topic_61 | topic_62 | topic_63 | topic_64 | topic_65 | topic_66 | topic_67 | topic_68 | topic_69 | topic_70 | topic_71 | topic_72 | topic_73 | topic_74 | topic_75 | topic_76 | topic_77 | topic_78 | topic_79 | topic_80 | topic_81 | topic_82 | topic_83 | topic_84 | topic_85 | topic_86 | topic_87 | topic_88 | topic_89 | topic_90 | topic_91 | topic_92 | topic_93 | topic_94 | topic_95 | topic_96 | topic_97 | topic_98 | topic_99 | topic_100 | topic_101 | topic_102 | topic_103 | topic_104 | topic_105 | topic_106 | topic_107 | topic_108 | topic_109 | topic_110 | topic_111 | topic_112 | topic_113 | topic_114 | topic_115 | topic_116 | topic_117 | topic_118 | topic_119 | topic_120 | topic_121 | topic_122 | topic_123 | topic_124 | topic_125 | topic_126 | topic_127 | topic_128 | topic_129 | topic_130 | topic_131 | topic_132 | topic_133 | topic_134 | topic_135 | topic_136 | topic_137 | topic_138 | topic_139 | topic_140 | topic_141 | topic_142 | topic_143 | topic_144 | topic_145 | topic_146 | topic_147 | topic_148 | topic_149 | topic_150 | topic_151 | topic_152 | topic_153 | topic_154 | topic_155 | topic_156 | topic_157 | topic_158 | topic_159 | topic_160 | topic_161 | topic_162 | topic_163 | topic_164 | topic_165 | topic_166 | topic_167 | topic_168 | topic_169 | topic_170 | topic_171 | topic_172 | topic_173 | topic_174 | topic_175 | topic_176 | topic_177 | topic_178 | topic_179 | topic_180 | topic_181 | topic_182 | topic_183 | topic_184 | topic_185 | topic_186 | topic_187 | topic_188 | topic_189 | topic_190 | topic_191 | topic_192 | topic_193 | topic_194 | topic_195 | topic_196 | topic_197 | topic_198 | topic_199 | topic_200 | topic_201 | topic_202 | topic_203 | topic_204 | topic_205 | topic_206 | topic_207 | topic_208 | topic_209 | topic_210 | topic_211 | topic_212 | topic_213 | topic_214 | topic_215 | topic_216 | topic_217 | topic_218 | topic_219 | topic_220 | topic_221 | topic_222 | topic_223 | topic_224 | topic_225 | topic_226 | topic_227 | topic_228 | topic_229 | topic_230 | topic_231 | topic_232 | topic_233 | topic_234 | topic_235 | topic_236 | topic_237 | topic_238 | topic_239 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | topic_87 | legal, aid, cases, advice, court, system, law, lawyer, solicitors, litigants | 0.16797 | Mr Gerry Bermingham | Does the Minister agree that if one does not provi... | 0.043965 | 0.004041 | 0.001391 | 0.002655 | 0.003035 | 0.001203 | 0.003792 | 0.001988 | 0.001804 | 0.001161 | 0.00289 | 0.001879 | 0.003518 | 0.002171 | 0.011101 | 0.001691 | 0.002148 | 0.001428 | 0.00132 | 0.001246 | 0.003348 | 0.034134 | 0.001502 | 0.001059 | 0.006522 | 0.001422 | 0.001054 | 0.001499 | 0.000768 | 0.001285 | 0.006056 | 0.002152 | 0.006282 | 0.00386 | 0.001871 | 0.002912 | 0.004134 | 0.002102 | 0.001651 | 0.001178 | 0.002149 | 0.001653 | 0.005514 | 0.001027 | 0.001433 | 0.003178 | 0.002903 | 0.001821 | 0.000907 | 0.00206 | 0.002664 | 0.001604 | 0.004242 | 0.003069 | 0.001372 | 0.001572 | 0.010114 | 0.002245 | 0.003244 | 0.001519 | 0.001254 | 0.001791 | 0.001211 | 0.004479 | 0.007045 | 0.001032 | 0.002343 | 0.002221 | 0.001219 | 0.004302 | 0.00583 | 0.004589 | 0.001281 | 0.001106 | 0.001183 | 0.004933 | 0.002176 | 0.001063 | 0.006258 | 0.001903 | 0.002481 | 0.00056 | 0.005718 | 0.00141 | 0.002412 | 0.003385 | 0.003717 | 0.001193 | 0.16797 | 0.003041 | 0.001293 | 0.004539 | 0.004232 | 0.000913 | 0.00289 | 0.004827 | 0.010436 | 0.006734 | 0.001311 | 0.006445 | 0.00233 | 0.002069 | 0.001622 | 0.000944 | 0.00272 | 0.00789 | 0.001226 | 0.001267 | 0.001433 | 0.006964 | 0.00401 | 0.002205 | 0.01401 | 0.001318 | 0.001362 | 0.002963 | 0.003664 | 0.002987 | 0.001741 | 0.001769 | 0.00168 | 0.004751 | 0.005687 | 0.005972 | 0.003915 | 0.001601 | 0.004665 | 0.001213 | 0.001462 | 0.001737 | 0.001341 | 0.006358 | 0.003369 | 0.00321 | 0.001515 | 0.005882 | 0.008352 | 0.001636 | 0.003068 | 0.011426 | 0.003419 | 0.002113 | 0.001919 | 0.002036 | 0.009607 | 0.00464 | 0.004334 | 0.000946 | 0.001025 | 0.007145 | 0.002288 | 0.002798 | 0.001381 | 0.009234 | 0.001998 | 0.009331 | 0.003262 | 0.001235 | 0.002842 | 0.007432 | 0.001252 | 0.003514 | 0.002367 | 0.005254 | 0.001195 | 0.004438 | 0.001392 | 0.001379 | 0.004981 | 0.002487 | 0.001255 | 0.001667 | 0.001405 | 0.001555 | 0.009217 | 0.003634 | 0.007665 | 0.001526 | 0.00197 | 0.003302 | 0.002911 | 0.001137 | 0.001488 | 0.001273 | 0.002139 | 0.001642 | 0.001539 | 0.001576 | 0.001087 | 0.002102 | 0.001245 | 0.004102 | 0.006396 | 0.002045 | 0.00469 | 0.006267 | 0.001929 | 0.012077 | 0.002457 | 0.002237 | 0.003666 | 0.022004 | 0.002181 | 0.008888 | 0.002221 | 0.004582 | 0.001073 | 0.003257 | 0.002895 | 0.001331 | 0.003844 | 0.001898 | 0.003689 | 0.005221 | 0.002511 | 0.001184 | 0.00113 | 0.00195 | 0.001906 | 0.001213 | 0.001341 | 0.002199 | 0.001262 | 0.004514 | 0.002361 | 0.001158 | 0.002327 | 0.001191 | 0.001243 | 0.001221 | 0.007826 | 0.001382 | 0.001317 | 0.002769 | 0.005798 | 0.002919 | 0.001196 | 0.003822 | 0.003577 | 0.002212 | 0.001369 |
2 | topic_22 | fishing, fishermen, fish, fisheries, industry, sea, marine, conservation, vessels, fishermens | 1.0 | Richard Benyon | My right honourable Friend will know that there is... | 0.0 | 2.409584e-307 | 2.777592e-307 | 1.339487e-307 | 1.215367e-307 | 6.984532e-307 | 1.112502e-307 | 1.401726e-307 | 3.366072e-307 | 1.385535e-307 | 2.807412e-307 | 2.164449e-307 | 1.085048e-307 | 3.046906e-307 | 1.502615e-307 | 5.040040e-307 | 1.722364e-307 | 1.695230e-307 | 3.878772e-307 | 1.474443e-307 | 1.320987e-307 | 1.444193e-307 | 3.985065e-307 | 1.0 | 1.709283e-307 | 1.895954e-307 | 4.171915e-307 | 6.383930e-307 | 4.700139e-307 | 2.795904e-307 | 1.664952e-307 | 2.304228e-307 | 1.558749e-307 | 1.574965e-307 | 1.864378e-307 | 1.824634e-307 | 1.118675e-307 | 1.608528e-307 | 1.415907e-307 | 1.245878e-307 | 1.519837e-307 | 2.543567e-307 | 1.341151e-307 | 1.605463e-307 | 1.350279e-307 | 1.702117e-307 | 2.657082e-307 | 1.785103e-307 | 1.606623e-307 | 1.209747e-307 | 2.801690e-307 | 1.604444e-307 | 2.005726e-307 | 1.344802e-307 | 2.255745e-307 | 1.295795e-307 | 1.681387e-307 | 1.189876e-307 | 1.085106e-307 | 2.945497e-307 | 2.001174e-307 | 4.308093e-307 | 1.158053e-306 | 1.752071e-307 | 1.367220e-307 | 5.276326e-307 | 1.910063e-307 | 1.376773e-307 | 4.121481e-307 | 1.641325e-307 | 1.893145e-307 | 2.026828e-307 | 5.088588e-307 | 1.187003e-306 | 3.192378e-307 | 1.236715e-307 | 1.783859e-307 | 8.404581e-307 | 1.556770e-307 | 1.120358e-307 | 2.360908e-307 | 1.767775e-307 | 1.197294e-307 | 3.185820e-307 | 1.745644e-307 | 1.463532e-307 | 1.285231e-307 | 1.143427e-306 | 1.495211e-307 | 2.478799e-307 | 4.814387e-307 | 1.345058e-307 | 1.631086e-307 | 2.716689e-307 | 2.605204e-307 | 2.009461e-307 | 1.523980e-307 | 1.906404e-307 | 5.643037e-307 | 1.220769e-307 | 2.078179e-307 | 1.264070e-307 | 1.699676e-307 | 1.763924e-307 | 1.135469e-307 | 1.788315e-307 | 1.468152e-307 | 1.594030e-307 | 1.425138e-307 | 1.309516e-307 | 1.229503e-307 | 1.496371e-307 | 1.631474e-307 | 6.193856e-307 | 3.617686e-307 | 1.919997e-307 | 1.950644e-307 | 1.201606e-307 | 4.472034e-307 | 1.413093e-307 | 1.531566e-307 | 1.689022e-307 | 1.869425e-307 | 1.838141e-307 | 1.541317e-307 | 4.055817e-307 | 2.047519e-307 | 6.201056e-307 | 4.650893e-307 | 1.776927e-307 | 1.430714e-307 | 1.262870e-307 | 1.872262e-307 | 1.867817e-307 | 4.723111e-307 | 1.211818e-307 | 1.268618e-307 | 5.132255e-307 | 2.003833e-307 | 1.468130e-307 | 2.002439e-307 | 1.943697e-307 | 3.799007e-307 | 3.123250e-307 | 1.321809e-307 | 1.898880e-307 | 2.071306e-307 | 1.264200e-307 | 1.683278e-307 | 1.536503e-307 | 1.325717e-307 | 1.917572e-307 | 1.559229e-307 | 1.290616e-307 | 1.960865e-307 | 1.382141e-307 | 2.345569e-307 | 1.267526e-307 | 1.701063e-307 | 1.584590e-307 | 3.952960e-307 | 1.670066e-307 | 1.426341e-307 | 1.775151e-307 | 1.720156e-307 | 1.584812e-307 | 3.112093e-307 | 1.477069e-307 | 1.564011e-307 | 1.939658e-307 | 3.081934e-307 | 4.879836e-307 | 6.185012e-307 | 1.716764e-307 | 1.427484e-307 | 1.538377e-307 | 1.454659e-307 | 3.064519e-307 | 2.949482e-307 | 1.262003e-307 | 1.639660e-307 | 2.324958e-307 | 2.854048e-307 | 1.567381e-307 | 3.218422e-307 | 1.265665e-307 | 1.330603e-307 | 3.952529e-307 | 1.698816e-307 | 1.313392e-307 | 1.368401e-307 | 2.122025e-307 | 1.837408e-307 | 2.038133e-307 | 1.926490e-307 | 1.506087e-307 | 9.907181e-308 | 1.333170e-307 | 1.196730e-307 | 1.972104e-307 | 2.001180e-307 | 1.390832e-307 | 3.300989e-307 | 1.558511e-307 | 1.880517e-307 | 1.496291e-307 | 1.807116e-307 | 2.092684e-307 | 2.035533e-307 | 1.210000e-307 | 2.170597e-307 | 1.689890e-307 | 1.652968e-307 | 1.819821e-307 | 1.739442e-307 | 1.357835e-307 | 1.804197e-307 | 1.592313e-307 | 3.019267e-307 | 1.418906e-307 | 1.360037e-307 | 3.261747e-307 | 1.407881e-307 | 1.140075e-307 | 1.792352e-307 | 1.318813e-307 | 3.043408e-307 | 3.266105e-307 | 1.425656e-307 | 1.756072e-307 | 1.678449e-307 | 4.132594e-307 | 1.526275e-307 | 1.728773e-307 | 1.270355e-307 | 2.795018e-307 | 1.293750e-307 | 2.208128e-307 | 2.266386e-307 | 1.678376e-307 | 1.568370e-307 |
3 | topic_178 | parking, stevenage, town, hire, taxi, charges, centre, borough, regeneration, local | 1.0 | Penny Mordaunt | I congratulate my honourable Friend on his campaig... | 0.0 | 6.391277e-307 | 2.345420e-307 | 1.814121e-306 | 2.922295e-307 | 1.780881e-307 | 3.778895e-307 | 1.914265e-307 | 1.971974e-307 | 2.266575e-307 | 2.585887e-307 | 2.881446e-307 | 5.137861e-307 | 2.581527e-307 | 7.024945e-307 | 2.127185e-307 | 1.883227e-307 | 2.469107e-307 | 1.593686e-307 | 1.454105e-307 | 2.622233e-307 | 4.831738e-307 | 1.797459e-307 | 1.469261e-307 | 3.344549e-307 | 2.606840e-307 | 1.374573e-307 | 1.887253e-307 | 1.040219e-307 | 1.541432e-307 | 6.208399e-307 | 2.071429e-307 | 3.240391e-307 | 8.179376e-307 | 1.730221e-307 | 2.275312e-307 | 4.445581e-307 | 3.898600e-307 | 1.690854e-307 | 2.418034e-307 | 4.328521e-307 | 2.590332e-307 | 1.422950e-306 | 1.910231e-307 | 3.199570e-307 | 2.373607e-307 | 2.765900e-307 | 3.181794e-307 | 1.610564e-307 | 2.116635e-307 | 2.859886e-307 | 3.080400e-307 | 2.886265e-307 | 2.033671e-306 | 1.464756e-307 | 1.641478e-307 | 4.027316e-307 | 1.287338e-306 | 6.401497e-307 | 1.753232e-307 | 2.214916e-307 | 2.067573e-307 | 1.649565e-307 | 2.847772e-307 | 3.593485e-307 | 1.541297e-307 | 2.006313e-307 | 2.058802e-307 | 1.524486e-307 | 6.550144e-307 | 3.760745e-307 | 3.189226e-307 | 1.609852e-307 | 1.570686e-307 | 1.464216e-307 | 9.497039e-307 | 3.650936e-307 | 1.540928e-307 | 7.848966e-307 | 1.861488e-307 | 2.283553e-307 | 7.410293e-308 | 5.949490e-307 | 2.138775e-307 | 4.094010e-307 | 2.551387e-307 | 4.395413e-306 | 1.647610e-307 | 5.243064e-307 | 2.585022e-307 | 1.872876e-307 | 2.201522e-306 | 2.742290e-307 | 1.356489e-307 | 3.013634e-307 | 3.378562e-307 | 3.815060e-307 | 3.675142e-307 | 1.652075e-307 | 5.230068e-307 | 2.024696e-307 | 2.079005e-307 | 2.897521e-307 | 1.649088e-307 | 1.161286e-306 | 4.160975e-307 | 2.353772e-307 | 1.474967e-307 | 2.995483e-307 | 3.840919e-307 | 1.584909e-306 | 2.036513e-307 | 4.836236e-307 | 1.672851e-307 | 2.049206e-307 | 2.313664e-307 | 2.622323e-307 | 3.750731e-306 | 2.089711e-307 | 1.771966e-307 | 1.719656e-307 | 2.896667e-307 | 4.501398e-307 | 3.349632e-307 | 2.639059e-307 | 1.911688e-307 | 3.549757e-307 | 1.556908e-307 | 1.793215e-307 | 3.042441e-307 | 2.699078e-307 | 3.943593e-307 | 2.491685e-307 | 2.419233e-307 | 1.854852e-307 | 4.591832e-307 | 5.547151e-307 | 2.012094e-307 | 2.392580e-307 | 4.026554e-307 | 2.529497e-307 | 1.882352e-307 | 2.226300e-307 | 2.507656e-307 | 6.554605e-307 | 4.536412e-307 | 3.576033e-307 | 1.062917e-307 | 1.874527e-307 | 7.885029e-307 | 2.194382e-307 | 2.236847e-307 | 1.564872e-307 | 4.959884e-307 | 3.290815e-307 | 4.013810e-307 | 2.677714e-307 | 2.578281e-307 | 2.228609e-307 | 3.448356e-307 | 1.905218e-307 | 5.925451e-307 | 2.133834e-307 | 3.098304e-307 | 1.336118e-307 | 2.801086e-307 | 1.651859e-307 | 1.561379e-307 | 2.949184e-307 | 2.085673e-307 | 1.526108e-307 | 1.974576e-307 | 1.782153e-307 | 2.724875e-307 | 3.775297e-307 | 2.543890e-307 | 3.541762e-307 | 2.297102e-307 | 2.472255e-307 | 1.0 | 2.255917e-307 | 1.261422e-307 | 2.300333e-307 | 1.478593e-307 | 2.476590e-307 | 1.676029e-307 | 1.614399e-307 | 2.124219e-307 | 2.014388e-307 | 7.930954e-307 | 2.526387e-307 | 3.070891e-307 | 3.728500e-307 | 3.302092e-307 | 3.034047e-307 | 9.088073e-307 | 4.186826e-307 | 5.559739e-307 | 1.848464e-306 | 1.955087e-307 | 2.663105e-307 | 5.563833e-307 | 2.461975e-307 | 6.849871e-307 | 1.932927e-307 | 2.854627e-307 | 1.939794e-307 | 2.512820e-307 | 2.311729e-307 | 1.380946e-307 | 2.897283e-307 | 3.534900e-307 | 2.563861e-307 | 3.140572e-307 | 2.075715e-307 | 2.325030e-307 | 2.065301e-307 | 3.626227e-307 | 2.410811e-307 | 2.356186e-307 | 2.794425e-307 | 2.313708e-307 | 1.466655e-307 | 5.020368e-307 | 2.001398e-307 | 2.284017e-307 | 2.615041e-307 | 1.860578e-307 | 1.450761e-307 | 2.251243e-307 | 5.463389e-307 | 1.684474e-307 | 1.513260e-307 | 4.722403e-307 | 8.892318e-307 | 2.681487e-307 | 2.427072e-307 | 3.491490e-307 | 2.974070e-307 | 3.987444e-307 | 2.602154e-307 |
4 | topic_-1 | the, to, of, that, and, in, is, for, it, not | 0.260586 | Gerry Sutcliffe | We have had an interesting debate on an important ... | 0.260586 | 0.009011 | 0.001424 | 0.002068 | 0.001486 | 0.001222 | 0.001665 | 0.001296 | 0.001648 | 0.001023 | 0.00334 | 0.002041 | 0.001648 | 0.002552 | 0.00535 | 0.001813 | 0.001618 | 0.001337 | 0.001169 | 0.000923 | 0.001728 | 0.003805 | 0.00142 | 0.001006 | 0.004139 | 0.001417 | 0.000936 | 0.001513 | 0.000723 | 0.001103 | 0.009388 | 0.001852 | 0.003169 | 0.003982 | 0.001416 | 0.002206 | 0.001782 | 0.002033 | 0.001134 | 0.001011 | 0.001998 | 0.00185 | 0.002768 | 0.000948 | 0.001269 | 0.002264 | 0.00382 | 0.001832 | 0.000839 | 0.001211 | 0.003573 | 0.001525 | 0.003928 | 0.002326 | 0.001181 | 0.001053 | 0.006299 | 0.001612 | 0.001623 | 0.001354 | 0.001225 | 0.001864 | 0.001197 | 0.003159 | 0.002574 | 0.001 | 0.001818 | 0.001388 | 0.001092 | 0.005288 | 0.011239 | 0.005524 | 0.001184 | 0.00108 | 0.001026 | 0.002145 | 0.002265 | 0.001031 | 0.00566 | 0.001107 | 0.002257 | 0.000487 | 0.002218 | 0.001486 | 0.002557 | 0.001909 | 0.002271 | 0.001181 | 0.004721 | 0.003312 | 0.001345 | 0.002698 | 0.002689 | 0.000902 | 0.004267 | 0.007021 | 0.003671 | 0.006986 | 0.001231 | 0.002292 | 0.001846 | 0.001248 | 0.001553 | 0.000886 | 0.001587 | 0.011162 | 0.001106 | 0.00096 | 0.001294 | 0.002426 | 0.002079 | 0.001452 | 0.007451 | 0.00125 | 0.001443 | 0.002374 | 0.003035 | 0.00184 | 0.001848 | 0.001195 | 0.001191 | 0.003068 | 0.141773 | 0.004915 | 0.002372 | 0.001576 | 0.009627 | 0.00112 | 0.001409 | 0.001718 | 0.001209 | 0.002271 | 0.002674 | 0.002517 | 0.001489 | 0.002133 | 0.002578 | 0.001704 | 0.002597 | 0.003383 | 0.00282 | 0.001662 | 0.0021 | 0.002527 | 0.002659 | 0.014484 | 0.009743 | 0.000721 | 0.000953 | 0.005739 | 0.001371 | 0.00221 | 0.001026 | 0.002638 | 0.002165 | 0.002815 | 0.003589 | 0.001064 | 0.002034 | 0.003665 | 0.001292 | 0.004045 | 0.001478 | 0.003878 | 0.000947 | 0.002691 | 0.001233 | 0.001003 | 0.00283 | 0.001946 | 0.00109 | 0.001701 | 0.001378 | 0.001484 | 0.003011 | 0.002219 | 0.002931 | 0.001661 | 0.002196 | 0.002092 | 0.001978 | 0.000982 | 0.001598 | 0.000958 | 0.002481 | 0.001068 | 0.001048 | 0.001672 | 0.00102 | 0.001752 | 0.001098 | 0.005148 | 0.00857 | 0.002291 | 0.004292 | 0.004791 | 0.001146 | 0.002773 | 0.001698 | 0.001769 | 0.00324 | 0.003167 | 0.002539 | 0.006559 | 0.001702 | 0.002475 | 0.001013 | 0.002937 | 0.002398 | 0.000917 | 0.00425 | 0.00187 | 0.002453 | 0.004276 | 0.001862 | 0.001045 | 0.001078 | 0.001858 | 0.00209 | 0.001085 | 0.001187 | 0.002418 | 0.000919 | 0.00189 | 0.001787 | 0.001013 | 0.002833 | 0.001186 | 0.000912 | 0.00116 | 0.012559 | 0.001277 | 0.000977 | 0.003067 | 0.002337 | 0.003616 | 0.001039 | 0.007609 | 0.004694 | 0.002224 | 0.001271 |
This is how we extract the top 3 topics per document:
Python
# Extract only topic probability columns (columns starting with "topic_")
topic_columns = [col for col in merged_df.columns if col.startswith("topic_")]
# Find the top 3 topics and their probabilities for each document
top_topics = merged_df[topic_columns].apply(
lambda row: list(row.nlargest(3).items()), axis=1
)
# Add the top 3 topics and probabilities to `merged_df`
merged_df["top_1"] = [t[0][0] for t in top_topics] # First topic
merged_df["prob_1"] = [t[0][1] for t in top_topics] # Probability of first topic
merged_df["top_2"] = [t[1][0] for t in top_topics] # Second topic
merged_df["prob_2"] = [t[1][1] for t in top_topics] # Probability of second topic
merged_df["top_3"] = [t[2][0] for t in top_topics] # Third topic
merged_df["prob_3"] = [t[2][1] for t in top_topics] # Probability of third topic
# Reorganize the DataFrame if needed
cols = ["document_id", "winning_topic", "winning_topic_keywords", "winning_topics_prob", "body",
"top_1", "prob_1", "top_2", "prob_2", "top_3", "prob_3"]
merged_df2 = merged_df[cols]
This is how we extract the top 3 topics per document:
document_id | winning_topic | winning_topic_keywords | winning_topics_prob | body | top_1 | prob_1 | top_2 | prob_2 | top_3 | prob_3 |
---|---|---|---|---|---|---|---|---|---|---|
1 | topic_87 | legal, aid, cases, advice, court, system, law, lawyer, solicitors, litigants | 0.16797 | Does the Minister agree that if one does not provi... | topic_87 | 0.16797 | topic_-1 | 0.043965 | topic_20 | 0.034134 |
2 | topic_22 | fishing, fishermen, fish, fisheries, industry, sea, marine, conservation, vessels, fishermens | 1.0 | My right honourable Friend will know that there is... | topic_22 | 1.0 | topic_72 | 1.187003e-306 | topic_61 | 1.158053e-306 |
3 | topic_178 | parking, stevenage, town, hire, taxi, charges, centre, borough, regeneration, local | 1.0 | I congratulate my honourable Friend on his campaig... | topic_178 | 1.0 | topic_85 | 4.395413e-306 | topic_116 | 3.750731e-306 |
4 | topic_-1 | the, to, of, that, and, in, is, for, it, not | 0.260586 | We have had an interesting debate on an important ... | topic_-1 | 0.260586 | topic_121 | 0.141773 | topic_144 | 0.014484 |
Let us look at a text and its associated topic:
For example, the keywords associated with topic 2 are: rail, transport, network, road, london, trains, line, railway, railways, roads.
The speech that is associated with these keywords is the following:
Speech Keywords - rail, transport, network, road, london, trains, line, railway, railways, roads
I am encouraged by the honourable Gentleman’s anger. His intervention bears no relationship to the reality in Britain or to reality for our European Union partners. His comment was important. I hope that it will be well circulated to his constituents. It is clear that this whole process is in a state of disarray. I welcomed the new Secretary of State for Transport. His main challenge is to persuade a number of his colleagues in the Cabinet that the whole process should be dumped. I want to finish on two points that sum up this fiasco. I am sure that the honourable Member for Ross, Cromarty and Skye will deal with the issue of sleeper services to Scotland. The Government and the British Railways Board were dragged into the courts in Scotland to be exposed for what they had shown to Scots - utter contempt for their rail network. They thought that, through stealth and the incompetence of the director of franchising, Mr. Roger Salmon, they could close services under the guise of consultation on a draft passenger service requirement. They were found out. Is it not a disgrace that railway policy is now going to be partly dictated in the courts of England and Scotland? That is a fiasco and the Government know it. The second issue is trans-European networks. It is an important issue for Britain because it involves integrating the major routes in Britain with those in Europe. It is absolutely right that that should be done. What did Ministers do when they went to Brussels? They sabotaged the proposals. We find that the European Union is giving no priority to schemes. It has thrown out environmental impact assessments and of course no cash is being provided by the European Union for such developments..
Let us try to plot this topic over time:
We first merge the original texts to the dataframe with our calculations:
Let us try to plot the topic of security over time:
Python
# Loading Relevant libraries
import numpy as np
import scipy.stats as stats
# Add a binary column indicating if the winning_topic is "topic_2"
final["winning_topic_2"] = (final["winning_topic"] == "topic_2").astype(int)
# Group by year and calculate total speeches, total topic occurrences, and proportion for topic 2
trends = (
final
.groupby("year", as_index=False)
.agg(
total_speeches=('document_id', 'count'), # Total number of speeches
total_topic=('winning_topic_2', 'sum'), # Total speeches for "topic_2"
proportion_topic=('winning_topic_2', 'mean') # Mean proportion of "topic_2"
)
)
# Calculate standard deviation (SD) of the proportion for topic 2
trends["sd_topic"] = final.groupby("year")["winning_topic_2"].std().values
# Use total speeches as the sample size (n)
trends["n"] = trends["total_speeches"]
# Calculate standard error (SE)
trends["se"] = trends["sd_topic"] / np.sqrt(trends["n"])
# Calculate confidence intervals for the proportion of topic 2
confidence_level = 0.95
t_value = stats.t.ppf(1 - (1 - confidence_level) / 2, df=trends["n"] - 1)
trends["ci_lower"] = trends["proportion_topic"] - t_value * trends["se"]
trends["ci_upper"] = trends["proportion_topic"] + t_value * trends["se"]
We now use reticulate to plot it over time.
R
library(reticulate)
trends2 <- reticulate::py$trends
ggplot(trends2, aes(x = year, y = proportion_topic)) +
geom_line() +
geom_point() +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.2) +
labs(
title = "Topic over Time",
x = "Year",
y = "Topic Prevalence"
) +
theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_x_continuous(breaks = seq(min(trends2$year), max(trends2$year), by = 1))+
geom_hline(yintercept = 0)
As we saw earlier, almost 50% of data points are considered outliers.
The documentation provides four different strategies to deal with the outliers:
BERTopic uses clustering to define topics: not more than one topic is assigned
Documents however contain a mixture of topics: this can be accounted for by splitting documents into sentences and feeding those to BERTopic.
Each document is split into tokens according to the provided tokenizer in the CountVectorizer
.
Then, a sliding window is applied on each document creating subsets of the document.
Traditional word representations, like one-hot encoding, are unable to capture relationships between words.
Word embeddings address these limitations by:
Dense representations enable:
Modern techniques, such as dynamic embeddings (e.g., BERT), overcome the limitations of static embeddings by adapting to context.
Popescu (JCU): Lecture 22