L22: Unsupervised Learning: Word Embeddings

Bogdan G. Popescu

John Cabot University

Transitioning from LDA to Word Representations

About LDA:

  • Captures topics by identifying co-occurrence patterns of words in documents.
  • Words are grouped into topics based on shared context, implying a notion of semantic similarity.
  • LDA treats words as discrete entities (bags of words), without considering intrinsic relationships between them.

The GAP:

  • While LDA identifies topics, it doesn’t embed the similarity between individual words.
  • For example, words like king and queen may co-occur in the same topics but lack a direct quantitative measure of similarity.

Importance of Word Embeddings

  • We need dense word representations that embed similarity directly into the vectors.

Limitations of One-Hot Encoding

How words have been represented so far:

  • One-hot encoding: Words as sparse, high-dimensional vectors.
  • Example:
    • “cat” = \([1,0,0,...0]\), “dog” = \([0,1,0,...0]\)

Problems with Sparse Representations:

  • No Sense of Similarity
    • Dot product or cosine similarity between any two word vectors is always zero.
    • Fails to capture relationships between words (e.g., “cat” and “dog” are unrelated).

\[ cos(\theta) = w_{\text{cat}}^Tw_{\text{dog}} = \frac{\mathbf{w_{\text{cat}}} \cdot \mathbf{w_{\text{dog}}}}{\left|\left| \mathbf{w_{\text{cat}}} \right|\right| \left|\left| \mathbf{w_{\text{dog}}} \right|\right|}= 0 \]

  • High Dimensionality
    • Vocabulary determines vector size, leading to inefficiencies

This means that models cannot learn or infer word relationships like synonyms, analogies, or context.

Sparse Word Representations Problems

1.Similarity

  • documents may have zero term overlap, but have almost identical meanings
  • E.g.: “The royal heir ascended the throne.” vs. “The monarch’s child became the ruler.”

2.Classification

  • We may know one term is connected to a concept
  • sparse representations fail to reveal connections to related terms.
    • E.g. If we learn that ‘climate change’ is predictive of the concept of environmental policy, shouldn’t we also infer something about terms like ‘carbon emissions’, ‘renewable energy’, and ‘Paris Agreement’?”
  • Sparse Representations:
    • Words like ‘climate change’ and ‘renewable energy’ are treated as unrelated, despite their clear conceptual overlap.
  • Dense Representations:
    • Embeddings place related terms closer in vector space, allowing us to see the connections between ‘climate change’ and ‘renewable energy’ or ‘Paris Agreement’.

Word Embeddings: A New Paradigm

What are word embeddings?

  • Represent words as dense vectors in a lower-dimensional space.
  • Similar words have similar vector representations.

How embeddings solve the limitations:

  • Capture similarity
    • Words like king and queen will have vectors that are close in this new space.
  • Efficiency
    • Dense representations significantly reduce dimensionality.

Distributional Semantics

Distributional Semantics means that the meaning of a word can be derived from the distribution of contexts in which it appears.

The hypothesis implies that words that appear in similar “contexts” will share similar meanings.

When a word \(j\) appears in a text, its “context” is the set of words that appear nearby (within a fixed-size window).

We use the many contexts of \(w\) to build up a representation of \(w\).

Pre Keyword Post
the sacrifices made to secure our freedom will never be forgotten.
ensure that every citizen enjoys the freedom to speak their mind without fear.
we must defend the values of democracy and freedom against all forms of tyranny.
the right to pursue life, liberty, and freedom is fundamental to our society.
not everyone in the world experiences the freedom we often take for granted.
policies are designed to protect the freedom of individuals while ensuring equality.

Word Embedding Overview

The meaning of each word is based on the distribution of terms with which it co-occurs

We represent this meaning using a vector for each word

Vectors are constructed such that similar words are close to each other in “semantic” space

We build this space automatically by seeing which words are close to one another in texts

Dense Representations of Words

Our goal is to build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts (measuring similarity as the dot product)

\[ \begin{align} w_{\text{cat}} &= \begin{bmatrix} 0.73 \\ 0.04 \\ 0.07 \\ -0.18 \\ 0.81 \\ -0.97 \end{bmatrix} \end{align} \]

\[ \begin{align} w_{\text{dog}} &= \begin{bmatrix} 0.63 \\ .14 \\ .02 \\ -0.58 \\ 0.43 \\ -0.66 \end{bmatrix} \end{align} \]

These representations are known as word embeddings because we “embed” words into a low-dimensional space (low compared to the vocabulary size).

Word Embeddings Advantages

Low-dimensional word embeddings offer some advantages:

  1. Encode Similarity between Words
  • Each word is a vector, with the vectors of similar words closer together than vectors of very different words
  • We no longer have word similarities of zero.
  1. Facilitating Generalization
  • Embeddings enable models to generalize across words automatically.
    • For instance, if the word “amazing” is identified as a strong predictor of positive sentiment, but the word “outstanding” doesn’t appear in the training data, their similar embeddings allow the model to infer that “outstanding” likely has a similar sentiment.
  1. Quantifying Word Meaning
  • Embeddings provide a numerical way to assess the meaning of words.
  • Two words are considered to share a meaning if they appear in similar contexts, capturing this shared meaning through their proximity in vector space.

Characteristics of Word Embeddings

Training Data Requirements

  • High-quality embeddings require large amounts of diverse training data.
  • Typically trained on extensive external corpora (e.g., Wikipedia, news articles, web pages).
  • Embeddings reflect the language and biases present in the training data.

Context Window Size

  • Since a word’s meaning is influenced by its context, we must define what “context” means.
  • Context is typically implemented as a symmetric window of a specified size around each word.
  • The window size determines the type of relationships captured:
    • Small window: Focuses on immediate syntactic relationships
    • Large window: Captures broader semantic and topical associations.

Dimension of the Embedding

  • The embedding for each word will typically be between 50 and 500 elements long
  • The embedding encodes information about the contexts a word appears in, so in theory larger embeddings are able to encode more information

Traditional Word Embeddings

A word embedding is a dense vector representation of a word in a continuous vector space, where similar words are placed closer together.

  • Words that frequently appear in similar contexts have similar embeddings.
  • Example: Words like king, queen, and monarch would be close in the embedding space.

However, there are some limitations:

  • Static: Each word has one fixed embedding, regardless of context (bank in riverbank vs. financial bank).
  • Context-agnostic: Struggles with polysemy (multiple meanings for a single word).

Moving Beyond Traditional Embeddings?

Static Representations fall short:

  • Traditional embeddings ignore sentence-level context.
  • For example:
    • “The bank of the river is beautiful.”
    • “I deposited money at the bank.”
    • Bank has the same vector in both cases, even though the meanings are different.
  • Models like BERT address this limitation by generating dynamic word representations based on their context.

BERTopic

BERTopic is a topic modelling technique that leverages transformers and c-TF-IDF to create clusters for easily interpretable topics.

What are Transformers?

  • A neural network architecture that excels at understanding relationships between words in context
  • They generate dynamic embeddings for each document based on the context of words within it.
  • Embeddings capture semantic nuances, allowing words with multiple meanings (e.g., “bank”) to be represented accurately.

What is TF-IDF?

  • Method that assigns importance to terms based on their frequency in a document.

What is c-TF-IDF?

  • Instead of treating each document independently, it treats clusters of documents as “pseudo-documents.”
  • Calculates term importance for the entire cluster, emphasizing words that uniquely define a topic.

BERTopic

How c-TF-IDF Works in BERTopic

  • Create Clusters: Documents are clustered based on their transformer-generated embeddings.
  • Aggregate Documents: Documents within each cluster are combined into a single “pseudo-document.”
  • Calculate c-TF-IDF: Determines the most representative keywords for each cluster, making topics interpretable.

Why This Combination Works

  • Transformers: Capture deep, contextual relationships between words in documents.
  • c-TF-IDF: Extracts interpretable topics by highlighting the most relevant terms in a cluster.
Python
import pandas as pd
#There might be some incompatibility issues here.
#pip install numpy 1.26
#pip install --upgrade numpy==1.26
#You have to install, uninstall things in pip
#This worked:
#pip uninstall numpy scipy blis thinc
#pip uninstall spacy gensim scipy thinc -y
from bertopic import BERTopic

Reading the Speeches

The next section reads the relevant documents.

Python
import pandas as pd
# Load the aggression_texts CSV and extract the 'body' column
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
aggression_texts['document_id'] = range(1, len(aggression_texts) + 1)

We now initialize the Bert model

Python
# The following two lines will remove some annoying message in Bert: "huggingface/tokenizers: The current process just got forked..."
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Importing Bert
from bertopic import BERTopic
# Create a UMAP instance with a fixed random state for reproducibility
from umap import UMAP
umap_model = UMAP(random_state=42)
# Initialize the BERTopic model with probability calculation enabled and the UMAP model specified
topic_model = BERTopic(calculate_probabilities=True,umap_model=umap_model, verbose=True)
topics, probs = topic_model.fit_transform(aggression_texts["body"])

Batches:   0%|          | 0/896 [00:00<?, ?it/s]
Batches:   0%|          | 1/896 [00:00<08:13,  1.81it/s]
Batches:   0%|          | 2/896 [00:00<06:28,  2.30it/s]
Batches:   0%|          | 3/896 [00:01<05:37,  2.64it/s]
Batches:   0%|          | 4/896 [00:01<05:11,  2.86it/s]
Batches:   1%|          | 5/896 [00:01<04:50,  3.06it/s]
Batches:   1%|          | 6/896 [00:02<04:35,  3.23it/s]
Batches:   1%|          | 7/896 [00:02<04:23,  3.37it/s]
Batches:   1%|          | 8/896 [00:02<04:14,  3.49it/s]
Batches:   1%|1         | 9/896 [00:02<04:11,  3.53it/s]
Batches:   1%|1         | 10/896 [00:03<04:07,  3.57it/s]
Batches:   1%|1         | 11/896 [00:03<04:03,  3.64it/s]
Batches:   1%|1         | 12/896 [00:03<03:59,  3.69it/s]
Batches:   1%|1         | 13/896 [00:03<03:55,  3.74it/s]
Batches:   2%|1         | 14/896 [00:04<03:51,  3.81it/s]
Batches:   2%|1         | 15/896 [00:04<03:42,  3.95it/s]
Batches:   2%|1         | 16/896 [00:04<03:40,  4.00it/s]
Batches:   2%|1         | 17/896 [00:04<03:34,  4.10it/s]
Batches:   2%|2         | 18/896 [00:05<03:29,  4.19it/s]
Batches:   2%|2         | 19/896 [00:05<03:24,  4.28it/s]
Batches:   2%|2         | 20/896 [00:05<03:25,  4.26it/s]
Batches:   2%|2         | 21/896 [00:05<03:23,  4.29it/s]
Batches:   2%|2         | 22/896 [00:06<03:22,  4.31it/s]
Batches:   3%|2         | 23/896 [00:06<03:19,  4.38it/s]
Batches:   3%|2         | 24/896 [00:06<03:16,  4.45it/s]
Batches:   3%|2         | 25/896 [00:06<03:13,  4.51it/s]
Batches:   3%|2         | 26/896 [00:06<03:10,  4.57it/s]
Batches:   3%|3         | 27/896 [00:07<03:09,  4.58it/s]
Batches:   3%|3         | 28/896 [00:07<03:06,  4.66it/s]
Batches:   3%|3         | 29/896 [00:07<03:02,  4.76it/s]
Batches:   3%|3         | 30/896 [00:07<02:58,  4.84it/s]
Batches:   3%|3         | 31/896 [00:07<02:55,  4.92it/s]
Batches:   4%|3         | 32/896 [00:08<02:53,  4.98it/s]
Batches:   4%|3         | 33/896 [00:08<02:51,  5.03it/s]
Batches:   4%|3         | 34/896 [00:08<02:48,  5.11it/s]
Batches:   4%|3         | 35/896 [00:08<02:48,  5.10it/s]
Batches:   4%|4         | 36/896 [00:08<02:47,  5.13it/s]
Batches:   4%|4         | 37/896 [00:09<02:47,  5.13it/s]
Batches:   4%|4         | 38/896 [00:09<02:48,  5.09it/s]
Batches:   4%|4         | 39/896 [00:09<02:46,  5.16it/s]
Batches:   4%|4         | 40/896 [00:09<02:44,  5.20it/s]
Batches:   5%|4         | 41/896 [00:09<02:42,  5.28it/s]
Batches:   5%|4         | 42/896 [00:10<02:41,  5.29it/s]
Batches:   5%|4         | 43/896 [00:10<02:39,  5.35it/s]
Batches:   5%|4         | 44/896 [00:10<02:38,  5.39it/s]
Batches:   5%|5         | 45/896 [00:10<02:36,  5.43it/s]
Batches:   5%|5         | 46/896 [00:10<02:35,  5.48it/s]
Batches:   5%|5         | 47/896 [00:10<02:33,  5.52it/s]
Batches:   5%|5         | 48/896 [00:11<02:32,  5.56it/s]
Batches:   5%|5         | 49/896 [00:11<02:31,  5.60it/s]
Batches:   6%|5         | 50/896 [00:11<02:29,  5.64it/s]
Batches:   6%|5         | 51/896 [00:11<02:29,  5.66it/s]
Batches:   6%|5         | 52/896 [00:11<02:30,  5.61it/s]
Batches:   6%|5         | 53/896 [00:11<02:29,  5.66it/s]
Batches:   6%|6         | 54/896 [00:12<02:27,  5.71it/s]
Batches:   6%|6         | 55/896 [00:12<02:26,  5.76it/s]
Batches:   6%|6         | 56/896 [00:12<02:25,  5.76it/s]
Batches:   6%|6         | 57/896 [00:12<02:24,  5.82it/s]
Batches:   6%|6         | 58/896 [00:12<02:23,  5.86it/s]
Batches:   7%|6         | 59/896 [00:13<02:22,  5.88it/s]
Batches:   7%|6         | 60/896 [00:13<02:22,  5.85it/s]
Batches:   7%|6         | 61/896 [00:13<02:22,  5.85it/s]
Batches:   7%|6         | 62/896 [00:13<02:22,  5.86it/s]
Batches:   7%|7         | 63/896 [00:13<02:21,  5.89it/s]
Batches:   7%|7         | 64/896 [00:13<02:20,  5.94it/s]
Batches:   7%|7         | 65/896 [00:14<02:18,  6.00it/s]
Batches:   7%|7         | 66/896 [00:14<02:16,  6.07it/s]
Batches:   7%|7         | 67/896 [00:14<02:15,  6.13it/s]
Batches:   8%|7         | 68/896 [00:14<02:14,  6.18it/s]
Batches:   8%|7         | 69/896 [00:14<02:13,  6.20it/s]
Batches:   8%|7         | 70/896 [00:14<02:12,  6.21it/s]
Batches:   8%|7         | 71/896 [00:14<02:12,  6.23it/s]
Batches:   8%|8         | 72/896 [00:15<02:12,  6.24it/s]
Batches:   8%|8         | 73/896 [00:15<02:11,  6.28it/s]
Batches:   8%|8         | 74/896 [00:15<02:10,  6.30it/s]
Batches:   8%|8         | 75/896 [00:15<02:10,  6.31it/s]
Batches:   8%|8         | 76/896 [00:15<02:09,  6.33it/s]
Batches:   9%|8         | 77/896 [00:15<02:08,  6.36it/s]
Batches:   9%|8         | 78/896 [00:16<02:08,  6.38it/s]
Batches:   9%|8         | 79/896 [00:16<02:06,  6.44it/s]
Batches:   9%|8         | 80/896 [00:16<02:06,  6.47it/s]
Batches:   9%|9         | 81/896 [00:16<02:05,  6.49it/s]
Batches:   9%|9         | 82/896 [00:16<02:05,  6.50it/s]
Batches:   9%|9         | 83/896 [00:16<02:04,  6.54it/s]
Batches:   9%|9         | 84/896 [00:16<02:03,  6.58it/s]
Batches:   9%|9         | 85/896 [00:17<02:02,  6.60it/s]
Batches:  10%|9         | 86/896 [00:17<02:11,  6.16it/s]
Batches:  10%|9         | 87/896 [00:17<02:08,  6.31it/s]
Batches:  10%|9         | 88/896 [00:17<02:05,  6.42it/s]
Batches:  10%|9         | 89/896 [00:17<02:04,  6.51it/s]
Batches:  10%|#         | 90/896 [00:17<02:02,  6.59it/s]
Batches:  10%|#         | 91/896 [00:18<02:01,  6.65it/s]
Batches:  10%|#         | 92/896 [00:18<02:00,  6.68it/s]
Batches:  10%|#         | 93/896 [00:18<01:59,  6.72it/s]
Batches:  10%|#         | 94/896 [00:18<01:58,  6.75it/s]
Batches:  11%|#         | 95/896 [00:18<01:58,  6.79it/s]
Batches:  11%|#         | 96/896 [00:18<01:57,  6.81it/s]
Batches:  11%|#         | 97/896 [00:18<01:56,  6.83it/s]
Batches:  11%|#         | 98/896 [00:19<01:56,  6.86it/s]
Batches:  11%|#1        | 99/896 [00:19<01:55,  6.87it/s]
Batches:  11%|#1        | 100/896 [00:19<01:55,  6.90it/s]
Batches:  11%|#1        | 101/896 [00:19<01:54,  6.92it/s]
Batches:  11%|#1        | 102/896 [00:19<01:54,  6.95it/s]
Batches:  11%|#1        | 103/896 [00:19<01:54,  6.95it/s]
Batches:  12%|#1        | 104/896 [00:19<01:53,  6.99it/s]
Batches:  12%|#1        | 105/896 [00:20<01:53,  7.00it/s]
Batches:  12%|#1        | 106/896 [00:20<01:52,  7.03it/s]
Batches:  12%|#1        | 107/896 [00:20<01:51,  7.05it/s]
Batches:  12%|#2        | 108/896 [00:20<01:51,  7.07it/s]
Batches:  12%|#2        | 109/896 [00:20<01:51,  7.06it/s]
Batches:  12%|#2        | 110/896 [00:20<01:51,  7.06it/s]
Batches:  12%|#2        | 111/896 [00:20<01:50,  7.08it/s]
Batches:  12%|#2        | 112/896 [00:21<01:50,  7.11it/s]
Batches:  13%|#2        | 113/896 [00:21<01:49,  7.15it/s]
Batches:  13%|#2        | 114/896 [00:21<02:27,  5.29it/s]
Batches:  13%|#2        | 115/896 [00:21<02:15,  5.76it/s]
Batches:  13%|#2        | 116/896 [00:21<02:07,  6.13it/s]
Batches:  13%|#3        | 117/896 [00:21<02:01,  6.43it/s]
Batches:  13%|#3        | 118/896 [00:22<01:56,  6.67it/s]
Batches:  13%|#3        | 119/896 [00:22<01:53,  6.85it/s]
Batches:  13%|#3        | 120/896 [00:22<01:50,  7.00it/s]
Batches:  14%|#3        | 121/896 [00:22<01:49,  7.10it/s]
Batches:  14%|#3        | 122/896 [00:22<01:47,  7.18it/s]
Batches:  14%|#3        | 123/896 [00:22<01:46,  7.24it/s]
Batches:  14%|#3        | 124/896 [00:22<01:45,  7.30it/s]
Batches:  14%|#3        | 125/896 [00:23<01:45,  7.33it/s]
Batches:  14%|#4        | 126/896 [00:23<01:44,  7.37it/s]
Batches:  14%|#4        | 127/896 [00:23<01:44,  7.36it/s]
Batches:  14%|#4        | 128/896 [00:23<01:43,  7.39it/s]
Batches:  14%|#4        | 129/896 [00:23<01:43,  7.40it/s]
Batches:  15%|#4        | 130/896 [00:23<01:43,  7.41it/s]
Batches:  15%|#4        | 131/896 [00:23<01:42,  7.43it/s]
Batches:  15%|#4        | 132/896 [00:23<01:42,  7.46it/s]
Batches:  15%|#4        | 133/896 [00:24<01:41,  7.49it/s]
Batches:  15%|#4        | 134/896 [00:24<01:41,  7.50it/s]
Batches:  15%|#5        | 135/896 [00:24<01:46,  7.18it/s]
Batches:  15%|#5        | 136/896 [00:24<01:45,  7.18it/s]
Batches:  15%|#5        | 137/896 [00:24<01:45,  7.20it/s]
Batches:  15%|#5        | 138/896 [00:24<01:45,  7.21it/s]
Batches:  16%|#5        | 139/896 [00:24<01:44,  7.23it/s]
Batches:  16%|#5        | 140/896 [00:25<01:44,  7.25it/s]
Batches:  16%|#5        | 141/896 [00:25<01:43,  7.26it/s]
Batches:  16%|#5        | 142/896 [00:25<01:43,  7.27it/s]
Batches:  16%|#5        | 143/896 [00:25<01:43,  7.30it/s]
Batches:  16%|#6        | 144/896 [00:25<01:42,  7.30it/s]
Batches:  16%|#6        | 145/896 [00:25<01:42,  7.33it/s]
Batches:  16%|#6        | 146/896 [00:25<01:41,  7.36it/s]
Batches:  16%|#6        | 147/896 [00:26<01:41,  7.37it/s]
Batches:  17%|#6        | 148/896 [00:26<01:41,  7.38it/s]
Batches:  17%|#6        | 149/896 [00:26<01:41,  7.39it/s]
Batches:  17%|#6        | 150/896 [00:26<01:40,  7.41it/s]
Batches:  17%|#6        | 151/896 [00:26<01:40,  7.41it/s]
Batches:  17%|#6        | 152/896 [00:26<01:53,  6.58it/s]
Batches:  17%|#7        | 153/896 [00:26<01:48,  6.82it/s]
Batches:  17%|#7        | 154/896 [00:27<01:45,  7.01it/s]
Batches:  17%|#7        | 155/896 [00:27<01:55,  6.40it/s]
Batches:  17%|#7        | 156/896 [00:27<02:01,  6.07it/s]
Batches:  18%|#7        | 157/896 [00:27<02:07,  5.78it/s]
Batches:  18%|#7        | 158/896 [00:27<01:59,  6.19it/s]
Batches:  18%|#7        | 159/896 [00:27<02:03,  5.97it/s]
Batches:  18%|#7        | 160/896 [00:28<02:08,  5.71it/s]
Batches:  18%|#7        | 161/896 [00:28<02:12,  5.56it/s]
Batches:  18%|#8        | 162/896 [00:28<02:13,  5.52it/s]
Batches:  18%|#8        | 163/896 [00:28<02:14,  5.44it/s]
Batches:  18%|#8        | 164/896 [00:28<02:16,  5.37it/s]
Batches:  18%|#8        | 165/896 [00:28<02:02,  5.96it/s]
Batches:  19%|#8        | 166/896 [00:29<02:05,  5.81it/s]
Batches:  19%|#8        | 167/896 [00:29<02:10,  5.59it/s]
Batches:  19%|#8        | 168/896 [00:29<01:58,  6.13it/s]
Batches:  19%|#8        | 169/896 [00:29<02:04,  5.85it/s]
Batches:  19%|#8        | 170/896 [00:29<01:53,  6.42it/s]
Batches:  19%|#9        | 171/896 [00:29<02:00,  6.04it/s]
Batches:  19%|#9        | 172/896 [00:30<02:04,  5.80it/s]
Batches:  19%|#9        | 173/896 [00:30<02:09,  5.60it/s]
Batches:  19%|#9        | 174/896 [00:30<02:11,  5.48it/s]
Batches:  20%|#9        | 175/896 [00:30<02:12,  5.46it/s]
Batches:  20%|#9        | 176/896 [00:30<02:12,  5.45it/s]
Batches:  20%|#9        | 177/896 [00:31<02:13,  5.38it/s]
Batches:  20%|#9        | 178/896 [00:31<02:12,  5.40it/s]
Batches:  20%|#9        | 179/896 [00:31<02:14,  5.31it/s]
Batches:  20%|##        | 180/896 [00:31<02:13,  5.36it/s]
Batches:  20%|##        | 181/896 [00:31<02:15,  5.29it/s]
Batches:  20%|##        | 182/896 [00:31<01:58,  6.02it/s]
Batches:  20%|##        | 183/896 [00:32<02:01,  5.85it/s]
Batches:  21%|##        | 184/896 [00:32<02:05,  5.67it/s]
Batches:  21%|##        | 185/896 [00:32<02:09,  5.49it/s]
Batches:  21%|##        | 186/896 [00:32<02:12,  5.34it/s]
Batches:  21%|##        | 187/896 [00:32<02:15,  5.22it/s]
Batches:  21%|##        | 188/896 [00:33<02:17,  5.16it/s]
Batches:  21%|##1       | 189/896 [00:33<02:17,  5.13it/s]
Batches:  21%|##1       | 190/896 [00:33<02:18,  5.09it/s]
Batches:  21%|##1       | 191/896 [00:33<02:19,  5.07it/s]
Batches:  21%|##1       | 192/896 [00:33<02:18,  5.08it/s]
Batches:  22%|##1       | 193/896 [00:34<02:17,  5.13it/s]
Batches:  22%|##1       | 194/896 [00:34<01:58,  5.92it/s]
Batches:  22%|##1       | 195/896 [00:34<02:01,  5.75it/s]
Batches:  22%|##1       | 196/896 [00:34<02:04,  5.64it/s]
Batches:  22%|##1       | 197/896 [00:34<01:48,  6.44it/s]
Batches:  22%|##2       | 198/896 [00:34<01:56,  6.01it/s]
Batches:  22%|##2       | 199/896 [00:35<02:01,  5.76it/s]
Batches:  22%|##2       | 200/896 [00:35<01:46,  6.52it/s]
Batches:  23%|##2       | 202/896 [00:35<01:41,  6.83it/s]
Batches:  23%|##2       | 204/896 [00:35<01:40,  6.91it/s]
Batches:  23%|##2       | 205/896 [00:35<01:47,  6.43it/s]
Batches:  23%|##3       | 207/896 [00:36<01:43,  6.68it/s]
Batches:  23%|##3       | 208/896 [00:36<01:48,  6.33it/s]
Batches:  23%|##3       | 209/896 [00:36<01:53,  6.05it/s]
Batches:  24%|##3       | 211/896 [00:36<01:33,  7.34it/s]
Batches:  24%|##3       | 212/896 [00:36<01:39,  6.87it/s]
Batches:  24%|##3       | 214/896 [00:37<01:36,  7.07it/s]
Batches:  24%|##3       | 215/896 [00:37<01:42,  6.65it/s]
Batches:  24%|##4       | 217/896 [00:37<01:37,  6.96it/s]
Batches:  24%|##4       | 219/896 [00:37<01:25,  7.96it/s]
Batches:  25%|##4       | 220/896 [00:38<01:31,  7.43it/s]
Batches:  25%|##4       | 222/896 [00:38<01:20,  8.42it/s]
Batches:  25%|##4       | 223/896 [00:38<01:27,  7.73it/s]
Batches:  25%|##5       | 225/896 [00:38<01:16,  8.77it/s]
Batches:  25%|##5       | 226/896 [00:38<01:24,  7.91it/s]
Batches:  25%|##5       | 227/896 [00:38<01:33,  7.18it/s]
Batches:  25%|##5       | 228/896 [00:39<01:41,  6.59it/s]
Batches:  26%|##5       | 229/896 [00:39<01:48,  6.14it/s]
Batches:  26%|##5       | 231/896 [00:39<01:26,  7.68it/s]
Batches:  26%|##5       | 232/896 [00:39<01:35,  6.94it/s]
Batches:  26%|##6       | 234/896 [00:39<01:20,  8.26it/s]
Batches:  26%|##6       | 236/896 [00:40<01:11,  9.19it/s]
Batches:  27%|##6       | 238/896 [00:40<01:16,  8.62it/s]
Batches:  27%|##6       | 240/896 [00:40<01:09,  9.40it/s]
Batches:  27%|##6       | 241/896 [00:40<01:21,  8.02it/s]
Batches:  27%|##7       | 243/896 [00:40<01:25,  7.63it/s]
Batches:  27%|##7       | 245/896 [00:41<01:27,  7.43it/s]
Batches:  27%|##7       | 246/896 [00:41<01:23,  7.77it/s]
Batches:  28%|##7       | 248/896 [00:41<01:25,  7.61it/s]
Batches:  28%|##7       | 249/896 [00:41<01:34,  6.88it/s]
Batches:  28%|##7       | 250/896 [00:42<01:40,  6.43it/s]
Batches:  28%|##8       | 251/896 [00:42<01:46,  6.06it/s]
Batches:  28%|##8       | 252/896 [00:42<01:51,  5.79it/s]
Batches:  28%|##8       | 254/896 [00:42<01:26,  7.44it/s]
Batches:  28%|##8       | 255/896 [00:42<01:35,  6.73it/s]
Batches:  29%|##8       | 256/896 [00:42<01:43,  6.21it/s]
Batches:  29%|##8       | 258/896 [00:43<01:37,  6.54it/s]
Batches:  29%|##8       | 259/896 [00:43<01:43,  6.18it/s]
Batches:  29%|##9       | 261/896 [00:43<01:21,  7.75it/s]
Batches:  29%|##9       | 262/896 [00:43<01:29,  7.10it/s]
Batches:  29%|##9       | 264/896 [00:43<01:15,  8.42it/s]
Batches:  30%|##9       | 266/896 [00:44<01:06,  9.43it/s]
Batches:  30%|##9       | 268/896 [00:44<01:02, 10.11it/s]
Batches:  30%|###       | 270/896 [00:44<01:08,  9.10it/s]
Batches:  30%|###       | 272/896 [00:44<01:03,  9.90it/s]
Batches:  31%|###       | 274/896 [00:44<01:09,  8.90it/s]
Batches:  31%|###       | 275/896 [00:45<01:19,  7.83it/s]
Batches:  31%|###       | 276/896 [00:45<01:27,  7.11it/s]
Batches:  31%|###1      | 278/896 [00:45<01:14,  8.30it/s]
Batches:  31%|###1      | 280/896 [00:45<01:05,  9.35it/s]
Batches:  31%|###1      | 282/896 [00:45<01:11,  8.55it/s]
Batches:  32%|###1      | 284/896 [00:46<01:04,  9.47it/s]
Batches:  32%|###1      | 286/896 [00:46<01:09,  8.73it/s]
Batches:  32%|###2      | 288/896 [00:46<01:13,  8.31it/s]
Batches:  32%|###2      | 289/896 [00:46<01:20,  7.55it/s]
Batches:  32%|###2      | 290/896 [00:47<01:27,  6.90it/s]
Batches:  32%|###2      | 291/896 [00:47<01:33,  6.45it/s]
Batches:  33%|###2      | 293/896 [00:47<01:15,  8.02it/s]
Batches:  33%|###2      | 295/896 [00:47<01:04,  9.26it/s]
Batches:  33%|###3      | 297/896 [00:47<00:58, 10.18it/s]
Batches:  33%|###3      | 299/896 [00:47<01:06,  8.98it/s]
Batches:  34%|###3      | 301/896 [00:48<00:59,  9.96it/s]
Batches:  34%|###3      | 303/896 [00:48<01:05,  9.01it/s]
Batches:  34%|###3      | 304/896 [00:48<01:14,  8.00it/s]
Batches:  34%|###4      | 305/896 [00:48<01:22,  7.20it/s]
Batches:  34%|###4      | 307/896 [00:49<01:20,  7.34it/s]
Batches:  34%|###4      | 309/896 [00:49<01:07,  8.65it/s]
Batches:  35%|###4      | 310/896 [00:49<01:15,  7.74it/s]
Batches:  35%|###4      | 311/896 [00:49<01:22,  7.08it/s]
Batches:  35%|###4      | 313/896 [00:49<01:18,  7.39it/s]
Batches:  35%|###5      | 315/896 [00:49<01:05,  8.90it/s]
Batches:  35%|###5      | 317/896 [00:50<00:57, 10.11it/s]
Batches:  36%|###5      | 319/896 [00:50<00:52, 11.05it/s]
Batches:  36%|###5      | 321/896 [00:50<00:56, 10.12it/s]
Batches:  36%|###6      | 323/896 [00:50<01:01,  9.36it/s]
Batches:  36%|###6      | 325/896 [00:50<01:03,  9.02it/s]
Batches:  36%|###6      | 327/896 [00:51<00:57,  9.87it/s]
Batches:  37%|###6      | 329/896 [00:51<00:52, 10.75it/s]
Batches:  37%|###6      | 331/896 [00:51<00:49, 11.42it/s]
Batches:  37%|###7      | 333/896 [00:51<00:45, 12.25it/s]
Batches:  37%|###7      | 335/896 [00:51<00:53, 10.54it/s]
Batches:  38%|###7      | 337/896 [00:51<00:49, 11.32it/s]
Batches:  38%|###7      | 339/896 [00:52<00:45, 12.36it/s]
Batches:  38%|###8      | 341/896 [00:52<00:52, 10.52it/s]
Batches:  38%|###8      | 343/896 [00:52<00:56,  9.70it/s]
Batches:  39%|###8      | 345/896 [00:52<00:50, 10.96it/s]
Batches:  39%|###8      | 347/896 [00:52<00:45, 12.09it/s]
Batches:  39%|###8      | 349/896 [00:53<00:51, 10.62it/s]
Batches:  39%|###9      | 351/896 [00:53<01:04,  8.43it/s]
Batches:  39%|###9      | 353/896 [00:53<01:04,  8.40it/s]
Batches:  40%|###9      | 355/896 [00:53<00:55,  9.80it/s]
Batches:  40%|###9      | 357/896 [00:54<00:58,  9.24it/s]
Batches:  40%|####      | 359/896 [00:54<00:50, 10.60it/s]
Batches:  40%|####      | 361/896 [00:54<00:45, 11.82it/s]
Batches:  41%|####      | 363/896 [00:54<00:40, 13.00it/s]
Batches:  41%|####      | 365/896 [00:54<00:39, 13.55it/s]
Batches:  41%|####      | 367/896 [00:54<00:45, 11.74it/s]
Batches:  41%|####1     | 369/896 [00:54<00:41, 12.67it/s]
Batches:  41%|####1     | 371/896 [00:55<00:38, 13.61it/s]
Batches:  42%|####1     | 373/896 [00:55<00:45, 11.44it/s]
Batches:  42%|####1     | 375/896 [00:55<00:42, 12.22it/s]
Batches:  42%|####2     | 377/896 [00:55<00:47, 10.82it/s]
Batches:  42%|####2     | 379/896 [00:55<00:51, 10.00it/s]
Batches:  43%|####2     | 381/896 [00:56<00:46, 11.15it/s]
Batches:  43%|####2     | 383/896 [00:56<00:50, 10.19it/s]
Batches:  43%|####2     | 385/896 [00:56<00:44, 11.55it/s]
Batches:  43%|####3     | 387/896 [00:56<00:39, 12.84it/s]
Batches:  43%|####3     | 389/896 [00:56<00:36, 13.74it/s]
Batches:  44%|####3     | 391/896 [00:56<00:34, 14.51it/s]
Batches:  44%|####3     | 393/896 [00:56<00:40, 12.27it/s]
Batches:  44%|####4     | 395/896 [00:57<00:38, 13.06it/s]
Batches:  44%|####4     | 397/896 [00:57<00:35, 14.06it/s]
Batches:  45%|####4     | 399/896 [00:57<00:41, 11.89it/s]
Batches:  45%|####4     | 401/896 [00:57<00:39, 12.54it/s]
Batches:  45%|####4     | 403/896 [00:57<00:44, 11.10it/s]
Batches:  45%|####5     | 405/896 [00:57<00:39, 12.44it/s]
Batches:  45%|####5     | 407/896 [00:58<00:35, 13.71it/s]
Batches:  46%|####5     | 409/896 [00:58<00:43, 11.21it/s]
Batches:  46%|####5     | 411/896 [00:58<00:48,  9.99it/s]
Batches:  46%|####6     | 413/896 [00:58<01:03,  7.60it/s]
Batches:  46%|####6     | 415/896 [00:59<00:52,  9.14it/s]
Batches:  47%|####6     | 417/896 [00:59<00:44, 10.77it/s]
Batches:  47%|####6     | 419/896 [00:59<00:38, 12.33it/s]
Batches:  47%|####6     | 421/896 [00:59<00:34, 13.67it/s]
Batches:  47%|####7     | 423/896 [00:59<00:42, 11.15it/s]
Batches:  47%|####7     | 425/896 [00:59<00:37, 12.59it/s]
Batches:  48%|####7     | 427/896 [00:59<00:33, 13.80it/s]
Batches:  48%|####7     | 429/896 [01:00<00:41, 11.12it/s]
Batches:  48%|####8     | 431/896 [01:00<00:36, 12.59it/s]
Batches:  48%|####8     | 433/896 [01:00<00:33, 13.98it/s]
Batches:  49%|####8     | 435/896 [01:00<00:30, 14.95it/s]
Batches:  49%|####8     | 437/896 [01:00<00:38, 11.80it/s]
Batches:  49%|####8     | 439/896 [01:00<00:44, 10.25it/s]
Batches:  49%|####9     | 441/896 [01:01<00:38, 11.76it/s]
Batches:  49%|####9     | 443/896 [01:01<00:43, 10.31it/s]
Batches:  50%|####9     | 445/896 [01:01<00:38, 11.84it/s]
Batches:  50%|####9     | 447/896 [01:01<00:44, 10.18it/s]
Batches:  50%|#####     | 449/896 [01:01<00:38, 11.70it/s]
Batches:  50%|#####     | 451/896 [01:02<00:43, 10.25it/s]
Batches:  51%|#####     | 453/896 [01:02<00:36, 11.98it/s]
Batches:  51%|#####     | 455/896 [01:02<00:33, 13.31it/s]
Batches:  51%|#####1    | 457/896 [01:02<00:30, 14.37it/s]
Batches:  51%|#####1    | 459/896 [01:02<00:28, 15.36it/s]
Batches:  51%|#####1    | 461/896 [01:02<00:26, 16.38it/s]
Batches:  52%|#####1    | 463/896 [01:02<00:35, 12.36it/s]
Batches:  52%|#####1    | 465/896 [01:02<00:31, 13.58it/s]
Batches:  52%|#####2    | 467/896 [01:03<00:29, 14.76it/s]
Batches:  52%|#####2    | 469/896 [01:03<00:27, 15.69it/s]
Batches:  53%|#####2    | 471/896 [01:03<00:26, 16.28it/s]
Batches:  53%|#####2    | 473/896 [01:03<00:35, 12.00it/s]
Batches:  53%|#####3    | 475/896 [01:03<00:31, 13.52it/s]
Batches:  53%|#####3    | 477/896 [01:03<00:37, 11.10it/s]
Batches:  53%|#####3    | 479/896 [01:04<00:32, 12.80it/s]
Batches:  54%|#####3    | 481/896 [01:04<00:29, 14.16it/s]
Batches:  54%|#####3    | 483/896 [01:04<00:36, 11.23it/s]
Batches:  54%|#####4    | 485/896 [01:04<00:32, 12.78it/s]
Batches:  54%|#####4    | 487/896 [01:04<00:38, 10.53it/s]
Batches:  55%|#####4    | 489/896 [01:05<00:43,  9.43it/s]
Batches:  55%|#####4    | 491/896 [01:05<00:45,  8.82it/s]
Batches:  55%|#####5    | 494/896 [01:05<00:35, 11.39it/s]
Batches:  55%|#####5    | 496/896 [01:05<00:31, 12.65it/s]
Batches:  56%|#####5    | 498/896 [01:05<00:45,  8.68it/s]
Batches:  56%|#####5    | 501/896 [01:06<00:35, 11.00it/s]
Batches:  56%|#####6    | 504/896 [01:06<00:30, 13.05it/s]
Batches:  57%|#####6    | 507/896 [01:06<00:26, 14.89it/s]
Batches:  57%|#####6    | 510/896 [01:06<00:23, 16.34it/s]
Batches:  57%|#####7    | 512/896 [01:06<00:29, 13.04it/s]
Batches:  57%|#####7    | 515/896 [01:06<00:25, 15.00it/s]
Batches:  58%|#####7    | 518/896 [01:07<00:22, 16.66it/s]
Batches:  58%|#####8    | 521/896 [01:07<00:21, 17.82it/s]
Batches:  58%|#####8    | 524/896 [01:07<00:19, 18.60it/s]
Batches:  59%|#####8    | 526/896 [01:07<00:19, 18.65it/s]
Batches:  59%|#####8    | 528/896 [01:07<00:26, 14.00it/s]
Batches:  59%|#####9    | 530/896 [01:07<00:24, 15.15it/s]
Batches:  59%|#####9    | 532/896 [01:08<00:29, 12.17it/s]
Batches:  60%|#####9    | 535/896 [01:08<00:25, 14.33it/s]
Batches:  60%|######    | 538/896 [01:08<00:22, 16.12it/s]
Batches:  60%|######    | 540/896 [01:08<00:27, 12.74it/s]
Batches:  61%|######    | 543/896 [01:08<00:24, 14.57it/s]
Batches:  61%|######    | 546/896 [01:08<00:21, 16.23it/s]
Batches:  61%|######1   | 548/896 [01:09<00:26, 12.89it/s]
Batches:  61%|######1   | 550/896 [01:09<00:30, 11.17it/s]
Batches:  62%|######1   | 552/896 [01:09<00:34, 10.07it/s]
Batches:  62%|######1   | 555/896 [01:09<00:27, 12.63it/s]
Batches:  62%|######2   | 557/896 [01:10<00:31, 10.90it/s]
Batches:  62%|######2   | 559/896 [01:10<00:27, 12.29it/s]
Batches:  63%|######2   | 562/896 [01:10<00:22, 14.62it/s]
Batches:  63%|######2   | 564/896 [01:10<00:34,  9.70it/s]
Batches:  63%|######3   | 567/896 [01:10<00:27, 12.18it/s]
Batches:  64%|######3   | 569/896 [01:11<00:30, 10.86it/s]
Batches:  64%|######3   | 571/896 [01:11<00:26, 12.27it/s]
Batches:  64%|######3   | 573/896 [01:11<00:29, 10.83it/s]
Batches:  64%|######4   | 575/896 [01:11<00:26, 11.91it/s]
Batches:  65%|######4   | 578/896 [01:11<00:21, 14.46it/s]
Batches:  65%|######4   | 581/896 [01:11<00:18, 16.58it/s]
Batches:  65%|######5   | 583/896 [01:12<00:23, 13.30it/s]
Batches:  65%|######5   | 586/896 [01:12<00:19, 15.72it/s]
Batches:  66%|######5   | 589/896 [01:12<00:17, 17.84it/s]
Batches:  66%|######6   | 592/896 [01:12<00:15, 19.35it/s]
Batches:  66%|######6   | 595/896 [01:12<00:14, 20.39it/s]
Batches:  67%|######6   | 598/896 [01:12<00:14, 20.96it/s]
Batches:  67%|######7   | 601/896 [01:13<00:23, 12.66it/s]
Batches:  67%|######7   | 603/896 [01:13<00:25, 11.28it/s]
Batches:  68%|######7   | 606/896 [01:13<00:21, 13.42it/s]
Batches:  68%|######7   | 609/896 [01:13<00:18, 15.69it/s]
Batches:  68%|######8   | 612/896 [01:13<00:20, 13.60it/s]
Batches:  69%|######8   | 615/896 [01:14<00:22, 12.33it/s]
Batches:  69%|######8   | 618/896 [01:14<00:18, 14.64it/s]
Batches:  69%|######9   | 620/896 [01:14<00:21, 12.60it/s]
Batches:  69%|######9   | 622/896 [01:14<00:24, 11.20it/s]
Batches:  70%|######9   | 625/896 [01:14<00:19, 13.80it/s]
Batches:  70%|######9   | 627/896 [01:15<00:23, 11.63it/s]
Batches:  70%|#######   | 630/896 [01:15<00:18, 14.13it/s]
Batches:  71%|#######   | 633/896 [01:15<00:16, 16.24it/s]
Batches:  71%|#######   | 635/896 [01:15<00:19, 13.47it/s]
Batches:  71%|#######1  | 637/896 [01:15<00:21, 11.82it/s]
Batches:  71%|#######1  | 640/896 [01:16<00:17, 14.54it/s]
Batches:  72%|#######1  | 642/896 [01:16<00:20, 12.39it/s]
Batches:  72%|#######1  | 645/896 [01:16<00:16, 14.83it/s]
Batches:  72%|#######2  | 648/896 [01:16<00:14, 17.27it/s]
Batches:  73%|#######2  | 651/896 [01:16<00:12, 19.34it/s]
Batches:  73%|#######2  | 654/896 [01:16<00:11, 20.92it/s]
Batches:  73%|#######3  | 657/896 [01:16<00:10, 22.58it/s]
Batches:  74%|#######3  | 660/896 [01:17<00:14, 16.49it/s]
Batches:  74%|#######3  | 663/896 [01:17<00:16, 14.42it/s]
Batches:  74%|#######4  | 666/896 [01:17<00:14, 16.29it/s]
Batches:  75%|#######4  | 669/896 [01:17<00:15, 14.42it/s]
Batches:  75%|#######5  | 673/896 [01:17<00:12, 17.91it/s]
Batches:  75%|#######5  | 676/896 [01:18<00:14, 15.33it/s]
Batches:  76%|#######5  | 678/896 [01:18<00:16, 13.18it/s]
Batches:  76%|#######5  | 680/896 [01:18<00:18, 11.99it/s]
Batches:  76%|#######6  | 682/896 [01:18<00:19, 10.87it/s]
Batches:  76%|#######6  | 684/896 [01:19<00:21,  9.89it/s]
Batches:  77%|#######6  | 687/896 [01:19<00:16, 12.55it/s]
Batches:  77%|#######7  | 690/896 [01:19<00:13, 15.36it/s]
Batches:  77%|#######7  | 693/896 [01:19<00:11, 17.75it/s]
Batches:  78%|#######7  | 696/896 [01:19<00:09, 20.02it/s]
Batches:  78%|#######8  | 699/896 [01:19<00:12, 16.27it/s]
Batches:  78%|#######8  | 702/896 [01:20<00:13, 14.30it/s]
Batches:  79%|#######8  | 705/896 [01:20<00:11, 16.90it/s]
Batches:  79%|#######9  | 708/896 [01:20<00:12, 15.15it/s]
Batches:  79%|#######9  | 711/896 [01:20<00:13, 13.80it/s]
Batches:  80%|#######9  | 714/896 [01:20<00:11, 16.28it/s]
Batches:  80%|#######9  | 716/896 [01:21<00:12, 14.04it/s]
Batches:  80%|########  | 720/896 [01:21<00:09, 18.16it/s]
Batches:  81%|########  | 724/896 [01:21<00:07, 21.67it/s]
Batches:  81%|########1 | 727/896 [01:21<00:09, 17.72it/s]
Batches:  81%|########1 | 730/896 [01:21<00:10, 15.69it/s]
Batches:  82%|########1 | 732/896 [01:21<00:11, 13.70it/s]
Batches:  82%|########2 | 735/896 [01:22<00:12, 13.11it/s]
Batches:  82%|########2 | 738/896 [01:22<00:10, 15.79it/s]
Batches:  83%|########2 | 741/896 [01:22<00:08, 18.28it/s]
Batches:  83%|########3 | 744/896 [01:22<00:09, 16.38it/s]
Batches:  83%|########3 | 746/896 [01:22<00:10, 14.03it/s]
Batches:  84%|########3 | 750/896 [01:23<00:08, 18.18it/s]
Batches:  84%|########4 | 754/896 [01:23<00:06, 21.30it/s]
Batches:  84%|########4 | 757/896 [01:23<00:07, 17.53it/s]
Batches:  85%|########4 | 760/896 [01:23<00:10, 12.99it/s]
Batches:  85%|########5 | 763/896 [01:24<00:10, 12.94it/s]
Batches:  86%|########5 | 767/896 [01:24<00:09, 13.82it/s]
Batches:  86%|########6 | 771/896 [01:24<00:07, 16.78it/s]
Batches:  86%|########6 | 773/896 [01:24<00:08, 14.67it/s]
Batches:  86%|########6 | 775/896 [01:24<00:09, 13.24it/s]
Batches:  87%|########6 | 777/896 [01:25<00:09, 12.38it/s]
Batches:  87%|########7 | 781/896 [01:25<00:06, 16.91it/s]
Batches:  88%|########7 | 785/896 [01:25<00:05, 21.38it/s]
Batches:  88%|########8 | 789/896 [01:25<00:04, 25.04it/s]
Batches:  88%|########8 | 792/896 [01:25<00:06, 16.19it/s]
Batches:  89%|########8 | 796/896 [01:25<00:05, 19.40it/s]
Batches:  89%|########9 | 799/896 [01:26<00:05, 17.15it/s]
Batches:  90%|########9 | 802/896 [01:26<00:05, 16.22it/s]
Batches:  90%|########9 | 805/896 [01:26<00:05, 15.39it/s]
Batches:  90%|######### | 807/896 [01:26<00:06, 14.06it/s]
Batches:  90%|######### | 810/896 [01:26<00:06, 13.88it/s]
Batches:  91%|######### | 812/896 [01:27<00:06, 12.93it/s]
Batches:  91%|#########1| 817/896 [01:27<00:05, 15.38it/s]
Batches:  92%|#########1| 822/896 [01:27<00:03, 20.79it/s]
Batches:  92%|#########2| 825/896 [01:27<00:03, 18.48it/s]
Batches:  93%|#########2| 830/896 [01:27<00:02, 23.28it/s]
Batches:  93%|#########3| 834/896 [01:28<00:03, 20.33it/s]
Batches:  93%|#########3| 837/896 [01:28<00:03, 15.17it/s]
Batches:  94%|#########3| 839/896 [01:28<00:04, 14.09it/s]
Batches:  94%|#########4| 844/896 [01:28<00:02, 19.68it/s]
Batches:  95%|#########4| 847/896 [01:29<00:03, 14.73it/s]
Batches:  95%|#########4| 850/896 [01:29<00:03, 14.71it/s]
Batches:  96%|#########5| 856/896 [01:29<00:02, 17.45it/s]
Batches:  96%|#########5| 859/896 [01:29<00:02, 16.55it/s]
Batches:  96%|#########6| 863/896 [01:29<00:01, 16.62it/s]
Batches:  97%|#########6| 867/896 [01:30<00:01, 17.22it/s]
Batches:  97%|#########7| 871/896 [01:30<00:01, 17.63it/s]
Batches:  98%|#########7| 875/896 [01:30<00:01, 17.00it/s]
Batches:  98%|#########7| 878/896 [01:30<00:01, 16.49it/s]
Batches:  98%|#########8| 881/896 [01:31<00:00, 15.76it/s]
Batches:  99%|#########8| 883/896 [01:31<00:00, 14.42it/s]
Batches:  99%|#########8| 886/896 [01:31<00:00, 14.48it/s]
Batches:  99%|#########9| 890/896 [01:31<00:00, 16.10it/s]
Batches: 100%|#########9| 893/896 [01:31<00:00, 16.30it/s]
Batches: 100%|#########9| 895/896 [01:31<00:00, 14.89it/s]
Batches: 100%|##########| 896/896 [01:32<00:00,  9.72it/s]

Reading the Speeches

Let us know look at the main topics and their representation:

Python
# Retrieve topic information and select specific columns: 'Topic', 'Count', 'Name', and 'Representation'
topic_info = topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]

# Create a simplified version of the topic information, retaining only 'Topic', 'Count', and 'Representation' columns
topic_info_simple = topic_info[["Topic", 'Count', "Representation"]]

# Modify the 'Representation' column to remove duplicate words while preserving their order
# This uses a dictionary to enforce uniqueness and joins the keys (unique words) back into a string
topic_info_simple['Representation'] = topic_info_simple['Representation'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))

# Convert the simplified topic information into a DataFrame for easier analysis or visualization
frequency_table = pd.DataFrame(topic_info_simple)
frequency_table.head(7)

Topic Count Representation
-1 11169 the to of that and in is for it not
0 2096 tax chancellor the government that we is businesses in business
1 1330 health nhs care patients hospital service hospitals services patient that
2 551 rail transport network road london trains line railway railways roads
3 520 ireland northern agreement irish ira sinn parties fein belfast decommissioning
4 456 energy climate carbon fuel gas emissions wind change electricity efficiency
5 398 police officers crime policing home chief metropolitan force constable constables

Recording the Word Scores

Python
# Initialize an empty list to store rows
topic_words_scores = []

# Loop through each topic and extract its word scores
for topic_id in topic_info_simple["Topic"]:
    words_scores = topic_model.get_topic(topic_id)  # Get words and scores for the topic
    if words_scores:  # Ensure there are words for the topic
        for word, score in words_scores:
            topic_words_scores.append({"Topic": topic_id, "Word": word, "Score": score})

# Convert the list into a DataFrame
topic_words_scores_df = pd.DataFrame(topic_words_scores)

Recording the Word Scores

See code
Python
# Step 1: Group by 'Topic' and rank terms by descending 'Score'
topic_words_scores_df['rank'] = topic_words_scores_df.groupby('Topic')['Score'] \
    .rank(method='min', ascending=False)

# Step 2: Filter only the first 11 topics: "-1" to "10"
# First, make sure Topic is treated as a string (if needed)
topic_words_scores_df['Topic'] = topic_words_scores_df['Topic'].astype(str)

selected_topics = [str(i) for i in range(-1, 11)]
reshaped_df2 = topic_words_scores_df[topic_words_scores_df['Topic'].isin(selected_topics)]
See code
R
# Load necessary libraries
library(dplyr)      # For data manipulation
library(broom)      # For tidying data (not used explicitly in this code but useful in many workflows)
library(ggplot2)    # For creating plots
library(ggpubr)     # For arranging multiple plots into a single figure
library(forcats)  # for fct_rev


reshaped_df2 <- reticulate::py$reshaped_df2

#Creating a graph with the topics
topics_df<-ggplot(reshaped_df2, aes(y = fct_rev(as.factor(Topic)), x = as.numeric(rank))) +   
  geom_tile(aes(fill = Score))+
  scale_fill_viridis_c()+
  geom_label(aes(y = fct_rev(as.factor(Topic)), x=rank, label=Word), fill="white", size=3)+
  scale_x_continuous(breaks = seq(min(reshaped_df2$rank), max(reshaped_df2$rank))) +  # Ensure all ranks are shown
  ylab("Topic")+ xlab("Top 10 words")+
  theme(legend.position = "bottom")
topics_df                     

Recording the Word Scores

See code
R
# Load necessary libraries
library(dplyr)      # For data manipulation
library(broom)      # For tidying data (not used explicitly in this code but useful in many workflows)
library(ggplot2)    # For creating plots
library(ggpubr)     # For arranging multiple plots into a single figure

# Accessing the Python dataframe (using reticulate) and converting it to an R dataframe
# `reticulate` bridges R and Python environments
topics_df <- reticulate::py$topic_words_scores_df

# Filter the dataframe to include only topics less than 3
# This creates a subset of data for visualization
topics_df8 <- topics_df[topics_df$Topic < 3, ]

# Create an empty list to store plots
# This list will hold the individual plots for each topic
plot_list <- list()

# Loop through each unique topic in the filtered dataframe
# Each iteration generates a plot for one topic
for (topic_id in unique(topics_df8$Topic)) {
  
  # Filter the dataframe for the current topic
  # This ensures that each plot uses only the relevant data
  topic_data <- topics_df8[topics_df8$Topic == topic_id, ]
  
  # Extract the title of the topic for use in the plot title
  # The `unique` function ensures that only one title is used
  topic_title <- unique(topic_data$Title) 
  
  # Create the bar plot for the current topic
  # The x-axis represents scores, and the y-axis represents words sorted by score
  p <- ggplot(topic_data, aes(x = Score, y = reorder(Word, Score))) +
    geom_bar(stat = "identity") +  # `geom_bar` creates the bar chart
    theme_bw() +                   # `theme_bw` applies a clean, minimal theme
    theme(axis.title.x = element_blank(),   # Remove x-axis title
          axis.title.y = element_blank()) + # Remove y-axis title
    ggtitle(paste("Topic", topic_id, ":", topic_title)) # Add a descriptive title to the plot
  
  # Store the plot in the list, with the topic ID as the key
  plot_list[[as.character(topic_id)]] <- p
}

# Arrange the stored plots in a grid layout
# `ncol` specifies the number of columns, and `nrow` specifies the number of rows
# `ggarrange` makes it easy to combine multiple plots into a single figure
ggarrange(plotlist = plot_list, ncol = 2, nrow = 2)
$`1`


$`2`


$`3`


$`4`


$`5`


$`6`


$`7`


$`8`


$`9`


$`10`


$`11`


$`12`


$`13`


$`14`


$`15`


$`16`


$`17`


$`18`


$`19`


$`20`


$`21`


$`22`


$`23`


$`24`


$`25`


$`26`


$`27`


$`28`


$`29`


$`30`


$`31`


$`32`


$`33`


$`34`


$`35`


$`36`


$`37`


$`38`


$`39`


$`40`


$`41`


attr(,"class")
[1] "list"      "ggarrange"

Topics per Class

This is how we get topics per document:

Python
# Initialize a list to hold probabilities with document IDs
probs_with_outliers = []

# Use document_id from aggression_texts
document_ids = aggression_texts['document_id'].tolist()

# Iterate over topics, probabilities, and document IDs
for doc_id, topic, prob_row in zip(document_ids, topics, probs):
    if topic == -1:
        # Assign all probability to topic -1
        prob_row_with_outlier = [doc_id] + list(prob_row) + [1]  # Add document ID and probabilities
    else:
        # Calculate residual probability for topic -1
        outlier_prob = max(0, 1 - sum(prob_row))  # Ensure non-negative
        prob_row_with_outlier = [doc_id] + list(prob_row) + [outlier_prob]
    probs_with_outliers.append(prob_row_with_outlier)

# Define columns, including document_id
topic_columns = [f"topic_{i}" for i in range(len(probs[0]))] + ["topic_-1"]
columns = ["document_id"] + topic_columns

# Convert to DataFrame
probs_df_with_outliers = pd.DataFrame(probs_with_outliers, columns=columns)

# Add winning topic
probs_df_with_outliers["winning_topic"] = probs_df_with_outliers[topic_columns].idxmax(axis=1)

# Get keywords for all topics
topic_keywords = {topic: topic_model.get_topic(topic) for topic in topic_model.get_topics()}

# Function to extract keywords for the winning topic
def get_keywords(topic):
    topic_id = int(topic.split("_")[1])  # Extract numeric topic ID
      # Format keywords as a comma-separated string
    return ", ".join([word for word, _ in topic_keywords[topic_id]])

# Add keywords for the winning topic
probs_df_with_outliers["winning_topic_keywords"] = probs_df_with_outliers["winning_topic"].apply(get_keywords)

# Add winning topic probability
probs_df_with_outliers["winning_topics_prob"] = probs_df_with_outliers[topic_columns].max(axis=1)

# Merge with original text data
merged_df = pd.merge(probs_df_with_outliers, aggression_texts[["document_id","body", "name"]],how='left',on=['document_id'])

# Rearrange columns to have: document_id, winning_topic, winning_topic_keywords, winning_topics_prob, body, and then topic columns
cols = ["document_id", "winning_topic", "winning_topic_keywords", "winning_topics_prob", "name", "body", "topic_-1"] + [f"topic_{i}" for i in range(len(probs[0]))]
merged_df = merged_df[cols]

Top Topics in Boris Johnson

See code
Python
pmq_boris = merged_df[merged_df["name"] == "Boris Johnson"]
cols = ["document_id", "winning_topic", "winning_topic_keywords", "winning_topics_prob", "name", "body", "topic_-1"] + [f"topic_{i}" for i in range(0,10)]
pmq_boris = pmq_boris[cols]
# Convert to long format
long_format = pmq_boris.melt(
    id_vars=["document_id", "winning_topic"],  # Columns to keep as identifiers
    value_vars=[col for col in pmq_boris.columns if col.startswith("topic_")],  # Columns to unpivot
    var_name="topic",  # Name of the resulting topic column
    value_name="importance"  # Name of the resulting value column
)

# Sort for better readability (optional)
long_format = long_format.sort_values(by=["document_id", "topic"]).reset_index(drop=True)

# Function to get the first five words for a given topic
def get_top_words(topic):
    # Extract the topic number
    topic_number = int(topic.replace('topic_', '').replace('_', '-'))  # Handles cases like 'topic_-1'
    # Get the first five words for the topic
    if topic_number in topic_keywords:
        return ', '.join(word for word, _ in topic_keywords[topic_number][:3])
    return None  # Return None if the topic is not in the dictionary

# Apply the function to the 'topic' column
long_format['top_words'] = long_format['topic'].apply(get_top_words)
See code
Python
# Step 1: Sort by document_id to prepare for group-wise operation
result_df = long_format.sort_values(['document_id', 'importance'], ascending=[True, False])

# Step 2: Within each document_id, identify duplicates and add small increments
result_df['is_dup_forward'] = result_df.duplicated(['document_id', 'importance'])
result_df['is_dup_backward'] = result_df.duplicated(['document_id', 'importance'], keep='last')

# Combine both to detect any kind of duplicate
result_df['is_dup'] = result_df['is_dup_forward'] | result_df['is_dup_backward']

# Step 3: Generate row numbers within each group to apply tiny increments
result_df['row_number'] = result_df.groupby('document_id').cumcount() + 1

# Step 4: Apply increment only to duplicates
result_df['importance_adjusted'] = result_df['importance']
result_df.loc[result_df['is_dup'], 'importance_adjusted'] += result_df['row_number'] * 1e-6

# Step 5: Rank by adjusted importance (descending), ties get minimum rank
result_df['rank'] = result_df.groupby('document_id')['importance_adjusted'] \
                             .rank(ascending=False, method='min')

# Optional: drop the intermediate columns if not needed
reshaped_df2 = result_df.drop(columns=['is_dup_forward', 'is_dup_backward', 'is_dup', 'row_number'])
See code
R
library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
reshaped_df2 <- reticulate::py$reshaped_df2

#Creating a graph with the topics
topics_lda_keys<-ggplot(reshaped_df2, aes(y = reorder(document_id, -document_id), x = as.numeric(rank))) +   
  geom_tile(aes(fill = importance))+
  scale_fill_viridis_c()+
  geom_label(aes(y = reorder(document_id, -document_id), x=as.numeric(rank), label=topic), fill="white", size=3)+
  scale_x_continuous(breaks = seq(min(reshaped_df2$rank), max(reshaped_df2$rank))) +  # Ensure all ranks are shown
  ylab("Speech")+ xlab("Top Topics")+
  theme(legend.position = "bottom")
topics_lda_keys

Top Topics in Boris Johnson

See code
Python
# Step 1: Rank within each document_id, descending order, ties.method = "min"
result_df['rank'] = result_df.groupby('document_id')['importance'].rank(ascending=False, method='min')

# Step 2: Apply regex to format `top_words` as `terms`
# Add a newline after the second comma-separated word
result_df['terms'] = result_df['top_words'].str.replace(r'^([^,]+,[^,]+),', r'\1,\n', regex=True)
# Final result
reshaped_df2 = result_df
See code
R
#
library(dplyr)
library(broom)
library(ggplot2)
#Turning the Pandas dataframe to R
reshaped_df2 <- reticulate::py$reshaped_df2

#Creating a graph with the topics
topics_lda_keys<-ggplot(reshaped_df2, aes(y = reorder(document_id, -document_id), x = as.numeric(rank))) +   
  geom_tile(aes(fill = importance))+
  scale_fill_viridis_c()+
  geom_label(aes(y = reorder(document_id, -document_id), x=as.numeric(rank), label=terms), fill="white", size=2)+
  scale_x_continuous(breaks = seq(min(reshaped_df2$rank), max(reshaped_df2$rank))) +  # Ensure all ranks are shown
  ylab("Speech")+ xlab("Top Topics")+
  theme(legend.position = "bottom")
topics_lda_keys

Top Topics in Boris Johnson

Let us look at a text and its associated topic (where the winning topic is not -1):

For example, speech 6113 has the following winning keywords: parliament, referendum, treaty, viii, henry, clauses, osce, passerelle, stunt, vote.

This speech reads as follows:

Johnson’s speech - 6113

We have been clear that the Government do not agree, as I have said previously to the House, with the recent changes to US immigration policy, and that that is not the approach the UK would take..

Topics per Class

This is how we get topics per document:

document_id winning_topic winning_topic_keywords winning_topics_prob name body topic_-1 topic_0 topic_1 topic_2 topic_3 topic_4 topic_5 topic_6 topic_7 topic_8 topic_9 topic_10 topic_11 topic_12 topic_13 topic_14 topic_15 topic_16 topic_17 topic_18 topic_19 topic_20 topic_21 topic_22 topic_23 topic_24 topic_25 topic_26 topic_27 topic_28 topic_29 topic_30 topic_31 topic_32 topic_33 topic_34 topic_35 topic_36 topic_37 topic_38 topic_39 topic_40 topic_41 topic_42 topic_43 topic_44 topic_45 topic_46 topic_47 topic_48 topic_49 topic_50 topic_51 topic_52 topic_53 topic_54 topic_55 topic_56 topic_57 topic_58 topic_59 topic_60 topic_61 topic_62 topic_63 topic_64 topic_65 topic_66 topic_67 topic_68 topic_69 topic_70 topic_71 topic_72 topic_73 topic_74 topic_75 topic_76 topic_77 topic_78 topic_79 topic_80 topic_81 topic_82 topic_83 topic_84 topic_85 topic_86 topic_87 topic_88 topic_89 topic_90 topic_91 topic_92 topic_93 topic_94 topic_95 topic_96 topic_97 topic_98 topic_99 topic_100 topic_101 topic_102 topic_103 topic_104 topic_105 topic_106 topic_107 topic_108 topic_109 topic_110 topic_111 topic_112 topic_113 topic_114 topic_115 topic_116 topic_117 topic_118 topic_119 topic_120 topic_121 topic_122 topic_123 topic_124 topic_125 topic_126 topic_127 topic_128 topic_129 topic_130 topic_131 topic_132 topic_133 topic_134 topic_135 topic_136 topic_137 topic_138 topic_139 topic_140 topic_141 topic_142 topic_143 topic_144 topic_145 topic_146 topic_147 topic_148 topic_149 topic_150 topic_151 topic_152 topic_153 topic_154 topic_155 topic_156 topic_157 topic_158 topic_159 topic_160 topic_161 topic_162 topic_163 topic_164 topic_165 topic_166 topic_167 topic_168 topic_169 topic_170 topic_171 topic_172 topic_173 topic_174 topic_175 topic_176 topic_177 topic_178 topic_179 topic_180 topic_181 topic_182 topic_183 topic_184 topic_185 topic_186 topic_187 topic_188 topic_189 topic_190 topic_191 topic_192 topic_193 topic_194 topic_195 topic_196 topic_197 topic_198 topic_199 topic_200 topic_201 topic_202 topic_203 topic_204 topic_205 topic_206 topic_207 topic_208 topic_209 topic_210 topic_211 topic_212 topic_213 topic_214 topic_215 topic_216 topic_217 topic_218 topic_219 topic_220 topic_221 topic_222 topic_223 topic_224 topic_225 topic_226 topic_227 topic_228 topic_229 topic_230 topic_231 topic_232 topic_233 topic_234 topic_235 topic_236 topic_237 topic_238 topic_239
1 topic_87 legal, aid, cases, advice, court, system, law, lawyer, solicitors, litigants 0.16797 Mr Gerry Bermingham Does the Minister agree that if one does not provi... 0.043965 0.004041 0.001391 0.002655 0.003035 0.001203 0.003792 0.001988 0.001804 0.001161 0.00289 0.001879 0.003518 0.002171 0.011101 0.001691 0.002148 0.001428 0.00132 0.001246 0.003348 0.034134 0.001502 0.001059 0.006522 0.001422 0.001054 0.001499 0.000768 0.001285 0.006056 0.002152 0.006282 0.00386 0.001871 0.002912 0.004134 0.002102 0.001651 0.001178 0.002149 0.001653 0.005514 0.001027 0.001433 0.003178 0.002903 0.001821 0.000907 0.00206 0.002664 0.001604 0.004242 0.003069 0.001372 0.001572 0.010114 0.002245 0.003244 0.001519 0.001254 0.001791 0.001211 0.004479 0.007045 0.001032 0.002343 0.002221 0.001219 0.004302 0.00583 0.004589 0.001281 0.001106 0.001183 0.004933 0.002176 0.001063 0.006258 0.001903 0.002481 0.00056 0.005718 0.00141 0.002412 0.003385 0.003717 0.001193 0.16797 0.003041 0.001293 0.004539 0.004232 0.000913 0.00289 0.004827 0.010436 0.006734 0.001311 0.006445 0.00233 0.002069 0.001622 0.000944 0.00272 0.00789 0.001226 0.001267 0.001433 0.006964 0.00401 0.002205 0.01401 0.001318 0.001362 0.002963 0.003664 0.002987 0.001741 0.001769 0.00168 0.004751 0.005687 0.005972 0.003915 0.001601 0.004665 0.001213 0.001462 0.001737 0.001341 0.006358 0.003369 0.00321 0.001515 0.005882 0.008352 0.001636 0.003068 0.011426 0.003419 0.002113 0.001919 0.002036 0.009607 0.00464 0.004334 0.000946 0.001025 0.007145 0.002288 0.002798 0.001381 0.009234 0.001998 0.009331 0.003262 0.001235 0.002842 0.007432 0.001252 0.003514 0.002367 0.005254 0.001195 0.004438 0.001392 0.001379 0.004981 0.002487 0.001255 0.001667 0.001405 0.001555 0.009217 0.003634 0.007665 0.001526 0.00197 0.003302 0.002911 0.001137 0.001488 0.001273 0.002139 0.001642 0.001539 0.001576 0.001087 0.002102 0.001245 0.004102 0.006396 0.002045 0.00469 0.006267 0.001929 0.012077 0.002457 0.002237 0.003666 0.022004 0.002181 0.008888 0.002221 0.004582 0.001073 0.003257 0.002895 0.001331 0.003844 0.001898 0.003689 0.005221 0.002511 0.001184 0.00113 0.00195 0.001906 0.001213 0.001341 0.002199 0.001262 0.004514 0.002361 0.001158 0.002327 0.001191 0.001243 0.001221 0.007826 0.001382 0.001317 0.002769 0.005798 0.002919 0.001196 0.003822 0.003577 0.002212 0.001369
2 topic_22 fishing, fishermen, fish, fisheries, industry, sea, marine, conservation, vessels, fishermens 1.0 Richard Benyon My right honourable Friend will know that there is... 0.0 2.409584e-307 2.777592e-307 1.339487e-307 1.215367e-307 6.984532e-307 1.112502e-307 1.401726e-307 3.366072e-307 1.385535e-307 2.807412e-307 2.164449e-307 1.085048e-307 3.046906e-307 1.502615e-307 5.040040e-307 1.722364e-307 1.695230e-307 3.878772e-307 1.474443e-307 1.320987e-307 1.444193e-307 3.985065e-307 1.0 1.709283e-307 1.895954e-307 4.171915e-307 6.383930e-307 4.700139e-307 2.795904e-307 1.664952e-307 2.304228e-307 1.558749e-307 1.574965e-307 1.864378e-307 1.824634e-307 1.118675e-307 1.608528e-307 1.415907e-307 1.245878e-307 1.519837e-307 2.543567e-307 1.341151e-307 1.605463e-307 1.350279e-307 1.702117e-307 2.657082e-307 1.785103e-307 1.606623e-307 1.209747e-307 2.801690e-307 1.604444e-307 2.005726e-307 1.344802e-307 2.255745e-307 1.295795e-307 1.681387e-307 1.189876e-307 1.085106e-307 2.945497e-307 2.001174e-307 4.308093e-307 1.158053e-306 1.752071e-307 1.367220e-307 5.276326e-307 1.910063e-307 1.376773e-307 4.121481e-307 1.641325e-307 1.893145e-307 2.026828e-307 5.088588e-307 1.187003e-306 3.192378e-307 1.236715e-307 1.783859e-307 8.404581e-307 1.556770e-307 1.120358e-307 2.360908e-307 1.767775e-307 1.197294e-307 3.185820e-307 1.745644e-307 1.463532e-307 1.285231e-307 1.143427e-306 1.495211e-307 2.478799e-307 4.814387e-307 1.345058e-307 1.631086e-307 2.716689e-307 2.605204e-307 2.009461e-307 1.523980e-307 1.906404e-307 5.643037e-307 1.220769e-307 2.078179e-307 1.264070e-307 1.699676e-307 1.763924e-307 1.135469e-307 1.788315e-307 1.468152e-307 1.594030e-307 1.425138e-307 1.309516e-307 1.229503e-307 1.496371e-307 1.631474e-307 6.193856e-307 3.617686e-307 1.919997e-307 1.950644e-307 1.201606e-307 4.472034e-307 1.413093e-307 1.531566e-307 1.689022e-307 1.869425e-307 1.838141e-307 1.541317e-307 4.055817e-307 2.047519e-307 6.201056e-307 4.650893e-307 1.776927e-307 1.430714e-307 1.262870e-307 1.872262e-307 1.867817e-307 4.723111e-307 1.211818e-307 1.268618e-307 5.132255e-307 2.003833e-307 1.468130e-307 2.002439e-307 1.943697e-307 3.799007e-307 3.123250e-307 1.321809e-307 1.898880e-307 2.071306e-307 1.264200e-307 1.683278e-307 1.536503e-307 1.325717e-307 1.917572e-307 1.559229e-307 1.290616e-307 1.960865e-307 1.382141e-307 2.345569e-307 1.267526e-307 1.701063e-307 1.584590e-307 3.952960e-307 1.670066e-307 1.426341e-307 1.775151e-307 1.720156e-307 1.584812e-307 3.112093e-307 1.477069e-307 1.564011e-307 1.939658e-307 3.081934e-307 4.879836e-307 6.185012e-307 1.716764e-307 1.427484e-307 1.538377e-307 1.454659e-307 3.064519e-307 2.949482e-307 1.262003e-307 1.639660e-307 2.324958e-307 2.854048e-307 1.567381e-307 3.218422e-307 1.265665e-307 1.330603e-307 3.952529e-307 1.698816e-307 1.313392e-307 1.368401e-307 2.122025e-307 1.837408e-307 2.038133e-307 1.926490e-307 1.506087e-307 9.907181e-308 1.333170e-307 1.196730e-307 1.972104e-307 2.001180e-307 1.390832e-307 3.300989e-307 1.558511e-307 1.880517e-307 1.496291e-307 1.807116e-307 2.092684e-307 2.035533e-307 1.210000e-307 2.170597e-307 1.689890e-307 1.652968e-307 1.819821e-307 1.739442e-307 1.357835e-307 1.804197e-307 1.592313e-307 3.019267e-307 1.418906e-307 1.360037e-307 3.261747e-307 1.407881e-307 1.140075e-307 1.792352e-307 1.318813e-307 3.043408e-307 3.266105e-307 1.425656e-307 1.756072e-307 1.678449e-307 4.132594e-307 1.526275e-307 1.728773e-307 1.270355e-307 2.795018e-307 1.293750e-307 2.208128e-307 2.266386e-307 1.678376e-307 1.568370e-307
3 topic_178 parking, stevenage, town, hire, taxi, charges, centre, borough, regeneration, local 1.0 Penny Mordaunt I congratulate my honourable Friend on his campaig... 0.0 6.391277e-307 2.345420e-307 1.814121e-306 2.922295e-307 1.780881e-307 3.778895e-307 1.914265e-307 1.971974e-307 2.266575e-307 2.585887e-307 2.881446e-307 5.137861e-307 2.581527e-307 7.024945e-307 2.127185e-307 1.883227e-307 2.469107e-307 1.593686e-307 1.454105e-307 2.622233e-307 4.831738e-307 1.797459e-307 1.469261e-307 3.344549e-307 2.606840e-307 1.374573e-307 1.887253e-307 1.040219e-307 1.541432e-307 6.208399e-307 2.071429e-307 3.240391e-307 8.179376e-307 1.730221e-307 2.275312e-307 4.445581e-307 3.898600e-307 1.690854e-307 2.418034e-307 4.328521e-307 2.590332e-307 1.422950e-306 1.910231e-307 3.199570e-307 2.373607e-307 2.765900e-307 3.181794e-307 1.610564e-307 2.116635e-307 2.859886e-307 3.080400e-307 2.886265e-307 2.033671e-306 1.464756e-307 1.641478e-307 4.027316e-307 1.287338e-306 6.401497e-307 1.753232e-307 2.214916e-307 2.067573e-307 1.649565e-307 2.847772e-307 3.593485e-307 1.541297e-307 2.006313e-307 2.058802e-307 1.524486e-307 6.550144e-307 3.760745e-307 3.189226e-307 1.609852e-307 1.570686e-307 1.464216e-307 9.497039e-307 3.650936e-307 1.540928e-307 7.848966e-307 1.861488e-307 2.283553e-307 7.410293e-308 5.949490e-307 2.138775e-307 4.094010e-307 2.551387e-307 4.395413e-306 1.647610e-307 5.243064e-307 2.585022e-307 1.872876e-307 2.201522e-306 2.742290e-307 1.356489e-307 3.013634e-307 3.378562e-307 3.815060e-307 3.675142e-307 1.652075e-307 5.230068e-307 2.024696e-307 2.079005e-307 2.897521e-307 1.649088e-307 1.161286e-306 4.160975e-307 2.353772e-307 1.474967e-307 2.995483e-307 3.840919e-307 1.584909e-306 2.036513e-307 4.836236e-307 1.672851e-307 2.049206e-307 2.313664e-307 2.622323e-307 3.750731e-306 2.089711e-307 1.771966e-307 1.719656e-307 2.896667e-307 4.501398e-307 3.349632e-307 2.639059e-307 1.911688e-307 3.549757e-307 1.556908e-307 1.793215e-307 3.042441e-307 2.699078e-307 3.943593e-307 2.491685e-307 2.419233e-307 1.854852e-307 4.591832e-307 5.547151e-307 2.012094e-307 2.392580e-307 4.026554e-307 2.529497e-307 1.882352e-307 2.226300e-307 2.507656e-307 6.554605e-307 4.536412e-307 3.576033e-307 1.062917e-307 1.874527e-307 7.885029e-307 2.194382e-307 2.236847e-307 1.564872e-307 4.959884e-307 3.290815e-307 4.013810e-307 2.677714e-307 2.578281e-307 2.228609e-307 3.448356e-307 1.905218e-307 5.925451e-307 2.133834e-307 3.098304e-307 1.336118e-307 2.801086e-307 1.651859e-307 1.561379e-307 2.949184e-307 2.085673e-307 1.526108e-307 1.974576e-307 1.782153e-307 2.724875e-307 3.775297e-307 2.543890e-307 3.541762e-307 2.297102e-307 2.472255e-307 1.0 2.255917e-307 1.261422e-307 2.300333e-307 1.478593e-307 2.476590e-307 1.676029e-307 1.614399e-307 2.124219e-307 2.014388e-307 7.930954e-307 2.526387e-307 3.070891e-307 3.728500e-307 3.302092e-307 3.034047e-307 9.088073e-307 4.186826e-307 5.559739e-307 1.848464e-306 1.955087e-307 2.663105e-307 5.563833e-307 2.461975e-307 6.849871e-307 1.932927e-307 2.854627e-307 1.939794e-307 2.512820e-307 2.311729e-307 1.380946e-307 2.897283e-307 3.534900e-307 2.563861e-307 3.140572e-307 2.075715e-307 2.325030e-307 2.065301e-307 3.626227e-307 2.410811e-307 2.356186e-307 2.794425e-307 2.313708e-307 1.466655e-307 5.020368e-307 2.001398e-307 2.284017e-307 2.615041e-307 1.860578e-307 1.450761e-307 2.251243e-307 5.463389e-307 1.684474e-307 1.513260e-307 4.722403e-307 8.892318e-307 2.681487e-307 2.427072e-307 3.491490e-307 2.974070e-307 3.987444e-307 2.602154e-307
4 topic_-1 the, to, of, that, and, in, is, for, it, not 0.260586 Gerry Sutcliffe We have had an interesting debate on an important ... 0.260586 0.009011 0.001424 0.002068 0.001486 0.001222 0.001665 0.001296 0.001648 0.001023 0.00334 0.002041 0.001648 0.002552 0.00535 0.001813 0.001618 0.001337 0.001169 0.000923 0.001728 0.003805 0.00142 0.001006 0.004139 0.001417 0.000936 0.001513 0.000723 0.001103 0.009388 0.001852 0.003169 0.003982 0.001416 0.002206 0.001782 0.002033 0.001134 0.001011 0.001998 0.00185 0.002768 0.000948 0.001269 0.002264 0.00382 0.001832 0.000839 0.001211 0.003573 0.001525 0.003928 0.002326 0.001181 0.001053 0.006299 0.001612 0.001623 0.001354 0.001225 0.001864 0.001197 0.003159 0.002574 0.001 0.001818 0.001388 0.001092 0.005288 0.011239 0.005524 0.001184 0.00108 0.001026 0.002145 0.002265 0.001031 0.00566 0.001107 0.002257 0.000487 0.002218 0.001486 0.002557 0.001909 0.002271 0.001181 0.004721 0.003312 0.001345 0.002698 0.002689 0.000902 0.004267 0.007021 0.003671 0.006986 0.001231 0.002292 0.001846 0.001248 0.001553 0.000886 0.001587 0.011162 0.001106 0.00096 0.001294 0.002426 0.002079 0.001452 0.007451 0.00125 0.001443 0.002374 0.003035 0.00184 0.001848 0.001195 0.001191 0.003068 0.141773 0.004915 0.002372 0.001576 0.009627 0.00112 0.001409 0.001718 0.001209 0.002271 0.002674 0.002517 0.001489 0.002133 0.002578 0.001704 0.002597 0.003383 0.00282 0.001662 0.0021 0.002527 0.002659 0.014484 0.009743 0.000721 0.000953 0.005739 0.001371 0.00221 0.001026 0.002638 0.002165 0.002815 0.003589 0.001064 0.002034 0.003665 0.001292 0.004045 0.001478 0.003878 0.000947 0.002691 0.001233 0.001003 0.00283 0.001946 0.00109 0.001701 0.001378 0.001484 0.003011 0.002219 0.002931 0.001661 0.002196 0.002092 0.001978 0.000982 0.001598 0.000958 0.002481 0.001068 0.001048 0.001672 0.00102 0.001752 0.001098 0.005148 0.00857 0.002291 0.004292 0.004791 0.001146 0.002773 0.001698 0.001769 0.00324 0.003167 0.002539 0.006559 0.001702 0.002475 0.001013 0.002937 0.002398 0.000917 0.00425 0.00187 0.002453 0.004276 0.001862 0.001045 0.001078 0.001858 0.00209 0.001085 0.001187 0.002418 0.000919 0.00189 0.001787 0.001013 0.002833 0.001186 0.000912 0.00116 0.012559 0.001277 0.000977 0.003067 0.002337 0.003616 0.001039 0.007609 0.004694 0.002224 0.001271

Topics per Document

This is how we extract the top 3 topics per document:

Python
# Extract only topic probability columns (columns starting with "topic_")
topic_columns = [col for col in merged_df.columns if col.startswith("topic_")]

# Find the top 3 topics and their probabilities for each document
top_topics = merged_df[topic_columns].apply(
    lambda row: list(row.nlargest(3).items()), axis=1
)

# Add the top 3 topics and probabilities to `merged_df`
merged_df["top_1"] = [t[0][0] for t in top_topics]  # First topic
merged_df["prob_1"] = [t[0][1] for t in top_topics]  # Probability of first topic
merged_df["top_2"] = [t[1][0] for t in top_topics]  # Second topic
merged_df["prob_2"] = [t[1][1] for t in top_topics]  # Probability of second topic
merged_df["top_3"] = [t[2][0] for t in top_topics]  # Third topic
merged_df["prob_3"] = [t[2][1] for t in top_topics]  # Probability of third topic

# Reorganize the DataFrame if needed
cols = ["document_id", "winning_topic", "winning_topic_keywords", "winning_topics_prob", "body",
        "top_1", "prob_1", "top_2", "prob_2", "top_3", "prob_3"]
merged_df2 = merged_df[cols]

Topics per Document

This is how we extract the top 3 topics per document:

document_id winning_topic winning_topic_keywords winning_topics_prob body top_1 prob_1 top_2 prob_2 top_3 prob_3
1 topic_87 legal, aid, cases, advice, court, system, law, lawyer, solicitors, litigants 0.16797 Does the Minister agree that if one does not provi... topic_87 0.16797 topic_-1 0.043965 topic_20 0.034134
2 topic_22 fishing, fishermen, fish, fisheries, industry, sea, marine, conservation, vessels, fishermens 1.0 My right honourable Friend will know that there is... topic_22 1.0 topic_72 1.187003e-306 topic_61 1.158053e-306
3 topic_178 parking, stevenage, town, hire, taxi, charges, centre, borough, regeneration, local 1.0 I congratulate my honourable Friend on his campaig... topic_178 1.0 topic_85 4.395413e-306 topic_116 3.750731e-306
4 topic_-1 the, to, of, that, and, in, is, for, it, not 0.260586 We have had an interesting debate on an important ... topic_-1 0.260586 topic_121 0.141773 topic_144 0.014484

Topics per Document

Let us look at a text and its associated topic:

For example, the keywords associated with topic 2 are: rail, transport, network, road, london, trains, line, railway, railways, roads.

The speech that is associated with these keywords is the following:

Speech Keywords - rail, transport, network, road, london, trains, line, railway, railways, roads

I am encouraged by the honourable Gentleman’s anger. His intervention bears no relationship to the reality in Britain or to reality for our European Union partners. His comment was important. I hope that it will be well circulated to his constituents. It is clear that this whole process is in a state of disarray. I welcomed the new Secretary of State for Transport. His main challenge is to persuade a number of his colleagues in the Cabinet that the whole process should be dumped. I want to finish on two points that sum up this fiasco. I am sure that the honourable Member for Ross, Cromarty and Skye will deal with the issue of sleeper services to Scotland. The Government and the British Railways Board were dragged into the courts in Scotland to be exposed for what they had shown to Scots - utter contempt for their rail network. They thought that, through stealth and the incompetence of the director of franchising, Mr. Roger Salmon, they could close services under the guise of consultation on a draft passenger service requirement. They were found out. Is it not a disgrace that railway policy is now going to be partly dictated in the courts of England and Scotland? That is a fiasco and the Government know it. The second issue is trans-European networks. It is an important issue for Britain because it involves integrating the major routes in Britain with those in Europe. It is absolutely right that that should be done. What did Ministers do when they went to Brussels? They sabotaged the proposals. We find that the European Union is giving no priority to schemes. It has thrown out environmental impact assessments and of course no cash is being provided by the European Union for such developments..

Mapping One Topic over Time

Let us try to plot this topic over time:

We first merge the original texts to the dataframe with our calculations:

Python
# Merge with original text data
final = pd.merge(
    aggression_texts,
    merged_df2[["document_id", "winning_topic", "winning_topic_keywords"]],
    how="left",
    on=["document_id"],
)

Mapping One Topic over Time

Let us try to plot the topic of security over time:

Python
# Loading Relevant libraries
import numpy as np
import scipy.stats as stats

# Add a binary column indicating if the winning_topic is "topic_2"
final["winning_topic_2"] = (final["winning_topic"] == "topic_2").astype(int)

# Group by year and calculate total speeches, total topic occurrences, and proportion for topic 2
trends = (
    final
    .groupby("year", as_index=False)
    .agg(
        total_speeches=('document_id', 'count'),          # Total number of speeches
        total_topic=('winning_topic_2', 'sum'),          # Total speeches for "topic_2"
        proportion_topic=('winning_topic_2', 'mean')     # Mean proportion of "topic_2"
    )
)

# Calculate standard deviation (SD) of the proportion for topic 2
trends["sd_topic"] = final.groupby("year")["winning_topic_2"].std().values

# Use total speeches as the sample size (n)
trends["n"] = trends["total_speeches"]

# Calculate standard error (SE)
trends["se"] = trends["sd_topic"] / np.sqrt(trends["n"])

# Calculate confidence intervals for the proportion of topic 2
confidence_level = 0.95
t_value = stats.t.ppf(1 - (1 - confidence_level) / 2, df=trends["n"] - 1)

trends["ci_lower"] = trends["proportion_topic"] - t_value * trends["se"]
trends["ci_upper"] = trends["proportion_topic"] + t_value * trends["se"]

Plotting

We now use reticulate to plot it over time.

See code
R
library(reticulate)
trends2 <- reticulate::py$trends
ggplot(trends2, aes(x = year, y = proportion_topic)) +
  geom_line() +
  geom_point() +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.2) +
  labs(
    title = "Topic over Time",
    x = "Year",
    y = "Topic Prevalence"
  ) +
      theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(trends2$year), max(trends2$year), by = 1))+
    geom_hline(yintercept = 0)

Dealing with Outliers

As we saw earlier, almost 50% of data points are considered outliers.

The documentation provides four different strategies to deal with the outliers:

  • based on topic-document probabilities,
  • based on topic distributions,
  • based on c-TF-IFD representations,
  • based on document and topic embeddings.

Using Topic Distributions

BERTopic uses clustering to define topics: not more than one topic is assigned

Documents however contain a mixture of topics: this can be accounted for by splitting documents into sentences and feeding those to BERTopic.

Each document is split into tokens according to the provided tokenizer in the CountVectorizer.

Then, a sliding window is applied on each document creating subsets of the document.

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Conclusion

Traditional word representations, like one-hot encoding, are unable to capture relationships between words.

Word embeddings address these limitations by:

  • Representing words as dense vectors in lower-dimensional spaces.
  • Capturing semantic similarity and contextual meaning.

Dense representations enable:

  • Generalization across similar words.
  • Improved interpretability of relationships in text.
  • More efficient processing for downstream NLP tasks.

Modern techniques, such as dynamic embeddings (e.g., BERT), overcome the limitations of static embeddings by adapting to context.