L22: Unsupervised Learning: Word Embeddings

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Transitioning from LDA to Word Representations

About LDA:

Captures topics by identifying co-occurrence patterns of words in documents.
Words are grouped into topics based on shared context, implying a notion of semantic similarity.
LDA treats words as discrete entities (bags of words), without considering intrinsic relationships between them.

The GAP:

While LDA identifies topics, it doesn’t embed the similarity between individual words.
For example, words like king and queen may co-occur in the same topics but lack a direct quantitative measure of similarity.

Importance of Word Embeddings

We need dense word representations that embed similarity directly into the vectors.

Limitations of One-Hot Encoding

How words have been represented so far:

One-hot encoding: Words as sparse, high-dimensional vectors.
Example:
- “cat” = \([1,0,0,...0]\), “dog” = \([0,1,0,...0]\)

Problems with Sparse Representations:

No Sense of Similarity
- Dot product or cosine similarity between any two word vectors is always zero.
- Fails to capture relationships between words (e.g., “cat” and “dog” are unrelated).

\[ cos(\theta) = w_{\text{cat}}^Tw_{\text{dog}} = \frac{\mathbf{w_{\text{cat}}} \cdot \mathbf{w_{\text{dog}}}}{\left|\left| \mathbf{w_{\text{cat}}} \right|\right| \left|\left| \mathbf{w_{\text{dog}}} \right|\right|}= 0 \]

High Dimensionality
- Vocabulary determines vector size, leading to inefficiencies

This means that models cannot learn or infer word relationships like synonyms, analogies, or context.

Sparse Word Representations Problems

1.Similarity

documents may have zero term overlap, but have almost identical meanings
E.g.: “The royal heir ascended the throne.” vs. “The monarch’s child became the ruler.”

2.Classification

We may know one term is connected to a concept
sparse representations fail to reveal connections to related terms.
- E.g. If we learn that ‘climate change’ is predictive of the concept of environmental policy, shouldn’t we also infer something about terms like ‘carbon emissions’, ‘renewable energy’, and ‘Paris Agreement’?”

Sparse Representations:
- Words like ‘climate change’ and ‘renewable energy’ are treated as unrelated, despite their clear conceptual overlap.

Dense Representations:
- Embeddings place related terms closer in vector space, allowing us to see the connections between ‘climate change’ and ‘renewable energy’ or ‘Paris Agreement’.

Word Embeddings: A New Paradigm

What are word embeddings?

Represent words as dense vectors in a lower-dimensional space.
Similar words have similar vector representations.

How embeddings solve the limitations:

Capture similarity
- Words like king and queen will have vectors that are close in this new space.

Efficiency
- Dense representations significantly reduce dimensionality.

Distributional Semantics

Distributional Semantics means that the meaning of a word can be derived from the distribution of contexts in which it appears.

The hypothesis implies that words that appear in similar “contexts” will share similar meanings.

When a word \(j\) appears in a text, its “context” is the set of words that appear nearby (within a fixed-size window).

We use the many contexts of \(w\) to build up a representation of \(w\).

Pre	Keyword	Post
the sacrifices made to secure our	freedom	will never be forgotten.
ensure that every citizen enjoys the	freedom	to speak their mind without fear.
we must defend the values of democracy and	freedom	against all forms of tyranny.
the right to pursue life, liberty, and	freedom	is fundamental to our society.
not everyone in the world experiences the	freedom	we often take for granted.
policies are designed to protect the	freedom	of individuals while ensuring equality.

Word Embedding Overview

The meaning of each word is based on the distribution of terms with which it co-occurs

We represent this meaning using a vector for each word

Vectors are constructed such that similar words are close to each other in “semantic” space

We build this space automatically by seeing which words are close to one another in texts

Dense Representations of Words

Our goal is to build a dense vector for each word, chosen so that it is similar to vectors of words that appear in similar contexts (measuring similarity as the dot product)

\[ \begin{align} w_{\text{cat}} &= \begin{bmatrix} 0.73 \\ 0.04 \\ 0.07 \\ -0.18 \\ 0.81 \\ -0.97 \end{bmatrix} \end{align} \]

\[ \begin{align} w_{\text{dog}} &= \begin{bmatrix} 0.63 \\ .14 \\ .02 \\ -0.58 \\ 0.43 \\ -0.66 \end{bmatrix} \end{align} \]

These representations are known as word embeddings because we “embed” words into a low-dimensional space (low compared to the vocabulary size).

Word Embeddings Advantages

Low-dimensional word embeddings offer some advantages:

Encode Similarity between Words

Each word is a vector, with the vectors of similar words closer together than vectors of very different words
We no longer have word similarities of zero.

Facilitating Generalization

Embeddings enable models to generalize across words automatically.
- For instance, if the word “amazing” is identified as a strong predictor of positive sentiment, but the word “outstanding” doesn’t appear in the training data, their similar embeddings allow the model to infer that “outstanding” likely has a similar sentiment.

Quantifying Word Meaning

Embeddings provide a numerical way to assess the meaning of words.
Two words are considered to share a meaning if they appear in similar contexts, capturing this shared meaning through their proximity in vector space.

Characteristics of Word Embeddings

Training Data Requirements

High-quality embeddings require large amounts of diverse training data.
Typically trained on extensive external corpora (e.g., Wikipedia, news articles, web pages).
Embeddings reflect the language and biases present in the training data.

Context Window Size

Since a word’s meaning is influenced by its context, we must define what “context” means.
Context is typically implemented as a symmetric window of a specified size around each word.
The window size determines the type of relationships captured:
- Small window: Focuses on immediate syntactic relationships
- Large window: Captures broader semantic and topical associations.

Dimension of the Embedding

The embedding for each word will typically be between 50 and 500 elements long
The embedding encodes information about the contexts a word appears in, so in theory larger embeddings are able to encode more information

Traditional Word Embeddings

A word embedding is a dense vector representation of a word in a continuous vector space, where similar words are placed closer together.

Words that frequently appear in similar contexts have similar embeddings.
Example: Words like king, queen, and monarch would be close in the embedding space.

However, there are some limitations:

Static: Each word has one fixed embedding, regardless of context (bank in riverbank vs. financial bank).
Context-agnostic: Struggles with polysemy (multiple meanings for a single word).

Moving Beyond Traditional Embeddings?

Static Representations fall short:

Traditional embeddings ignore sentence-level context.
For example:
- “The bank of the river is beautiful.”
- “I deposited money at the bank.”
- Bank has the same vector in both cases, even though the meanings are different.

Models like BERT address this limitation by generating dynamic word representations based on their context.

BERTopic

BERTopic is a topic modelling technique that leverages transformers and c-TF-IDF to create clusters for easily interpretable topics.

What are Transformers?

A neural network architecture that excels at understanding relationships between words in context
They generate dynamic embeddings for each document based on the context of words within it.
Embeddings capture semantic nuances, allowing words with multiple meanings (e.g., “bank”) to be represented accurately.

What is TF-IDF?

Method that assigns importance to terms based on their frequency in a document.

What is c-TF-IDF?

Instead of treating each document independently, it treats clusters of documents as “pseudo-documents.”
Calculates term importance for the entire cluster, emphasizing words that uniquely define a topic.

BERTopic

How c-TF-IDF Works in BERTopic

Create Clusters: Documents are clustered based on their transformer-generated embeddings.
Aggregate Documents: Documents within each cluster are combined into a single “pseudo-document.”
Calculate c-TF-IDF: Determines the most representative keywords for each cluster, making topics interpretable.

Why This Combination Works

Transformers: Capture deep, contextual relationships between words in documents.
c-TF-IDF: Extracts interpretable topics by highlighting the most relevant terms in a cluster.

Python

import pandas as pd
#There might be some incompatibility issues here.
#pip install numpy 1.26
#pip install --upgrade numpy==1.26
#You have to install, uninstall things in pip
#This worked:
#pip uninstall numpy scipy blis thinc
#pip uninstall spacy gensim scipy thinc -y
from bertopic import BERTopic

Reading the Speeches

The next section reads the relevant documents.

Python

import pandas as pd
# Load the aggression_texts CSV and extract the 'body' column
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
aggression_texts['document_id'] = range(1, len(aggression_texts) + 1)

We now initialize the Bert model

Python

# The following two lines will remove some annoying message in Bert: "huggingface/tokenizers: The current process just got forked..."
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
# Importing Bert
from bertopic import BERTopic
# Create a UMAP instance with a fixed random state for reproducibility
from umap import UMAP
umap_model = UMAP(random_state=42)
# Initialize the BERTopic model with probability calculation enabled and the UMAP model specified
topic_model = BERTopic(calculate_probabilities=True,umap_model=umap_model, verbose=True)
topics, probs = topic_model.fit_transform(aggression_texts["body"])


Batches:   0%|          | 0/896 [00:00<?, ?it/s]
Batches:   0%|          | 1/896 [00:00<08:13,  1.81it/s]
Batches:   0%|          | 2/896 [00:00<06:28,  2.30it/s]
Batches:   0%|          | 3/896 [00:01<05:37,  2.64it/s]
Batches:   0%|          | 4/896 [00:01<05:11,  2.86it/s]
Batches:   1%|          | 5/896 [00:01<04:50,  3.06it/s]
Batches:   1%|          | 6/896 [00:02<04:35,  3.23it/s]
Batches:   1%|          | 7/896 [00:02<04:23,  3.37it/s]
Batches:   1%|          | 8/896 [00:02<04:14,  3.49it/s]
Batches:   1%|1         | 9/896 [00:02<04:11,  3.53it/s]
Batches:   1%|1         | 10/896 [00:03<04:07,  3.57it/s]
Batches:   1%|1         | 11/896 [00:03<04:03,  3.64it/s]
Batches:   1%|1         | 12/896 [00:03<03:59,  3.69it/s]
Batches:   1%|1         | 13/896 [00:03<03:55,  3.74it/s]
Batches:   2%|1         | 14/896 [00:04<03:51,  3.81it/s]
Batches:   2%|1         | 15/896 [00:04<03:42,  3.95it/s]
Batches:   2%|1         | 16/896 [00:04<03:40,  4.00it/s]
Batches:   2%|1         | 17/896 [00:04<03:34,  4.10it/s]
Batches:   2%|2         | 18/896 [00:05<03:29,  4.19it/s]
Batches:   2%|2         | 19/896 [00:05<03:24,  4.28it/s]
Batches:   2%|2         | 20/896 [00:05<03:25,  4.26it/s]
Batches:   2%|2         | 21/896 [00:05<03:23,  4.29it/s]
Batches:   2%|2         | 22/896 [00:06<03:22,  4.31it/s]
Batches:   3%|2         | 23/896 [00:06<03:19,  4.38it/s]
Batches:   3%|2         | 24/896 [00:06<03:16,  4.45it/s]
Batches:   3%|2         | 25/896 [00:06<03:13,  4.51it/s]
Batches:   3%|2         | 26/896 [00:06<03:10,  4.57it/s]
Batches:   3%|3         | 27/896 [00:07<03:09,  4.58it/s]
Batches:   3%|3         | 28/896 [00:07<03:06,  4.66it/s]
Batches:   3%|3         | 29/896 [00:07<03:02,  4.76it/s]
Batches:   3%|3         | 30/896 [00:07<02:58,  4.84it/s]
Batches:   3%|3         | 31/896 [00:07<02:55,  4.92it/s]
Batches:   4%|3         | 32/896 [00:08<02:53,  4.98it/s]
Batches:   4%|3         | 33/896 [00:08<02:51,  5.03it/s]
Batches:   4%|3         | 34/896 [00:08<02:48,  5.11it/s]
Batches:   4%|3         | 35/896 [00:08<02:48,  5.10it/s]
Batches:   4%|4         | 36/896 [00:08<02:47,  5.13it/s]
Batches:   4%|4         | 37/896 [00:09<02:47,  5.13it/s]
Batches:   4%|4         | 38/896 [00:09<02:48,  5.09it/s]
Batches:   4%|4         | 39/896 [00:09<02:46,  5.16it/s]
Batches:   4%|4         | 40/896 [00:09<02:44,  5.20it/s]
Batches:   5%|4         | 41/896 [00:09<02:42,  5.28it/s]
Batches:   5%|4         | 42/896 [00:10<02:41,  5.29it/s]
Batches:   5%|4         | 43/896 [00:10<02:39,  5.35it/s]
Batches:   5%|4         | 44/896 [00:10<02:38,  5.39it/s]
Batches:   5%|5         | 45/896 [00:10<02:36,  5.43it/s]
Batches:   5%|5         | 46/896 [00:10<02:35,  5.48it/s]
Batches:   5%|5         | 47/896 [00:10<02:33,  5.52it/s]
Batches:   5%|5         | 48/896 [00:11<02:32,  5.56it/s]
Batches:   5%|5         | 49/896 [00:11<02:31,  5.60it/s]
Batches:   6%|5         | 50/896 [00:11<02:29,  5.64it/s]
Batches:   6%|5         | 51/896 [00:11<02:29,  5.66it/s]
Batches:   6%|5         | 52/896 [00:11<02:30,  5.61it/s]
Batches:   6%|5         | 53/896 [00:11<02:29,  5.66it/s]
Batches:   6%|6         | 54/896 [00:12<02:27,  5.71it/s]
Batches:   6%|6         | 55/896 [00:12<02:26,  5.76it/s]
Batches:   6%|6         | 56/896 [00:12<02:25,  5.76it/s]
Batches:   6%|6         | 57/896 [00:12<02:24,  5.82it/s]
Batches:   6%|6         | 58/896 [00:12<02:23,  5.86it/s]
Batches:   7%|6         | 59/896 [00:13<02:22,  5.88it/s]
Batches:   7%|6         | 60/896 [00:13<02:22,  5.85it/s]
Batches:   7%|6         | 61/896 [00:13<02:22,  5.85it/s]
Batches:   7%|6         | 62/896 [00:13<02:22,  5.86it/s]
Batches:   7%|7         | 63/896 [00:13<02:21,  5.89it/s]
Batches:   7%|7         | 64/896 [00:13<02:20,  5.94it/s]
Batches:   7%|7         | 65/896 [00:14<02:18,  6.00it/s]
Batches:   7%|7         | 66/896 [00:14<02:16,  6.07it/s]
Batches:   7%|7         | 67/896 [00:14<02:15,  6.13it/s]
Batches:   8%|7         | 68/896 [00:14<02:14,  6.18it/s]
Batches:   8%|7         | 69/896 [00:14<02:13,  6.20it/s]
Batches:   8%|7         | 70/896 [00:14<02:12,  6.21it/s]
Batches:   8%|7         | 71/896 [00:14<02:12,  6.23it/s]
Batches:   8%|8         | 72/896 [00:15<02:12,  6.24it/s]
Batches:   8%|8         | 73/896 [00:15<02:11,  6.28it/s]
Batches:   8%|8         | 74/896 [00:15<02:10,  6.30it/s]
Batches:   8%|8         | 75/896 [00:15<02:10,  6.31it/s]
Batches:   8%|8         | 76/896 [00:15<02:09,  6.33it/s]
Batches:   9%|8         | 77/896 [00:15<02:08,  6.36it/s]
Batches:   9%|8         | 78/896 [00:16<02:08,  6.38it/s]
Batches:   9%|8         | 79/896 [00:16<02:06,  6.44it/s]
Batches:   9%|8         | 80/896 [00:16<02:06,  6.47it/s]
Batches:   9%|9         | 81/896 [00:16<02:05,  6.49it/s]
Batches:   9%|9         | 82/896 [00:16<02:05,  6.50it/s]
Batches:   9%|9         | 83/896 [00:16<02:04,  6.54it/s]
Batches:   9%|9         | 84/896 [00:16<02:03,  6.58it/s]
Batches:   9%|9         | 85/896 [00:17<02:02,  6.60it/s]
Batches:  10%|9         | 86/896 [00:17<02:11,  6.16it/s]
Batches:  10%|9         | 87/896 [00:17<02:08,  6.31it/s]
Batches:  10%|9         | 88/896 [00:17<02:05,  6.42it/s]
Batches:  10%|9         | 89/896 [00:17<02:04,  6.51it/s]
Batches:  10%|#         | 90/896 [00:17<02:02,  6.59it/s]
Batches:  10%|#         | 91/896 [00:18<02:01,  6.65it/s]
Batches:  10%|#         | 92/896 [00:18<02:00,  6.68it/s]
Batches:  10%|#         | 93/896 [00:18<01:59,  6.72it/s]
Batches:  10%|#         | 94/896 [00:18<01:58,  6.75it/s]
Batches:  11%|#         | 95/896 [00:18<01:58,  6.79it/s]
Batches:  11%|#         | 96/896 [00:18<01:57,  6.81it/s]
Batches:  11%|#         | 97/896 [00:18<01:56,  6.83it/s]
Batches:  11%|#         | 98/896 [00:19<01:56,  6.86it/s]
Batches:  11%|#1        | 99/896 [00:19<01:55,  6.87it/s]
Batches:  11%|#1        | 100/896 [00:19<01:55,  6.90it/s]
Batches:  11%|#1        | 101/896 [00:19<01:54,  6.92it/s]
Batches:  11%|#1        | 102/896 [00:19<01:54,  6.95it/s]
Batches:  11%|#1        | 103/896 [00:19<01:54,  6.95it/s]
Batches:  12%|#1        | 104/896 [00:19<01:53,  6.99it/s]
Batches:  12%|#1        | 105/896 [00:20<01:53,  7.00it/s]
Batches:  12%|#1        | 106/896 [00:20<01:52,  7.03it/s]
Batches:  12%|#1        | 107/896 [00:20<01:51,  7.05it/s]
Batches:  12%|#2        | 108/896 [00:20<01:51,  7.07it/s]
Batches:  12%|#2        | 109/896 [00:20<01:51,  7.06it/s]
Batches:  12%|#2        | 110/896 [00:20<01:51,  7.06it/s]
Batches:  12%|#2        | 111/896 [00:20<01:50,  7.08it/s]
Batches:  12%|#2        | 112/896 [00:21<01:50,  7.11it/s]
Batches:  13%|#2        | 113/896 [00:21<01:49,  7.15it/s]
Batches:  13%|#2        | 114/896 [00:21<02:27,  5.29it/s]
Batches:  13%|#2        | 115/896 [00:21<02:15,  5.76it/s]
Batches:  13%|#2        | 116/896 [00:21<02:07,  6.13it/s]
Batches:  13%|#3        | 117/896 [00:21<02:01,  6.43it/s]
Batches:  13%|#3        | 118/896 [00:22<01:56,  6.67it/s]
Batches:  13%|#3        | 119/896 [00:22<01:53,  6.85it/s]
Batches:  13%|#3        | 120/896 [00:22<01:50,  7.00it/s]
Batches:  14%|#3        | 121/896 [00:22<01:49,  7.10it/s]
Batches:  14%|#3        | 122/896 [00:22<01:47,  7.18it/s]
Batches:  14%|#3        | 123/896 [00:22<01:46,  7.24it/s]
Batches:  14%|#3        | 124/896 [00:22<01:45,  7.30it/s]
Batches:  14%|#3        | 125/896 [00:23<01:45,  7.33it/s]
Batches:  14%|#4        | 126/896 [00:23<01:44,  7.37it/s]
Batches:  14%|#4        | 127/896 [00:23<01:44,  7.36it/s]
Batches:  14%|#4        | 128/896 [00:23<01:43,  7.39it/s]
Batches:  14%|#4        | 129/896 [00:23<01:43,  7.40it/s]
Batches:  15%|#4        | 130/896 [00:23<01:43,  7.41it/s]
Batches:  15%|#4        | 131/896 [00:23<01:42,  7.43it/s]
Batches:  15%|#4        | 132/896 [00:23<01:42,  7.46it/s]
Batches:  15%|#4        | 133/896 [00:24<01:41,  7.49it/s]
Batches:  15%|#4        | 134/896 [00:24<01:41,  7.50it/s]
Batches:  15%|#5        | 135/896 [00:24<01:46,  7.18it/s]
Batches:  15%|#5        | 136/896 [00:24<01:45,  7.18it/s]
Batches:  15%|#5        | 137/896 [00:24<01:45,  7.20it/s]
Batches:  15%|#5        | 138/896 [00:24<01:45,  7.21it/s]
Batches:  16%|#5        | 139/896 [00:24<01:44,  7.23it/s]
Batches:  16%|#5        | 140/896 [00:25<01:44,  7.25it/s]
Batches:  16%|#5        | 141/896 [00:25<01:43,  7.26it/s]
Batches:  16%|#5        | 142/896 [00:25<01:43,  7.27it/s]
Batches:  16%|#5        | 143/896 [00:25<01:43,  7.30it/s]
Batches:  16%|#6        | 144/896 [00:25<01:42,  7.30it/s]
Batches:  16%|#6        | 145/896 [00:25<01:42,  7.33it/s]
Batches:  16%|#6        | 146/896 [00:25<01:41,  7.36it/s]
Batches:  16%|#6        | 147/896 [00:26<01:41,  7.37it/s]
Batches:  17%|#6        | 148/896 [00:26<01:41,  7.38it/s]
Batches:  17%|#6        | 149/896 [00:26<01:41,  7.39it/s]
Batches:  17%|#6        | 150/896 [00:26<01:40,  7.41it/s]
Batches:  17%|#6        | 151/896 [00:26<01:40,  7.41it/s]
Batches:  17%|#6        | 152/896 [00:26<01:53,  6.58it/s]
Batches:  17%|#7        | 153/896 [00:26<01:48,  6.82it/s]
Batches:  17%|#7        | 154/896 [00:27<01:45,  7.01it/s]
Batches:  17%|#7        | 155/896 [00:27<01:55,  6.40it/s]
Batches:  17%|#7        | 156/896 [00:27<02:01,  6.07it/s]
Batches:  18%|#7        | 157/896 [00:27<02:07,  5.78it/s]
Batches:  18%|#7        | 158/896 [00:27<01:59,  6.19it/s]
Batches:  18%|#7        | 159/896 [00:27<02:03,  5.97it/s]
Batches:  18%|#7        | 160/896 [00:28<02:08,  5.71it/s]
Batches:  18%|#7        | 161/896 [00:28<02:12,  5.56it/s]
Batches:  18%|#8        | 162/896 [00:28<02:13,  5.52it/s]
Batches:  18%|#8        | 163/896 [00:28<02:14,  5.44it/s]
Batches:  18%|#8        | 164/896 [00:28<02:16,  5.37it/s]
Batches:  18%|#8        | 165/896 [00:28<02:02,  5.96it/s]
Batches:  19%|#8        | 166/896 [00:29<02:05,  5.81it/s]
Batches:  19%|#8        | 167/896 [00:29<02:10,  5.59it/s]
Batches:  19%|#8        | 168/896 [00:29<01:58,  6.13it/s]
Batches:  19%|#8        | 169/896 [00:29<02:04,  5.85it/s]
Batches:  19%|#8        | 170/896 [00:29<01:53,  6.42it/s]
Batches:  19%|#9        | 171/896 [00:29<02:00,  6.04it/s]
Batches:  19%|#9        | 172/896 [00:30<02:04,  5.80it/s]
Batches:  19%|#9        | 173/896 [00:30<02:09,  5.60it/s]
Batches:  19%|#9        | 174/896 [00:30<02:11,  5.48it/s]
Batches:  20%|#9        | 175/896 [00:30<02:12,  5.46it/s]
Batches:  20%|#9        | 176/896 [00:30<02:12,  5.45it/s]
Batches:  20%|#9        | 177/896 [00:31<02:13,  5.38it/s]
Batches:  20%|#9        | 178/896 [00:31<02:12,  5.40it/s]
Batches:  20%|#9        | 179/896 [00:31<02:14,  5.31it/s]
Batches:  20%|##        | 180/896 [00:31<02:13,  5.36it/s]
Batches:  20%|##        | 181/896 [00:31<02:15,  5.29it/s]
Batches:  20%|##        | 182/896 [00:31<01:58,  6.02it/s]
Batches:  20%|##        | 183/896 [00:32<02:01,  5.85it/s]
Batches:  21%|##        | 184/896 [00:32<02:05,  5.67it/s]
Batches:  21%|##        | 185/896 [00:32<02:09,  5.49it/s]
Batches:  21%|##        | 186/896 [00:32<02:12,  5.34it/s]
Batches:  21%|##        | 187/896 [00:32<02:15,  5.22it/s]
Batches:  21%|##        | 188/896 [00:33<02:17,  5.16it/s]
Batches:  21%|##1       | 189/896 [00:33<02:17,  5.13it/s]
Batches:  21%|##1       | 190/896 [00:33<02:18,  5.09it/s]
Batches:  21%|##1       | 191/896 [00:33<02:19,  5.07it/s]
Batches:  21%|##1       | 192/896 [00:33<02:18,  5.08it/s]
Batches:  22%|##1       | 193/896 [00:34<02:17,  5.13it/s]
Batches:  22%|##1       | 194/896 [00:34<01:58,  5.92it/s]
Batches:  22%|##1       | 195/896 [00:34<02:01,  5.75it/s]
Batches:  22%|##1       | 196/896 [00:34<02:04,  5.64it/s]
Batches:  22%|##1       | 197/896 [00:34<01:48,  6.44it/s]
Batches:  22%|##2       | 198/896 [00:34<01:56,  6.01it/s]
Batches:  22%|##2       | 199/896 [00:35<02:01,  5.76it/s]
Batches:  22%|##2       | 200/896 [00:35<01:46,  6.52it/s]
Batches:  23%|##2       | 202/896 [00:35<01:41,  6.83it/s]
Batches:  23%|##2       | 204/896 [00:35<01:40,  6.91it/s]
Batches:  23%|##2       | 205/896 [00:35<01:47,  6.43it/s]
Batches:  23%|##3       | 207/896 [00:36<01:43,  6.68it/s]
Batches:  23%|##3       | 208/896 [00:36<01:48,  6.33it/s]
Batches:  23%|##3       | 209/896 [00:36<01:53,  6.05it/s]
Batches:  24%|##3       | 211/896 [00:36<01:33,  7.34it/s]
Batches:  24%|##3       | 212/896 [00:36<01:39,  6.87it/s]
Batches:  24%|##3       | 214/896 [00:37<01:36,  7.07it/s]
Batches:  24%|##3       | 215/896 [00:37<01:42,  6.65it/s]
Batches:  24%|##4       | 217/896 [00:37<01:37,  6.96it/s]
Batches:  24%|##4       | 219/896 [00:37<01:25,  7.96it/s]
Batches:  25%|##4       | 220/896 [00:38<01:31,  7.43it/s]
Batches:  25%|##4       | 222/896 [00:38<01:20,  8.42it/s]
Batches:  25%|##4       | 223/896 [00:38<01:27,  7.73it/s]
Batches:  25%|##5       | 225/896 [00:38<01:16,  8.77it/s]
Batches:  25%|##5       | 226/896 [00:38<01:24,  7.91it/s]
Batches:  25%|##5       | 227/896 [00:38<01:33,  7.18it/s]
Batches:  25%|##5       | 228/896 [00:39<01:41,  6.59it/s]
Batches:  26%|##5       | 229/896 [00:39<01:48,  6.14it/s]
Batches:  26%|##5       | 231/896 [00:39<01:26,  7.68it/s]
Batches:  26%|##5       | 232/896 [00:39<01:35,  6.94it/s]
Batches:  26%|##6       | 234/896 [00:39<01:20,  8.26it/s]
Batches:  26%|##6       | 236/896 [00:40<01:11,  9.19it/s]
Batches:  27%|##6       | 238/896 [00:40<01:16,  8.62it/s]
Batches:  27%|##6       | 240/896 [00:40<01:09,  9.40it/s]
Batches:  27%|##6       | 241/896 [00:40<01:21,  8.02it/s]
Batches:  27%|##7       | 243/896 [00:40<01:25,  7.63it/s]
Batches:  27%|##7       | 245/896 [00:41<01:27,  7.43it/s]
Batches:  27%|##7       | 246/896 [00:41<01:23,  7.77it/s]
Batches:  28%|##7       | 248/896 [00:41<01:25,  7.61it/s]
Batches:  28%|##7       | 249/896 [00:41<01:34,  6.88it/s]
Batches:  28%|##7       | 250/896 [00:42<01:40,  6.43it/s]
Batches:  28%|##8       | 251/896 [00:42<01:46,  6.06it/s]
Batches:  28%|##8       | 252/896 [00:42<01:51,  5.79it/s]
Batches:  28%|##8       | 254/896 [00:42<01:26,  7.44it/s]
Batches:  28%|##8       | 255/896 [00:42<01:35,  6.73it/s]
Batches:  29%|##8       | 256/896 [00:42<01:43,  6.21it/s]
Batches:  29%|##8       | 258/896 [00:43<01:37,  6.54it/s]
Batches:  29%|##8       | 259/896 [00:43<01:43,  6.18it/s]
Batches:  29%|##9       | 261/896 [00:43<01:21,  7.75it/s]
Batches:  29%|##9       | 262/896 [00:43<01:29,  7.10it/s]
Batches:  29%|##9       | 264/896 [00:43<01:15,  8.42it/s]
Batches:  30%|##9       | 266/896 [00:44<01:06,  9.43it/s]
Batches:  30%|##9       | 268/896 [00:44<01:02, 10.11it/s]
Batches:  30%|###       | 270/896 [00:44<01:08,  9.10it/s]
Batches:  30%|###       | 272/896 [00:44<01:03,  9.90it/s]
Batches:  31%|###       | 274/896 [00:44<01:09,  8.90it/s]
Batches:  31%|###       | 275/896 [00:45<01:19,  7.83it/s]
Batches:  31%|###       | 276/896 [00:45<01:27,  7.11it/s]
Batches:  31%|###1      | 278/896 [00:45<01:14,  8.30it/s]
Batches:  31%|###1      | 280/896 [00:45<01:05,  9.35it/s]
Batches:  31%|###1      | 282/896 [00:45<01:11,  8.55it/s]
Batches:  32%|###1      | 284/896 [00:46<01:04,  9.47it/s]
Batches:  32%|###1      | 286/896 [00:46<01:09,  8.73it/s]
Batches:  32%|###2      | 288/896 [00:46<01:13,  8.31it/s]
Batches:  32%|###2      | 289/896 [00:46<01:20,  7.55it/s]
Batches:  32%|###2      | 290/896 [00:47<01:27,  6.90it/s]
Batches:  32%|###2      | 291/896 [00:47<01:33,  6.45it/s]
Batches:  33%|###2      | 293/896 [00:47<01:15,  8.02it/s]
Batches:  33%|###2      | 295/896 [00:47<01:04,  9.26it/s]
Batches:  33%|###3      | 297/896 [00:47<00:58, 10.18it/s]
Batches:  33%|###3      | 299/896 [00:47<01:06,  8.98it/s]
Batches:  34%|###3      | 301/896 [00:48<00:59,  9.96it/s]
Batches:  34%|###3      | 303/896 [00:48<01:05,  9.01it/s]
Batches:  34%|###3      | 304/896 [00:48<01:14,  8.00it/s]
Batches:  34%|###4      | 305/896 [00:48<01:22,  7.20it/s]
Batches:  34%|###4      | 307/896 [00:49<01:20,  7.34it/s]
Batches:  34%|###4      | 309/896 [00:49<01:07,  8.65it/s]
Batches:  35%|###4      | 310/896 [00:49<01:15,  7.74it/s]
Batches:  35%|###4      | 311/896 [00:49<01:22,  7.08it/s]
Batches:  35%|###4      | 313/896 [00:49<01:18,  7.39it/s]
Batches:  35%|###5      | 315/896 [00:49<01:05,  8.90it/s]
Batches:  35%|###5      | 317/896 [00:50<00:57, 10.11it/s]
Batches:  36%|###5      | 319/896 [00:50<00:52, 11.05it/s]
Batches:  36%|###5      | 321/896 [00:50<00:56, 10.12it/s]
Batches:  36%|###6      | 323/896 [00:50<01:01,  9.36it/s]
Batches:  36%|###6      | 325/896 [00:50<01:03,  9.02it/s]
Batches:  36%|###6      | 327/896 [00:51<00:57,  9.87it/s]
Batches:  37%|###6      | 329/896 [00:51<00:52, 10.75it/s]
Batches:  37%|###6      | 331/896 [00:51<00:49, 11.42it/s]
Batches:  37%|###7      | 333/896 [00:51<00:45, 12.25it/s]
Batches:  37%|###7      | 335/896 [00:51<00:53, 10.54it/s]
Batches:  38%|###7      | 337/896 [00:51<00:49, 11.32it/s]
Batches:  38%|###7      | 339/896 [00:52<00:45, 12.36it/s]
Batches:  38%|###8      | 341/896 [00:52<00:52, 10.52it/s]
Batches:  38%|###8      | 343/896 [00:52<00:56,  9.70it/s]
Batches:  39%|###8      | 345/896 [00:52<00:50, 10.96it/s]
Batches:  39%|###8      | 347/896 [00:52<00:45, 12.09it/s]
Batches:  39%|###8      | 349/896 [00:53<00:51, 10.62it/s]
Batches:  39%|###9      | 351/896 [00:53<01:04,  8.43it/s]
Batches:  39%|###9      | 353/896 [00:53<01:04,  8.40it/s]
Batches:  40%|###9      | 355/896 [00:53<00:55,  9.80it/s]
Batches:  40%|###9      | 357/896 [00:54<00:58,  9.24it/s]
Batches:  40%|####      | 359/896 [00:54<00:50, 10.60it/s]
Batches:  40%|####      | 361/896 [00:54<00:45, 11.82it/s]
Batches:  41%|####      | 363/896 [00:54<00:40, 13.00it/s]
Batches:  41%|####      | 365/896 [00:54<00:39, 13.55it/s]
Batches:  41%|####      | 367/896 [00:54<00:45, 11.74it/s]
Batches:  41%|####1     | 369/896 [00:54<00:41, 12.67it/s]
Batches:  41%|####1     | 371/896 [00:55<00:38, 13.61it/s]
Batches:  42%|####1     | 373/896 [00:55<00:45, 11.44it/s]
Batches:  42%|####1     | 375/896 [00:55<00:42, 12.22it/s]
Batches:  42%|####2     | 377/896 [00:55<00:47, 10.82it/s]
Batches:  42%|####2     | 379/896 [00:55<00:51, 10.00it/s]
Batches:  43%|####2     | 381/896 [00:56<00:46, 11.15it/s]
Batches:  43%|####2     | 383/896 [00:56<00:50, 10.19it/s]
Batches:  43%|####2     | 385/896 [00:56<00:44, 11.55it/s]
Batches:  43%|####3     | 387/896 [00:56<00:39, 12.84it/s]
Batches:  43%|####3     | 389/896 [00:56<00:36, 13.74it/s]
Batches:  44%|####3     | 391/896 [00:56<00:34, 14.51it/s]
Batches:  44%|####3     | 393/896 [00:56<00:40, 12.27it/s]
Batches:  44%|####4     | 395/896 [00:57<00:38, 13.06it/s]
Batches:  44%|####4     | 397/896 [00:57<00:35, 14.06it/s]
Batches:  45%|####4     | 399/896 [00:57<00:41, 11.89it/s]
Batches:  45%|####4     | 401/896 [00:57<00:39, 12.54it/s]
Batches:  45%|####4     | 403/896 [00:57<00:44, 11.10it/s]
Batches:  45%|####5     | 405/896 [00:57<00:39, 12.44it/s]
Batches:  45%|####5     | 407/896 [00:58<00:35, 13.71it/s]
Batches:  46%|####5     | 409/896 [00:58<00:43, 11.21it/s]
Batches:  46%|####5     | 411/896 [00:58<00:48,  9.99it/s]
Batches:  46%|####6     | 413/896 [00:58<01:03,  7.60it/s]
Batches:  46%|####6     | 415/896 [00:59<00:52,  9.14it/s]
Batches:  47%|####6     | 417/896 [00:59<00:44, 10.77it/s]
Batches:  47%|####6     | 419/896 [00:59<00:38, 12.33it/s]
Batches:  47%|####6     | 421/896 [00:59<00:34, 13.67it/s]
Batches:  47%|####7     | 423/896 [00:59<00:42, 11.15it/s]
Batches:  47%|####7     | 425/896 [00:59<00:37, 12.59it/s]
Batches:  48%|####7     | 427/896 [00:59<00:33, 13.80it/s]
Batches:  48%|####7     | 429/896 [01:00<00:41, 11.12it/s]
Batches:  48%|####8     | 431/896 [01:00<00:36, 12.59it/s]
Batches:  48%|####8     | 433/896 [01:00<00:33, 13.98it/s]
Batches:  49%|####8     | 435/896 [01:00<00:30, 14.95it/s]
Batches:  49%|####8     | 437/896 [01:00<00:38, 11.80it/s]
Batches:  49%|####8     | 439/896 [01:00<00:44, 10.25it/s]
Batches:  49%|####9     | 441/896 [01:01<00:38, 11.76it/s]
Batches:  49%|####9     | 443/896 [01:01<00:43, 10.31it/s]
Batches:  50%|####9     | 445/896 [01:01<00:38, 11.84it/s]
Batches:  50%|####9     | 447/896 [01:01<00:44, 10.18it/s]
Batches:  50%|#####     | 449/896 [01:01<00:38, 11.70it/s]
Batches:  50%|#####     | 451/896 [01:02<00:43, 10.25it/s]
Batches:  51%|#####     | 453/896 [01:02<00:36, 11.98it/s]
Batches:  51%|#####     | 455/896 [01:02<00:33, 13.31it/s]
Batches:  51%|#####1    | 457/896 [01:02<00:30, 14.37it/s]
Batches:  51%|#####1    | 459/896 [01:02<00:28, 15.36it/s]
Batches:  51%|#####1    | 461/896 [01:02<00:26, 16.38it/s]
Batches:  52%|#####1    | 463/896 [01:02<00:35, 12.36it/s]
Batches:  52%|#####1    | 465/896 [01:02<00:31, 13.58it/s]
Batches:  52%|#####2    | 467/896 [01:03<00:29, 14.76it/s]
Batches:  52%|#####2    | 469/896 [01:03<00:27, 15.69it/s]
Batches:  53%|#####2    | 471/896 [01:03<00:26, 16.28it/s]
Batches:  53%|#####2    | 473/896 [01:03<00:35, 12.00it/s]
Batches:  53%|#####3    | 475/896 [01:03<00:31, 13.52it/s]
Batches:  53%|#####3    | 477/896 [01:03<00:37, 11.10it/s]
Batches:  53%|#####3    | 479/896 [01:04<00:32, 12.80it/s]
Batches:  54%|#####3    | 481/896 [01:04<00:29, 14.16it/s]
Batches:  54%|#####3    | 483/896 [01:04<00:36, 11.23it/s]
Batches:  54%|#####4    | 485/896 [01:04<00:32, 12.78it/s]
Batches:  54%|#####4    | 487/896 [01:04<00:38, 10.53it/s]
Batches:  55%|#####4    | 489/896 [01:05<00:43,  9.43it/s]
Batches:  55%|#####4    | 491/896 [01:05<00:45,  8.82it/s]
Batches:  55%|#####5    | 494/896 [01:05<00:35, 11.39it/s]
Batches:  55%|#####5    | 496/896 [01:05<00:31, 12.65it/s]
Batches:  56%|#####5    | 498/896 [01:05<00:45,  8.68it/s]
Batches:  56%|#####5    | 501/896 [01:06<00:35, 11.00it/s]
Batches:  56%|#####6    | 504/896 [01:06<00:30, 13.05it/s]
Batches:  57%|#####6    | 507/896 [01:06<00:26, 14.89it/s]
Batches:  57%|#####6    | 510/896 [01:06<00:23, 16.34it/s]
Batches:  57%|#####7    | 512/896 [01:06<00:29, 13.04it/s]
Batches:  57%|#####7    | 515/896 [01:06<00:25, 15.00it/s]
Batches:  58%|#####7    | 518/896 [01:07<00:22, 16.66it/s]
Batches:  58%|#####8    | 521/896 [01:07<00:21, 17.82it/s]
Batches:  58%|#####8    | 524/896 [01:07<00:19, 18.60it/s]
Batches:  59%|#####8    | 526/896 [01:07<00:19, 18.65it/s]
Batches:  59%|#####8    | 528/896 [01:07<00:26, 14.00it/s]
Batches:  59%|#####9    | 530/896 [01:07<00:24, 15.15it/s]
Batches:  59%|#####9    | 532/896 [01:08<00:29, 12.17it/s]
Batches:  60%|#####9    | 535/896 [01:08<00:25, 14.33it/s]
Batches:  60%|######    | 538/896 [01:08<00:22, 16.12it/s]
Batches:  60%|######    | 540/896 [01:08<00:27, 12.74it/s]
Batches:  61%|######    | 543/896 [01:08<00:24, 14.57it/s]
Batches:  61%|######    | 546/896 [01:08<00:21, 16.23it/s]
Batches:  61%|######1   | 548/896 [01:09<00:26, 12.89it/s]
Batches:  61%|######1   | 550/896 [01:09<00:30, 11.17it/s]
Batches:  62%|######1   | 552/896 [01:09<00:34, 10.07it/s]
Batches:  62%|######1   | 555/896 [01:09<00:27, 12.63it/s]
Batches:  62%|######2   | 557/896 [01:10<00:31, 10.90it/s]
Batches:  62%|######2   | 559/896 [01:10<00:27, 12.29it/s]
Batches:  63%|######2   | 562/896 [01:10<00:22, 14.62it/s]
Batches:  63%|######2   | 564/896 [01:10<00:34,  9.70it/s]
Batches:  63%|######3   | 567/896 [01:10<00:27, 12.18it/s]
Batches:  64%|######3   | 569/896 [01:11<00:30, 10.86it/s]
Batches:  64%|######3   | 571/896 [01:11<00:26, 12.27it/s]
Batches:  64%|######3   | 573/896 [01:11<00:29, 10.83it/s]
Batches:  64%|######4   | 575/896 [01:11<00:26, 11.91it/s]
Batches:  65%|######4   | 578/896 [01:11<00:21, 14.46it/s]
Batches:  65%|######4   | 581/896 [01:11<00:18, 16.58it/s]
Batches:  65%|######5   | 583/896 [01:12<00:23, 13.30it/s]
Batches:  65%|######5   | 586/896 [01:12<00:19, 15.72it/s]
Batches:  66%|######5   | 589/896 [01:12<00:17, 17.84it/s]
Batches:  66%|######6   | 592/896 [01:12<00:15, 19.35it/s]
Batches:  66%|######6   | 595/896 [01:12<00:14, 20.39it/s]
Batches:  67%|######6   | 598/896 [01:12<00:14, 20.96it/s]
Batches:  67%|######7   | 601/896 [01:13<00:23, 12.66it/s]
Batches:  67%|######7   | 603/896 [01:13<00:25, 11.28it/s]
Batches:  68%|######7   | 606/896 [01:13<00:21, 13.42it/s]
Batches:  68%|######7   | 609/896 [01:13<00:18, 15.69it/s]
Batches:  68%|######8   | 612/896 [01:13<00:20, 13.60it/s]
Batches:  69%|######8   | 615/896 [01:14<00:22, 12.33it/s]
Batches:  69%|######8   | 618/896 [01:14<00:18, 14.64it/s]
Batches:  69%|######9   | 620/896 [01:14<00:21, 12.60it/s]
Batches:  69%|######9   | 622/896 [01:14<00:24, 11.20it/s]
Batches:  70%|######9   | 625/896 [01:14<00:19, 13.80it/s]
Batches:  70%|######9   | 627/896 [01:15<00:23, 11.63it/s]
Batches:  70%|#######   | 630/896 [01:15<00:18, 14.13it/s]
Batches:  71%|#######   | 633/896 [01:15<00:16, 16.24it/s]
Batches:  71%|#######   | 635/896 [01:15<00:19, 13.47it/s]
Batches:  71%|#######1  | 637/896 [01:15<00:21, 11.82it/s]
Batches:  71%|#######1  | 640/896 [01:16<00:17, 14.54it/s]
Batches:  72%|#######1  | 642/896 [01:16<00:20, 12.39it/s]
Batches:  72%|#######1  | 645/896 [01:16<00:16, 14.83it/s]
Batches:  72%|#######2  | 648/896 [01:16<00:14, 17.27it/s]
Batches:  73%|#######2  | 651/896 [01:16<00:12, 19.34it/s]
Batches:  73%|#######2  | 654/896 [01:16<00:11, 20.92it/s]
Batches:  73%|#######3  | 657/896 [01:16<00:10, 22.58it/s]
Batches:  74%|#######3  | 660/896 [01:17<00:14, 16.49it/s]
Batches:  74%|#######3  | 663/896 [01:17<00:16, 14.42it/s]
Batches:  74%|#######4  | 666/896 [01:17<00:14, 16.29it/s]
Batches:  75%|#######4  | 669/896 [01:17<00:15, 14.42it/s]
Batches:  75%|#######5  | 673/896 [01:17<00:12, 17.91it/s]
Batches:  75%|#######5  | 676/896 [01:18<00:14, 15.33it/s]
Batches:  76%|#######5  | 678/896 [01:18<00:16, 13.18it/s]
Batches:  76%|#######5  | 680/896 [01:18<00:18, 11.99it/s]
Batches:  76%|#######6  | 682/896 [01:18<00:19, 10.87it/s]
Batches:  76%|#######6  | 684/896 [01:19<00:21,  9.89it/s]
Batches:  77%|#######6  | 687/896 [01:19<00:16, 12.55it/s]
Batches:  77%|#######7  | 690/896 [01:19<00:13, 15.36it/s]
Batches:  77%|#######7  | 693/896 [01:19<00:11, 17.75it/s]
Batches:  78%|#######7  | 696/896 [01:19<00:09, 20.02it/s]
Batches:  78%|#######8  | 699/896 [01:19<00:12, 16.27it/s]
Batches:  78%|#######8  | 702/896 [01:20<00:13, 14.30it/s]
Batches:  79%|#######8  | 705/896 [01:20<00:11, 16.90it/s]
Batches:  79%|#######9  | 708/896 [01:20<00:12, 15.15it/s]
Batches:  79%|#######9  | 711/896 [01:20<00:13, 13.80it/s]
Batches:  80%|#######9  | 714/896 [01:20<00:11, 16.28it/s]
Batches:  80%|#######9  | 716/896 [01:21<00:12, 14.04it/s]
Batches:  80%|########  | 720/896 [01:21<00:09, 18.16it/s]
Batches:  81%|########  | 724/896 [01:21<00:07, 21.67it/s]
Batches:  81%|########1 | 727/896 [01:21<00:09, 17.72it/s]
Batches:  81%|########1 | 730/896 [01:21<00:10, 15.69it/s]
Batches:  82%|########1 | 732/896 [01:21<00:11, 13.70it/s]
Batches:  82%|########2 | 735/896 [01:22<00:12, 13.11it/s]
Batches:  82%|########2 | 738/896 [01:22<00:10, 15.79it/s]
Batches:  83%|########2 | 741/896 [01:22<00:08, 18.28it/s]
Batches:  83%|########3 | 744/896 [01:22<00:09, 16.38it/s]
Batches:  83%|########3 | 746/896 [01:22<00:10, 14.03it/s]
Batches:  84%|########3 | 750/896 [01:23<00:08, 18.18it/s]
Batches:  84%|########4 | 754/896 [01:23<00:06, 21.30it/s]
Batches:  84%|########4 | 757/896 [01:23<00:07, 17.53it/s]
Batches:  85%|########4 | 760/896 [01:23<00:10, 12.99it/s]
Batches:  85%|########5 | 763/896 [01:24<00:10, 12.94it/s]
Batches:  86%|########5 | 767/896 [01:24<00:09, 13.82it/s]
Batches:  86%|########6 | 771/896 [01:24<00:07, 16.78it/s]
Batches:  86%|########6 | 773/896 [01:24<00:08, 14.67it/s]
Batches:  86%|########6 | 775/896 [01:24<00:09, 13.24it/s]
Batches:  87%|########6 | 777/896 [01:25<00:09, 12.38it/s]
Batches:  87%|########7 | 781/896 [01:25<00:06, 16.91it/s]
Batches:  88%|########7 | 785/896 [01:25<00:05, 21.38it/s]
Batches:  88%|########8 | 789/896 [01:25<00:04, 25.04it/s]
Batches:  88%|########8 | 792/896 [01:25<00:06, 16.19it/s]
Batches:  89%|########8 | 796/896 [01:25<00:05, 19.40it/s]
Batches:  89%|########9 | 799/896 [01:26<00:05, 17.15it/s]
Batches:  90%|########9 | 802/896 [01:26<00:05, 16.22it/s]
Batches:  90%|########9 | 805/896 [01:26<00:05, 15.39it/s]
Batches:  90%|######### | 807/896 [01:26<00:06, 14.06it/s]
Batches:  90%|######### | 810/896 [01:26<00:06, 13.88it/s]
Batches:  91%|######### | 812/896 [01:27<00:06, 12.93it/s]
Batches:  91%|#########1| 817/896 [01:27<00:05, 15.38it/s]
Batches:  92%|#########1| 822/896 [01:27<00:03, 20.79it/s]
Batches:  92%|#########2| 825/896 [01:27<00:03, 18.48it/s]
Batches:  93%|#########2| 830/896 [01:27<00:02, 23.28it/s]
Batches:  93%|#########3| 834/896 [01:28<00:03, 20.33it/s]
Batches:  93%|#########3| 837/896 [01:28<00:03, 15.17it/s]
Batches:  94%|#########3| 839/896 [01:28<00:04, 14.09it/s]
Batches:  94%|#########4| 844/896 [01:28<00:02, 19.68it/s]
Batches:  95%|#########4| 847/896 [01:29<00:03, 14.73it/s]
Batches:  95%|#########4| 850/896 [01:29<00:03, 14.71it/s]
Batches:  96%|#########5| 856/896 [01:29<00:02, 17.45it/s]
Batches:  96%|#########5| 859/896 [01:29<00:02, 16.55it/s]
Batches:  96%|#########6| 863/896 [01:29<00:01, 16.62it/s]
Batches:  97%|#########6| 867/896 [01:30<00:01, 17.22it/s]
Batches:  97%|#########7| 871/896 [01:30<00:01, 17.63it/s]
Batches:  98%|#########7| 875/896 [01:30<00:01, 17.00it/s]
Batches:  98%|#########7| 878/896 [01:30<00:01, 16.49it/s]
Batches:  98%|#########8| 881/896 [01:31<00:00, 15.76it/s]
Batches:  99%|#########8| 883/896 [01:31<00:00, 14.42it/s]
Batches:  99%|#########8| 886/896 [01:31<00:00, 14.48it/s]
Batches:  99%|#########9| 890/896 [01:31<00:00, 16.10it/s]
Batches: 100%|#########9| 893/896 [01:31<00:00, 16.30it/s]
Batches: 100%|#########9| 895/896 [01:31<00:00, 14.89it/s]
Batches: 100%|##########| 896/896 [01:32<00:00,  9.72it/s]

Reading the Speeches

Let us know look at the main topics and their representation:

Python

# Retrieve topic information and select specific columns: 'Topic', 'Count', 'Name', and 'Representation'
topic_info = topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]

# Create a simplified version of the topic information, retaining only 'Topic', 'Count', and 'Representation' columns
topic_info_simple = topic_info[["Topic", 'Count', "Representation"]]

# Modify the 'Representation' column to remove duplicate words while preserving their order
# This uses a dictionary to enforce uniqueness and joins the keys (unique words) back into a string
topic_info_simple['Representation'] = topic_info_simple['Representation'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))

# Convert the simplified topic information into a DataFrame for easier analysis or visualization
frequency_table = pd.DataFrame(topic_info_simple)

frequency_table.head(7)

Topic	Count	Representation
-1	11169	the to of that and in is for it not
0	2096	tax chancellor the government that we is businesses in business
1	1330	health nhs care patients hospital service hospitals services patient that
2	551	rail transport network road london trains line railway railways roads
3	520	ireland northern agreement irish ira sinn parties fein belfast decommissioning
4	456	energy climate carbon fuel gas emissions wind change electricity efficiency
5	398	police officers crime policing home chief metropolitan force constable constables

Recording the Word Scores

Python

# Initialize an empty list to store rows
topic_words_scores = []

# Loop through each topic and extract its word scores
for topic_id in topic_info_simple["Topic"]:
    words_scores = topic_model.get_topic(topic_id)  # Get words and scores for the topic
    if words_scores:  # Ensure there are words for the topic
        for word, score in words_scores:
            topic_words_scores.append({"Topic": topic_id, "Word": word, "Score": score})

# Convert the list into a DataFrame
topic_words_scores_df = pd.DataFrame(topic_words_scores)

Recording the Word Scores

See code

Python

# Step 1: Group by 'Topic' and rank terms by descending 'Score'
topic_words_scores_df['rank'] = topic_words_scores_df.groupby('Topic')['Score'] \
    .rank(method='min', ascending=False)

# Step 2: Filter only the first 11 topics: "-1" to "10"
# First, make sure Topic is treated as a string (if needed)
topic_words_scores_df['Topic'] = topic_words_scores_df['Topic'].astype(str)

selected_topics = [str(i) for i in range(-1, 11)]
reshaped_df2 = topic_words_scores_df[topic_words_scores_df['Topic'].isin(selected_topics)]

See code

# Load necessary libraries
library(dplyr)      # For data manipulation
library(broom)      # For tidying data (not used explicitly in this code but useful in many workflows)
library(ggplot2)    # For creating plots
library(ggpubr)     # For arranging multiple plots into a single figure
library(forcats)  # for fct_rev


reshaped_df2 <- reticulate::py$reshaped_df2

#Creating a graph with the topics
topics_df<-ggplot(reshaped_df2, aes(y = fct_rev(as.factor(Topic)), x = as.numeric(rank))) +   
  geom_tile(aes(fill = Score))+
  scale_fill_viridis_c()+
  geom_label(aes(y = fct_rev(as.factor(Topic)), x=rank, label=Word), fill="white", size=3)+
  scale_x_continuous(breaks = seq(min(reshaped_df2$rank), max(reshaped_df2$rank))) +  # Ensure all ranks are shown
  ylab("Topic")+ xlab("Top 10 words")+
  theme(legend.position = "bottom")
topics_df

Recording the Word Scores

See code

# Load necessary libraries
library(dplyr)      # For data manipulation
library(broom)      # For tidying data (not used explicitly in this code but useful in many workflows)
library(ggplot2)    # For creating plots
library(ggpubr)     # For arranging multiple plots into a single figure

# Accessing the Python dataframe (using reticulate) and converting it to an R dataframe
# `reticulate` bridges R and Python environments
topics_df <- reticulate::py$topic_words_scores_df

# Filter the dataframe to include only topics less than 3
# This creates a subset of data for visualization
topics_df8 <- topics_df[topics_df$Topic < 3, ]

# Create an empty list to store plots
# This list will hold the individual plots for each topic
plot_list <- list()

# Loop through each unique topic in the filtered dataframe
# Each iteration generates a plot for one topic
for (topic_id in unique(topics_df8$Topic)) {
  
  # Filter the dataframe for the current topic
  # This ensures that each plot uses only the relevant data
  topic_data <- topics_df8[topics_df8$Topic == topic_id, ]
  
  # Extract the title of the topic for use in the plot title
  # The `unique` function ensures that only one title is used
  topic_title <- unique(topic_data$Title) 
  
  # Create the bar plot for the current topic
  # The x-axis represents scores, and the y-axis represents words sorted by score
  p <- ggplot(topic_data, aes(x = Score, y = reorder(Word, Score))) +
    geom_bar(stat = "identity") +  # `geom_bar` creates the bar chart
    theme_bw() +                   # `theme_bw` applies a clean, minimal theme
    theme(axis.title.x = element_blank(),   # Remove x-axis title
          axis.title.y = element_blank()) + # Remove y-axis title
    ggtitle(paste("Topic", topic_id, ":", topic_title)) # Add a descriptive title to the plot
  
  # Store the plot in the list, with the topic ID as the key
  plot_list[[as.character(topic_id)]] <- p
}

# Arrange the stored plots in a grid layout
# `ncol` specifies the number of columns, and `nrow` specifies the number of rows
# `ggarrange` makes it easy to combine multiple plots into a single figure
ggarrange(plotlist = plot_list, ncol = 2, nrow = 2)

$`1`


$`2`


$`3`


$`4`


$`5`


$`6`


$`7`


$`8`


$`9`


$`10`


$`11`


$`12`


$`13`


$`14`


$`15`


$`16`


$`17`


$`18`


$`19`


$`20`


$`21`


$`22`


$`23`


$`24`


$`25`


$`26`


$`27`


$`28`


$`29`


$`30`


$`31`


$`32`


$`33`


$`34`


$`35`


$`36`


$`37`


$`38`


$`39`


$`40`


$`41`


attr(,"class")
[1] "list"      "ggarrange"

Topics per Class

This is how we get topics per document:

Python

# Initialize a list to hold probabilities with document IDs
probs_with_outliers = []

# Use document_id from aggression_texts
document_ids = aggression_texts['document_id'].tolist()

# Iterate over topics, probabilities, and document IDs
for doc_id, topic, prob_row in zip(document_ids, topics, probs):
    if topic == -1:
        # Assign all probability to topic -1
        prob_row_with_outlier = [doc_id] + list(prob_row) + [1]  # Add document ID and probabilities
    else:
        # Calculate residual probability for topic -1
        outlier_prob = max(0, 1 - sum(prob_row))  # Ensure non-negative
        prob_row_with_outlier = [doc_id] + list(prob_row) + [outlier_prob]
    probs_with_outliers.append(prob_row_with_outlier)

# Define columns, including document_id
topic_columns = [f"topic_{i}" for i in range(len(probs[0]))] + ["topic_-1"]
columns = ["document_id"] + topic_columns

# Convert to DataFrame
probs_df_with_outliers = pd.DataFrame(probs_with_outliers, columns=columns)

# Add winning topic
probs_df_with_outliers["winning_topic"] = probs_df_with_outliers[topic_columns].idxmax(axis=1)

# Get keywords for all topics
topic_keywords = {topic: topic_model.get_topic(topic) for topic in topic_model.get_topics()}

# Function to extract keywords for the winning topic
def get_keywords(topic):
    topic_id = int(topic.split("_")[1])  # Extract numeric topic ID
      # Format keywords as a comma-separated string
    return ", ".join([word for word, _ in topic_keywords[topic_id]])

# Add keywords for the winning topic
probs_df_with_outliers["winning_topic_keywords"] = probs_df_with_outliers["winning_topic"].apply(get_keywords)

# Add winning topic probability
probs_df_with_outliers["winning_topics_prob"] = probs_df_with_outliers[topic_columns].max(axis=1)

# Merge with original text data
merged_df = pd.merge(probs_df_with_outliers, aggression_texts[["document_id","body", "name"]],how='left',on=['document_id'])

# Rearrange columns to have: document_id, winning_topic, winning_topic_keywords, winning_topics_prob, body, and then topic columns
cols = ["document_id", "winning_topic", "winning_topic_keywords", "winning_topics_prob", "name", "body", "topic_-1"] + [f"topic_{i}" for i in range(len(probs[0]))]
merged_df = merged_df[cols]

Topics per Class

This is how we get topics per document:

document_id	winning_topic	winning_topic_keywords	winning_topics_prob	name	body	topic_-1	topic_0	topic_1	topic_2	topic_3	topic_4	topic_5	topic_6	topic_7	topic_8	topic_9	topic_10	topic_11	topic_12	topic_13	topic_14	topic_15	topic_16	topic_17	topic_18	topic_19	topic_20	topic_21	topic_22	topic_23	topic_24	topic_25	topic_26	topic_27	topic_28	topic_29	topic_30	topic_31	topic_32	topic_33	topic_34	topic_35	topic_36	topic_37	topic_38	topic_39	topic_40	topic_41	topic_42	topic_43	topic_44	topic_45	topic_46	topic_47	topic_48	topic_49	topic_50	topic_51	topic_52	topic_53	topic_54	topic_55	topic_56	topic_57	topic_58	topic_59	topic_60	topic_61	topic_62	topic_63	topic_64	topic_65	topic_66	topic_67	topic_68	topic_69	topic_70	topic_71	topic_72	topic_73	topic_74	topic_75	topic_76	topic_77	topic_78	topic_79	topic_80	topic_81	topic_82	topic_83	topic_84	topic_85	topic_86	topic_87	topic_88	topic_89	topic_90	topic_91	topic_92	topic_93	topic_94	topic_95	topic_96	topic_97	topic_98	topic_99	topic_100	topic_101	topic_102	topic_103	topic_104	topic_105	topic_106	topic_107	topic_108	topic_109	topic_110	topic_111	topic_112	topic_113	topic_114	topic_115	topic_116	topic_117	topic_118	topic_119	topic_120	topic_121	topic_122	topic_123	topic_124	topic_125	topic_126	topic_127	topic_128	topic_129	topic_130	topic_131	topic_132	topic_133	topic_134	topic_135	topic_136	topic_137	topic_138	topic_139	topic_140	topic_141	topic_142	topic_143	topic_144	topic_145	topic_146	topic_147	topic_148	topic_149	topic_150	topic_151	topic_152	topic_153	topic_154	topic_155	topic_156	topic_157	topic_158	topic_159	topic_160	topic_161	topic_162	topic_163	topic_164	topic_165	topic_166	topic_167	topic_168	topic_169	topic_170	topic_171	topic_172	topic_173	topic_174	topic_175	topic_176	topic_177	topic_178	topic_179	topic_180	topic_181	topic_182	topic_183	topic_184	topic_185	topic_186	topic_187	topic_188	topic_189	topic_190	topic_191	topic_192	topic_193	topic_194	topic_195	topic_196	topic_197	topic_198	topic_199	topic_200	topic_201	topic_202	topic_203	topic_204	topic_205	topic_206	topic_207	topic_208	topic_209	topic_210	topic_211	topic_212	topic_213	topic_214	topic_215	topic_216	topic_217	topic_218	topic_219	topic_220	topic_221	topic_222	topic_223	topic_224	topic_225	topic_226	topic_227	topic_228	topic_229	topic_230	topic_231	topic_232	topic_233	topic_234	topic_235	topic_236	topic_237	topic_238	topic_239
1	topic_87	legal, aid, cases, advice, court, system, law, lawyer, solicitors, litigants	0.16797	Mr Gerry Bermingham	Does the Minister agree that if one does not provi...	0.043965	0.004041	0.001391	0.002655	0.003035	0.001203	0.003792	0.001988	0.001804	0.001161	0.00289	0.001879	0.003518	0.002171	0.011101	0.001691	0.002148	0.001428	0.00132	0.001246	0.003348	0.034134	0.001502	0.001059	0.006522	0.001422	0.001054	0.001499	0.000768	0.001285	0.006056	0.002152	0.006282	0.00386	0.001871	0.002912	0.004134	0.002102	0.001651	0.001178	0.002149	0.001653	0.005514	0.001027	0.001433	0.003178	0.002903	0.001821	0.000907	0.00206	0.002664	0.001604	0.004242	0.003069	0.001372	0.001572	0.010114	0.002245	0.003244	0.001519	0.001254	0.001791	0.001211	0.004479	0.007045	0.001032	0.002343	0.002221	0.001219	0.004302	0.00583	0.004589	0.001281	0.001106	0.001183	0.004933	0.002176	0.001063	0.006258	0.001903	0.002481	0.00056	0.005718	0.00141	0.002412	0.003385	0.003717	0.001193	0.16797	0.003041	0.001293	0.004539	0.004232	0.000913	0.00289	0.004827	0.010436	0.006734	0.001311	0.006445	0.00233	0.002069	0.001622	0.000944	0.00272	0.00789	0.001226	0.001267	0.001433	0.006964	0.00401	0.002205	0.01401	0.001318	0.001362	0.002963	0.003664	0.002987	0.001741	0.001769	0.00168	0.004751	0.005687	0.005972	0.003915	0.001601	0.004665	0.001213	0.001462	0.001737	0.001341	0.006358	0.003369	0.00321	0.001515	0.005882	0.008352	0.001636	0.003068	0.011426	0.003419	0.002113	0.001919	0.002036	0.009607	0.00464	0.004334	0.000946	0.001025	0.007145	0.002288	0.002798	0.001381	0.009234	0.001998	0.009331	0.003262	0.001235	0.002842	0.007432	0.001252	0.003514	0.002367	0.005254	0.001195	0.004438	0.001392	0.001379	0.004981	0.002487	0.001255	0.001667	0.001405	0.001555	0.009217	0.003634	0.007665	0.001526	0.00197	0.003302	0.002911	0.001137	0.001488	0.001273	0.002139	0.001642	0.001539	0.001576	0.001087	0.002102	0.001245	0.004102	0.006396	0.002045	0.00469	0.006267	0.001929	0.012077	0.002457	0.002237	0.003666	0.022004	0.002181	0.008888	0.002221	0.004582	0.001073	0.003257	0.002895	0.001331	0.003844	0.001898	0.003689	0.005221	0.002511	0.001184	0.00113	0.00195	0.001906	0.001213	0.001341	0.002199	0.001262	0.004514	0.002361	0.001158	0.002327	0.001191	0.001243	0.001221	0.007826	0.001382	0.001317	0.002769	0.005798	0.002919	0.001196	0.003822	0.003577	0.002212	0.001369
2	topic_22	fishing, fishermen, fish, fisheries, industry, sea, marine, conservation, vessels, fishermens	1.0	Richard Benyon	My right honourable Friend will know that there is...	0.0	2.409584e-307	2.777592e-307	1.339487e-307	1.215367e-307	6.984532e-307	1.112502e-307	1.401726e-307	3.366072e-307	1.385535e-307	2.807412e-307	2.164449e-307	1.085048e-307	3.046906e-307	1.502615e-307	5.040040e-307	1.722364e-307	1.695230e-307	3.878772e-307	1.474443e-307	1.320987e-307	1.444193e-307	3.985065e-307	1.0	1.709283e-307	1.895954e-307	4.171915e-307	6.383930e-307	4.700139e-307	2.795904e-307	1.664952e-307	2.304228e-307	1.558749e-307	1.574965e-307	1.864378e-307	1.824634e-307	1.118675e-307	1.608528e-307	1.415907e-307	1.245878e-307	1.519837e-307	2.543567e-307	1.341151e-307	1.605463e-307	1.350279e-307	1.702117e-307	2.657082e-307	1.785103e-307	1.606623e-307	1.209747e-307	2.801690e-307	1.604444e-307	2.005726e-307	1.344802e-307	2.255745e-307	1.295795e-307	1.681387e-307	1.189876e-307	1.085106e-307	2.945497e-307	2.001174e-307	4.308093e-307	1.158053e-306	1.752071e-307	1.367220e-307	5.276326e-307	1.910063e-307	1.376773e-307	4.121481e-307	1.641325e-307	1.893145e-307	2.026828e-307	5.088588e-307	1.187003e-306	3.192378e-307	1.236715e-307	1.783859e-307	8.404581e-307	1.556770e-307	1.120358e-307	2.360908e-307	1.767775e-307	1.197294e-307	3.185820e-307	1.745644e-307	1.463532e-307	1.285231e-307	1.143427e-306	1.495211e-307	2.478799e-307	4.814387e-307	1.345058e-307	1.631086e-307	2.716689e-307	2.605204e-307	2.009461e-307	1.523980e-307	1.906404e-307	5.643037e-307	1.220769e-307	2.078179e-307	1.264070e-307	1.699676e-307	1.763924e-307	1.135469e-307	1.788315e-307	1.468152e-307	1.594030e-307	1.425138e-307	1.309516e-307	1.229503e-307	1.496371e-307	1.631474e-307	6.193856e-307	3.617686e-307	1.919997e-307	1.950644e-307	1.201606e-307	4.472034e-307	1.413093e-307	1.531566e-307	1.689022e-307	1.869425e-307	1.838141e-307	1.541317e-307	4.055817e-307	2.047519e-307	6.201056e-307	4.650893e-307	1.776927e-307	1.430714e-307	1.262870e-307	1.872262e-307	1.867817e-307	4.723111e-307	1.211818e-307	1.268618e-307	5.132255e-307	2.003833e-307	1.468130e-307	2.002439e-307	1.943697e-307	3.799007e-307	3.123250e-307	1.321809e-307	1.898880e-307	2.071306e-307	1.264200e-307	1.683278e-307	1.536503e-307	1.325717e-307	1.917572e-307	1.559229e-307	1.290616e-307	1.960865e-307	1.382141e-307	2.345569e-307	1.267526e-307	1.701063e-307	1.584590e-307	3.952960e-307	1.670066e-307	1.426341e-307	1.775151e-307	1.720156e-307	1.584812e-307	3.112093e-307	1.477069e-307	1.564011e-307	1.939658e-307	3.081934e-307	4.879836e-307	6.185012e-307	1.716764e-307	1.427484e-307	1.538377e-307	1.454659e-307	3.064519e-307	2.949482e-307	1.262003e-307	1.639660e-307	2.324958e-307	2.854048e-307	1.567381e-307	3.218422e-307	1.265665e-307	1.330603e-307	3.952529e-307	1.698816e-307	1.313392e-307	1.368401e-307	2.122025e-307	1.837408e-307	2.038133e-307	1.926490e-307	1.506087e-307	9.907181e-308	1.333170e-307	1.196730e-307	1.972104e-307	2.001180e-307	1.390832e-307	3.300989e-307	1.558511e-307	1.880517e-307	1.496291e-307	1.807116e-307	2.092684e-307	2.035533e-307	1.210000e-307	2.170597e-307	1.689890e-307	1.652968e-307	1.819821e-307	1.739442e-307	1.357835e-307	1.804197e-307	1.592313e-307	3.019267e-307	1.418906e-307	1.360037e-307	3.261747e-307	1.407881e-307	1.140075e-307	1.792352e-307	1.318813e-307	3.043408e-307	3.266105e-307	1.425656e-307	1.756072e-307	1.678449e-307	4.132594e-307	1.526275e-307	1.728773e-307	1.270355e-307	2.795018e-307	1.293750e-307	2.208128e-307	2.266386e-307	1.678376e-307	1.568370e-307
3	topic_178	parking, stevenage, town, hire, taxi, charges, centre, borough, regeneration, local	1.0	Penny Mordaunt	I congratulate my honourable Friend on his campaig...	0.0	6.391277e-307	2.345420e-307	1.814121e-306	2.922295e-307	1.780881e-307	3.778895e-307	1.914265e-307	1.971974e-307	2.266575e-307	2.585887e-307	2.881446e-307	5.137861e-307	2.581527e-307	7.024945e-307	2.127185e-307	1.883227e-307	2.469107e-307	1.593686e-307	1.454105e-307	2.622233e-307	4.831738e-307	1.797459e-307	1.469261e-307	3.344549e-307	2.606840e-307	1.374573e-307	1.887253e-307	1.040219e-307	1.541432e-307	6.208399e-307	2.071429e-307	3.240391e-307	8.179376e-307	1.730221e-307	2.275312e-307	4.445581e-307	3.898600e-307	1.690854e-307	2.418034e-307	4.328521e-307	2.590332e-307	1.422950e-306	1.910231e-307	3.199570e-307	2.373607e-307	2.765900e-307	3.181794e-307	1.610564e-307	2.116635e-307	2.859886e-307	3.080400e-307	2.886265e-307	2.033671e-306	1.464756e-307	1.641478e-307	4.027316e-307	1.287338e-306	6.401497e-307	1.753232e-307	2.214916e-307	2.067573e-307	1.649565e-307	2.847772e-307	3.593485e-307	1.541297e-307	2.006313e-307	2.058802e-307	1.524486e-307	6.550144e-307	3.760745e-307	3.189226e-307	1.609852e-307	1.570686e-307	1.464216e-307	9.497039e-307	3.650936e-307	1.540928e-307	7.848966e-307	1.861488e-307	2.283553e-307	7.410293e-308	5.949490e-307	2.138775e-307	4.094010e-307	2.551387e-307	4.395413e-306	1.647610e-307	5.243064e-307	2.585022e-307	1.872876e-307	2.201522e-306	2.742290e-307	1.356489e-307	3.013634e-307	3.378562e-307	3.815060e-307	3.675142e-307	1.652075e-307	5.230068e-307	2.024696e-307	2.079005e-307	2.897521e-307	1.649088e-307	1.161286e-306	4.160975e-307	2.353772e-307	1.474967e-307	2.995483e-307	3.840919e-307	1.584909e-306	2.036513e-307	4.836236e-307	1.672851e-307	2.049206e-307	2.313664e-307	2.622323e-307	3.750731e-306	2.089711e-307	1.771966e-307	1.719656e-307	2.896667e-307	4.501398e-307	3.349632e-307	2.639059e-307	1.911688e-307	3.549757e-307	1.556908e-307	1.793215e-307	3.042441e-307	2.699078e-307	3.943593e-307	2.491685e-307	2.419233e-307	1.854852e-307	4.591832e-307	5.547151e-307	2.012094e-307	2.392580e-307	4.026554e-307	2.529497e-307	1.882352e-307	2.226300e-307	2.507656e-307	6.554605e-307	4.536412e-307	3.576033e-307	1.062917e-307	1.874527e-307	7.885029e-307	2.194382e-307	2.236847e-307	1.564872e-307	4.959884e-307	3.290815e-307	4.013810e-307	2.677714e-307	2.578281e-307	2.228609e-307	3.448356e-307	1.905218e-307	5.925451e-307	2.133834e-307	3.098304e-307	1.336118e-307	2.801086e-307	1.651859e-307	1.561379e-307	2.949184e-307	2.085673e-307	1.526108e-307	1.974576e-307	1.782153e-307	2.724875e-307	3.775297e-307	2.543890e-307	3.541762e-307	2.297102e-307	2.472255e-307	1.0	2.255917e-307	1.261422e-307	2.300333e-307	1.478593e-307	2.476590e-307	1.676029e-307	1.614399e-307	2.124219e-307	2.014388e-307	7.930954e-307	2.526387e-307	3.070891e-307	3.728500e-307	3.302092e-307	3.034047e-307	9.088073e-307	4.186826e-307	5.559739e-307	1.848464e-306	1.955087e-307	2.663105e-307	5.563833e-307	2.461975e-307	6.849871e-307	1.932927e-307	2.854627e-307	1.939794e-307	2.512820e-307	2.311729e-307	1.380946e-307	2.897283e-307	3.534900e-307	2.563861e-307	3.140572e-307	2.075715e-307	2.325030e-307	2.065301e-307	3.626227e-307	2.410811e-307	2.356186e-307	2.794425e-307	2.313708e-307	1.466655e-307	5.020368e-307	2.001398e-307	2.284017e-307	2.615041e-307	1.860578e-307	1.450761e-307	2.251243e-307	5.463389e-307	1.684474e-307	1.513260e-307	4.722403e-307	8.892318e-307	2.681487e-307	2.427072e-307	3.491490e-307	2.974070e-307	3.987444e-307	2.602154e-307
4	topic_-1	the, to, of, that, and, in, is, for, it, not	0.260586	Gerry Sutcliffe	We have had an interesting debate on an important ...	0.260586	0.009011	0.001424	0.002068	0.001486	0.001222	0.001665	0.001296	0.001648	0.001023	0.00334	0.002041	0.001648	0.002552	0.00535	0.001813	0.001618	0.001337	0.001169	0.000923	0.001728	0.003805	0.00142	0.001006	0.004139	0.001417	0.000936	0.001513	0.000723	0.001103	0.009388	0.001852	0.003169	0.003982	0.001416	0.002206	0.001782	0.002033	0.001134	0.001011	0.001998	0.00185	0.002768	0.000948	0.001269	0.002264	0.00382	0.001832	0.000839	0.001211	0.003573	0.001525	0.003928	0.002326	0.001181	0.001053	0.006299	0.001612	0.001623	0.001354	0.001225	0.001864	0.001197	0.003159	0.002574	0.001	0.001818	0.001388	0.001092	0.005288	0.011239	0.005524	0.001184	0.00108	0.001026	0.002145	0.002265	0.001031	0.00566	0.001107	0.002257	0.000487	0.002218	0.001486	0.002557	0.001909	0.002271	0.001181	0.004721	0.003312	0.001345	0.002698	0.002689	0.000902	0.004267	0.007021	0.003671	0.006986	0.001231	0.002292	0.001846	0.001248	0.001553	0.000886	0.001587	0.011162	0.001106	0.00096	0.001294	0.002426	0.002079	0.001452	0.007451	0.00125	0.001443	0.002374	0.003035	0.00184	0.001848	0.001195	0.001191	0.003068	0.141773	0.004915	0.002372	0.001576	0.009627	0.00112	0.001409	0.001718	0.001209	0.002271	0.002674	0.002517	0.001489	0.002133	0.002578	0.001704	0.002597	0.003383	0.00282	0.001662	0.0021	0.002527	0.002659	0.014484	0.009743	0.000721	0.000953	0.005739	0.001371	0.00221	0.001026	0.002638	0.002165	0.002815	0.003589	0.001064	0.002034	0.003665	0.001292	0.004045	0.001478	0.003878	0.000947	0.002691	0.001233	0.001003	0.00283	0.001946	0.00109	0.001701	0.001378	0.001484	0.003011	0.002219	0.002931	0.001661	0.002196	0.002092	0.001978	0.000982	0.001598	0.000958	0.002481	0.001068	0.001048	0.001672	0.00102	0.001752	0.001098	0.005148	0.00857	0.002291	0.004292	0.004791	0.001146	0.002773	0.001698	0.001769	0.00324	0.003167	0.002539	0.006559	0.001702	0.002475	0.001013	0.002937	0.002398	0.000917	0.00425	0.00187	0.002453	0.004276	0.001862	0.001045	0.001078	0.001858	0.00209	0.001085	0.001187	0.002418	0.000919	0.00189	0.001787	0.001013	0.002833	0.001186	0.000912	0.00116	0.012559	0.001277	0.000977	0.003067	0.002337	0.003616	0.001039	0.007609	0.004694	0.002224	0.001271

Topics per Document

This is how we extract the top 3 topics per document:

Python

# Extract only topic probability columns (columns starting with "topic_")
topic_columns = [col for col in merged_df.columns if col.startswith("topic_")]

# Find the top 3 topics and their probabilities for each document
top_topics = merged_df[topic_columns].apply(
    lambda row: list(row.nlargest(3).items()), axis=1
)

# Add the top 3 topics and probabilities to `merged_df`
merged_df["top_1"] = [t[0][0] for t in top_topics]  # First topic
merged_df["prob_1"] = [t[0][1] for t in top_topics]  # Probability of first topic
merged_df["top_2"] = [t[1][0] for t in top_topics]  # Second topic
merged_df["prob_2"] = [t[1][1] for t in top_topics]  # Probability of second topic
merged_df["top_3"] = [t[2][0] for t in top_topics]  # Third topic
merged_df["prob_3"] = [t[2][1] for t in top_topics]  # Probability of third topic

# Reorganize the DataFrame if needed
cols = ["document_id", "winning_topic", "winning_topic_keywords", "winning_topics_prob", "body",
        "top_1", "prob_1", "top_2", "prob_2", "top_3", "prob_3"]
merged_df2 = merged_df[cols]

Topics per Document

This is how we extract the top 3 topics per document:

document_id	winning_topic	winning_topic_keywords	winning_topics_prob	body	top_1	prob_1	top_2	prob_2	top_3	prob_3
1	topic_87	legal, aid, cases, advice, court, system, law, lawyer, solicitors, litigants	0.16797	Does the Minister agree that if one does not provi...	topic_87	0.16797	topic_-1	0.043965	topic_20	0.034134
2	topic_22	fishing, fishermen, fish, fisheries, industry, sea, marine, conservation, vessels, fishermens	1.0	My right honourable Friend will know that there is...	topic_22	1.0	topic_72	1.187003e-306	topic_61	1.158053e-306
3	topic_178	parking, stevenage, town, hire, taxi, charges, centre, borough, regeneration, local	1.0	I congratulate my honourable Friend on his campaig...	topic_178	1.0	topic_85	4.395413e-306	topic_116	3.750731e-306
4	topic_-1	the, to, of, that, and, in, is, for, it, not	0.260586	We have had an interesting debate on an important ...	topic_-1	0.260586	topic_121	0.141773	topic_144	0.014484

Topics per Document

Let us look at a text and its associated topic:

For example, the keywords associated with topic 2 are: rail, transport, network, road, london, trains, line, railway, railways, roads.

The speech that is associated with these keywords is the following:

Speech Keywords - rail, transport, network, road, london, trains, line, railway, railways, roads

I am encouraged by the honourable Gentleman’s anger. His intervention bears no relationship to the reality in Britain or to reality for our European Union partners. His comment was important. I hope that it will be well circulated to his constituents. It is clear that this whole process is in a state of disarray. I welcomed the new Secretary of State for Transport. His main challenge is to persuade a number of his colleagues in the Cabinet that the whole process should be dumped. I want to finish on two points that sum up this fiasco. I am sure that the honourable Member for Ross, Cromarty and Skye will deal with the issue of sleeper services to Scotland. The Government and the British Railways Board were dragged into the courts in Scotland to be exposed for what they had shown to Scots - utter contempt for their rail network. They thought that, through stealth and the incompetence of the director of franchising, Mr. Roger Salmon, they could close services under the guise of consultation on a draft passenger service requirement. They were found out. Is it not a disgrace that railway policy is now going to be partly dictated in the courts of England and Scotland? That is a fiasco and the Government know it. The second issue is trans-European networks. It is an important issue for Britain because it involves integrating the major routes in Britain with those in Europe. It is absolutely right that that should be done. What did Ministers do when they went to Brussels? They sabotaged the proposals. We find that the European Union is giving no priority to schemes. It has thrown out environmental impact assessments and of course no cash is being provided by the European Union for such developments..

Mapping One Topic over Time

Let us try to plot this topic over time:

We first merge the original texts to the dataframe with our calculations:

Python

# Merge with original text data
final = pd.merge(
    aggression_texts,
    merged_df2[["document_id", "winning_topic", "winning_topic_keywords"]],
    how="left",
    on=["document_id"],
)

Mapping One Topic over Time

Let us try to plot the topic of security over time:

Python

# Loading Relevant libraries
import numpy as np
import scipy.stats as stats

# Add a binary column indicating if the winning_topic is "topic_2"
final["winning_topic_2"] = (final["winning_topic"] == "topic_2").astype(int)

# Group by year and calculate total speeches, total topic occurrences, and proportion for topic 2
trends = (
    final
    .groupby("year", as_index=False)
    .agg(
        total_speeches=('document_id', 'count'),          # Total number of speeches
        total_topic=('winning_topic_2', 'sum'),          # Total speeches for "topic_2"
        proportion_topic=('winning_topic_2', 'mean')     # Mean proportion of "topic_2"
    )
)

# Calculate standard deviation (SD) of the proportion for topic 2
trends["sd_topic"] = final.groupby("year")["winning_topic_2"].std().values

# Use total speeches as the sample size (n)
trends["n"] = trends["total_speeches"]

# Calculate standard error (SE)
trends["se"] = trends["sd_topic"] / np.sqrt(trends["n"])

# Calculate confidence intervals for the proportion of topic 2
confidence_level = 0.95
t_value = stats.t.ppf(1 - (1 - confidence_level) / 2, df=trends["n"] - 1)

trends["ci_lower"] = trends["proportion_topic"] - t_value * trends["se"]
trends["ci_upper"] = trends["proportion_topic"] + t_value * trends["se"]

Plotting

We now use reticulate to plot it over time.

See code

library(reticulate)
trends2 <- reticulate::py$trends
ggplot(trends2, aes(x = year, y = proportion_topic)) +
  geom_line() +
  geom_point() +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), alpha = 0.2) +
  labs(
    title = "Topic over Time",
    x = "Year",
    y = "Topic Prevalence"
  ) +
      theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(trends2$year), max(trends2$year), by = 1))+
    geom_hline(yintercept = 0)

Dealing with Outliers

As we saw earlier, almost 50% of data points are considered outliers.

The documentation provides four different strategies to deal with the outliers:

based on topic-document probabilities,
based on topic distributions,
based on c-TF-IFD representations,
based on document and topic embeddings.

Using Topic Distributions

BERTopic uses clustering to define topics: not more than one topic is assigned

Documents however contain a mixture of topics: this can be accounted for by splitting documents into sentences and feeding those to BERTopic.

Each document is split into tokens according to the provided tokenizer in the CountVectorizer.

Then, a sliding window is applied on each document creating subsets of the document.

Using Topic Distributions

Conclusion

Traditional word representations, like one-hot encoding, are unable to capture relationships between words.

Word embeddings address these limitations by:

Representing words as dense vectors in lower-dimensional spaces.
Capturing semantic similarity and contextual meaning.

Dense representations enable:

Generalization across similar words.
Improved interpretability of relationships in text.
More efficient processing for downstream NLP tasks.

Modern techniques, such as dynamic embeddings (e.g., BERT), overcome the limitations of static embeddings by adapting to context.

L22: Unsupervised Learning: Word Embeddings

Transitioning from LDA to Word Representations

Limitations of One-Hot Encoding

Sparse Word Representations Problems

Word Embeddings: A New Paradigm

Distributional Semantics

Word Embedding Overview

Dense Representations of Words

Word Embeddings Advantages

Characteristics of Word Embeddings

Traditional Word Embeddings

Moving Beyond Traditional Embeddings?

BERTopic

BERTopic

Reading the Speeches

Reading the Speeches

Recording the Word Scores

Recording the Word Scores

Recording the Word Scores

Topics per Class

Top Topics in Boris Johnson

Top Topics in Boris Johnson

Top Topics in Boris Johnson

Topics per Class

Topics per Document

Topics per Document

Topics per Document

Mapping One Topic over Time

Mapping One Topic over Time

Plotting

Dealing with Outliers

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Using Topic Distributions

Conclusion