Topic Modeling Using LDA vs BERTopic

Application to King James’ Bible

Author

Bogdan G. Popescu

1 Introduction

In this report, I perform topic modeling on the Bible, employing two methods: Latent Dirichlet Allocation (LDA) and Word Embeddings using BERTopic. Both of them are unsupervised machine learning approaches to extract information of the latent topics. Comprising 66 books written by multiple authors over numerous centuries, the Bible encompasses a diverse array of genres, including historical and religious narratives, poetry, epistles, and prophecies. By applying topic modeling, we can unveil fresh insights into this important text.

Thus, the main questions asked in this report are:

  • What are the general topics in the Bible?

  • How do two topic modeling methods, LDA and BERTopic compare when it comes to the Bible?

Before I use the LDA and the BERTopic, I perform some descriptive analysis.

2 Descriptive Approach

Data cleaning and preprocessing are important steps when analyzing textual data before vectorizing it. The preprocessing and data cleaning techniques include tokenization, lemmatization, stopword removal, lowercasing and non-alphabetical character removal. These techniques aim to prepare the data for NLP computations and further remove noise and meaningless information from the data.

We first load the relevant libraries.

Show the code
#Importing the relevant libraries
import requests
import numpy as np
import string
import pandas as pd
import re
import string
import matplotlib.pyplot as plt

#NLTK
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize, bigrams
from collections import Counter
from wordcloud import WordCloud

#Bertopic
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer
from hdbscan import HDBSCAN

#HTML display
from IPython.display import HTML

We then extract the Bible from the following URL and turn the response into actual text.

bible_url = 'https://bereanbible.com/bsb.txt'
path = "./data/bereanbible.txt"
#with open(path, 'rb') as f:
#  text = f.read()
#Requesting the URL
response = requests.get(bible_url)
raw_bible = response.text
#Priting the first 400 characters
raw_bible[:400]
"The Holy Bible, Berean Standard Bible, BSB is produced in cooperation with Bible Hub, Discovery Bible, OpenBible.com, and the Berean Bible Translation Committee. \t\nThis text of God's Word has been dedicated to the public domain. Free resources and databases are available at BereanBible.com.\t\nVerse\tBerean Standard Bible\nGenesis 1:1\tIn the beginning God created the heavens and the earth.\nGenesis 1:2"

The following command splits the text text into lines.

bible_list = raw_bible.splitlines()
#This prints the first 2 entries.
bible_list[:2]
['The Holy Bible, Berean Standard Bible, BSB is produced in cooperation with Bible Hub, Discovery Bible, OpenBible.com, and the Berean Bible Translation Committee. \t', "This text of God's Word has been dedicated to the public domain. Free resources and databases are available at BereanBible.com.\t"]

The next function takes a list of strings (bible_list), splits each string by the tab character, and stores the result in a NumPy array (bible_array)

bible_array = np.array([item.split('\t') for item in bible_list])
bible_array[:2]
array([['The Holy Bible, Berean Standard Bible, BSB is produced in cooperation with Bible Hub, Discovery Bible, OpenBible.com, and the Berean Bible Translation Committee. ',
        ''],
       ["This text of God's Word has been dedicated to the public domain. Free resources and databases are available at BereanBible.com.",
        '']], dtype='<U400')

In the next step, we want to examine more closely what our data looks like:

df = pd.DataFrame(bible_array[3:, :], columns=['reference','text'])
# The following works
df_3 = df.head(3)
HTML(df_3.to_html(index=False))
reference text
Genesis 1:1 In the beginning God created the heavens and the earth.
Genesis 1:2 Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters.
Genesis 1:3 And God said, “Let there be light,” and there was light.

2.1 Extracting Chapter Names

In the next lines, I create a function that extracts the chapter names. For example any chapter called “Genesis 1:1” or “Genesis 1:2” will turn into “Genesis 1”. The logic is that titles like “Genesis 1” will become documents in our topic modeling analysis. In NLP, a document could be anything. For example a document could be a journal paper, a book, an article, a chapter, or even a sentence. In our case it is a book. For example, “Genesis 1” and “Genesis 2” are books in our example.

#The next function only extracts the chapter name: e.g. for Genesis 1:1, it extracts Genesis 1   
def extract_book_name(verse_reference):
    # Use regular expression to extract the book name and chapter number
    match = re.match(r'(.+?)\s+(\d+):\d+', verse_reference)
    if match:
        book_name = match.group(1)
        chapter_number = match.group(2)
        return f"{book_name} {chapter_number}"
    else:
        return None
      
#The next function only extracts the chaper name: e.g. for Genesis 1:1, it extracts Genesis    
def extract_book_name2(verse_reference):
    # Use regular expression to extract the book name
    match = re.match(r'(.+?)\s+\d+:\d+', verse_reference)
    if match:
        book_name = match.group(1)
        return book_name
    else:
        return None

Now I apply the function.

df["book_names"] = df["reference"].apply(extract_book_name)
df_5 = df.head(5)
#len(df["book_names"].unique())
HTML(df_5.to_html(index=False))
reference text book_names
Genesis 1:1 In the beginning God created the heavens and the earth. Genesis 1
Genesis 1:2 Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters. Genesis 1
Genesis 1:3 And God said, “Let there be light,” and there was light. Genesis 1
Genesis 1:4 And God saw that the light was good, and He separated the light from the darkness. Genesis 1
Genesis 1:5 God called the light “day,” and the darkness He called “night.” And there was evening, and there was morning—the first day. Genesis 1

In the next step, I try to investigate the unique chapter names.

unique_entries = df["book_names"].unique()
unique_entries
array(['Genesis 1', 'Genesis 2', 'Genesis 3', ..., 'Revelation 20',
       'Revelation 21', 'Revelation 22'], dtype=object)

Next, I group text by chapter entry, while keeping their original order.

# Add a column to store the original order
df['original_order'] = range(len(df))
merged_df = df.groupby('book_names')['text'].apply(' '.join).reset_index()
# Merge back the original order
merged_df = pd.merge(merged_df, df[['book_names', 'original_order']], on='book_names', how='left')
# Sort the DataFrame based on the original order
merged_df.sort_values('original_order', inplace=True)
# Drop the original_order column if it's no longer needed
merged_df.drop(columns='original_order', inplace=True)
merged_df = merged_df[['book_names', 'text']].drop_duplicates()
#Examining the first two entries
df_2 = merged_df.head(2)
HTML(df_2.to_html(index=False))
book_names text
Genesis 1 In the beginning God created the heavens and the earth. Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters. And God said, “Let there be light,” and there was light. And God saw that the light was good, and He separated the light from the darkness. God called the light “day,” and the darkness He called “night.” And there was evening, and there was morning—the first day. And God said, “Let there be an expanse between the waters, to separate the waters from the waters.” So God made the expanse and separated the waters beneath it from the waters above. And it was so. God called the expanse “sky.” And there was evening, and there was morning—the second day. And God said, “Let the waters under the sky be gathered into one place, so that the dry land may appear.” And it was so. God called the dry land “earth,” and the gathering of waters He called “seas.” And God saw that it was good. Then God said, “Let the earth bring forth vegetation: seed-bearing plants and fruit trees, each bearing fruit with seed according to its kind.” And it was so. The earth produced vegetation: seed-bearing plants according to their kinds and trees bearing fruit with seed according to their kinds. And God saw that it was good. And there was evening, and there was morning—the third day. And God said, “Let there be lights in the expanse of the sky to distinguish between the day and the night, and let them be signs to mark the seasons and days and years. And let them serve as lights in the expanse of the sky to shine upon the earth.” And it was so. God made two great lights: the greater light to rule the day and the lesser light to rule the night. And He made the stars as well. God set these lights in the expanse of the sky to shine upon the earth, to preside over the day and the night, and to separate the light from the darkness. And God saw that it was good. And there was evening, and there was morning—the fourth day. And God said, “Let the waters teem with living creatures, and let birds fly above the earth in the open expanse of the sky.” So God created the great sea creatures and every living thing that moves, with which the waters teemed according to their kinds, and every bird of flight after its kind. And God saw that it was good. Then God blessed them and said, “Be fruitful and multiply and fill the waters of the seas, and let birds multiply on the earth.” And there was evening, and there was morning—the fifth day. And God said, “Let the earth bring forth living creatures according to their kinds: livestock, land crawlers, and beasts of the earth according to their kinds.” And it was so. God made the beasts of the earth according to their kinds, the livestock according to their kinds, and everything that crawls upon the earth according to its kind. And God saw that it was good. Then God said, “Let Us make man in Our image, after Our likeness, to rule over the fish of the sea and the birds of the air, over the livestock, and over all the earth itself and every creature that crawls upon it.” So God created man in His own image; in the image of God He created him; male and female He created them. God blessed them and said to them, “Be fruitful and multiply, and fill the earth and subdue it; rule over the fish of the sea and the birds of the air and every creature that crawls upon the earth.” Then God said, “Behold, I have given you every seed-bearing plant on the face of all the earth, and every tree whose fruit contains seed. They will be yours for food. And to every beast of the earth and every bird of the air and every creature that crawls upon the earth—everything that has the breath of life in it—I have given every green plant for food.” And it was so. And God looked upon all that He had made, and indeed, it was very good. And there was evening, and there was morning—the sixth day.
Genesis 2 Thus the heavens and the earth were completed in all their vast array. And by the seventh day God had finished the work He had been doing; so on that day He rested from all His work. Then God blessed the seventh day and sanctified it, because on that day He rested from all the work of creation that He had accomplished. This is the account of the heavens and the earth when they were created, in the day that the LORD God made them. Now no shrub of the field had yet appeared on the earth, nor had any plant of the field sprouted; for the LORD God had not yet sent rain upon the earth, and there was no man to cultivate the ground. But springs welled up from the earth and watered the whole surface of the ground. Then the LORD God formed man from the dust of the ground and breathed the breath of life into his nostrils, and the man became a living being. And the LORD God planted a garden in Eden, in the east, where He placed the man He had formed. Out of the ground the LORD God gave growth to every tree that is pleasing to the eye and good for food. And in the middle of the garden were the tree of life and the tree of the knowledge of good and evil. Now a river flowed out of Eden to water the garden, and from there it branched into four headwaters: The name of the first river is Pishon; it winds through the whole land of Havilah, where there is gold. And the gold of that land is pure, and bdellium and onyx are found there. The name of the second river is Gihon; it winds through the whole land of Cush. The name of the third river is Hiddekel; it runs along the east side of Assyria. And the fourth river is the Euphrates. Then the LORD God took the man and placed him in the Garden of Eden to cultivate and keep it. And the LORD God commanded him, “You may eat freely from every tree of the garden, but you must not eat from the tree of the knowledge of good and evil; for in the day that you eat of it, you will surely die.” The LORD God also said, “It is not good for the man to be alone. I will make for him a suitable helper.” And out of the ground the LORD God formed every beast of the field and every bird of the air, and He brought them to the man to see what he would name each one. And whatever the man called each living creature, that was its name. The man gave names to all the livestock, to the birds of the air, and to every beast of the field. But for Adam no suitable helper was found. So the LORD God caused the man to fall into a deep sleep, and while he slept, He took one of the man’s ribs and closed up the area with flesh. And from the rib that the LORD God had taken from the man, He made a woman and brought her to him. And the man said: “This is now bone of my bones and flesh of my flesh; she shall be called ‘woman,’ for out of man she was taken.” For this reason a man will leave his father and mother and be united to his wife, and they will become one flesh. And the man and his wife were both naked, and they were not ashamed.

2.2 Pre-processing the Text

We fill first pre-process the text. This includes a variety of steps including eliminating double spaces, using lower cases, removing punctuation, removing stop words, removing the most frequent words, removing special characters, stemming, lemmatizing and POS tagging. We will then use this processed text to create a Wordcloud. We use the NLTK package for pre-preprocessing text for LDA analysis.

Step 1: Eliminating double spaces

We now eliminate strange characters such as apostrophes, dashes and then, double spaces.

#Step1: Eliminating apostrophes, dashes, etc.
merged_df['text'] = merged_df['text'].str.replace(r'[\x93\x90\x92\x94\x97]+', ' ', regex=True)
# \x93 - the left double quotation mark
# \x90 - an invisible character used to control display or other functions and does not have a standard visual representation
# \x92 - an apostrophe
# \x94 - the left double quotation mark
# \x97 - a dash or hyphen character

#Step2: Eliminating double spaces
merged_df['text'] = merged_df['text'].str.replace(r'\s+', ' ', regex=True)

Step 2: Using Lower Case

In this step, I make everything lower case.

merged_df = merged_df.copy()
merged_df['text_lc'] = merged_df['text'].str.lower()

Step 3: Removing Punctuation

I then remove punctuation.

all_punctuation = string.punctuation

#Creating a function that removes punctuation
def remove_punct(text):
    return text.translate(str.maketrans('', '', all_punctuation))

#Applying the function
merged_df = merged_df.copy()
merged_df['text_np'] = merged_df['text_lc'].apply(lambda x: remove_punct(x))
merged_df = merged_df[["book_names", "text", "text_np"]]

#df_2 = merged_df.head(2)
#HTML(df_2.to_html(index=False))

Step 4: Removing Stop words

In the next step, I remove all the stop words: “the”, “and”, “so”. These are words that are frequently used, but which add limited meaning.

nltk.download('stopwords')
True

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bgpopescu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
stops = set(stopwords.words('english'))

# Define the function to remove stopwords
def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in set(stopwords.words('english'))])

#Using the function for the stop words
merged_df = merged_df.copy()
merged_df['text_nostop'] = merged_df['text_np'].apply(remove_stopwords)
#df_2 = merged_df.head(2)
#HTML(df_2.to_html(index=False))

Step 6: Removing Special Characters

#Function to replace special characters: that are not alphanumeric or whitespace characters
def remove_special_ch(text):
    text = re.sub('[^a-zA-Z0-9\s]', '', text)  # Replace non-alphanumeric characters and spaces with nothing
    return text

#Applying it to the text
merged_df = merged_df.copy()
merged_df['clean_text'] = merged_df['text_nostop'].apply(remove_special_ch)
merged_df= merged_df[["book_names", "text", "clean_text"]]

Let us now examine the clean text, which has no special characters, no stop words, no punctuation, is in lower case, and has no double spaces.

df_2 = merged_df.head(2)
HTML(df_2.to_html(index=False))
book_names text clean_text
Genesis 1 In the beginning God created the heavens and the earth. Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters. And God said, Let there be light, and there was light. And God saw that the light was good, and He separated the light from the darkness. God called the light day, and the darkness He called night. And there was evening, and there was morning the first day. And God said, Let there be an expanse between the waters, to separate the waters from the waters. So God made the expanse and separated the waters beneath it from the waters above. And it was so. God called the expanse sky. And there was evening, and there was morning the second day. And God said, Let the waters under the sky be gathered into one place, so that the dry land may appear. And it was so. God called the dry land earth, and the gathering of waters He called seas. And God saw that it was good. Then God said, Let the earth bring forth vegetation: seed-bearing plants and fruit trees, each bearing fruit with seed according to its kind. And it was so. The earth produced vegetation: seed-bearing plants according to their kinds and trees bearing fruit with seed according to their kinds. And God saw that it was good. And there was evening, and there was morning the third day. And God said, Let there be lights in the expanse of the sky to distinguish between the day and the night, and let them be signs to mark the seasons and days and years. And let them serve as lights in the expanse of the sky to shine upon the earth. And it was so. God made two great lights: the greater light to rule the day and the lesser light to rule the night. And He made the stars as well. God set these lights in the expanse of the sky to shine upon the earth, to preside over the day and the night, and to separate the light from the darkness. And God saw that it was good. And there was evening, and there was morning the fourth day. And God said, Let the waters teem with living creatures, and let birds fly above the earth in the open expanse of the sky. So God created the great sea creatures and every living thing that moves, with which the waters teemed according to their kinds, and every bird of flight after its kind. And God saw that it was good. Then God blessed them and said, Be fruitful and multiply and fill the waters of the seas, and let birds multiply on the earth. And there was evening, and there was morning the fifth day. And God said, Let the earth bring forth living creatures according to their kinds: livestock, land crawlers, and beasts of the earth according to their kinds. And it was so. God made the beasts of the earth according to their kinds, the livestock according to their kinds, and everything that crawls upon the earth according to its kind. And God saw that it was good. Then God said, Let Us make man in Our image, after Our likeness, to rule over the fish of the sea and the birds of the air, over the livestock, and over all the earth itself and every creature that crawls upon it. So God created man in His own image; in the image of God He created him; male and female He created them. God blessed them and said to them, Be fruitful and multiply, and fill the earth and subdue it; rule over the fish of the sea and the birds of the air and every creature that crawls upon the earth. Then God said, Behold, I have given you every seed-bearing plant on the face of all the earth, and every tree whose fruit contains seed. They will be yours for food. And to every beast of the earth and every bird of the air and every creature that crawls upon the earth everything that has the breath of life in it I have given every green plant for food. And it was so. And God looked upon all that He had made, and indeed, it was very good. And there was evening, and there was morning the sixth day. beginning god created heavens earth earth formless void darkness surface deep spirit god hovering surface waters god said let light light god saw light good separated light darkness god called light day darkness called night evening morning first day god said let expanse waters separate waters waters god made expanse separated waters beneath waters god called expanse sky evening morning second day god said let waters sky gathered one place dry land may appear god called dry land earth gathering waters called seas god saw good god said let earth bring forth vegetation seedbearing plants fruit trees bearing fruit seed according kind earth produced vegetation seedbearing plants according kinds trees bearing fruit seed according kinds god saw good evening morning third day god said let lights expanse sky distinguish day night let signs mark seasons days years let serve lights expanse sky shine upon earth god made two great lights greater light rule day lesser light rule night made stars well god set lights expanse sky shine upon earth preside day night separate light darkness god saw good evening morning fourth day god said let waters teem living creatures let birds fly earth open expanse sky god created great sea creatures every living thing moves waters teemed according kinds every bird flight kind god saw good god blessed said fruitful multiply fill waters seas let birds multiply earth evening morning fifth day god said let earth bring forth living creatures according kinds livestock land crawlers beasts earth according kinds god made beasts earth according kinds livestock according kinds everything crawls upon earth according kind god saw good god said let us make man image likeness rule fish sea birds air livestock earth every creature crawls upon god created man image image god created male female created god blessed said fruitful multiply fill earth subdue rule fish sea birds air every creature crawls upon earth god said behold given every seedbearing plant face earth every tree whose fruit contains seed food every beast earth every bird air every creature crawls upon earth everything breath life given every green plant food god looked upon made indeed good evening morning sixth day
Genesis 2 Thus the heavens and the earth were completed in all their vast array. And by the seventh day God had finished the work He had been doing; so on that day He rested from all His work. Then God blessed the seventh day and sanctified it, because on that day He rested from all the work of creation that He had accomplished. This is the account of the heavens and the earth when they were created, in the day that the LORD God made them. Now no shrub of the field had yet appeared on the earth, nor had any plant of the field sprouted; for the LORD God had not yet sent rain upon the earth, and there was no man to cultivate the ground. But springs welled up from the earth and watered the whole surface of the ground. Then the LORD God formed man from the dust of the ground and breathed the breath of life into his nostrils, and the man became a living being. And the LORD God planted a garden in Eden, in the east, where He placed the man He had formed. Out of the ground the LORD God gave growth to every tree that is pleasing to the eye and good for food. And in the middle of the garden were the tree of life and the tree of the knowledge of good and evil. Now a river flowed out of Eden to water the garden, and from there it branched into four headwaters: The name of the first river is Pishon; it winds through the whole land of Havilah, where there is gold. And the gold of that land is pure, and bdellium and onyx are found there. The name of the second river is Gihon; it winds through the whole land of Cush. The name of the third river is Hiddekel; it runs along the east side of Assyria. And the fourth river is the Euphrates. Then the LORD God took the man and placed him in the Garden of Eden to cultivate and keep it. And the LORD God commanded him, You may eat freely from every tree of the garden, but you must not eat from the tree of the knowledge of good and evil; for in the day that you eat of it, you will surely die. The LORD God also said, It is not good for the man to be alone. I will make for him a suitable helper. And out of the ground the LORD God formed every beast of the field and every bird of the air, and He brought them to the man to see what he would name each one. And whatever the man called each living creature, that was its name. The man gave names to all the livestock, to the birds of the air, and to every beast of the field. But for Adam no suitable helper was found. So the LORD God caused the man to fall into a deep sleep, and while he slept, He took one of the man s ribs and closed up the area with flesh. And from the rib that the LORD God had taken from the man, He made a woman and brought her to him. And the man said: This is now bone of my bones and flesh of my flesh; she shall be called ‘woman, for out of man she was taken. For this reason a man will leave his father and mother and be united to his wife, and they will become one flesh. And the man and his wife were both naked, and they were not ashamed. thus heavens earth completed vast array seventh day god finished work day rested work god blessed seventh day sanctified day rested work creation accomplished account heavens earth created day lord god made shrub field yet appeared earth plant field sprouted lord god yet sent rain upon earth man cultivate ground springs welled earth watered whole surface ground lord god formed man dust ground breathed breath life nostrils man became living lord god planted garden eden east placed man formed ground lord god gave growth every tree pleasing eye good food middle garden tree life tree knowledge good evil river flowed eden water garden branched four headwaters name first river pishon winds whole land havilah gold gold land pure bdellium onyx found name second river gihon winds whole land cush name third river hiddekel runs along east side assyria fourth river euphrates lord god took man placed garden eden cultivate keep lord god commanded may eat freely every tree garden must eat tree knowledge good evil day eat surely die lord god also said good man alone make suitable helper ground lord god formed every beast field every bird air brought man see would name one whatever man called living creature name man gave names livestock birds air every beast field adam suitable helper found lord god caused man fall deep sleep slept took one man ribs closed area flesh rib lord god taken man made woman brought man said bone bones flesh flesh shall called woman man taken reason man leave father mother united wife become one flesh man wife naked ashamed

Step 7: Stemming

Stemming is a very crude method. It chops off the end of the word. The result is not necessarily a real word. Both stemming and lemmatization are useful to reduce vocabulary size and vector dimensionality. Thus both will speed up processing. For example “bosses” becomes “boss” while “replacement” becomes “replac” (not a real word).

There are multiple stemming algorithms. The most popular is Porter Stemmer in NLTK. For example:

ps = PorterStemmer()
#Stemming the word "walking"
ps.stem("walking")
'walk'

I first define a function that stems the words.

#Fucntion to stem the word
def stem_words(text):
    ps = PorterStemmer()
    return " ".join([ps.stem(word) for word in text.split()])

#We now apply this to our text
merged_df=merged_df.copy()
merged_df['stemmed_text'] = merged_df["clean_text"].apply(lambda x: stem_words(x))
merged_df= merged_df[["book_names", "text", "clean_text", "stemmed_text"]]

Step 8: Lemmatization & POS Tagging

Unlile stemming, lemmatization is more sophisticated, as it uses actual rules of language. The true root word will be returned. Both stemming and lemmatization are useful to reduce vocabulary size and vector dimensionality and will speed up processing.

For example, if we lemmatize, we obtain “good” from “better”. For “was”, we obtain “be”.

True

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/bgpopescu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
True

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/bgpopescu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!

I next make a few abbreviations.

wordnet_map = {"N": wordnet.NOUN, "V": wordnet.VERB, "J": wordnet.ADJ, "R": wordnet.ADV}

I then define a function that lemmatizes the text.

#Defining the function
def lemmatize_words(text):
  # find POS tags
  pos_text = pos_tag(text.split())
  return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_text])

#Applying that function to the text.
merged_df = merged_df.copy()
merged_df['lemmatized'] = merged_df["clean_text"].apply(lambda x: lemmatize_words(x))
merged_df= merged_df[["book_names", "text", "clean_text", "stemmed_text", "lemmatized"]]
merged_df2 =  merged_df[["book_names", "lemmatized"]]

#Investigating the result
df_2 = merged_df2.head(2)
HTML(df_2.to_html(index=False))
book_names lemmatized
Genesis 1 begin god create heaven earth earth formless void darkness surface deep spirit god hover surface water god say let light light god saw light good separate light darkness god call light day darkness call night even morning first day god say let expanse water separate water water god make expanse separated water beneath water god call expanse sky even morning second day god say let water sky gather one place dry land may appear god call dry land earth gather water call sea god saw good god say let earth bring forth vegetation seedbearing plant fruit tree bear fruit seed accord kind earth produce vegetation seedbearing plant accord kind tree bear fruit seed accord kind god saw good evening morning third day god say let light expanse sky distinguish day night let sign mark season day year let serve light expanse sky shine upon earth god make two great light great light rule day lesser light rule night make star well god set light expanse sky shine upon earth preside day night separate light darkness god saw good evening morning fourth day god say let water teem live creature let bird fly earth open expanse sky god create great sea creature every live thing move water teem accord kind every bird flight kind god saw good god bless say fruitful multiply fill water seas let bird multiply earth even morning fifth day god say let earth bring forth living creature accord kind livestock land crawler beasts earth accord kind god make beast earth accord kind livestock accord kind everything crawl upon earth accord kind god saw good god say let u make man image likeness rule fish sea bird air livestock earth every creature crawl upon god create man image image god create male female create god bless say fruitful multiply fill earth subdue rule fish sea bird air every creature crawl upon earth god say behold give every seedbearing plant face earth every tree whose fruit contain seed food every beast earth every bird air every creature crawl upon earth everything breath life give every green plant food god look upon make indeed good even morning sixth day
Genesis 2 thus heaven earth complete vast array seventh day god finished work day rest work god bless seventh day sanctify day rest work creation accomplish account heaven earth create day lord god make shrub field yet appear earth plant field sprout lord god yet send rain upon earth man cultivate ground spring well earth watered whole surface ground lord god form man dust ground breathe breath life nostril man become living lord god plant garden eden east place man form ground lord god give growth every tree please eye good food middle garden tree life tree knowledge good evil river flow eden water garden branch four headwater name first river pishon wind whole land havilah gold gold land pure bdellium onyx find name second river gihon wind whole land cush name third river hiddekel run along east side assyria fourth river euphrates lord god take man place garden eden cultivate keep lord god command may eat freely every tree garden must eat tree knowledge good evil day eat surely die lord god also say good man alone make suitable helper ground lord god form every beast field every bird air bring man see would name one whatever man call living creature name man give name livestock bird air every beast field adam suitable helper find lord god cause man fall deep sleep sleep take one man rib close area flesh rib lord god take man make woman bring man say bone bone flesh flesh shall call woman man take reason man leave father mother united wife become one flesh man wife naked ashamed

As explained, the stemmed text differs from the lemmatized text.

2.3 Original Text vs. Clean Text

Text cleaning may have a significant effect on the length of the text, particularly if the cleaning process involves the removal of a large number of stop words or the stemming of words to their root form. This could result in a shorter text that is more focused on the core content of the text and less cluttered with common words or repetitive phrases.

We can now count the number of tokens within each type of text:

# Tokenize the text into words
merged_df['word_count_full'] = merged_df['text'].apply(lambda x: len(x.split()))
merged_df['word_count_lem'] = merged_df['lemmatized'].apply(lambda x: len(x.split()))
merged_df2 = merged_df[["book_names", "word_count_full", "word_count_lem"]]
#Investigating the result
df_10 = merged_df2.head(10)
HTML(df_10.to_html(index=False))
book_names word_count_full word_count_lem
Genesis 1 747 359
Genesis 2 608 262
Genesis 3 664 264
Genesis 4 622 275
Genesis 5 497 244
Genesis 6 507 228
Genesis 7 496 251
Genesis 8 515 246
Genesis 9 609 272
Genesis 10 432 227

I now switch to R to make use of ggplot’s good graphic capabilities to analyze the difference in the number of words in the original text vs. the lemmatized text.

Show the code
library(reticulate)
library(ggplot2)
library(ggpubr)

df <-  reticulate::py$merged_df
unclean<-ggplot(df, aes(x = word_count_full)) +
  geom_histogram(color = "white",
                 alpha = 0.5)+
  geom_density(aes(y = after_stat(density)* 80000), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "No. Book Chapters (e.g. Genesis 1, Genesis 2, etc)",
                     sec.axis = 
                       sec_axis(~.x/80000, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab("No. Words")+
  theme_bw()+
  ggtitle("Original")
  


clean<-ggplot(df, aes(x = word_count_lem)) +
  geom_histogram(color = "white",
                 alpha = 0.5)+
    #Adding density and multiplying by a manually defined value
  #for visibility
  geom_density(aes(y = after_stat(density)* 40000), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "No. Book Chapters (e.g. Genesis 1, Genesis 2, etc)",
                     sec.axis = 
                       sec_axis(~.x/40000, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab("No. Words")+
  theme_bw()+
  ggtitle("Lemmatized")


ggarrange(unclean, clean, ncol=2)

Thus, by perfoming basic text processing methods, we reduced the size of the text substantially.

2.4 Most Frequent Words

It could also be important to compare the most frequent words before and after cleaning. The most frequent words after preprocessing the text may reveal important themes and trends present in the text. These words can provide valuable insights into its content and structure, and can inform the topic modeling process.

To do that, I need to run:

# Function to tokenize text and count words
def count_words(text):
    words = re.findall(r'\w+', text.lower())
    return Counter(words)

# Combine all texts into a single string
all_text = ' '.join(merged_df['text'])
# Count the words
word_counts_unclean = count_words(all_text)

# Create a new DataFrame from word_counts
word_counts_unclean_df = pd.DataFrame(word_counts_unclean.items(), columns=['word', 'count'])

# Combine all texts into a single string
all_text = ' '.join(merged_df['lemmatized'])

# Count the words
word_counts_clean = count_words(all_text)

# Create a new DataFrame from word_counts
word_counts_clean_df = pd.DataFrame(word_counts_clean.items(), columns=['word', 'count'])

We can now plot the most frequent words in the original, unaltered text, versus the one that was cleaned. We notice some substantial differences.

Show the code
df_unclean <-  reticulate::py$word_counts_unclean_df
df_clean <-  reticulate::py$word_counts_clean_df

# Order the dataframe by count
df_unclean <- df_unclean[order(-df_unclean$count), ][1:10, ]
df_clean <- df_clean[order(-df_clean$count), ][1:10, ]


# Plot the dataframe using ggplot2
unclean<-ggplot(df_unclean, aes(x = reorder(word, -count), y = count)) +
  geom_bar(stat = "identity") +
  labs(title = "Word Count", x = "Word", y = "Count") +
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


clean<-ggplot(df_clean, aes(x = reorder(word, -count), y = count)) +
  geom_bar(stat = "identity") +
  labs(title = "Word Count", x = "Word", y = "Count") +
  theme_bw()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggarrange(unclean, clean, ncol=2)

2.5 Bigrams

Bigrams are pairs of words that appear together in a text. They can provide valuable insights into the content and structure of a text.

True

[nltk_data] Downloading package punkt to /Users/bgpopescu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
#Joining together the entire text
text = ' '.join(merged_df['lemmatized'])
#Tokenizing the text
words = word_tokenize(text)
#Creating a list of bigrams
text_bigrams = list(bigrams(words))

# Count the occurrences of each bigram
bigram_counts = Counter(text_bigrams)

# Convert the bigram counts to a DataFrame
df_bigram_counts = pd.DataFrame(bigram_counts.items(), columns=['bigram', 'count'])

# Concatenate the individual words in the Bigram column
df_bigram_counts['bigram'] = df_bigram_counts['bigram'].apply(lambda x: ', '.join(x))

# Sort the DataFrame by count in descending order
df_bigram_counts = df_bigram_counts.sort_values(by='count', ascending=False)

We now visualize the the most important bigrams.

Show the code
df_bigram_counts <-  reticulate::py$df_bigram_counts

# Order the dataframe by count
df_bigram_counts <- df_bigram_counts[order(-df_bigram_counts$count), ][1:10, ]
#The second most commont bigram is "let, u", which is in fact "let, us".
#This is because wordnet._morphy uses a rule for nouns which replaces ending "s" with "".
#See also https://stackoverflow.com/questions/54784287/nltk-wordnetlemmatizer-processes-us-as-u
#So all the "us" became "u". I am thus transforming it back
df_bigram_counts$bigram[df_bigram_counts$bigram=="let, u"]<-"let, us"

# Plot the dataframe using ggplot2
bigram <- ggplot(df_bigram_counts, aes(x = reorder(bigram, -count), y = count)) +
  theme_bw() + 
  geom_bar(stat = "identity") +
  labs(title = "Bigrams", x = "Pair", y = "Count") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
bigram

2.6 Wordcloud

We can now create a wordcloud to examine the most commonly used words.

dict_from_df = merged_df.to_dict(orient='index')
# Extracting lemmatized text
text = ""
for chapter in dict_from_df.values():
    text += chapter['lemmatized'] + " "

# Generate word cloud
wordcloud = WordCloud(collocations=False,
  width=1200, height=600, background_color='white', max_font_size=100, max_words=200, scale=3).generate(text)

# Display the generated word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

3 Using LDA

Latent Dirichlet Allocation (LDA) is a generative statistical model used for topic modeling. It assumes that documents are probability distributions over topics, and topics are probability distributions over words. The goal of LDA is to uncover the latent (hidden) topics in a collection of documents and the distribution of words in those topics. LDA is unsupervised, meaning it doesn’t require labeled data. It is often used for tasks like document clustering, summarization, and content recommendation.

LDA is considered as a state-of-the art method for topic modeling and there are many reasons to why it is still widely used today, 20 years after its introduction in 2003. The practical advantages include that it is effective and computationally inexpensive, that it handles both shorter and longer input documents, and that it provides topics that are easily interpretable.

3.1 Limitations of LDA

One limitation of LDA lies in its difficulty to ascertain its efficacy due to its nature of producing soft-clusters for topics. Moreover, the reliability of coherence and perplexity scores, commonly used to evaluate LDA, can sometimes be questioned. However, perhaps the most significant constraint of LDA, one that BERTopic seeks to address, is rooted in its fundamental premise - the bag-of-words representation. In LDA, documents are perceived as probabilistic blends of latent topics, with each topic characterized by a probability distribution over words, and documents represented through a bag-of-words model. While this representation suffices for uncovering latent themes, it lacks depth in capturing a document’s semantic structure. Neglecting the semantic relationships between words can critically impact the accuracy of results from a topic model. For instance, consider the sentence “The girl became the queen of England”; the bag-of-words approach fails to recognize the semantic connection between “girl” and “queen.” Consequently, LDA may overlook the nuanced meaning of a sentence, especially when semantic structure significantly influences word interpretation. Despite these limitations, it is worth running the LDA.

3.2 Creating a Dictionary and Corpus

Before we run the algorithm, we need to create a dictionary. A dictionary is a mapping of words to unique integer ids. Each unique word in the entire corpus is assigned a unique id. The dictionary is used to convert text documents into a bag-of-words representation (model of text which uses a representation of text that is based on an unordered collection (or “bag”) of words), where each document is represented as a sparse vector of word counts (most of elements of the matrix are zero).

import gensim
from gensim import corpora
#Tokenize text
tokenized_texts = [text.split() for text in merged_df['lemmatized']]

# Create dictionary
dictionary = corpora.Dictionary(tokenized_texts)

# Tokenize the text and count word frequencies
word_freq = Counter()

for text in merged_df['lemmatized']:
    words = word_tokenize(text.lower())  # Tokenize text and convert to lowercase
    word_freq.update(words)

# Create human-readable dictionary with frequency
dictionary_with_frequency = dict(word_freq)
#len(dictionary_with_frequency)

A corpus is a collection of documents, in this case, the preprocessed text of the Bible. It is typically represented as a matrix, with each row representing a document and each column representing a word in the dictionary. The corpus allows the topic modeling algorithm to analyze the relationships between words and documents, and to identify the underlying themes and topics present in the text.

corpus = [dictionary.doc2bow(text) for text in tokenized_texts]

Once the dictionary and corpus have been created from the preprocessed text, the next step is to build the model and view the topics. To build the LDA model, it is important to specify the number of topics for the model to identify, as well as any other parameters that may be relevant for the analysis (e.g., the number of iterations, the learning rate). The model is then trained on the corpus, and the resulting topics are generated.

Choosing the right number of topics is crucial to ensure that the model is effective at capturing the underlying structure and themes of the text, and is interpretable and informative. One common approach is to use a measure such as the perplexity score or the coherence score to evaluate the model’s performance for a range of different numbers of topics.

Coherence measures are used in NLP to evaluate topics constructed by some topic model. Coherence measures are used to evaluate how well a topic model captures the underlying themes in a corpus of text. Topic coherence has been proposed as an intrinsic evaluation method for topic models and is defined as average or median of pairwise word similarities formed by top words of a given topic.

The following code allows us to do exactly that: to calculate coherence and perplexity score by the number of topics.

from gensim.models import CoherenceModel

# Initialize lists to store results
topic_nums = list(range(3, 17))
coherence_scores = []
perplexity_scores = []

#random_state - this t
# Iterate over different numbers of topics
for num_topics in topic_nums:
    # Build LDA model
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=num_topics, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha=[0.01]*num_topics,
                                           per_word_topics=True,
                                           eta=[0.01]*len(dictionary.keys()))
    
    # Compute perplexity
    perplexity = lda_model.log_perplexity(corpus)
    
    # Compute coherence score
    coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_texts, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model_lda.get_coherence()
    
    # Append scores to lists
    coherence_scores.append(coherence_score)
    perplexity_scores.append(perplexity)
    
    print(f"Iteration {num_topics - 1}: Coherence Score = {coherence_score}, Perplexity = {perplexity}")
Iteration 2: Coherence Score = 0.3896506021447372, Perplexity = -24.7951825396147
Iteration 3: Coherence Score = 0.46360847879158296, Perplexity = -24.76838281751851
Iteration 4: Coherence Score = 0.44399766203089186, Perplexity = -24.67679524046883
Iteration 5: Coherence Score = 0.46328402729886226, Perplexity = -24.679357657409536
Iteration 6: Coherence Score = 0.4485418787121641, Perplexity = -24.66909080650075
Iteration 7: Coherence Score = 0.45709976027725435, Perplexity = -24.666750615458774
Iteration 8: Coherence Score = 0.4622491003942612, Perplexity = -24.623392780107455
Iteration 9: Coherence Score = 0.46631117354509294, Perplexity = -24.633483829014956
Iteration 10: Coherence Score = 0.4549033629236759, Perplexity = -24.621043961059595
Iteration 11: Coherence Score = 0.4643261505480911, Perplexity = -24.639711722716093
Iteration 12: Coherence Score = 0.4612562640173955, Perplexity = -24.63333706910718
Iteration 13: Coherence Score = 0.4335570698289995, Perplexity = -24.637688250079357
Iteration 14: Coherence Score = 0.4692925394327871, Perplexity = -24.591849637756678
Iteration 15: Coherence Score = 0.445953854376223, Perplexity = -24.618393516147275

# Create DataFrame to store results
results_df = pd.DataFrame({
    'Num Topics': topic_nums,
    'Coherence Score': coherence_scores,
    'Perplexity': perplexity_scores
})

With the help of ggplot in R, we can now plot the coherence and perplexity scores.

Show the code
library(ggplot2)
library(ggpubr)
results_df_r <-  reticulate::py$results_df


# Plot the dataframe using ggplot2
coh_graph <- ggplot(data=results_df_r, aes(x=`Num Topics`, y=`Coherence Score`)) +
  geom_line()+
  scale_x_continuous(breaks = unique(results_df_r$`Num Topics`)) +  # Specify x-axis breaks
  theme_bw() 

per_graph <- ggplot(data=results_df_r, aes(x=`Num Topics`, y=Perplexity)) +
  geom_line()+
  scale_x_continuous(breaks = unique(results_df_r$`Num Topics`)) +  # Specify x-axis breaks
  theme_bw()

ggarrange(coh_graph, per_graph, ncol=2)

Based on this graph, we can see that the number of topics that maximizes coherence.

# Find the number of topics that maximizes coherence
max_coherence_index = results_df['Coherence Score'].idxmax()
optimal_topics_coherence = results_df.loc[max_coherence_index, 'Num Topics']

Thus, it seems that 15 is the number of topics that is most coherent. So I am now re-runing the model with 10 topics.

# Parameters
num_topics = optimal_topics_coherence
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=num_topics, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha=[0.01]*num_topics,
                                           per_word_topics=True,
                                           minimum_probability=0.0,
                                           eta=[0.01]*len(dictionary.keys()))
                                           
                                           
#lda_model.print_topic(corpus, topn=20)

from gensim.models import CoherenceModel
# Compute coherence score
coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_texts, dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model_lda.get_coherence()
print(f"Coherence Score: {coherence_score} for {num_topics} topics.")
Coherence Score: 0.4692925394327871 for 15 topics.

The output from LDA can be used to identify the topics that are present in the corpus and the words that are associated with each topic. Thus, I create a function that allows to see the keywords that are associated with every topic for every document. Note that only the first few keywords are computed that contribute to a total of about 90% of a topic within that document. The rest add very little to that topic.

def format_topics_sent(ldamodel, corpus, texts):
    formatted_topics = []
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row[0], key=lambda x: x[1], reverse=True)
        topics_info = []
        
        for j, (topic_num, prop_topic) in enumerate(row):  # Get the first 8 topics
            wp = ldamodel.show_topic(topic_num)
            topic_keywords = ", ".join([word for word, prop in wp])
            topic_info = {
                'topic': int(topic_num),
                'contrib': round(prop_topic, 4),
                'keywords': topic_keywords,
                'doc': i+1
            }
            topics_info.append(topic_info)

        formatted_topics.extend(topics_info)
    
    return formatted_topics

#Using the function to create a dataframe.
sent_topics_df = format_topics_sent(lda_model, corpus, tokenized_texts)
sent_topics_df = pd.DataFrame(sent_topics_df)
sent_topics_df
       topic  contrib                                           keywords   doc
0         12   0.4155  like, come, one, day, say, earth, hand, man, w...     1
1          3   0.4120  god, know, love, u, one, may, lord, thing, spi...     1
2          2   0.1154  lord, god, people, land, israel, say, nation, ...     1
3          9   0.0412  eat, day, house, drink, fruit, sabbath, wine, ...     1
4         14   0.0124  cubit, four, side, one, five, three, long, two...     1
...      ...      ...                                                ...   ...
17830      1   0.0000  tent, ark, tabernacle, set, curtain, meeting, ...  1189
17831      4   0.0000  son, father, abraham, jacob, joseph, brother, ...  1189
17832      8   0.0000  offering, offer, priest, lord, sacrifice, sin,...  1189
17833     11   0.0000  son, descendant, tribe, clan, levites, manasse...  1189
17834     14   0.0000  cubit, four, side, one, five, three, long, two...  1189

[17835 rows x 4 columns]

I now save the names of the chapters in the Bible associated with each document number.

#merged_df["book_names"].unique()
merged_df_names = merged_df[["book_names"]]
merged_df_names = merged_df_names[["book_names"]].reset_index(drop=True)
merged_df_names.index += 1
merged_df_names.rename_axis("index", inplace=True)
merged_df_names['doc'] = merged_df_names.index
#Merging the names
merged_df2 = pd.merge(sent_topics_df, merged_df_names, left_on='doc', right_on="doc", how='left')
#Seeing the result
df_10 = merged_df2.head(10)
HTML(df_10.to_html(index=False))
topic contrib keywords doc book_names
12 0.4155 like, come, one, day, say, earth, hand, man, water, angel 1 Genesis 1
3 0.4120 god, know, love, u, one, may, lord, thing, spirit, give 1 Genesis 1
2 0.1154 lord, god, people, land, israel, say, nation, make, house, day 1 Genesis 1
9 0.0412 eat, day, house, drink, fruit, sabbath, wine, food, bread, feast 1 Genesis 1
14 0.0124 cubit, four, side, one, five, three, long, two, face, hundred 1 Genesis 1
8 0.0032 offering, offer, priest, lord, sacrifice, sin, altar, shall, burnt, blood 1 Genesis 1
0 0.0000 woman, wife, must, husband, man, commit, lord, give, sister, mother 1 Genesis 1
1 0.0000 tent, ark, tabernacle, set, curtain, meeting, place, command, leather, fifty 1 Genesis 1
4 0.0000 son, father, abraham, jacob, joseph, brother, isaac, year, become, esau 1 Genesis 1
5 0.0000 say, go, come, tell, son, ask, father, man, one, send 1 Genesis 1

We can now finally create a plot that shows the contribution of every topic to the each document.

Show the code
library(tidyr)
library(tidyverse)
library(reticulate)

merged_topics_df <-  reticulate::py$merged_df2

reshaped_df2<-merged_topics_df%>%
  group_by(doc)%>%
  mutate(rank = rank(-contrib, ties.method = "min"))


reshaped_df2$rowno<-as.numeric(rownames(reshaped_df2))
reshaped_df2_756<-subset(reshaped_df2, doc<=50)

topics_genesis_lda<-ggplot(reshaped_df2_756, aes(topic, reorder(book_names, -doc))) +   
  geom_tile(aes(fill = contrib), color=ifelse(reshaped_df2_756$rank == 1, "red", NA), linewidth=1) +
  scale_fill_viridis_c()+
  scale_x_continuous(limits = c(min(reshaped_df2_756$topic)-1, 
                                max(reshaped_df2_756$topic)+1), breaks = min(reshaped_df2_756$topic):max(reshaped_df2_756$topic))+
    ylab("Document")+
  ggtitle("Contribution of topics to every document")+
    labs(fill = "Topic\nContribution\nScores")
topics_genesis_lda

We can now identify the top 10 keywords associated with every topic.

topic_keys = {'Topic_' + str(i): [(token, score) for token, score in lda_model.show_topic(i, topn=10)] for i in range(0, lda_model.num_topics)}

# Initialize an empty list to store data
data = []

# Iterate through the topic_keys dictionary and flatten the data
for topic, keywords in topic_keys.items():
    for token, score in keywords:
        data.append([topic, token, score])

# Create a DataFrame
df = pd.DataFrame(data, columns=['Topic', 'Token', 'Score'])
#Seeing the result
df_15 = df.head(15)
HTML(df_15.to_html(index=False))
Topic Token Score
Topic_0 woman 0.082425
Topic_0 wife 0.059742
Topic_0 must 0.053713
Topic_0 husband 0.045201
Topic_0 man 0.038121
Topic_0 commit 0.031362
Topic_0 lord 0.027722
Topic_0 give 0.025856
Topic_0 sister 0.023506
Topic_0 mother 0.021462
Topic_1 tent 0.193075
Topic_1 ark 0.135989
Topic_1 tabernacle 0.106218
Topic_1 set 0.048365
Topic_1 curtain 0.046702

We will now graph the keywords associated with every topic.

Show the code part 2
#Creating a list of topics
original_topics <- unique(df2c$topic_name)

#Creating a list of labels
labels <- c(
  "Family Roles",
  "Sacred Spaces",
  "Divine Covenant",
  "God\'s Love",
  "Ancestral Lineage",
  "Communication Verbs",
  "Holy Measurements",
  "City Declarations",
  "Sacred Offerings",
  "Daily Life",
  "Monarchy and Nations",
  "Lineage of Levites",
  "Future Prophecies",
  "Universal Identity",
  "Measurement Units"
)

#Numbering the topics
topic_no_num <- as.numeric(gsub("\\D", "", original_topics))
Show the code part 3
topic_no_num_reorg <- topic_no_num+2

# Create a list combining original topics and their corresponding labels
topics_with_labels <- list(original_topics = original_topics, 
                           labels = labels,
                           topic_no_num = topic_no_num,
                           topic_no_num_reorg = topic_no_num_reorg)

#Assigning the labels to the dataframe
df2d<-df2c
df2d$topic_label<-NA
for (key in 1:length(topics_with_labels$original_topics)) {
  # Extracting current label pair
  current_label_pair <- topics_with_labels$original_topics[[key]]
  assignment <- topics_with_labels$labels[[key]]
  # Finding rows where Topic matches the first element of the label pair
  rows_to_update <- df2d$topic_name == current_label_pair[1]
  # Updating topic_label where the condition is met
  df2d$topic_label[rows_to_update] <- assignment
}

#Including the names
df2d$up_label<-paste(df2d$topic_no, df2d$topic_label, sep="\n ")  

#Arranging the topics
library(ggh4x)
df2e<-subset(df2d, !duplicated(Topic))
df2e <- df2e %>%
  arrange(topic)

#Creating a graph with the topics
topics_lda_keys<-ggplot(df2c, aes(Rank, reorder(topic_no, -topic))) +   
  geom_tile(aes(fill = Score))+
  scale_fill_viridis_c()+
  geom_label(aes(y=topic_no, x=Rank, label=Token), size=1.9)+
  scale_x_continuous(limits = c(min(df2b$Rank)-1, 
                                max(df2b$Rank)+1), 
                     breaks = min(df2b$Rank):max(df2b$Rank))+
  guides(y.sec = guide_axis_manual(labels = rev(df2e$topic_label)))+
  labs(fill = "Keyword\nContribution\nScores")+
  ylab("Topic")+
  theme(legend.position = "bottom")
topics_lda_keys

Finally, we can count the number of documents associated with every topic.

Show the code part 1
reshaped_count<-subset(reshaped_df2, rank==1)
topic_counts <- data.frame(table(reshaped_count$topic))
names(topic_counts)[1]<-"topic"
names(topic_counts)[2]<-"no_documents"

#Assigning labels
topic_counts$topic_label<-NA
for (key in 1:length(topics_with_labels$topic_no_num)) {
  # Extracting current label pair
  current_label_pair <- topics_with_labels$topic_no_num[[key]]
  assignment <- topics_with_labels$labels[[key]]
  # Finding rows where Topic matches the first element of the label pair
  rows_to_update <- topic_counts$topic == current_label_pair[1]
  # Updating topic_label where the condition is met
  topic_counts$topic_label[rows_to_update] <- assignment
}

#Graphing the dataframe
topics_lda<-ggplot(topic_counts, aes(y =topic, x = no_documents)) +
  geom_bar(stat = "identity")+
  guides(y.sec = guide_axis_manual(labels = topic_counts$topic_label))+
  labs(fill = "Keyword\nContribution\nScores")+
  ylab("Topic")+
  theme(legend.position = "bottom")+
  ggtitle("Topics LDA")
topics_lda

Show the code part 1
#sum(topic_counts$no_documents)

4 Using BERTopic

BERTopic stands for Bidirectional Encoder Representations from Transformers. It is a text analysis technique based on word embeddings. Word embeddings are contextualized word representations learned from a large corpus of text using a deep neural network. These capture rich conceptual information of each word getting at semantic meaning and contextual relationships among words. Specifically, the method aims to assign a real-valued vector to each word that encodes its meaning and semantics in such a way that similar words have vectors that are close together in the vector space.

BERTopic leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. BERTopic clusters documents and adds meaning to them by linking them with terms and ranked c-TF-IDF scores. This gives an impression of what words are relatively prevalent in one topic. BERTopic has been shown to outperform other topic modeling techniques, such as LDA, in terms of topic coherence and interpretability. The suggested minimum amount of data for BERTopic is 1000 documents.

There are a variety of parameters in BERTopic. The first is the choice of library for encoding text to dense vector embeddings. sentence-transformers/all-MiniLM-L6-v2is a pre-trained transformer-based language model designed to be fine-tuned for a wide range of NLP tasks, such as text classification, and is available on Hugging Face’s Sentence Transformers library. The model was trained on a large corpus of text data using a masked language modeling objective, which allows it to learn contextual representations of words and sentences. It creates 384-dimensional sentence embeddings. Other important parameters that can be changed in BERTopic include:

UMAP

Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimensionality reduction technique that aims to preserve both local and global structure of high-dimensional data. It is based on the idea of constructing a topological representation of the data manifold using a graph-based approach, where points close to each other in the high-dimensional space are connected by edges in the graph.

HDBSCAN

A variety of different clustering models are available for use such as HDBSCAN, k-Means and BIRCH. The default setting HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise, is a a density-based clustering algorithm. HDBSCAN utilizes a hierarchical density-based approach to cluster data points and is designed to be effective, even for large and complex datasets. It applies a cluster stability analysis to select the optimal clustering solution. When using HDBSCAN in BERTopic, a number of outlier documents might also be created if they do not fall within any of the created topics. They will be assigned to a topic labeled “-1” which will contain the outlier documents that cannot be assigned to a topic given an automatically calculated probability threshold.

CountVectorizer

The CountVectorizer is a method in scikit-learn that converts a collection of text documents to a matrix of token counts to extract features from text. Specifically, it converts the text documents into a bag-of-words representation, where each document is represented by a vector that counts the frequency of each word in the document. In addition, it performs several text pre-processing steps such as tokenizing the text, lowercasing and removing stop words. Thus, within BERT, there is no need to pre-process the text.

Weighting Scheme: c-TF-IDF

The goal is to discern what sets apart one cluster from another based on the generated bag-of-word representation. We seek distinctive words that characterize each cluster, setting it apart from the rest. To achieve this, BERTopic employs a modified version of TF-IDF, known as c-TF-IDF. This approach assigns each cluster a specific topic, facilitating the identification of distinguishing features within each cluster’s word representation. Unlike traditional TF-IDF, which focuses on individual documents, c-TF-IDF evaluates the significance of words within clusters of documents. Consequently, this method enables the creation of topic-word distributions for each cluster, shedding light on the unique characteristics of each topic.

To recap, in order to run BERTopic, I ran a few steps:

  1. The text was done using BERT-embeddings, where SentenceTransformers was utilized to implement context-based representation, using the pre-trained model - all-MiniLM-L6-v2.

  2. After numerical text representation of the documents was created, dimensional reduction with the technique of UMAP was used.

  3. With the reduced BERT-embeddings the data was clustered with the use of HDBSCAN.

  4. Within each cluster a bag-of-words representation was generated through the CountVectorizer(). This function also performs several text pre-processing steps such as tokenizing the text, lowercasing and removing stopwords.

  5. From the generated bag-of-words representation of each cluster, we want to distinguish them from one another using c-TF-IDF. This is generated through the ClassTfidfTransformer().

4.1 Limitations of BERTopic

BERTopic is an improvement over LDA in that it addresses the problem of semantic understanding by using the embedded component. Thanks to the support for hierarchical topic reduction, BERTopic allows for a more nuanced understanding of the relationships between topics.

Some of the limitations include the fact that more time is needed for running and fine-tuning the model, which is significantly more resource demanding and computationally expensive than LDA. Since it is a deep learning model, it also scales better with larger corpora and can have difficulties with giving accurate topic representations on smaller datasets.

import pandas as pd
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer
from hdbscan import HDBSCAN
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

# Define your dataframe merged_df here

coherence_scores = []

for nr_topics in range(3, 17):  # Iterate over nr_topics from 9 to 25
    #print(f"Calculating coherence for {nr_topics} topics...")
    # Define and fit BERTopic model
    umap_model = UMAP(n_neighbors=25, n_components=20, min_dist=0.0, metric='cosine', low_memory=False, random_state=42)
    ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
    hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', prediction_data=True)
    vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")
    
    topic_model = BERTopic(min_topic_size=10, top_n_words=20, umap_model=umap_model, 
                           ctfidf_model=ctfidf_model, vectorizer_model=vectorizer_model,
                           calculate_probabilities=True, hdbscan_model=hdbscan_model,
                           nr_topics=nr_topics)
    
    topics, probs = topic_model.fit_transform(merged_df['text'])

    # Preprocess Documents
    documents = pd.DataFrame({"Document": merged_df['text'], "ID": range(len(merged_df['text'])), "Topic": topics})
    documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
    cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

    # Extract vectorizer and analyzer from BERTopic
    vectorizer = topic_model.vectorizer_model
    analyzer = vectorizer.build_analyzer()

    # Extract features for Topic Coherence evaluation
    words = vectorizer.get_feature_names_out
    tokens = [analyzer(doc) for doc in cleaned_docs]
    dictionary = corpora.Dictionary(tokens)
    corpus = [dictionary.doc2bow(token) for token in tokens]
    topic_words = [[words for words, _ in topic_model.get_topic(topic)] for topic in range(len(set(topics)) - 1)]

    # Evaluate Coherence
    coherence_model = CoherenceModel(topics=topic_words, texts=tokens, corpus=corpus, dictionary=dictionary, coherence='c_v')
    coherence = coherence_model.get_coherence()

    coherence_scores.append({'nr_topics': nr_topics, 'coherence_score': coherence})
    print(f"Calculated coherence for {nr_topics} topics.")
Calculated coherence for 3 topics.
Calculated coherence for 4 topics.
Calculated coherence for 5 topics.
Calculated coherence for 6 topics.
Calculated coherence for 7 topics.
Calculated coherence for 8 topics.
Calculated coherence for 9 topics.
Calculated coherence for 10 topics.
Calculated coherence for 11 topics.
Calculated coherence for 12 topics.
Calculated coherence for 13 topics.
Calculated coherence for 14 topics.
Calculated coherence for 15 topics.
Calculated coherence for 16 topics.

# Create DataFrame to store coherence scores
coherence_df = pd.DataFrame(coherence_scores)

We can now visualize the coherence scores.

Show the code
library(ggplot2)
library(ggpubr)
results_df_r <-  reticulate::py$coherence_df


# Plot the dataframe using ggplot2
coh_graph <- ggplot(data=results_df_r, aes(x=nr_topics, y=coherence_score)) +
  geom_line()+
  scale_x_continuous(breaks = unique(results_df_r$nr_topics)) +  # Specify x-axis breaks
  theme_bw() 
coh_graph

4.2 Rerunning BERTopic for topic no with highest coherence

In order to make the topic models somewhat comparable to the LDA, and to make the topics more granular, I choose 11 topics, although 4 seems to have the highest coherence. Nevertheless, 11 is still a reasonable number with a reasonable coherence.

#Model
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer
from hdbscan import HDBSCAN

chosen_no_topics = 11
min_topic_size = 10   # default: 10
top_n_words = 20      # default: 10

umap_model = UMAP(n_neighbors=25, n_components=20, min_dist=0.0, metric='cosine', low_memory=False, random_state=42)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', prediction_data=True)

# we add this to remove stopwords
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")
#Running the model
topic_model = BERTopic(min_topic_size=min_topic_size,
                           top_n_words=top_n_words,
                           umap_model=umap_model, 
                           ctfidf_model=ctfidf_model,
                           vectorizer_model=vectorizer_model,
                           calculate_probabilities=True,
                           hdbscan_model=hdbscan_model,
                           nr_topics = chosen_no_topics)

The following line allows us to save the topics and the probabilities. It is important to note that BERTopic does not need lemmatized text. This is because BERTopic uses transformers that are based on “real and clean” text, not on text without stopwords, lemmas or tokens. In addition, BERTopic has a CountVectorizer component from scikit-learn as the default setting, which performs tokenization, lowercasing, and stopword removal automatically. At the end of the calculation stop words have become noise (non-informative) and are all in topic_id = -1.

topics, probs= topic_model.fit_transform(merged_df['text'])
#The following code describes how to calculate coherennce scores.
#https://github.com/MaartenGr/BERTopic/issues/90
from bertopic import BERTopic
import gensim.corpora as corpora
from gensim.models.coherencemodel import CoherenceModel

# Preprocess Documents
docs = merged_df['text']
documents = pd.DataFrame({"Document": docs,
                          "ID": range(len(docs)),
                          "Topic": topics})

documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})
cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)

# Extract vectorizer and analyzer from BERTopic
vectorizer = topic_model.vectorizer_model
analyzer = vectorizer.build_analyzer()

# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names_out

tokens = [analyzer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
topic_words = [[words for words, _ in topic_model.get_topic(topic)] 
               for topic in range(len(set(topics))-1)]

# Evaluate
coherence_model = CoherenceModel(topics=topic_words, 
                                 texts=tokens, 
                                 corpus=corpus,
                                 dictionary=dictionary, 
                                 coherence='c_v')
coherence_score = coherence_model.get_coherence()
coherence_score
0.6514031139119772

BERTopic created 11 topics. The first one is topic -1. In BERTopic, a topic with the label “-1” typically indicates that the document does not significantly align with any of the defined topics in the model. It is often considered noise or an outlier in the context of the topics learned by the model.

The 11 topics created are listed below.

#Creating a list with unique topics
topics_relabeled = list(set(topics))
#Ordering the topic numbers
ordered_topics = sorted(topics_relabeled)
#These are the topics
ordered_topics
[-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

BERTopic also created an array with probabilities that has the following dimensions:

probs.shape
(1189, 10)

The first number - 1189 corresponds to the number of documents. The second number - 10 corresponds to the 10 probabilities associated with each topic. Note that there were no probabilities computed for the residual topic (-1). We can now finally assign the probabilities that we calculated to a dataframe.

Therefore, I am now creating a dataframe where I associate the probabilities connected with all the 11 topics to the 1189 documents in our analysis.

probs_df = pd.DataFrame(probs)
probs_df.rename(columns={col: f'topic_{col}' for col in probs_df.columns}, inplace=True)
#Adding column "topic_-1"
probs_df.insert(0, "topic_-1", None)
df_5 = probs_df.head(5)
HTML(df_5.to_html(index=False))
topic_-1 topic_0 topic_1 topic_2 topic_3 topic_4 topic_5 topic_6 topic_7 topic_8 topic_9
None 6.398443e-02 1.220270e-01 3.308001e-02 4.360875e-02 3.036763e-02 2.980985e-02 1.379953e-01 2.669987e-02 0.021563 4.201700e-02
None 6.674281e-02 1.216345e-01 3.250206e-02 4.572841e-02 3.035448e-02 3.098863e-02 1.593987e-01 2.652198e-02 0.022427 4.094005e-02
None 6.387558e-02 8.711585e-02 2.637531e-02 6.316551e-02 2.774398e-02 2.930832e-02 1.437205e-01 2.244477e-02 0.022091 3.242878e-02
None 2.494071e-01 6.160130e-02 1.606678e-02 2.990693e-02 1.672392e-02 4.181984e-02 3.949162e-02 1.483390e-02 0.049668 1.809949e-02
None 1.202258e-307 3.806346e-308 8.786780e-309 1.248075e-308 8.377273e-309 2.034558e-308 1.721047e-308 8.052732e-309 1.000000 9.428069e-309

I now try to include the book names in the dataframe.

probs_df.insert(0, "book_names", merged_df["book_names"].unique())
merged_df_with_topics=probs_df

We can now inspect what we just created:

df_5 = merged_df_with_topics.head(5)
HTML(df_5.to_html(index=False))
book_names topic_-1 topic_0 topic_1 topic_2 topic_3 topic_4 topic_5 topic_6 topic_7 topic_8 topic_9
Genesis 1 None 6.398443e-02 1.220270e-01 3.308001e-02 4.360875e-02 3.036763e-02 2.980985e-02 1.379953e-01 2.669987e-02 0.021563 4.201700e-02
Genesis 2 None 6.674281e-02 1.216345e-01 3.250206e-02 4.572841e-02 3.035448e-02 3.098863e-02 1.593987e-01 2.652198e-02 0.022427 4.094005e-02
Genesis 3 None 6.387558e-02 8.711585e-02 2.637531e-02 6.316551e-02 2.774398e-02 2.930832e-02 1.437205e-01 2.244477e-02 0.022091 3.242878e-02
Genesis 4 None 2.494071e-01 6.160130e-02 1.606678e-02 2.990693e-02 1.672392e-02 4.181984e-02 3.949162e-02 1.483390e-02 0.049668 1.809949e-02
Genesis 5 None 1.202258e-307 3.806346e-308 8.786780e-309 1.248075e-308 8.377273e-309 2.034558e-308 1.721047e-308 8.052732e-309 1.000000 9.428069e-309
Show the code
library(tidyr)
library(tidyverse)

merged_topics_df <-  reticulate::py$merged_df_with_topics
merged_topics_df<-subset(merged_topics_df, select = -c(`topic_-1`))

# Reshape the dataframe
reshaped_df <- merged_topics_df %>%
  pivot_longer(cols = starts_with("topic"), 
               names_to = "topic", 
               values_to = "probs")


reshaped_df<-reshaped_df%>%
  mutate(topic_no = as.numeric(gsub("\\D", "", topic)),
         chapter_no = as.numeric(gsub("\\D", "", book_names)))


# Adding the incremental index
reshaped_df$order <- match(reshaped_df$book_names, unique(reshaped_df$book_names))



reshaped_df2<-reshaped_df%>%
  group_by(book_names)%>%
  mutate(rank = rank(-probs, ties.method = "min"))


reshaped_df2$rowno<-as.numeric(rownames(reshaped_df2))
reshaped_df2_301<-subset(reshaped_df2, rowno<=300)
reshaped_df2_301<-reshaped_df2_301%>%
  mutate(book_names = fct_reorder(book_names, order)) 

#reshaped_df2_301$topic_no
#reshaped_df2_301$probs

topics_genesis_bert<-ggplot(reshaped_df2_301, aes(topic_no, reorder(book_names, -order))) +   
  geom_tile(aes(fill = probs), color=ifelse(reshaped_df2_301$rank == 1, "red", NA), linewidth=1) +
  scale_fill_viridis_c()+
  #scale_x_continuous(limits = c(0, 15), breaks = 1:14)
  scale_x_continuous(limits = c(min(reshaped_df2_301$topic_no)-1, 
                                max(reshaped_df2_301$topic_no)+1), breaks = min(reshaped_df2_301$topic_no):max(reshaped_df2_301$topic_no))+
    ylab("Document")+
  ggtitle("Contribution of the topics to every document")+
    labs(fill = "Topic\nContribution\nProbabilities")
topics_genesis_bert

4.3 Connection among Topics

One interesting feature that BERTopic offers is visualizing the connection among topics.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

distance_matrix = cosine_similarity(np.array(topic_model.topic_embeddings_)[1:, :])
labels = ["_".join(label.split("_")[1:]) for label in topic_model.get_topic_info().Name[1:]]
# Creating DataFrame
df = pd.DataFrame(distance_matrix, index=labels, columns=labels)
Show the code
library(reshape2)
library(tibble)
df2 <-  reticulate::py$df
df2 <- cbind(name = rownames(df2), df2)
rownames(df2) <- 1:nrow(df2)

df_long <- melt(df2, id.vars = "name")
df_long$name <- factor(df_long$name, levels = unique(df_long$name))
df_long$variable <- factor(df_long$variable, levels = unique(df_long$variable))


# Plot the heatmap
ggplot(df_long, aes(x = rev(name), y = rev(variable))) +
  geom_tile(aes(fill = value), color=ifelse(df_long$value >= 0.9, "red", NA), linewidth=1) +
  theme_minimal() +
  scale_fill_viridis_b()+
  labs(x = "Name", y = "Variable", fill = "Value") +
  geom_text(aes(y=rev(variable), x=rev(name), label=round(value, 2)), size=1.9)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

4.4 Term Score Decline

We can use the Term Score Matrix to examine the term score decline for topics and compare scores for terms of the same rank for different topics. The Term Score Decline diagram helps us decide whether we can cut off the number of terms we want to distinguish. E.g. We might conclude that only the top 5 terms are important enough for us to consider in order to assign meaningful topic names.

all_topics_data = []

# Loop through each topic id in the 'topics' list
for topic_id in list(set(topics)):
    # Get the topic
    t = topic_model.get_topic(topic_id)
    
    # Select top 9 terms and scores
    top_terms = [x[0] for x in t[:9]]
    top_scores = [x[1] for x in t[:9]]
    
    # Append data for the current topic to the list
    for term, score in zip(top_terms, top_scores):
        all_topics_data.append({'Topic': topic_model.topic_labels_[topic_id], 'Term': term, 'Score': score})

# Create DataFrame from the collected data
all_topics_df = pd.DataFrame(all_topics_data)
Show the code
all_topics_df <-  reticulate::py$all_topics_df
all_topics_dfr3<-all_topics_df%>%
  dplyr::group_by(Topic)%>%
  dplyr::mutate(Rank = rank(-Score))

all_topics_dfr3$topic <- as.numeric(gsub("[^0-9-]", "", all_topics_dfr3$Topic))
all_topics_dfr3$topic_no <- paste0("Topic: ", as.numeric(gsub("[^0-9-]", "", all_topics_dfr3$Topic)), sep="")

topic_minus1<-subset(all_topics_dfr3, topic_no=="Topic: -1")
ggplot(all_topics_dfr3, aes(Rank, Score))+   
  geom_line(aes(group = topic, color = topic_no)) +
  scale_color_viridis_d() +
  geom_line(data= topic_minus1, color="red")+
  labs(x = "Rank", y = "Score")+
    scale_x_continuous(limits = c(min(all_topics_dfr3$Rank)-1, 
                                max(all_topics_dfr3$Rank)+1), breaks = min(all_topics_dfr3$Rank):max(all_topics_dfr3$Rank))+
theme(legend.position="bottom")

Show the code part 1
original_topics <- unique(all_topics_dfr3$Topic)


labels <- c('David\'s Reign',
            'Prophetic Declarations',
            'Devotion and Selah',
            'Teachings of Jesus',
            'Christ\'s Love',
            'Priestly Offerings',
            'Apocalyptic Visions',
            'Spectrum of Wisdom',
            'Generational Legacy',
            'Job\'s Trials',
            'Divine Kingship')


#topic_no_num<- as.numeric(gsub("[^0-9-]", "", original_topics))
topic_no_num <- as.numeric(gsub("\\D", "", original_topics))
library(stringr)

# Regular expression pattern to match numbers
pattern <- "-?\\d+"
# Extracting numbers from each string
numbers <- str_extract_all(original_topics, pattern)
topic_no_num <- as.numeric(unlist(numbers))
Show the code part 1
topic_no_num_reorg <- topic_no_num+2

# Create a list combining original topics and their corresponding labels
topics_with_labels <- list(original_topics = original_topics, 
                           labels = labels,
                           topic_no_num = topic_no_num,
                           topic_no_num_reorg = topic_no_num_reorg)


all_topics_dfr4<-all_topics_dfr3
all_topics_dfr4$topic_label<-NA
for (key in 1:length(topics_with_labels$original_topics)) {
  # Extracting current label pair
  current_label_pair <- topics_with_labels$original_topics[[key]]
  assignment <- topics_with_labels$labels[[key]]
  # Finding rows where Topic matches the first element of the label pair
  rows_to_update <- all_topics_dfr4$Topic == current_label_pair[1]
  # Updating topic_label where the condition is met
  all_topics_dfr4$topic_label[rows_to_update] <- assignment
}

all_topics_dfr4$up_label<-paste(all_topics_dfr4$topic_no, all_topics_dfr4$topic_label, sep="\n ")  

library(ggh4x)
all_topics_dfr5<-subset(all_topics_dfr4, !duplicated(Topic))
all_topics_dfr5 <- all_topics_dfr5 %>%
  arrange(topic)

topics_bert_keys<-ggplot(all_topics_dfr4, aes(Rank, reorder(up_label, -topic))) +   
  geom_tile(aes(fill = Score))+
  scale_fill_viridis_b()+
  geom_label(aes(y=up_label, x=Rank, label=Term), size=1.9)+
  scale_x_continuous(limits = c(min(all_topics_dfr4$Rank)-1, 
                                max(all_topics_dfr4$Rank)+1), 
                     breaks = min(all_topics_dfr4$Rank):max(all_topics_dfr4$Rank))+
  guides(y.sec = guide_axis_manual(labels = all_topics_dfr5$Topic, breaks = rev(all_topics_dfr5$topic+2)))+
  labs(fill = "Keyword\nContribution\nScores")+
  ylab("Topic")+
  theme(legend.position = "bottom")
topics_bert_keys

I can now examine which topics are present in Genesis.

Show the code part 1
# Calculate ranks
merged_topics_df <-  reticulate::py$merged_df_with_topics
merged_topics_df<-subset(merged_topics_df, select = -c(`topic_-1`))

# Reshape the dataframe
reshaped_df <- merged_topics_df %>%
  pivot_longer(cols = starts_with("topic"), 
               names_to = "topic", 
               values_to = "probs")

reshaped_df<-reshaped_df%>%
  mutate(topic_no = as.numeric(gsub("\\D", "", topic)),
         chapter_no = as.numeric(gsub("\\D", "", book_names)))

# Adding the incremental index
reshaped_df$order <- match(reshaped_df$book_names, unique(reshaped_df$book_names))

#Creating the topic label column
reshaped_df2<-reshaped_df%>%
  dplyr::group_by(book_names)%>%
  dplyr::mutate(rank = rank(-probs, ties.method = "min"))

reshaped_df3<-reshaped_df2
reshaped_df3$topic_label<-NA
reshaped_df3$topic_label_simple<-NA
Show the code part 2
#Assigning labels
for (key in topics_with_labels$topic_no_num) {
  current_label_pair <- topics_with_labels$original_topics[topics_with_labels$topic_no_num == key]
  assignment1 <- topics_with_labels$labels[topics_with_labels$topic_no_num == key]
  assignment2 <- topics_with_labels$original_topics[topics_with_labels$topic_no_num == key]
  rows_to_update <- which(reshaped_df3$topic_no == key)
  reshaped_df3$topic_label[rows_to_update] <- assignment1
  reshaped_df3$topic_label_simple[rows_to_update] <- assignment2
}

reshaped_df3$rowno<-as.numeric(rownames(reshaped_df3))
genesis_rows <- reshaped_df3[grepl("^Genesis", reshaped_df3$book_names), ]


genesis_rows$book_names2 <- factor(
  genesis_rows$book_names,
  levels = unique(genesis_rows$book_names)[order(genesis_rows$order)]
)

#Identifying topics that ranked first for the secon y axis
library(ggh4x)
genesis_rows$book_names2_num<-as.numeric(genesis_rows$book_names2)
#reshaped_df3_750$topic_label
genesis_rows2<-genesis_rows%>%
  dplyr::group_by(book_names)%>%
  dplyr::mutate(book_names_win = paste(book_names, topic_label_simple[rank==1], sep=": "),
                topic_win = topic_label_simple[rank==1],
                topic_win_label = topic_label[rank==1])
Show the code part 3
unique_labels<-subset(genesis_rows2, rank==1)
unique_labels$topic_win_label2<-paste("Topic", unique_labels$topic_no, unique_labels$topic_win_label)

ggplot(genesis_rows2, aes(topic_no, reorder(book_names2, -order))) +   
   geom_tile(aes(fill = probs), color=ifelse(genesis_rows2$rank == 1, "red", NA), linewidth=0.6) +
   scale_fill_viridis_c()+
  #scale_x_continuous(limits = c(0, 15), breaks = 1:14)
  scale_x_continuous(limits = c(min(genesis_rows2$topic_no)-1, 
                                max(genesis_rows2$topic_no)+1), breaks=min(genesis_rows2$topic_no):max(genesis_rows2$topic_no))+
  guides(y.sec = guide_axis_manual(labels = rev(unique_labels$topic_win_label2)))+
    theme(legend.position = "bottom")

It could also be important to know how many documents in the Bible are associated with every topic.

Show the code part 1
reshaped_df3$rowno<-as.numeric(rownames(reshaped_df3))

reshaped_count<-subset(reshaped_df3, rank==1)
topic_counts <- data.frame(table(reshaped_count$topic))
names(topic_counts)[1]<-"topic"
names(topic_counts)[2]<-"no_documents"
topic_counts$topic<-as.numeric(gsub("topic_", "", topic_counts$topic))

#Assigning topic labels to opics
topic_counts$topic_label<-NA
for (key in 1:length(topics_with_labels$topic_no_num)) {
  # Extracting current label pair
  current_label_pair <- topics_with_labels$topic_no_num[[key]]
  assignment <- topics_with_labels$labels[[key]]
  # Finding rows where Topic matches the first element of the label pair
  rows_to_update <- topic_counts$topic == current_label_pair[1]
  # Updating topic_label where the condition is met
  topic_counts$topic_label[rows_to_update] <- assignment
}
Show the code part 2
topic_counts$topic<-as.factor(topic_counts$topic)
topics_bert<-ggplot(topic_counts, aes(y =topic, x = no_documents)) +
  geom_bar(stat = "identity")+
  guides(y.sec = guide_axis_manual(labels = topic_counts$topic_label))+
  labs(fill = "Keyword\nContribution\nScores")+
  ylab("Topic")+
  theme(legend.position = "bottom")+
  ggtitle("Topics BERTopic")
topics_bert

Show the code part 2
#sum(topic_counts$no_documents)

5 Comparison LDA vs. BERTopic

The following table compares LDA to BERTopic. It summarizes some of the assets and limitations that were discussed earlier.

Metric. LDA BERTopic
Data preprocessing Pre-processing is essential Pre-processing is not needed in most cases
Optimal No. topics No. of topics can be identified based on coherence No. of topics can be identified based on coherence
Topic relationship for each document Documents are a mixture of topics Documents are a mixture of topics
Topic representation Bag-of-words representation, disregards semantics Semantic embeddings lead to more meaningful and coherent topics.
Longer input documents Document length does not matter There is a limit on the number of input tokens
Shorter input documents Document length does not matter Shorter documents perform better
Small datasets (<1000 docs) Can handle fewer than 1000 docs May be less effective with small datasets
Large datasets (>1000 docs) Can handle more than 1000 docs Performs well with larger corpora
Speed & Resources Effective with modern computers Effective with modern computers

Finally, it could be useful to list the topics computed by the two methods below:

Show the code
library(ggpubr)
ggarrange(topics_lda, topics_bert, ncol=2)

Equally, it could be useful to show again the topics identified in both LDA and BERTopic.

Show the code
library(ggpubr)
ggarrange(topics_lda_keys, topics_bert_keys, ncol=1, nrow=2)

It is difficult to say if there is a substantial overlap among the topics identified. Choosing between LDA and BERTopic hinges on the specific requirements and dataset characteristics. LDA could be relevant if a well-established, computationally efficient method for uncovering latent themes in the text data without necessitating fine-grained semantic analysis ia priority. However, if there is need for a deeper contextual understanding and we are dealing with over 1,000 documents, BERTopic may be the better choice. BERTopic leverages transformer-based models like BERT to capture nuanced semantic relationships between words, allowing for more granular topic distinctions based on word embeddings and contextual information.

Additional Resources

  • https://cees-roele.medium.com/a-term-score-matrix-for-bertopic-821e78e198ee

  • https://github.com/ceesroele/disinformation/blob/main/TermScoreMatrix.ipynb?source=post_page—–821e78e198ee——————————–

  • https://www.pinecone.io/learn/bertopic/

  • https://umu.diva-portal.org/smash/get/diva2:1763637/FULLTEXT01.pdf