In this report, I perform topic modeling on the Bible, employing two methods: Latent Dirichlet Allocation (LDA) and Word Embeddings using BERTopic. Both of them are unsupervised machine learning approaches to extract information of the latent topics. Comprising 66 books written by multiple authors over numerous centuries, the Bible encompasses a diverse array of genres, including historical and religious narratives, poetry, epistles, and prophecies. By applying topic modeling, we can unveil fresh insights into this important text.
Thus, the main questions asked in this report are:
What are the general topics in the Bible?
How do two topic modeling methods, LDA and BERTopic compare when it comes to the Bible?
Before I use the LDA and the BERTopic, I perform some descriptive analysis.
2 Descriptive Approach
Data cleaning and preprocessing are important steps when analyzing textual data before vectorizing it. The preprocessing and data cleaning techniques include tokenization, lemmatization, stopword removal, lowercasing and non-alphabetical character removal. These techniques aim to prepare the data for NLP computations and further remove noise and meaningless information from the data.
We then extract the Bible from the following URL and turn the response into actual text.
bible_url ='https://bereanbible.com/bsb.txt'path ="./data/bereanbible.txt"#with open(path, 'rb') as f:# text = f.read()#Requesting the URLresponse = requests.get(bible_url)raw_bible = response.text#Priting the first 400 charactersraw_bible[:400]
"The Holy Bible, Berean Standard Bible, BSB is produced in cooperation with Bible Hub, Discovery Bible, OpenBible.com, and the Berean Bible Translation Committee. \t\nThis text of God's Word has been dedicated to the public domain. Free resources and databases are available at BereanBible.com.\t\nVerse\tBerean Standard Bible\nGenesis 1:1\tIn the beginning God created the heavens and the earth.\nGenesis 1:2"
The following command splits the text text into lines.
bible_list = raw_bible.splitlines()#This prints the first 2 entries.bible_list[:2]
['The Holy Bible, Berean Standard Bible, BSB is produced in cooperation with Bible Hub, Discovery Bible, OpenBible.com, and the Berean Bible Translation Committee. \t', "This text of God's Word has been dedicated to the public domain. Free resources and databases are available at BereanBible.com.\t"]
The next function takes a list of strings (bible_list), splits each string by the tab character, and stores the result in a NumPy array (bible_array)
bible_array = np.array([item.split('\t') for item in bible_list])bible_array[:2]
array([['The Holy Bible, Berean Standard Bible, BSB is produced in cooperation with Bible Hub, Discovery Bible, OpenBible.com, and the Berean Bible Translation Committee. ',
''],
["This text of God's Word has been dedicated to the public domain. Free resources and databases are available at BereanBible.com.",
'']], dtype='<U400')
In the next step, we want to examine more closely what our data looks like:
df = pd.DataFrame(bible_array[3:, :], columns=['reference','text'])# The following worksdf_3 = df.head(3)HTML(df_3.to_html(index=False))
reference
text
Genesis 1:1
In the beginning God created the heavens and the earth.
Genesis 1:2
Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters.
Genesis 1:3
And God said, “Let there be light,” and there was light.
2.1 Extracting Chapter Names
In the next lines, I create a function that extracts the chapter names. For example any chapter called “Genesis 1:1” or “Genesis 1:2” will turn into “Genesis 1”. The logic is that titles like “Genesis 1” will become documents in our topic modeling analysis. In NLP, a document could be anything. For example a document could be a journal paper, a book, an article, a chapter, or even a sentence. In our case it is a book. For example, “Genesis 1” and “Genesis 2” are books in our example.
#The next function only extracts the chapter name: e.g. for Genesis 1:1, it extracts Genesis 1 def extract_book_name(verse_reference):# Use regular expression to extract the book name and chapter number match = re.match(r'(.+?)\s+(\d+):\d+', verse_reference)if match: book_name = match.group(1) chapter_number = match.group(2)returnf"{book_name}{chapter_number}"else:returnNone#The next function only extracts the chaper name: e.g. for Genesis 1:1, it extracts Genesis def extract_book_name2(verse_reference):# Use regular expression to extract the book name match = re.match(r'(.+?)\s+\d+:\d+', verse_reference)if match: book_name = match.group(1)return book_nameelse:returnNone
Next, I group text by chapter entry, while keeping their original order.
# Add a column to store the original orderdf['original_order'] =range(len(df))merged_df = df.groupby('book_names')['text'].apply(' '.join).reset_index()# Merge back the original ordermerged_df = pd.merge(merged_df, df[['book_names', 'original_order']], on='book_names', how='left')# Sort the DataFrame based on the original ordermerged_df.sort_values('original_order', inplace=True)# Drop the original_order column if it's no longer neededmerged_df.drop(columns='original_order', inplace=True)merged_df = merged_df[['book_names', 'text']].drop_duplicates()#Examining the first two entriesdf_2 = merged_df.head(2)HTML(df_2.to_html(index=False))
book_names
text
Genesis 1
In the beginning God created the heavens and the earth. Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters. And God said, “Let there be light,” and there was light. And God saw that the light was good, and He separated the light from the darkness. God called the light “day,” and the darkness He called “night.” And there was evening, and there was morning—the first day. And God said, “Let there be an expanse between the waters, to separate the waters from the waters.” So God made the expanse and separated the waters beneath it from the waters above. And it was so. God called the expanse “sky.” And there was evening, and there was morning—the second day. And God said, “Let the waters under the sky be gathered into one place, so that the dry land may appear.” And it was so. God called the dry land “earth,” and the gathering of waters He called “seas.” And God saw that it was good. Then God said, “Let the earth bring forth vegetation: seed-bearing plants and fruit trees, each bearing fruit with seed according to its kind.” And it was so. The earth produced vegetation: seed-bearing plants according to their kinds and trees bearing fruit with seed according to their kinds. And God saw that it was good. And there was evening, and there was morning—the third day. And God said, “Let there be lights in the expanse of the sky to distinguish between the day and the night, and let them be signs to mark the seasons and days and years. And let them serve as lights in the expanse of the sky to shine upon the earth.” And it was so. God made two great lights: the greater light to rule the day and the lesser light to rule the night. And He made the stars as well. God set these lights in the expanse of the sky to shine upon the earth, to preside over the day and the night, and to separate the light from the darkness. And God saw that it was good. And there was evening, and there was morning—the fourth day. And God said, “Let the waters teem with living creatures, and let birds fly above the earth in the open expanse of the sky.” So God created the great sea creatures and every living thing that moves, with which the waters teemed according to their kinds, and every bird of flight after its kind. And God saw that it was good. Then God blessed them and said, “Be fruitful and multiply and fill the waters of the seas, and let birds multiply on the earth.” And there was evening, and there was morning—the fifth day. And God said, “Let the earth bring forth living creatures according to their kinds: livestock, land crawlers, and beasts of the earth according to their kinds.” And it was so. God made the beasts of the earth according to their kinds, the livestock according to their kinds, and everything that crawls upon the earth according to its kind. And God saw that it was good. Then God said, “Let Us make man in Our image, after Our likeness, to rule over the fish of the sea and the birds of the air, over the livestock, and over all the earth itself and every creature that crawls upon it.” So God created man in His own image; in the image of God He created him; male and female He created them. God blessed them and said to them, “Be fruitful and multiply, and fill the earth and subdue it; rule over the fish of the sea and the birds of the air and every creature that crawls upon the earth.” Then God said, “Behold, I have given you every seed-bearing plant on the face of all the earth, and every tree whose fruit contains seed. They will be yours for food. And to every beast of the earth and every bird of the air and every creature that crawls upon the earth—everything that has the breath of life in it—I have given every green plant for food.” And it was so. And God looked upon all that He had made, and indeed, it was very good. And there was evening, and there was morning—the sixth day.
Genesis 2
Thus the heavens and the earth were completed in all their vast array. And by the seventh day God had finished the work He had been doing; so on that day He rested from all His work. Then God blessed the seventh day and sanctified it, because on that day He rested from all the work of creation that He had accomplished. This is the account of the heavens and the earth when they were created, in the day that the LORD God made them. Now no shrub of the field had yet appeared on the earth, nor had any plant of the field sprouted; for the LORD God had not yet sent rain upon the earth, and there was no man to cultivate the ground. But springs welled up from the earth and watered the whole surface of the ground. Then the LORD God formed man from the dust of the ground and breathed the breath of life into his nostrils, and the man became a living being. And the LORD God planted a garden in Eden, in the east, where He placed the man He had formed. Out of the ground the LORD God gave growth to every tree that is pleasing to the eye and good for food. And in the middle of the garden were the tree of life and the tree of the knowledge of good and evil. Now a river flowed out of Eden to water the garden, and from there it branched into four headwaters: The name of the first river is Pishon; it winds through the whole land of Havilah, where there is gold. And the gold of that land is pure, and bdellium and onyx are found there. The name of the second river is Gihon; it winds through the whole land of Cush. The name of the third river is Hiddekel; it runs along the east side of Assyria. And the fourth river is the Euphrates. Then the LORD God took the man and placed him in the Garden of Eden to cultivate and keep it. And the LORD God commanded him, “You may eat freely from every tree of the garden, but you must not eat from the tree of the knowledge of good and evil; for in the day that you eat of it, you will surely die.” The LORD God also said, “It is not good for the man to be alone. I will make for him a suitable helper.” And out of the ground the LORD God formed every beast of the field and every bird of the air, and He brought them to the man to see what he would name each one. And whatever the man called each living creature, that was its name. The man gave names to all the livestock, to the birds of the air, and to every beast of the field. But for Adam no suitable helper was found. So the LORD God caused the man to fall into a deep sleep, and while he slept, He took one of the man’s ribs and closed up the area with flesh. And from the rib that the LORD God had taken from the man, He made a woman and brought her to him. And the man said: “This is now bone of my bones and flesh of my flesh; she shall be called ‘woman,’ for out of man she was taken.” For this reason a man will leave his father and mother and be united to his wife, and they will become one flesh. And the man and his wife were both naked, and they were not ashamed.
2.2 Pre-processing the Text
We fill first pre-process the text. This includes a variety of steps including eliminating double spaces, using lower cases, removing punctuation, removing stop words, removing the most frequent words, removing special characters, stemming, lemmatizing and POS tagging. We will then use this processed text to create a Wordcloud. We use the NLTK package for pre-preprocessing text for LDA analysis.
Step 1: Eliminating double spaces
We now eliminate strange characters such as apostrophes, dashes and then, double spaces.
#Step1: Eliminating apostrophes, dashes, etc.merged_df['text'] = merged_df['text'].str.replace(r'[\x93\x90\x92\x94\x97]+', ' ', regex=True)# \x93 - the left double quotation mark# \x90 - an invisible character used to control display or other functions and does not have a standard visual representation# \x92 - an apostrophe# \x94 - the left double quotation mark# \x97 - a dash or hyphen character#Step2: Eliminating double spacesmerged_df['text'] = merged_df['text'].str.replace(r'\s+', ' ', regex=True)
all_punctuation = string.punctuation#Creating a function that removes punctuationdef remove_punct(text):return text.translate(str.maketrans('', '', all_punctuation))#Applying the functionmerged_df = merged_df.copy()merged_df['text_np'] = merged_df['text_lc'].apply(lambda x: remove_punct(x))merged_df = merged_df[["book_names", "text", "text_np"]]#df_2 = merged_df.head(2)#HTML(df_2.to_html(index=False))
Step 4: Removing Stop words
In the next step, I remove all the stop words: “the”, “and”, “so”. These are words that are frequently used, but which add limited meaning.
nltk.download('stopwords')
True
[nltk_data] Downloading package stopwords to
[nltk_data] /Users/bgpopescu/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
stops =set(stopwords.words('english'))# Define the function to remove stopwordsdef remove_stopwords(text):return" ".join([word for word in text.split() if word notinset(stopwords.words('english'))])#Using the function for the stop wordsmerged_df = merged_df.copy()merged_df['text_nostop'] = merged_df['text_np'].apply(remove_stopwords)#df_2 = merged_df.head(2)#HTML(df_2.to_html(index=False))
Step 6: Removing Special Characters
#Function to replace special characters: that are not alphanumeric or whitespace charactersdef remove_special_ch(text): text = re.sub('[^a-zA-Z0-9\s]', '', text) # Replace non-alphanumeric characters and spaces with nothingreturn text#Applying it to the textmerged_df = merged_df.copy()merged_df['clean_text'] = merged_df['text_nostop'].apply(remove_special_ch)merged_df= merged_df[["book_names", "text", "clean_text"]]
Let us now examine the clean text, which has no special characters, no stop words, no punctuation, is in lower case, and has no double spaces.
In the beginning God created the heavens and the earth. Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters. And God said, Let there be light, and there was light. And God saw that the light was good, and He separated the light from the darkness. God called the light day, and the darkness He called night. And there was evening, and there was morning the first day. And God said, Let there be an expanse between the waters, to separate the waters from the waters. So God made the expanse and separated the waters beneath it from the waters above. And it was so. God called the expanse sky. And there was evening, and there was morning the second day. And God said, Let the waters under the sky be gathered into one place, so that the dry land may appear. And it was so. God called the dry land earth, and the gathering of waters He called seas. And God saw that it was good. Then God said, Let the earth bring forth vegetation: seed-bearing plants and fruit trees, each bearing fruit with seed according to its kind. And it was so. The earth produced vegetation: seed-bearing plants according to their kinds and trees bearing fruit with seed according to their kinds. And God saw that it was good. And there was evening, and there was morning the third day. And God said, Let there be lights in the expanse of the sky to distinguish between the day and the night, and let them be signs to mark the seasons and days and years. And let them serve as lights in the expanse of the sky to shine upon the earth. And it was so. God made two great lights: the greater light to rule the day and the lesser light to rule the night. And He made the stars as well. God set these lights in the expanse of the sky to shine upon the earth, to preside over the day and the night, and to separate the light from the darkness. And God saw that it was good. And there was evening, and there was morning the fourth day. And God said, Let the waters teem with living creatures, and let birds fly above the earth in the open expanse of the sky. So God created the great sea creatures and every living thing that moves, with which the waters teemed according to their kinds, and every bird of flight after its kind. And God saw that it was good. Then God blessed them and said, Be fruitful and multiply and fill the waters of the seas, and let birds multiply on the earth. And there was evening, and there was morning the fifth day. And God said, Let the earth bring forth living creatures according to their kinds: livestock, land crawlers, and beasts of the earth according to their kinds. And it was so. God made the beasts of the earth according to their kinds, the livestock according to their kinds, and everything that crawls upon the earth according to its kind. And God saw that it was good. Then God said, Let Us make man in Our image, after Our likeness, to rule over the fish of the sea and the birds of the air, over the livestock, and over all the earth itself and every creature that crawls upon it. So God created man in His own image; in the image of God He created him; male and female He created them. God blessed them and said to them, Be fruitful and multiply, and fill the earth and subdue it; rule over the fish of the sea and the birds of the air and every creature that crawls upon the earth. Then God said, Behold, I have given you every seed-bearing plant on the face of all the earth, and every tree whose fruit contains seed. They will be yours for food. And to every beast of the earth and every bird of the air and every creature that crawls upon the earth everything that has the breath of life in it I have given every green plant for food. And it was so. And God looked upon all that He had made, and indeed, it was very good. And there was evening, and there was morning the sixth day.
beginning god created heavens earth earth formless void darkness surface deep spirit god hovering surface waters god said let light light god saw light good separated light darkness god called light day darkness called night evening morning first day god said let expanse waters separate waters waters god made expanse separated waters beneath waters god called expanse sky evening morning second day god said let waters sky gathered one place dry land may appear god called dry land earth gathering waters called seas god saw good god said let earth bring forth vegetation seedbearing plants fruit trees bearing fruit seed according kind earth produced vegetation seedbearing plants according kinds trees bearing fruit seed according kinds god saw good evening morning third day god said let lights expanse sky distinguish day night let signs mark seasons days years let serve lights expanse sky shine upon earth god made two great lights greater light rule day lesser light rule night made stars well god set lights expanse sky shine upon earth preside day night separate light darkness god saw good evening morning fourth day god said let waters teem living creatures let birds fly earth open expanse sky god created great sea creatures every living thing moves waters teemed according kinds every bird flight kind god saw good god blessed said fruitful multiply fill waters seas let birds multiply earth evening morning fifth day god said let earth bring forth living creatures according kinds livestock land crawlers beasts earth according kinds god made beasts earth according kinds livestock according kinds everything crawls upon earth according kind god saw good god said let us make man image likeness rule fish sea birds air livestock earth every creature crawls upon god created man image image god created male female created god blessed said fruitful multiply fill earth subdue rule fish sea birds air every creature crawls upon earth god said behold given every seedbearing plant face earth every tree whose fruit contains seed food every beast earth every bird air every creature crawls upon earth everything breath life given every green plant food god looked upon made indeed good evening morning sixth day
Genesis 2
Thus the heavens and the earth were completed in all their vast array. And by the seventh day God had finished the work He had been doing; so on that day He rested from all His work. Then God blessed the seventh day and sanctified it, because on that day He rested from all the work of creation that He had accomplished. This is the account of the heavens and the earth when they were created, in the day that the LORD God made them. Now no shrub of the field had yet appeared on the earth, nor had any plant of the field sprouted; for the LORD God had not yet sent rain upon the earth, and there was no man to cultivate the ground. But springs welled up from the earth and watered the whole surface of the ground. Then the LORD God formed man from the dust of the ground and breathed the breath of life into his nostrils, and the man became a living being. And the LORD God planted a garden in Eden, in the east, where He placed the man He had formed. Out of the ground the LORD God gave growth to every tree that is pleasing to the eye and good for food. And in the middle of the garden were the tree of life and the tree of the knowledge of good and evil. Now a river flowed out of Eden to water the garden, and from there it branched into four headwaters: The name of the first river is Pishon; it winds through the whole land of Havilah, where there is gold. And the gold of that land is pure, and bdellium and onyx are found there. The name of the second river is Gihon; it winds through the whole land of Cush. The name of the third river is Hiddekel; it runs along the east side of Assyria. And the fourth river is the Euphrates. Then the LORD God took the man and placed him in the Garden of Eden to cultivate and keep it. And the LORD God commanded him, You may eat freely from every tree of the garden, but you must not eat from the tree of the knowledge of good and evil; for in the day that you eat of it, you will surely die. The LORD God also said, It is not good for the man to be alone. I will make for him a suitable helper. And out of the ground the LORD God formed every beast of the field and every bird of the air, and He brought them to the man to see what he would name each one. And whatever the man called each living creature, that was its name. The man gave names to all the livestock, to the birds of the air, and to every beast of the field. But for Adam no suitable helper was found. So the LORD God caused the man to fall into a deep sleep, and while he slept, He took one of the man s ribs and closed up the area with flesh. And from the rib that the LORD God had taken from the man, He made a woman and brought her to him. And the man said: This is now bone of my bones and flesh of my flesh; she shall be called ‘woman, for out of man she was taken. For this reason a man will leave his father and mother and be united to his wife, and they will become one flesh. And the man and his wife were both naked, and they were not ashamed.
thus heavens earth completed vast array seventh day god finished work day rested work god blessed seventh day sanctified day rested work creation accomplished account heavens earth created day lord god made shrub field yet appeared earth plant field sprouted lord god yet sent rain upon earth man cultivate ground springs welled earth watered whole surface ground lord god formed man dust ground breathed breath life nostrils man became living lord god planted garden eden east placed man formed ground lord god gave growth every tree pleasing eye good food middle garden tree life tree knowledge good evil river flowed eden water garden branched four headwaters name first river pishon winds whole land havilah gold gold land pure bdellium onyx found name second river gihon winds whole land cush name third river hiddekel runs along east side assyria fourth river euphrates lord god took man placed garden eden cultivate keep lord god commanded may eat freely every tree garden must eat tree knowledge good evil day eat surely die lord god also said good man alone make suitable helper ground lord god formed every beast field every bird air brought man see would name one whatever man called living creature name man gave names livestock birds air every beast field adam suitable helper found lord god caused man fall deep sleep slept took one man ribs closed area flesh rib lord god taken man made woman brought man said bone bones flesh flesh shall called woman man taken reason man leave father mother united wife become one flesh man wife naked ashamed
Step 7: Stemming
Stemming is a very crude method. It chops off the end of the word. The result is not necessarily a real word. Both stemming and lemmatization are useful to reduce vocabulary size and vector dimensionality. Thus both will speed up processing. For example “bosses” becomes “boss” while “replacement” becomes “replac” (not a real word).
There are multiple stemming algorithms. The most popular is Porter Stemmer in NLTK. For example:
ps = PorterStemmer()#Stemming the word "walking"ps.stem("walking")
'walk'
I first define a function that stems the words.
#Fucntion to stem the worddef stem_words(text): ps = PorterStemmer()return" ".join([ps.stem(word) for word in text.split()])#We now apply this to our textmerged_df=merged_df.copy()merged_df['stemmed_text'] = merged_df["clean_text"].apply(lambda x: stem_words(x))merged_df= merged_df[["book_names", "text", "clean_text", "stemmed_text"]]
Step 8: Lemmatization & POS Tagging
Unlile stemming, lemmatization is more sophisticated, as it uses actual rules of language. The true root word will be returned. Both stemming and lemmatization are useful to reduce vocabulary size and vector dimensionality and will speed up processing.
For example, if we lemmatize, we obtain “good” from “better”. For “was”, we obtain “be”.
True
[nltk_data] Downloading package wordnet to
[nltk_data] /Users/bgpopescu/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
True
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /Users/bgpopescu/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
I then define a function that lemmatizes the text.
#Defining the functiondef lemmatize_words(text):# find POS tags pos_text = pos_tag(text.split())return" ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_text])#Applying that function to the text.merged_df = merged_df.copy()merged_df['lemmatized'] = merged_df["clean_text"].apply(lambda x: lemmatize_words(x))merged_df= merged_df[["book_names", "text", "clean_text", "stemmed_text", "lemmatized"]]merged_df2 = merged_df[["book_names", "lemmatized"]]#Investigating the resultdf_2 = merged_df2.head(2)HTML(df_2.to_html(index=False))
book_names
lemmatized
Genesis 1
begin god create heaven earth earth formless void darkness surface deep spirit god hover surface water god say let light light god saw light good separate light darkness god call light day darkness call night even morning first day god say let expanse water separate water water god make expanse separated water beneath water god call expanse sky even morning second day god say let water sky gather one place dry land may appear god call dry land earth gather water call sea god saw good god say let earth bring forth vegetation seedbearing plant fruit tree bear fruit seed accord kind earth produce vegetation seedbearing plant accord kind tree bear fruit seed accord kind god saw good evening morning third day god say let light expanse sky distinguish day night let sign mark season day year let serve light expanse sky shine upon earth god make two great light great light rule day lesser light rule night make star well god set light expanse sky shine upon earth preside day night separate light darkness god saw good evening morning fourth day god say let water teem live creature let bird fly earth open expanse sky god create great sea creature every live thing move water teem accord kind every bird flight kind god saw good god bless say fruitful multiply fill water seas let bird multiply earth even morning fifth day god say let earth bring forth living creature accord kind livestock land crawler beasts earth accord kind god make beast earth accord kind livestock accord kind everything crawl upon earth accord kind god saw good god say let u make man image likeness rule fish sea bird air livestock earth every creature crawl upon god create man image image god create male female create god bless say fruitful multiply fill earth subdue rule fish sea bird air every creature crawl upon earth god say behold give every seedbearing plant face earth every tree whose fruit contain seed food every beast earth every bird air every creature crawl upon earth everything breath life give every green plant food god look upon make indeed good even morning sixth day
Genesis 2
thus heaven earth complete vast array seventh day god finished work day rest work god bless seventh day sanctify day rest work creation accomplish account heaven earth create day lord god make shrub field yet appear earth plant field sprout lord god yet send rain upon earth man cultivate ground spring well earth watered whole surface ground lord god form man dust ground breathe breath life nostril man become living lord god plant garden eden east place man form ground lord god give growth every tree please eye good food middle garden tree life tree knowledge good evil river flow eden water garden branch four headwater name first river pishon wind whole land havilah gold gold land pure bdellium onyx find name second river gihon wind whole land cush name third river hiddekel run along east side assyria fourth river euphrates lord god take man place garden eden cultivate keep lord god command may eat freely every tree garden must eat tree knowledge good evil day eat surely die lord god also say good man alone make suitable helper ground lord god form every beast field every bird air bring man see would name one whatever man call living creature name man give name livestock bird air every beast field adam suitable helper find lord god cause man fall deep sleep sleep take one man rib close area flesh rib lord god take man make woman bring man say bone bone flesh flesh shall call woman man take reason man leave father mother united wife become one flesh man wife naked ashamed
As explained, the stemmed text differs from the lemmatized text.
2.3 Original Text vs. Clean Text
Text cleaning may have a significant effect on the length of the text, particularly if the cleaning process involves the removal of a large number of stop words or the stemming of words to their root form. This could result in a shorter text that is more focused on the core content of the text and less cluttered with common words or repetitive phrases.
We can now count the number of tokens within each type of text:
# Tokenize the text into wordsmerged_df['word_count_full'] = merged_df['text'].apply(lambda x: len(x.split()))merged_df['word_count_lem'] = merged_df['lemmatized'].apply(lambda x: len(x.split()))merged_df2 = merged_df[["book_names", "word_count_full", "word_count_lem"]]#Investigating the resultdf_10 = merged_df2.head(10)HTML(df_10.to_html(index=False))
book_names
word_count_full
word_count_lem
Genesis 1
747
359
Genesis 2
608
262
Genesis 3
664
264
Genesis 4
622
275
Genesis 5
497
244
Genesis 6
507
228
Genesis 7
496
251
Genesis 8
515
246
Genesis 9
609
272
Genesis 10
432
227
I now switch to R to make use of ggplot’s good graphic capabilities to analyze the difference in the number of words in the original text vs. the lemmatized text.
Show the code
library(reticulate)library(ggplot2)library(ggpubr)df <- reticulate::py$merged_dfunclean<-ggplot(df, aes(x = word_count_full)) +geom_histogram(color ="white",alpha =0.5)+geom_density(aes(y =after_stat(density)*80000), #alpha = 0.5,fill=NA)+#Adding a secondary axisscale_y_continuous(name ="No. Book Chapters (e.g. Genesis 1, Genesis 2, etc)",sec.axis =sec_axis(~.x/80000, name ="density"))+labs(fill =NULL)+xlab("No. Words")+theme_bw()+ggtitle("Original")clean<-ggplot(df, aes(x = word_count_lem)) +geom_histogram(color ="white",alpha =0.5)+#Adding density and multiplying by a manually defined value#for visibilitygeom_density(aes(y =after_stat(density)*40000), #alpha = 0.5,fill=NA)+#Adding a secondary axisscale_y_continuous(name ="No. Book Chapters (e.g. Genesis 1, Genesis 2, etc)",sec.axis =sec_axis(~.x/40000, name ="density"))+labs(fill =NULL)+xlab("No. Words")+theme_bw()+ggtitle("Lemmatized")ggarrange(unclean, clean, ncol=2)
Thus, by perfoming basic text processing methods, we reduced the size of the text substantially.
2.4 Most Frequent Words
It could also be important to compare the most frequent words before and after cleaning. The most frequent words after preprocessing the text may reveal important themes and trends present in the text. These words can provide valuable insights into its content and structure, and can inform the topic modeling process.
To do that, I need to run:
# Function to tokenize text and count wordsdef count_words(text): words = re.findall(r'\w+', text.lower())return Counter(words)# Combine all texts into a single stringall_text =' '.join(merged_df['text'])# Count the wordsword_counts_unclean = count_words(all_text)# Create a new DataFrame from word_countsword_counts_unclean_df = pd.DataFrame(word_counts_unclean.items(), columns=['word', 'count'])# Combine all texts into a single stringall_text =' '.join(merged_df['lemmatized'])# Count the wordsword_counts_clean = count_words(all_text)# Create a new DataFrame from word_countsword_counts_clean_df = pd.DataFrame(word_counts_clean.items(), columns=['word', 'count'])
We can now plot the most frequent words in the original, unaltered text, versus the one that was cleaned. We notice some substantial differences.
Show the code
df_unclean <- reticulate::py$word_counts_unclean_dfdf_clean <- reticulate::py$word_counts_clean_df# Order the dataframe by countdf_unclean <- df_unclean[order(-df_unclean$count), ][1:10, ]df_clean <- df_clean[order(-df_clean$count), ][1:10, ]# Plot the dataframe using ggplot2unclean<-ggplot(df_unclean, aes(x =reorder(word, -count), y = count)) +geom_bar(stat ="identity") +labs(title ="Word Count", x ="Word", y ="Count") +theme_bw()+theme(axis.text.x =element_text(angle =45, hjust =1))clean<-ggplot(df_clean, aes(x =reorder(word, -count), y = count)) +geom_bar(stat ="identity") +labs(title ="Word Count", x ="Word", y ="Count") +theme_bw()+theme(axis.text.x =element_text(angle =45, hjust =1))ggarrange(unclean, clean, ncol=2)
2.5 Bigrams
Bigrams are pairs of words that appear together in a text. They can provide valuable insights into the content and structure of a text.
True
[nltk_data] Downloading package punkt to /Users/bgpopescu/nltk_data...
[nltk_data] Package punkt is already up-to-date!
#Joining together the entire texttext =' '.join(merged_df['lemmatized'])#Tokenizing the textwords = word_tokenize(text)#Creating a list of bigramstext_bigrams =list(bigrams(words))# Count the occurrences of each bigrambigram_counts = Counter(text_bigrams)# Convert the bigram counts to a DataFramedf_bigram_counts = pd.DataFrame(bigram_counts.items(), columns=['bigram', 'count'])# Concatenate the individual words in the Bigram columndf_bigram_counts['bigram'] = df_bigram_counts['bigram'].apply(lambda x: ', '.join(x))# Sort the DataFrame by count in descending orderdf_bigram_counts = df_bigram_counts.sort_values(by='count', ascending=False)
We now visualize the the most important bigrams.
Show the code
df_bigram_counts <- reticulate::py$df_bigram_counts# Order the dataframe by countdf_bigram_counts <- df_bigram_counts[order(-df_bigram_counts$count), ][1:10, ]#The second most commont bigram is "let, u", which is in fact "let, us".#This is because wordnet._morphy uses a rule for nouns which replaces ending "s" with "".#See also https://stackoverflow.com/questions/54784287/nltk-wordnetlemmatizer-processes-us-as-u#So all the "us" became "u". I am thus transforming it backdf_bigram_counts$bigram[df_bigram_counts$bigram=="let, u"]<-"let, us"# Plot the dataframe using ggplot2bigram <-ggplot(df_bigram_counts, aes(x =reorder(bigram, -count), y = count)) +theme_bw() +geom_bar(stat ="identity") +labs(title ="Bigrams", x ="Pair", y ="Count") +theme(axis.text.x =element_text(angle =90, vjust =0.5, hjust=1))bigram
2.6 Wordcloud
We can now create a wordcloud to examine the most commonly used words.
dict_from_df = merged_df.to_dict(orient='index')# Extracting lemmatized texttext =""for chapter in dict_from_df.values(): text += chapter['lemmatized'] +" "# Generate word cloudwordcloud = WordCloud(collocations=False, width=1200, height=600, background_color='white', max_font_size=100, max_words=200, scale=3).generate(text)# Display the generated word cloudplt.figure(figsize=(10, 5))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.show()
3 Using LDA
Latent Dirichlet Allocation (LDA) is a generative statistical model used for topic modeling. It assumes that documents are probability distributions over topics, and topics are probability distributions over words. The goal of LDA is to uncover the latent (hidden) topics in a collection of documents and the distribution of words in those topics. LDA is unsupervised, meaning it doesn’t require labeled data. It is often used for tasks like document clustering, summarization, and content recommendation.
LDA is considered as a state-of-the art method for topic modeling and there are many reasons to why it is still widely used today, 20 years after its introduction in 2003. The practical advantages include that it is effective and computationally inexpensive, that it handles both shorter and longer input documents, and that it provides topics that are easily interpretable.
3.1 Limitations of LDA
One limitation of LDA lies in its difficulty to ascertain its efficacy due to its nature of producing soft-clusters for topics. Moreover, the reliability of coherence and perplexity scores, commonly used to evaluate LDA, can sometimes be questioned. However, perhaps the most significant constraint of LDA, one that BERTopic seeks to address, is rooted in its fundamental premise - the bag-of-words representation. In LDA, documents are perceived as probabilistic blends of latent topics, with each topic characterized by a probability distribution over words, and documents represented through a bag-of-words model. While this representation suffices for uncovering latent themes, it lacks depth in capturing a document’s semantic structure. Neglecting the semantic relationships between words can critically impact the accuracy of results from a topic model. For instance, consider the sentence “The girl became the queen of England”; the bag-of-words approach fails to recognize the semantic connection between “girl” and “queen.” Consequently, LDA may overlook the nuanced meaning of a sentence, especially when semantic structure significantly influences word interpretation. Despite these limitations, it is worth running the LDA.
3.2 Creating a Dictionary and Corpus
Before we run the algorithm, we need to create a dictionary. A dictionary is a mapping of words to unique integer ids. Each unique word in the entire corpus is assigned a unique id. The dictionary is used to convert text documents into a bag-of-words representation (model of text which uses a representation of text that is based on an unordered collection (or “bag”) of words), where each document is represented as a sparse vector of word counts (most of elements of the matrix are zero).
import gensimfrom gensim import corpora#Tokenize texttokenized_texts = [text.split() for text in merged_df['lemmatized']]# Create dictionarydictionary = corpora.Dictionary(tokenized_texts)# Tokenize the text and count word frequenciesword_freq = Counter()for text in merged_df['lemmatized']: words = word_tokenize(text.lower()) # Tokenize text and convert to lowercase word_freq.update(words)# Create human-readable dictionary with frequencydictionary_with_frequency =dict(word_freq)#len(dictionary_with_frequency)
A corpus is a collection of documents, in this case, the preprocessed text of the Bible. It is typically represented as a matrix, with each row representing a document and each column representing a word in the dictionary. The corpus allows the topic modeling algorithm to analyze the relationships between words and documents, and to identify the underlying themes and topics present in the text.
corpus = [dictionary.doc2bow(text) for text in tokenized_texts]
Once the dictionary and corpus have been created from the preprocessed text, the next step is to build the model and view the topics. To build the LDA model, it is important to specify the number of topics for the model to identify, as well as any other parameters that may be relevant for the analysis (e.g., the number of iterations, the learning rate). The model is then trained on the corpus, and the resulting topics are generated.
Choosing the right number of topics is crucial to ensure that the model is effective at capturing the underlying structure and themes of the text, and is interpretable and informative. One common approach is to use a measure such as the perplexity score or the coherence score to evaluate the model’s performance for a range of different numbers of topics.
Coherence measures are used in NLP to evaluate topics constructed by some topic model. Coherence measures are used to evaluate how well a topic model captures the underlying themes in a corpus of text. Topic coherence has been proposed as an intrinsic evaluation method for topic models and is defined as average or median of pairwise word similarities formed by top words of a given topic.
The following code allows us to do exactly that: to calculate coherence and perplexity score by the number of topics.
from gensim.models import CoherenceModel# Initialize lists to store resultstopic_nums =list(range(3, 17))coherence_scores = []perplexity_scores = []#random_state - this t# Iterate over different numbers of topicsfor num_topics in topic_nums:# Build LDA model lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics, random_state=100, update_every=1, chunksize=100, passes=10, alpha=[0.01]*num_topics, per_word_topics=True, eta=[0.01]*len(dictionary.keys()))# Compute perplexity perplexity = lda_model.log_perplexity(corpus)# Compute coherence score coherence_model_lda = CoherenceModel(model=lda_model, texts=tokenized_texts, dictionary=dictionary, coherence='c_v') coherence_score = coherence_model_lda.get_coherence()# Append scores to lists coherence_scores.append(coherence_score) perplexity_scores.append(perplexity)print(f"Iteration {num_topics -1}: Coherence Score = {coherence_score}, Perplexity = {perplexity}")
Based on this graph, we can see that the number of topics that maximizes coherence.
# Find the number of topics that maximizes coherencemax_coherence_index = results_df['Coherence Score'].idxmax()optimal_topics_coherence = results_df.loc[max_coherence_index, 'Num Topics']
Thus, it seems that 15 is the number of topics that is most coherent. So I am now re-runing the model with 10 topics.
Coherence Score: 0.4692925394327871 for 15 topics.
The output from LDA can be used to identify the topics that are present in the corpus and the words that are associated with each topic. Thus, I create a function that allows to see the keywords that are associated with every topic for every document. Note that only the first few keywords are computed that contribute to a total of about 90% of a topic within that document. The rest add very little to that topic.
def format_topics_sent(ldamodel, corpus, texts): formatted_topics = []for i, row inenumerate(ldamodel[corpus]): row =sorted(row[0], key=lambda x: x[1], reverse=True) topics_info = []for j, (topic_num, prop_topic) inenumerate(row): # Get the first 8 topics wp = ldamodel.show_topic(topic_num) topic_keywords =", ".join([word for word, prop in wp]) topic_info = {'topic': int(topic_num),'contrib': round(prop_topic, 4),'keywords': topic_keywords,'doc': i+1 } topics_info.append(topic_info) formatted_topics.extend(topics_info)return formatted_topics#Using the function to create a dataframe.sent_topics_df = format_topics_sent(lda_model, corpus, tokenized_texts)sent_topics_df = pd.DataFrame(sent_topics_df)sent_topics_df
We can now finally create a plot that shows the contribution of every topic to the each document.
Show the code
library(tidyr)library(tidyverse)library(reticulate)merged_topics_df <- reticulate::py$merged_df2reshaped_df2<-merged_topics_df%>%group_by(doc)%>%mutate(rank =rank(-contrib, ties.method ="min"))reshaped_df2$rowno<-as.numeric(rownames(reshaped_df2))reshaped_df2_756<-subset(reshaped_df2, doc<=50)topics_genesis_lda<-ggplot(reshaped_df2_756, aes(topic, reorder(book_names, -doc))) +geom_tile(aes(fill = contrib), color=ifelse(reshaped_df2_756$rank ==1, "red", NA), linewidth=1) +scale_fill_viridis_c()+scale_x_continuous(limits =c(min(reshaped_df2_756$topic)-1, max(reshaped_df2_756$topic)+1), breaks =min(reshaped_df2_756$topic):max(reshaped_df2_756$topic))+ylab("Document")+ggtitle("Contribution of topics to every document")+labs(fill ="Topic\nContribution\nScores")topics_genesis_lda
We can now identify the top 10 keywords associated with every topic.
topic_keys = {'Topic_'+str(i): [(token, score) for token, score in lda_model.show_topic(i, topn=10)] for i inrange(0, lda_model.num_topics)}# Initialize an empty list to store datadata = []# Iterate through the topic_keys dictionary and flatten the datafor topic, keywords in topic_keys.items():for token, score in keywords: data.append([topic, token, score])# Create a DataFramedf = pd.DataFrame(data, columns=['Topic', 'Token', 'Score'])#Seeing the resultdf_15 = df.head(15)HTML(df_15.to_html(index=False))
Topic
Token
Score
Topic_0
woman
0.082425
Topic_0
wife
0.059742
Topic_0
must
0.053713
Topic_0
husband
0.045201
Topic_0
man
0.038121
Topic_0
commit
0.031362
Topic_0
lord
0.027722
Topic_0
give
0.025856
Topic_0
sister
0.023506
Topic_0
mother
0.021462
Topic_1
tent
0.193075
Topic_1
ark
0.135989
Topic_1
tabernacle
0.106218
Topic_1
set
0.048365
Topic_1
curtain
0.046702
We will now graph the keywords associated with every topic.
Show the code part 2
#Creating a list of topicsoriginal_topics <-unique(df2c$topic_name)#Creating a list of labelslabels <-c("Family Roles","Sacred Spaces","Divine Covenant","God\'s Love","Ancestral Lineage","Communication Verbs","Holy Measurements","City Declarations","Sacred Offerings","Daily Life","Monarchy and Nations","Lineage of Levites","Future Prophecies","Universal Identity","Measurement Units")#Numbering the topicstopic_no_num <-as.numeric(gsub("\\D", "", original_topics))
Show the code part 3
topic_no_num_reorg <- topic_no_num+2# Create a list combining original topics and their corresponding labelstopics_with_labels <-list(original_topics = original_topics, labels = labels,topic_no_num = topic_no_num,topic_no_num_reorg = topic_no_num_reorg)#Assigning the labels to the dataframedf2d<-df2cdf2d$topic_label<-NAfor (key in1:length(topics_with_labels$original_topics)) {# Extracting current label pair current_label_pair <- topics_with_labels$original_topics[[key]] assignment <- topics_with_labels$labels[[key]]# Finding rows where Topic matches the first element of the label pair rows_to_update <- df2d$topic_name == current_label_pair[1]# Updating topic_label where the condition is met df2d$topic_label[rows_to_update] <- assignment}#Including the namesdf2d$up_label<-paste(df2d$topic_no, df2d$topic_label, sep="\n ") #Arranging the topicslibrary(ggh4x)df2e<-subset(df2d, !duplicated(Topic))df2e <- df2e %>%arrange(topic)#Creating a graph with the topicstopics_lda_keys<-ggplot(df2c, aes(Rank, reorder(topic_no, -topic))) +geom_tile(aes(fill = Score))+scale_fill_viridis_c()+geom_label(aes(y=topic_no, x=Rank, label=Token), size=1.9)+scale_x_continuous(limits =c(min(df2b$Rank)-1, max(df2b$Rank)+1), breaks =min(df2b$Rank):max(df2b$Rank))+guides(y.sec =guide_axis_manual(labels =rev(df2e$topic_label)))+labs(fill ="Keyword\nContribution\nScores")+ylab("Topic")+theme(legend.position ="bottom")topics_lda_keys
Finally, we can count the number of documents associated with every topic.
Show the code part 1
reshaped_count<-subset(reshaped_df2, rank==1)topic_counts <-data.frame(table(reshaped_count$topic))names(topic_counts)[1]<-"topic"names(topic_counts)[2]<-"no_documents"#Assigning labelstopic_counts$topic_label<-NAfor (key in1:length(topics_with_labels$topic_no_num)) {# Extracting current label pair current_label_pair <- topics_with_labels$topic_no_num[[key]] assignment <- topics_with_labels$labels[[key]]# Finding rows where Topic matches the first element of the label pair rows_to_update <- topic_counts$topic == current_label_pair[1]# Updating topic_label where the condition is met topic_counts$topic_label[rows_to_update] <- assignment}#Graphing the dataframetopics_lda<-ggplot(topic_counts, aes(y =topic, x = no_documents)) +geom_bar(stat ="identity")+guides(y.sec =guide_axis_manual(labels = topic_counts$topic_label))+labs(fill ="Keyword\nContribution\nScores")+ylab("Topic")+theme(legend.position ="bottom")+ggtitle("Topics LDA")topics_lda
Show the code part 1
#sum(topic_counts$no_documents)
4 Using BERTopic
BERTopic stands for Bidirectional Encoder Representations from Transformers. It is a text analysis technique based on word embeddings. Word embeddings are contextualized word representations learned from a large corpus of text using a deep neural network. These capture rich conceptual information of each word getting at semantic meaning and contextual relationships among words. Specifically, the method aims to assign a real-valued vector to each word that encodes its meaning and semantics in such a way that similar words have vectors that are close together in the vector space.
BERTopic leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. BERTopic clusters documents and adds meaning to them by linking them with terms and ranked c-TF-IDF scores. This gives an impression of what words are relatively prevalent in one topic. BERTopic has been shown to outperform other topic modeling techniques, such as LDA, in terms of topic coherence and interpretability. The suggested minimum amount of data for BERTopic is 1000 documents.
There are a variety of parameters in BERTopic. The first is the choice of library for encoding text to dense vector embeddings. sentence-transformers/all-MiniLM-L6-v2is a pre-trained transformer-based language model designed to be fine-tuned for a wide range of NLP tasks, such as text classification, and is available on Hugging Face’s Sentence Transformers library. The model was trained on a large corpus of text data using a masked language modeling objective, which allows it to learn contextual representations of words and sentences. It creates 384-dimensional sentence embeddings. Other important parameters that can be changed in BERTopic include:
UMAP
Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimensionality reduction technique that aims to preserve both local and global structure of high-dimensional data. It is based on the idea of constructing a topological representation of the data manifold using a graph-based approach, where points close to each other in the high-dimensional space are connected by edges in the graph.
HDBSCAN
A variety of different clustering models are available for use such as HDBSCAN, k-Means and BIRCH. The default setting HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise, is a a density-based clustering algorithm. HDBSCAN utilizes a hierarchical density-based approach to cluster data points and is designed to be effective, even for large and complex datasets. It applies a cluster stability analysis to select the optimal clustering solution. When using HDBSCAN in BERTopic, a number of outlier documents might also be created if they do not fall within any of the created topics. They will be assigned to a topic labeled “-1” which will contain the outlier documents that cannot be assigned to a topic given an automatically calculated probability threshold.
CountVectorizer
The CountVectorizer is a method in scikit-learn that converts a collection of text documents to a matrix of token counts to extract features from text. Specifically, it converts the text documents into a bag-of-words representation, where each document is represented by a vector that counts the frequency of each word in the document. In addition, it performs several text pre-processing steps such as tokenizing the text, lowercasing and removing stop words. Thus, within BERT, there is no need to pre-process the text.
Weighting Scheme: c-TF-IDF
The goal is to discern what sets apart one cluster from another based on the generated bag-of-word representation. We seek distinctive words that characterize each cluster, setting it apart from the rest. To achieve this, BERTopic employs a modified version of TF-IDF, known as c-TF-IDF. This approach assigns each cluster a specific topic, facilitating the identification of distinguishing features within each cluster’s word representation. Unlike traditional TF-IDF, which focuses on individual documents, c-TF-IDF evaluates the significance of words within clusters of documents. Consequently, this method enables the creation of topic-word distributions for each cluster, shedding light on the unique characteristics of each topic.
To recap, in order to run BERTopic, I ran a few steps:
The text was done using BERT-embeddings, where SentenceTransformers was utilized to implement context-based representation, using the pre-trained model - all-MiniLM-L6-v2.
After numerical text representation of the documents was created, dimensional reduction with the technique of UMAP was used.
With the reduced BERT-embeddings the data was clustered with the use of HDBSCAN.
Within each cluster a bag-of-words representation was generated through the CountVectorizer(). This function also performs several text pre-processing steps such as tokenizing the text, lowercasing and removing stopwords.
From the generated bag-of-words representation of each cluster, we want to distinguish them from one another using c-TF-IDF. This is generated through the ClassTfidfTransformer().
4.1 Limitations of BERTopic
BERTopic is an improvement over LDA in that it addresses the problem of semantic understanding by using the embedded component. Thanks to the support for hierarchical topic reduction, BERTopic allows for a more nuanced understanding of the relationships between topics.
Some of the limitations include the fact that more time is needed for running and fine-tuning the model, which is significantly more resource demanding and computationally expensive than LDA. Since it is a deep learning model, it also scales better with larger corpora and can have difficulties with giving accurate topic representations on smaller datasets.
import pandas as pdfrom bertopic import BERTopicfrom bertopic.vectorizers import ClassTfidfTransformerfrom umap import UMAPfrom sklearn.feature_extraction.text import CountVectorizerfrom hdbscan import HDBSCANimport gensim.corpora as corporafrom gensim.models.coherencemodel import CoherenceModel# Define your dataframe merged_df herecoherence_scores = []for nr_topics inrange(3, 17): # Iterate over nr_topics from 9 to 25#print(f"Calculating coherence for {nr_topics} topics...")# Define and fit BERTopic model umap_model = UMAP(n_neighbors=25, n_components=20, min_dist=0.0, metric='cosine', low_memory=False, random_state=42) ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True) hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', prediction_data=True) vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english") topic_model = BERTopic(min_topic_size=10, top_n_words=20, umap_model=umap_model, ctfidf_model=ctfidf_model, vectorizer_model=vectorizer_model, calculate_probabilities=True, hdbscan_model=hdbscan_model, nr_topics=nr_topics) topics, probs = topic_model.fit_transform(merged_df['text'])# Preprocess Documents documents = pd.DataFrame({"Document": merged_df['text'], "ID": range(len(merged_df['text'])), "Topic": topics}) documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join}) cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)# Extract vectorizer and analyzer from BERTopic vectorizer = topic_model.vectorizer_model analyzer = vectorizer.build_analyzer()# Extract features for Topic Coherence evaluation words = vectorizer.get_feature_names_out tokens = [analyzer(doc) for doc in cleaned_docs] dictionary = corpora.Dictionary(tokens) corpus = [dictionary.doc2bow(token) for token in tokens] topic_words = [[words for words, _ in topic_model.get_topic(topic)] for topic inrange(len(set(topics)) -1)]# Evaluate Coherence coherence_model = CoherenceModel(topics=topic_words, texts=tokens, corpus=corpus, dictionary=dictionary, coherence='c_v') coherence = coherence_model.get_coherence() coherence_scores.append({'nr_topics': nr_topics, 'coherence_score': coherence})print(f"Calculated coherence for {nr_topics} topics.")
Calculated coherence for 3 topics.
Calculated coherence for 4 topics.
Calculated coherence for 5 topics.
Calculated coherence for 6 topics.
Calculated coherence for 7 topics.
Calculated coherence for 8 topics.
Calculated coherence for 9 topics.
Calculated coherence for 10 topics.
Calculated coherence for 11 topics.
Calculated coherence for 12 topics.
Calculated coherence for 13 topics.
Calculated coherence for 14 topics.
Calculated coherence for 15 topics.
Calculated coherence for 16 topics.
# Create DataFrame to store coherence scorescoherence_df = pd.DataFrame(coherence_scores)
We can now visualize the coherence scores.
Show the code
library(ggplot2)library(ggpubr)results_df_r <- reticulate::py$coherence_df# Plot the dataframe using ggplot2coh_graph <-ggplot(data=results_df_r, aes(x=nr_topics, y=coherence_score)) +geom_line()+scale_x_continuous(breaks =unique(results_df_r$nr_topics)) +# Specify x-axis breakstheme_bw() coh_graph
4.2 Rerunning BERTopic for topic no with highest coherence
In order to make the topic models somewhat comparable to the LDA, and to make the topics more granular, I choose 11 topics, although 4 seems to have the highest coherence. Nevertheless, 11 is still a reasonable number with a reasonable coherence.
The following line allows us to save the topics and the probabilities. It is important to note that BERTopic does not need lemmatized text. This is because BERTopic uses transformers that are based on “real and clean” text, not on text without stopwords, lemmas or tokens. In addition, BERTopic has a CountVectorizer component from scikit-learn as the default setting, which performs tokenization, lowercasing, and stopword removal automatically. At the end of the calculation stop words have become noise (non-informative) and are all in topic_id = -1.
#The following code describes how to calculate coherennce scores.#https://github.com/MaartenGr/BERTopic/issues/90from bertopic import BERTopicimport gensim.corpora as corporafrom gensim.models.coherencemodel import CoherenceModel# Preprocess Documentsdocs = merged_df['text']documents = pd.DataFrame({"Document": docs,"ID": range(len(docs)),"Topic": topics})documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join})cleaned_docs = topic_model._preprocess_text(documents_per_topic.Document.values)# Extract vectorizer and analyzer from BERTopicvectorizer = topic_model.vectorizer_modelanalyzer = vectorizer.build_analyzer()# Extract features for Topic Coherence evaluationwords = vectorizer.get_feature_names_outtokens = [analyzer(doc) for doc in cleaned_docs]dictionary = corpora.Dictionary(tokens)corpus = [dictionary.doc2bow(token) for token in tokens]topic_words = [[words for words, _ in topic_model.get_topic(topic)] for topic inrange(len(set(topics))-1)]# Evaluatecoherence_model = CoherenceModel(topics=topic_words, texts=tokens, corpus=corpus, dictionary=dictionary, coherence='c_v')coherence_score = coherence_model.get_coherence()coherence_score
0.6514031139119772
BERTopic created 11 topics. The first one is topic -1. In BERTopic, a topic with the label “-1” typically indicates that the document does not significantly align with any of the defined topics in the model. It is often considered noise or an outlier in the context of the topics learned by the model.
The 11 topics created are listed below.
#Creating a list with unique topicstopics_relabeled =list(set(topics))#Ordering the topic numbersordered_topics =sorted(topics_relabeled)#These are the topicsordered_topics
[-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
BERTopic also created an array with probabilities that has the following dimensions:
probs.shape
(1189, 10)
The first number - 1189 corresponds to the number of documents. The second number - 10 corresponds to the 10 probabilities associated with each topic. Note that there were no probabilities computed for the residual topic (-1). We can now finally assign the probabilities that we calculated to a dataframe.
Therefore, I am now creating a dataframe where I associate the probabilities connected with all the 11 topics to the 1189 documents in our analysis.
probs_df = pd.DataFrame(probs)probs_df.rename(columns={col: f'topic_{col}'for col in probs_df.columns}, inplace=True)#Adding column "topic_-1"probs_df.insert(0, "topic_-1", None)df_5 = probs_df.head(5)HTML(df_5.to_html(index=False))
topic_-1
topic_0
topic_1
topic_2
topic_3
topic_4
topic_5
topic_6
topic_7
topic_8
topic_9
None
6.398443e-02
1.220270e-01
3.308001e-02
4.360875e-02
3.036763e-02
2.980985e-02
1.379953e-01
2.669987e-02
0.021563
4.201700e-02
None
6.674281e-02
1.216345e-01
3.250206e-02
4.572841e-02
3.035448e-02
3.098863e-02
1.593987e-01
2.652198e-02
0.022427
4.094005e-02
None
6.387558e-02
8.711585e-02
2.637531e-02
6.316551e-02
2.774398e-02
2.930832e-02
1.437205e-01
2.244477e-02
0.022091
3.242878e-02
None
2.494071e-01
6.160130e-02
1.606678e-02
2.990693e-02
1.672392e-02
4.181984e-02
3.949162e-02
1.483390e-02
0.049668
1.809949e-02
None
1.202258e-307
3.806346e-308
8.786780e-309
1.248075e-308
8.377273e-309
2.034558e-308
1.721047e-308
8.052732e-309
1.000000
9.428069e-309
I now try to include the book names in the dataframe.
We can use the Term Score Matrix to examine the term score decline for topics and compare scores for terms of the same rank for different topics. The Term Score Decline diagram helps us decide whether we can cut off the number of terms we want to distinguish. E.g. We might conclude that only the top 5 terms are important enough for us to consider in order to assign meaningful topic names.
all_topics_data = []# Loop through each topic id in the 'topics' listfor topic_id inlist(set(topics)):# Get the topic t = topic_model.get_topic(topic_id)# Select top 9 terms and scores top_terms = [x[0] for x in t[:9]] top_scores = [x[1] for x in t[:9]]# Append data for the current topic to the listfor term, score inzip(top_terms, top_scores): all_topics_data.append({'Topic': topic_model.topic_labels_[topic_id], 'Term': term, 'Score': score})# Create DataFrame from the collected dataall_topics_df = pd.DataFrame(all_topics_data)
It could also be important to know how many documents in the Bible are associated with every topic.
Show the code part 1
reshaped_df3$rowno<-as.numeric(rownames(reshaped_df3))reshaped_count<-subset(reshaped_df3, rank==1)topic_counts <-data.frame(table(reshaped_count$topic))names(topic_counts)[1]<-"topic"names(topic_counts)[2]<-"no_documents"topic_counts$topic<-as.numeric(gsub("topic_", "", topic_counts$topic))#Assigning topic labels to opicstopic_counts$topic_label<-NAfor (key in1:length(topics_with_labels$topic_no_num)) {# Extracting current label pair current_label_pair <- topics_with_labels$topic_no_num[[key]] assignment <- topics_with_labels$labels[[key]]# Finding rows where Topic matches the first element of the label pair rows_to_update <- topic_counts$topic == current_label_pair[1]# Updating topic_label where the condition is met topic_counts$topic_label[rows_to_update] <- assignment}
It is difficult to say if there is a substantial overlap among the topics identified. Choosing between LDA and BERTopic hinges on the specific requirements and dataset characteristics. LDA could be relevant if a well-established, computationally efficient method for uncovering latent themes in the text data without necessitating fine-grained semantic analysis ia priority. However, if there is need for a deeper contextual understanding and we are dealing with over 1,000 documents, BERTopic may be the better choice. BERTopic leverages transformer-based models like BERT to capture nuanced semantic relationships between words, allowing for more granular topic distinctions based on word embeddings and contextual information.