Topic Modeling Using LDA vs BERTopic

Application to King James’ Bible

Author

Bogdan G. Popescu

1 Introduction

In this report, I perform topic modeling on the Bible, employing two methods: Latent Dirichlet Allocation (LDA) and Word Embeddings using BERTopic. Both of them are unsupervised machine learning approaches to extract information of the latent topics. Comprising 66 books written by multiple authors over numerous centuries, the Bible encompasses a diverse array of genres, including historical and religious narratives, poetry, epistles, and prophecies. By applying topic modeling, we can unveil fresh insights into this important text.

Thus, the main questions asked in this report are:

  • What are the general topics in the Bible?

  • How do two topic modeling methods, LDA and BERTopic compare when it comes to the Bible?

Before I use the LDA and the BERTopic, I perform some descriptive analysis.

2 Descriptive Approach

Data cleaning and preprocessing are important steps when analyzing textual data before vectorizing it. The preprocessing and data cleaning techniques include tokenization, lemmatization, stopword removal, lowercasing and non-alphabetical character removal. These techniques aim to prepare the data for NLP computations and further remove noise and meaningless information from the data.

We first load the relevant libraries.

Show the code
#Importing the relevant libraries
import requests
import numpy as np
import string
import pandas as pd
import re
import string
import matplotlib.pyplot as plt

#NLTK
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize, bigrams
from collections import Counter
from wordcloud import WordCloud

#Bertopic
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from umap import UMAP
from sklearn.feature_extraction.text import CountVectorizer
from hdbscan import HDBSCAN

#HTML display
from IPython.display import HTML

We then extract the Bible from the following URL and turn the response into actual text.

bible_url = 'https://bereanbible.com/bsb.txt'
path = "./data/bereanbible.txt"
#with open(path, 'rb') as f:
#  text = f.read()
#Requesting the URL
response = requests.get(bible_url)
raw_bible = response.text
#Priting the first 400 characters
raw_bible[:400]
"The Holy Bible, Berean Standard Bible, BSB is produced in cooperation with Bible Hub, Discovery Bible, OpenBible.com, and the Berean Bible Translation Committee. \t\nThis text of God's Word has been dedicated to the public domain. Free resources and databases are available at BereanBible.com.\t\nVerse\tBerean Standard Bible\nGenesis 1:1\tIn the beginning God created the heavens and the earth.\nGenesis 1:2"

The following command splits the text text into lines.

bible_list = raw_bible.splitlines()
#This prints the first 2 entries.
bible_list[:2]
['The Holy Bible, Berean Standard Bible, BSB is produced in cooperation with Bible Hub, Discovery Bible, OpenBible.com, and the Berean Bible Translation Committee. \t', "This text of God's Word has been dedicated to the public domain. Free resources and databases are available at BereanBible.com.\t"]

The next function takes a list of strings (bible_list), splits each string by the tab character, and stores the result in a NumPy array (bible_array)

bible_array = np.array([item.split('\t') for item in bible_list])
bible_array[:2]
array([['The Holy Bible, Berean Standard Bible, BSB is produced in cooperation with Bible Hub, Discovery Bible, OpenBible.com, and the Berean Bible Translation Committee. ',
        ''],
       ["This text of God's Word has been dedicated to the public domain. Free resources and databases are available at BereanBible.com.",
        '']], dtype='<U400')

In the next step, we want to examine more closely what our data looks like:

df = pd.DataFrame(bible_array[3:, :], columns=['reference','text'])
# The following works
df_3 = df.head(3)
HTML(df_3.to_html(index=False))
reference text
Genesis 1:1 In the beginning God created the heavens and the earth.
Genesis 1:2 Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters.
Genesis 1:3 And God said, “Let there be light,” and there was light.

2.1 Extracting Chapter Names

In the next lines, I create a function that extracts the chapter names. For example any chapter called “Genesis 1:1” or “Genesis 1:2” will turn into “Genesis 1”. The logic is that titles like “Genesis 1” will become documents in our topic modeling analysis. In NLP, a document could be anything. For example a document could be a journal paper, a book, an article, a chapter, or even a sentence. In our case it is a book. For example, “Genesis 1” and “Genesis 2” are books in our example.

#The next function only extracts the chapter name: e.g. for Genesis 1:1, it extracts Genesis 1   
def extract_book_name(verse_reference):
    # Use regular expression to extract the book name and chapter number
    match = re.match(r'(.+?)\s+(\d+):\d+', verse_reference)
    if match:
        book_name = match.group(1)
        chapter_number = match.group(2)
        return f"{book_name} {chapter_number}"
    else:
        return None
      
#The next function only extracts the chaper name: e.g. for Genesis 1:1, it extracts Genesis    
def extract_book_name2(verse_reference):
    # Use regular expression to extract the book name
    match = re.match(r'(.+?)\s+\d+:\d+', verse_reference)
    if match:
        book_name = match.group(1)
        return book_name
    else:
        return None

Now I apply the function.

df["book_names"] = df["reference"].apply(extract_book_name)
df_5 = df.head(5)
#len(df["book_names"].unique())
HTML(df_5.to_html(index=False))
reference text book_names
Genesis 1:1 In the beginning God created the heavens and the earth. Genesis 1
Genesis 1:2 Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters. Genesis 1
Genesis 1:3 And God said, “Let there be light,” and there was light. Genesis 1
Genesis 1:4 And God saw that the light was good, and He separated the light from the darkness. Genesis 1
Genesis 1:5 God called the light “day,” and the darkness He called “night.” And there was evening, and there was morning—the first day. Genesis 1

In the next step, I try to investigate the unique chapter names.

unique_entries = df["book_names"].unique()
unique_entries
array(['Genesis 1', 'Genesis 2', 'Genesis 3', ..., 'Revelation 20',
       'Revelation 21', 'Revelation 22'], dtype=object)

Next, I group text by chapter entry, while keeping their original order.

# Add a column to store the original order
df['original_order'] = range(len(df))
merged_df = df.groupby('book_names')['text'].apply(' '.join).reset_index()
# Merge back the original order
merged_df = pd.merge(merged_df, df[['book_names', 'original_order']], on='book_names', how='left')
# Sort the DataFrame based on the original order
merged_df.sort_values('original_order', inplace=True)
# Drop the original_order column if it's no longer needed
merged_df.drop(columns='original_order', inplace=True)
merged_df = merged_df[['book_names', 'text']].drop_duplicates()
#Examining the first two entries
df_2 = merged_df.head(2)
HTML(df_2.to_html(index=False))
book_names text
Genesis 1 In the beginning God created the heavens and the earth. Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters. And God said, “Let there be light,” and there was light. And God saw that the light was good, and He separated the light from the darkness. God called the light “day,” and the darkness He called “night.” And there was evening, and there was morning—the first day. And God said, “Let there be an expanse between the waters, to separate the waters from the waters.” So God made the expanse and separated the waters beneath it from the waters above. And it was so. God called the expanse “sky.” And there was evening, and there was morning—the second day. And God said, “Let the waters under the sky be gathered into one place, so that the dry land may appear.” And it was so. God called the dry land “earth,” and the gathering of waters He called “seas.” And God saw that it was good. Then God said, “Let the earth bring forth vegetation: seed-bearing plants and fruit trees, each bearing fruit with seed according to its kind.” And it was so. The earth produced vegetation: seed-bearing plants according to their kinds and trees bearing fruit with seed according to their kinds. And God saw that it was good. And there was evening, and there was morning—the third day. And God said, “Let there be lights in the expanse of the sky to distinguish between the day and the night, and let them be signs to mark the seasons and days and years. And let them serve as lights in the expanse of the sky to shine upon the earth.” And it was so. God made two great lights: the greater light to rule the day and the lesser light to rule the night. And He made the stars as well. God set these lights in the expanse of the sky to shine upon the earth, to preside over the day and the night, and to separate the light from the darkness. And God saw that it was good. And there was evening, and there was morning—the fourth day. And God said, “Let the waters teem with living creatures, and let birds fly above the earth in the open expanse of the sky.” So God created the great sea creatures and every living thing that moves, with which the waters teemed according to their kinds, and every bird of flight after its kind. And God saw that it was good. Then God blessed them and said, “Be fruitful and multiply and fill the waters of the seas, and let birds multiply on the earth.” And there was evening, and there was morning—the fifth day. And God said, “Let the earth bring forth living creatures according to their kinds: livestock, land crawlers, and beasts of the earth according to their kinds.” And it was so. God made the beasts of the earth according to their kinds, the livestock according to their kinds, and everything that crawls upon the earth according to its kind. And God saw that it was good. Then God said, “Let Us make man in Our image, after Our likeness, to rule over the fish of the sea and the birds of the air, over the livestock, and over all the earth itself and every creature that crawls upon it.” So God created man in His own image; in the image of God He created him; male and female He created them. God blessed them and said to them, “Be fruitful and multiply, and fill the earth and subdue it; rule over the fish of the sea and the birds of the air and every creature that crawls upon the earth.” Then God said, “Behold, I have given you every seed-bearing plant on the face of all the earth, and every tree whose fruit contains seed. They will be yours for food. And to every beast of the earth and every bird of the air and every creature that crawls upon the earth—everything that has the breath of life in it—I have given every green plant for food.” And it was so. And God looked upon all that He had made, and indeed, it was very good. And there was evening, and there was morning—the sixth day.
Genesis 2 Thus the heavens and the earth were completed in all their vast array. And by the seventh day God had finished the work He had been doing; so on that day He rested from all His work. Then God blessed the seventh day and sanctified it, because on that day He rested from all the work of creation that He had accomplished. This is the account of the heavens and the earth when they were created, in the day that the LORD God made them. Now no shrub of the field had yet appeared on the earth, nor had any plant of the field sprouted; for the LORD God had not yet sent rain upon the earth, and there was no man to cultivate the ground. But springs welled up from the earth and watered the whole surface of the ground. Then the LORD God formed man from the dust of the ground and breathed the breath of life into his nostrils, and the man became a living being. And the LORD God planted a garden in Eden, in the east, where He placed the man He had formed. Out of the ground the LORD God gave growth to every tree that is pleasing to the eye and good for food. And in the middle of the garden were the tree of life and the tree of the knowledge of good and evil. Now a river flowed out of Eden to water the garden, and from there it branched into four headwaters: The name of the first river is Pishon; it winds through the whole land of Havilah, where there is gold. And the gold of that land is pure, and bdellium and onyx are found there. The name of the second river is Gihon; it winds through the whole land of Cush. The name of the third river is Hiddekel; it runs along the east side of Assyria. And the fourth river is the Euphrates. Then the LORD God took the man and placed him in the Garden of Eden to cultivate and keep it. And the LORD God commanded him, “You may eat freely from every tree of the garden, but you must not eat from the tree of the knowledge of good and evil; for in the day that you eat of it, you will surely die.” The LORD God also said, “It is not good for the man to be alone. I will make for him a suitable helper.” And out of the ground the LORD God formed every beast of the field and every bird of the air, and He brought them to the man to see what he would name each one. And whatever the man called each living creature, that was its name. The man gave names to all the livestock, to the birds of the air, and to every beast of the field. But for Adam no suitable helper was found. So the LORD God caused the man to fall into a deep sleep, and while he slept, He took one of the man’s ribs and closed up the area with flesh. And from the rib that the LORD God had taken from the man, He made a woman and brought her to him. And the man said: “This is now bone of my bones and flesh of my flesh; she shall be called ‘woman,’ for out of man she was taken.” For this reason a man will leave his father and mother and be united to his wife, and they will become one flesh. And the man and his wife were both naked, and they were not ashamed.

2.2 Pre-processing the Text

We fill first pre-process the text. This includes a variety of steps including eliminating double spaces, using lower cases, removing punctuation, removing stop words, removing the most frequent words, removing special characters, stemming, lemmatizing and POS tagging. We will then use this processed text to create a Wordcloud. We use the NLTK package for pre-preprocessing text for LDA analysis.

Step 1: Eliminating double spaces

We now eliminate strange characters such as apostrophes, dashes and then, double spaces.

#Step1: Eliminating apostrophes, dashes, etc.
merged_df['text'] = merged_df['text'].str.replace(r'[\x93\x90\x92\x94\x97]+', ' ', regex=True)
# \x93 - the left double quotation mark
# \x90 - an invisible character used to control display or other functions and does not have a standard visual representation
# \x92 - an apostrophe
# \x94 - the left double quotation mark
# \x97 - a dash or hyphen character

#Step2: Eliminating double spaces
merged_df['text'] = merged_df['text'].str.replace(r'\s+', ' ', regex=True)

Step 2: Using Lower Case

In this step, I make everything lower case.

merged_df = merged_df.copy()
merged_df['text_lc'] = merged_df['text'].str.lower()

Step 3: Removing Punctuation

I then remove punctuation.

all_punctuation = string.punctuation

#Creating a function that removes punctuation
def remove_punct(text):
    return text.translate(str.maketrans('', '', all_punctuation))

#Applying the function
merged_df = merged_df.copy()
merged_df['text_np'] = merged_df['text_lc'].apply(lambda x: remove_punct(x))
merged_df = merged_df[["book_names", "text", "text_np"]]

#df_2 = merged_df.head(2)
#HTML(df_2.to_html(index=False))

Step 4: Removing Stop words

In the next step, I remove all the stop words: “the”, “and”, “so”. These are words that are frequently used, but which add limited meaning.

nltk.download('stopwords')
True

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/bgpopescu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
stops = set(stopwords.words('english'))

# Define the function to remove stopwords
def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in set(stopwords.words('english'))])

#Using the function for the stop words
merged_df = merged_df.copy()
merged_df['text_nostop'] = merged_df['text_np'].apply(remove_stopwords)
#df_2 = merged_df.head(2)
#HTML(df_2.to_html(index=False))

Step 6: Removing Special Characters

#Function to replace special characters: that are not alphanumeric or whitespace characters
def remove_special_ch(text):
    text = re.sub('[^a-zA-Z0-9\s]', '', text)  # Replace non-alphanumeric characters and spaces with nothing
    return text

#Applying it to the text
merged_df = merged_df.copy()
merged_df['clean_text'] = merged_df['text_nostop'].apply(remove_special_ch)
merged_df= merged_df[["book_names", "text", "clean_text"]]

Let us now examine the clean text, which has no special characters, no stop words, no punctuation, is in lower case, and has no double spaces.

df_2 = merged_df.head(2)
HTML(df_2.to_html(index=False))
book_names text clean_text
Genesis 1 In the beginning God created the heavens and the earth. Now the earth was formless and void, and darkness was over the surface of the deep. And the Spirit of God was hovering over the surface of the waters. And God said, Let there be light, and there was light. And God saw that the light was good, and He separated the light from the darkness. God called the light day, and the darkness He called night. And there was evening, and there was morning the first day. And God said, Let there be an expanse between the waters, to separate the waters from the waters. So God made the expanse and separated the waters beneath it from the waters above. And it was so. God called the expanse sky. And there was evening, and there was morning the second day. And God said, Let the waters under the sky be gathered into one place, so that the dry land may appear. And it was so. God called the dry land earth, and the gathering of waters He called seas. And God saw that it was good. Then God said, Let the earth bring forth vegetation: seed-bearing plants and fruit trees, each bearing fruit with seed according to its kind. And it was so. The earth produced vegetation: seed-bearing plants according to their kinds and trees bearing fruit with seed according to their kinds. And God saw that it was good. And there was evening, and there was morning the third day. And God said, Let there be lights in the expanse of the sky to distinguish between the day and the night, and let them be signs to mark the seasons and days and years. And let them serve as lights in the expanse of the sky to shine upon the earth. And it was so. God made two great lights: the greater light to rule the day and the lesser light to rule the night. And He made the stars as well. God set these lights in the expanse of the sky to shine upon the earth, to preside over the day and the night, and to separate the light from the darkness. And God saw that it was good. And there was evening, and there was morning the fourth day. And God said, Let the waters teem with living creatures, and let birds fly above the earth in the open expanse of the sky. So God created the great sea creatures and every living thing that moves, with which the waters teemed according to their kinds, and every bird of flight after its kind. And God saw that it was good. Then God blessed them and said, Be fruitful and multiply and fill the waters of the seas, and let birds multiply on the earth. And there was evening, and there was morning the fifth day. And God said, Let the earth bring forth living creatures according to their kinds: livestock, land crawlers, and beasts of the earth according to their kinds. And it was so. God made the beasts of the earth according to their kinds, the livestock according to their kinds, and everything that crawls upon the earth according to its kind. And God saw that it was good. Then God said, Let Us make man in Our image, after Our likeness, to rule over the fish of the sea and the birds of the air, over the livestock, and over all the earth itself and every creature that crawls upon it. So God created man in His own image; in the image of God He created him; male and female He created them. God blessed them and said to them, Be fruitful and multiply, and fill the earth and subdue it; rule over the fish of the sea and the birds of the air and every creature that crawls upon the earth. Then God said, Behold, I have given you every seed-bearing plant on the face of all the earth, and every tree whose fruit contains seed. They will be yours for food. And to every beast of the earth and every bird of the air and every creature that crawls upon the earth everything that has the breath of life in it I have given every green plant for food. And it was so. And God looked upon all that He had made, and indeed, it was very good. And there was evening, and there was morning the sixth day. beginning god created heavens earth earth formless void darkness surface deep spirit god hovering surface waters god said let light light god saw light good separated light darkness god called light day darkness called night evening morning first day god said let expanse waters separate waters waters god made expanse separated waters beneath waters god called expanse sky evening morning second day god said let waters sky gathered one place dry land may appear god called dry land earth gathering waters called seas god saw good god said let earth bring forth vegetation seedbearing plants fruit trees bearing fruit seed according kind earth produced vegetation seedbearing plants according kinds trees bearing fruit seed according kinds god saw good evening morning third day god said let lights expanse sky distinguish day night let signs mark seasons days years let serve lights expanse sky shine upon earth god made two great lights greater light rule day lesser light rule night made stars well god set lights expanse sky shine upon earth preside day night separate light darkness god saw good evening morning fourth day god said let waters teem living creatures let birds fly earth open expanse sky god created great sea creatures every living thing moves waters teemed according kinds every bird flight kind god saw good god blessed said fruitful multiply fill waters seas let birds multiply earth evening morning fifth day god said let earth bring forth living creatures according kinds livestock land crawlers beasts earth according kinds god made beasts earth according kinds livestock according kinds everything crawls upon earth according kind god saw good god said let us make man image likeness rule fish sea birds air livestock earth every creature crawls upon god created man image image god created male female created god blessed said fruitful multiply fill earth subdue rule fish sea birds air every creature crawls upon earth god said behold given every seedbearing plant face earth every tree whose fruit contains seed food every beast earth every bird air every creature crawls upon earth everything breath life given every green plant food god looked upon made indeed good evening morning sixth day
Genesis 2 Thus the heavens and the earth were completed in all their vast array. And by the seventh day God had finished the work He had been doing; so on that day He rested from all His work. Then God blessed the seventh day and sanctified it, because on that day He rested from all the work of creation that He had accomplished. This is the account of the heavens and the earth when they were created, in the day that the LORD God made them. Now no shrub of the field had yet appeared on the earth, nor had any plant of the field sprouted; for the LORD God had not yet sent rain upon the earth, and there was no man to cultivate the ground. But springs welled up from the earth and watered the whole surface of the ground. Then the LORD God formed man from the dust of the ground and breathed the breath of life into his nostrils, and the man became a living being. And the LORD God planted a garden in Eden, in the east, where He placed the man He had formed. Out of the ground the LORD God gave growth to every tree that is pleasing to the eye and good for food. And in the middle of the garden were the tree of life and the tree of the knowledge of good and evil. Now a river flowed out of Eden to water the garden, and from there it branched into four headwaters: The name of the first river is Pishon; it winds through the whole land of Havilah, where there is gold. And the gold of that land is pure, and bdellium and onyx are found there. The name of the second river is Gihon; it winds through the whole land of Cush. The name of the third river is Hiddekel; it runs along the east side of Assyria. And the fourth river is the Euphrates. Then the LORD God took the man and placed him in the Garden of Eden to cultivate and keep it. And the LORD God commanded him, You may eat freely from every tree of the garden, but you must not eat from the tree of the knowledge of good and evil; for in the day that you eat of it, you will surely die. The LORD God also said, It is not good for the man to be alone. I will make for him a suitable helper. And out of the ground the LORD God formed every beast of the field and every bird of the air, and He brought them to the man to see what he would name each one. And whatever the man called each living creature, that was its name. The man gave names to all the livestock, to the birds of the air, and to every beast of the field. But for Adam no suitable helper was found. So the LORD God caused the man to fall into a deep sleep, and while he slept, He took one of the man s ribs and closed up the area with flesh. And from the rib that the LORD God had taken from the man, He made a woman and brought her to him. And the man said: This is now bone of my bones and flesh of my flesh; she shall be called ‘woman, for out of man she was taken. For this reason a man will leave his father and mother and be united to his wife, and they will become one flesh. And the man and his wife were both naked, and they were not ashamed. thus heavens earth completed vast array seventh day god finished work day rested work god blessed seventh day sanctified day rested work creation accomplished account heavens earth created day lord god made shrub field yet appeared earth plant field sprouted lord god yet sent rain upon earth man cultivate ground springs welled earth watered whole surface ground lord god formed man dust ground breathed breath life nostrils man became living lord god planted garden eden east placed man formed ground lord god gave growth every tree pleasing eye good food middle garden tree life tree knowledge good evil river flowed eden water garden branched four headwaters name first river pishon winds whole land havilah gold gold land pure bdellium onyx found name second river gihon winds whole land cush name third river hiddekel runs along east side assyria fourth river euphrates lord god took man placed garden eden cultivate keep lord god commanded may eat freely every tree garden must eat tree knowledge good evil day eat surely die lord god also said good man alone make suitable helper ground lord god formed every beast field every bird air brought man see would name one whatever man called living creature name man gave names livestock birds air every beast field adam suitable helper found lord god caused man fall deep sleep slept took one man ribs closed area flesh rib lord god taken man made woman brought man said bone bones flesh flesh shall called woman man taken reason man leave father mother united wife become one flesh man wife naked ashamed

Step 7: Stemming

Stemming is a very crude method. It chops off the end of the word. The result is not necessarily a real word. Both stemming and lemmatization are useful to reduce vocabulary size and vector dimensionality. Thus both will speed up processing. For example “bosses” becomes “boss” while “replacement” becomes “replac” (not a real word).

There are multiple stemming algorithms. The most popular is Porter Stemmer in NLTK. For example:

ps = PorterStemmer()
#Stemming the word "walking"
ps.stem("walking")
'walk'

I first define a function that stems the words.

#Fucntion to stem the word
def stem_words(text):
    ps = PorterStemmer()
    return " ".join([ps.stem(word) for word in text.split()])

#We now apply this to our text
merged_df=merged_df.copy()
merged_df['stemmed_text'] = merged_df["clean_text"].apply(lambda x: stem_words(x))
merged_df= merged_df[["book_names", "text", "clean_text", "stemmed_text"]]

Step 8: Lemmatization & POS Tagging

Unlile stemming, lemmatization is more sophisticated, as it uses actual rules of language. The true root word will be returned. Both stemming and lemmatization are useful to reduce vocabulary size and vector dimensionality and will speed up processing.

For example, if we lemmatize, we obtain “good” from “better”. For “was”, we obtain “be”.

True

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/bgpopescu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
True

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/bgpopescu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!

I next make a few abbreviations.

wordnet_map = {"N": wordnet.NOUN, "V": wordnet.VERB, "J": wordnet.ADJ, "R": wordnet.ADV}

I then define a function that lemmatizes the text.

#Defining the function
def lemmatize_words(text):
  # find POS tags
  pos_text = pos_tag(text.split())
  return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_text])

#Applying that function to the text.
merged_df = merged_df.copy()
merged_df['lemmatized'] = merged_df["clean_text"].apply(lambda x: lemmatize_words(x))
merged_df= merged_df[["book_names", "text", "clean_text", "stemmed_text", "lemmatized"]]
merged_df2 =  merged_df[["book_names", "lemmatized"]]

#Investigating the result
df_2 = merged_df2.head(2)
HTML(df_2.to_html(index=False))
book_names lemmatized
Genesis 1 begin god create heaven earth earth formless void darkness surface deep spirit god hover surface water god say let light light god saw light good separate light darkness god call light day darkness call night even morning first day god say let expanse water separate water water god make expanse separated water beneath water god call expanse sky even morning second day god say let water sky gather one place dry land may appear god call dry land earth gather water call sea god saw good god say let earth bring forth vegetation seedbearing plant fruit tree bear fruit seed accord kind earth produce vegetation seedbearing plant accord kind tree bear fruit seed accord kind god saw good evening morning third day god say let light expanse sky distinguish day night let sign mark season day year let serve light expanse sky shine upon earth god make two great light great light rule day lesser light rule night make star well god set light expanse sky shine upon earth preside day night separate light darkness god saw good evening morning fourth day god say let water teem live creature let bird fly earth open expanse sky god create great sea creature every live thing move water teem accord kind every bird flight kind god saw good god bless say fruitful multiply fill water seas let bird multiply earth even morning fifth day god say let earth bring forth living creature accord kind livestock land crawler beasts earth accord kind god make beast earth accord kind livestock accord kind everything crawl upon earth accord kind god saw good god say let u make man image likeness rule fish sea bird air livestock earth every creature crawl upon god create man image image god create male female create god bless say fruitful multiply fill earth subdue rule fish sea bird air every creature crawl upon earth god say behold give every seedbearing plant face earth every tree whose fruit contain seed food every beast earth every bird air every creature crawl upon earth everything breath life give every green plant food god look upon make indeed good even morning sixth day
Genesis 2 thus heaven earth complete vast array seventh day god finished work day rest work god bless seventh day sanctify day rest work creation accomplish account heaven earth create day lord god make shrub field yet appear earth plant field sprout lord god yet send rain upon earth man cultivate ground spring well earth watered whole surface ground lord god form man dust ground breathe breath life nostril man become living lord god plant garden eden east place man form ground lord god give growth every tree please eye good food middle garden tree life tree knowledge good evil river flow eden water garden branch four headwater name first river pishon wind whole land havilah gold gold land pure bdellium onyx find name second river gihon wind whole land cush name third river hiddekel run along east side assyria fourth river euphrates lord god take man place garden eden cultivate keep lord god command may eat freely every tree garden must eat tree knowledge good evil day eat surely die lord god also say good man alone make suitable helper ground lord god form every beast field every bird air bring man see would name one whatever man call living creature name man give name livestock bird air every beast field adam suitable helper find lord god cause man fall deep sleep sleep take one man rib close area flesh rib lord god take man make woman bring man say bone bone flesh flesh shall call woman man take reason man leave father mother united wife become one flesh man wife naked ashamed

As explained, the stemmed text differs from the lemmatized text.

2.3 Original Text vs. Clean Text

Text cleaning may have a significant effect on the length of the text, particularly if the cleaning process involves the removal of a large number of stop words or the stemming of words to their root form. This could result in a shorter text that is more focused on the core content of the text and less cluttered with common words or repetitive phrases.

We can now count the number of tokens within each type of text:

# Tokenize the text into words
merged_df['word_count_full'] = merged_df['text'].apply(lambda x: len(x.split()))
merged_df['word_count_lem'] = merged_df['lemmatized'].apply(lambda x: len(x.split()))
merged_df2 = merged_df[["book_names", "word_count_full", "word_count_lem"]]
#Investigating the result
df_10 = merged_df2.head(10)
HTML(df_10.to_html(index=False))
book_names word_count_full word_count_lem
Genesis 1 747 359
Genesis 2 608 262
Genesis 3 664 264
Genesis 4 622 275
Genesis 5 497 244
Genesis 6 507 228
Genesis 7 496 251
Genesis 8 515 246
Genesis 9 609 272
Genesis 10 432 227

I now switch to R to make use of ggplot’s good graphic capabilities to analyze the difference in the number of words in the original text vs. the lemmatized text.

Show the code
library(reticulate)
library(ggplot2)
library(ggpubr)

df <-  reticulate::py$merged_df
unclean<-ggplot(df, aes(x = word_count_full)) +
  geom_histogram(color = "white",
                 alpha = 0.5)+
  geom_density(aes(y = after_stat(density)* 80000), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "No. Book Chapters (e.g. Genesis 1, Genesis 2, etc)",
                     sec.axis = 
                       sec_axis(~.x/80000, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab("No. Words")+
  theme_bw()+
  ggtitle("Original")
  


clean<-ggplot(df, aes(x = word_count_lem)) +
  geom_histogram(color = "white",
                 alpha = 0.5)+
    #Adding density and multiplying by a manually defined value
  #for visibility
  geom_density(aes(y = after_stat(density)* 40000), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "No. Book Chapters (e.g. Genesis 1, Genesis 2, etc)",
                     sec.axis = 
                       sec_axis(~.x/40000, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab("No. Words")+
  theme_bw()+
  ggtitle("Lemmatized")


ggarrange(unclean, clean, ncol=2)