L15: NLP Pipelines

Bogdan G. Popescu

John Cabot University

Introduction

Intro

We finally get into text analysis

  • Preprocessing Data (lowercase, remove punctuation, remove numbers)
  • Stopwords
  • Tokenization
  • Lemmatization
  • Stemming

Text

Let us imagine that we have the following text

#Step1: Load library
import pandas as pd

#Step2: text
text = [
    "Because I could not stop for Death -",
    "He kindly stopped for me -",
    "The Carriage held but just Ourselves -",
    "and Immortality"
]

Text

This is how we turn it into a dataframe

text_df = pd.DataFrame({'text': text})
text_df
                                     text
0    Because I could not stop for Death -
1              He kindly stopped for me -
2  The Carriage held but just Ourselves -
3                         and Immortality

Pre-processing Text

In a first instance we want to count the number of tokens in our text

Tokens are meaningful units of text, most often a word, that we are interested in using for further analysis

Before we do that, we need follow a few steps:

  • turning words to Lowercase
  • removing punctuation
  • removing numbers
  • tokenization
  • lemmatization
  • stemming

Pre-processing Text:

Step1: Turning words to Lowercase

This is how we make everything lower case:

text_df["text2"] = text_df['text'].str.lower()
text_df
                                     text                                   text2
0    Because I could not stop for Death -    because i could not stop for death -
1              He kindly stopped for me -              he kindly stopped for me -
2  The Carriage held but just Ourselves -  the carriage held but just ourselves -
3                         and Immortality                         and immortality

Pre-processing Text:

Step2: Remove punctuation

This is how we remove punctuation:

import string
text_df['text2'] = text_df['text2'].str.translate(str.maketrans('', '', string.punctuation))
text_df
                                     text                                  text2
0    Because I could not stop for Death -    because i could not stop for death 
1              He kindly stopped for me -              he kindly stopped for me 
2  The Carriage held but just Ourselves -  the carriage held but just ourselves 
3                         and Immortality                        and immortality

This approach is faster than regular expressions such as re.sub because translate works at a lower level.

Pre-processing Text:

Step2: Remove punctuation

import string
text_df['text2'] = text_df['text2'].str.translate(str.maketrans('', '', string.punctuation))

This is what each one of those mean:

  • str.translate():
    • str allows you to apply string functions to each element of the text2 column.
    • .translate() is a string method that modifies a string according to a translation table.

Pre-processing Text:

Step2: Remove punctuation

import string
text_df['text2'] = text_df['text2'].str.translate(str.maketrans('', '', string.punctuation))
  • str.maketrans('', '', string.punctuation)
    • str.maketrans('', '', string.punctuation) creates a translation table that tells Python how to modify characters in the string.
    • The first two empty strings '' indicate that no characters should be mapped or replaced
    • The third argument, string.punctuation, is a string containing all punctuation characters (e.g., !“#$%&’()*+,-./:;<=>?@[]^_{|}~`).
    • The translation table maps each punctuation character to None, which means that any punctuation character in the string will be removed.

Pre-processing Text:

Step3: Remove numbers

#remove numbers
text_df['text2'] = text_df['text2'].str.replace(r'[0-9]+', " ", regex=True)
text_df
                                     text                                  text2
0    Because I could not stop for Death -    because i could not stop for death 
1              He kindly stopped for me -              he kindly stopped for me 
2  The Carriage held but just Ourselves -  the carriage held but just ourselves 
3                         and Immortality                        and immortality
  • r'[0-9]+' is a regular expression (regex) pattern that matches any sequence of one or more digits (0–9).

  • The + ensures that the pattern matches one or more digits together, rather than replacing digits individually.

  • The regex=True specifies that the search pattern is a regular expression (regex). This is set to True by default.

Pre-processing Text:

Step4: Tokenize

This is how finally tokenize the data:

import nltk
from nltk.tokenize import word_tokenize
# Tokenize the text
text_df['text2'] = text_df['text2'].apply(lambda tokens: ' '.join(word_tokenize(tokens)))
text_df
                                     text                                 text2
0    Because I could not stop for Death -    because i could not stop for death
1              He kindly stopped for me -              he kindly stopped for me
2  The Carriage held but just Ourselves -  the carriage held but just ourselves
3                         and Immortality                       and immortality

Pre-processing Text:

Step 5 Lemmatization

import nltk
from nltk.tokenize import word_tokenize
lemmatizer = nltk.stem.WordNetLemmatizer()

# Define function
def lemmatize_text(text):
    return [lemmatizer.lemmatize(word) for word in word_tokenize(text)]

# Apply function and join tokens with a space
text_df['text2'] = text_df['text2'].apply(lambda tokens: ' '.join(lemmatize_text(tokens)))
text_df
                                     text                                 text2
0    Because I could not stop for Death -    because i could not stop for death
1              He kindly stopped for me -              he kindly stopped for me
2  The Carriage held but just Ourselves -  the carriage held but just ourselves
3                         and Immortality                       and immortality

Pre-processing Text:

Step6 Stemming

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

# Initialize Lemmatizer and Stemmer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

# Define function to stem tokens
def stem_text(text):
    return [stemmer.stem(word) for word in word_tokenize(text)]  # Tokenize text before stemming

# Apply function and join tokens with a space
text_df['text2'] = text_df['text2'].apply(lambda tokens: ' '.join(stem_text(tokens)))
text_df
                                     text                              text2
0    Because I could not stop for Death -  becaus i could not stop for death
1              He kindly stopped for me -              he kindli stop for me
2  The Carriage held but just Ourselves -  the carriag held but just ourselv
3                         and Immortality                         and immort

Pre-processing Text:

Step7: Listing words line by line

This how we can transform all text so that every word appers on every line

# Unnest the tokens into individual rows
text_df['text2'] = text_df['text2'].apply(lambda text: text.split())
# Then use explode() to unnest each word into a separate row
text_df2 = text_df.explode('text2').reset_index(drop=True)
text_df2
                                      text    text2
0     Because I could not stop for Death -   becaus
1     Because I could not stop for Death -        i
2     Because I could not stop for Death -    could
3     Because I could not stop for Death -      not
4     Because I could not stop for Death -     stop
5     Because I could not stop for Death -      for
6     Because I could not stop for Death -    death
7               He kindly stopped for me -       he
8               He kindly stopped for me -   kindli
9               He kindly stopped for me -     stop
10              He kindly stopped for me -      for
11              He kindly stopped for me -       me
12  The Carriage held but just Ourselves -      the
13  The Carriage held but just Ourselves -  carriag
14  The Carriage held but just Ourselves -     held
15  The Carriage held but just Ourselves -      but
16  The Carriage held but just Ourselves -     just
17  The Carriage held but just Ourselves -  ourselv
18                         and Immortality      and
19                         and Immortality   immort

Examining the Works of Jane Austen

But let us use Jane Austen’s works to see how to apply the python methods for text analysis.

R
library(janeaustenr)
library(reticulate)
original_books = as.data.frame(austen_books())
df = r_to_py(original_books)

Examining the Works of Jane Austen

But let us use Austen’s works to see how to apply the python methods

Python
#Step 0: Renaming the df 
df = r.df
#Step 1: Group the DataFrame by the book Column
grouped = df.groupby('book')
#Step 2: Count the Row Numbers within Each Group
df['linenumber'] = grouped.cumcount()
#Step 3: Identify if each row is a chapter
df['is_chapter'] = df['text'].str.contains(r'^chapter [\divxlc]', case=False, regex=True)
#Step 4: Group by book and create a cumulative chapter count
df['chapter'] = df.groupby('book')['is_chapter'].cumsum()

Examining the Works of Jane Austen

But let us use Austen’s works to see how to apply the python methods

Python
#Step 5: Clean up
df2 = df.drop(columns=['is_chapter'])
df2
                        text                 book  linenumber  chapter
0      SENSE AND SENSIBILITY  Sense & Sensibility           0        0
1                             Sense & Sensibility           1        0
2             by Jane Austen  Sense & Sensibility           2        0
3                             Sense & Sensibility           3        0
4                     (1811)  Sense & Sensibility           4        0
...                      ...                  ...         ...      ...
73417   national importance.           Persuasion        8323       24
73418                                  Persuasion        8324       24
73419                                  Persuasion        8325       24
73420                                  Persuasion        8326       24
73421                  Finis           Persuasion        8327       24

[73422 rows x 4 columns]

Works of Jane Austen: NLP Analysis

We now need to transform everything in a one-token-per-row format.

This is how we would do it:

Python
# Step1: Ensure the 'text' column contains only strings
df2['text'] = df2['text'].astype(str)

### Step2: Turning words to Lowercase
df2["text2"] = df2['text'].str.lower()

### Step3: Remove punctuation
import string
df2['text2'] = df2['text2'].str.translate(str.maketrans('', '', string.punctuation))

### Step4: Remove numbers
df2['text2'] = df2['text2'].str.replace(r'[0-9]+', " ", regex=True)

### Step5: Tokenize
from nltk.tokenize import word_tokenize
# Tokenize the text
df2['text2'] = df2['text2'].apply(lambda tokens: ' '.join(word_tokenize(tokens)))

### Step6: Lemmatization
import nltk
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Define function
def lemmatize_text(text):
    return [lemmatizer.lemmatize(word) for word in word_tokenize(text)]

# Apply function and join tokens with a space
df2['text2'] = df2['text2'].apply(lambda tokens: ' '.join(lemmatize_text(tokens)))

### Step7: Stemming
from nltk.stem import PorterStemmer

# Initialize Stemmer
stemmer = PorterStemmer()

# Define function to stem tokens
def stem_text(text):
    return [stemmer.stem(word) for word in word_tokenize(text)]  # Tokenize text before stemming

# Apply function and join tokens with a space
df2['text2'] = df2['text2'].apply(lambda tokens: ' '.join(stem_text(tokens)))

### Step8: Listing words line by line
# Unnest the tokens into individual rows
df2['text2'] = df2['text2'].apply(lambda text: text.split())
# Then use explode() to unnest each word into a separate row
df2 = df2.explode('text2').reset_index(drop=True)

### Step9: Remove rows with NaN values in the 'text2' column
df2 = df2.dropna(subset=['text2'])

# Final DataFrame
df2
                                                     text  ...    text2
0                                   SENSE AND SENSIBILITY  ...     sens
1                                   SENSE AND SENSIBILITY  ...      and
2                                   SENSE AND SENSIBILITY  ...  sensibl
4                                          by Jane Austen  ...       by
5                                          by Jane Austen  ...     jane
...                                                   ...  ...      ...
729021  possible, more distinguished in its domestic v...  ...       in
729022  possible, more distinguished in its domestic v...  ...       it
729023                               national importance.  ...   nation
729024                               national importance.  ...   import
729028                                              Finis  ...     fini

[717876 rows x 5 columns]

Works of Jane Austen: Tokenization

This is what the final result looks like:

Python
df2
                                                     text  ...    text2
0                                   SENSE AND SENSIBILITY  ...     sens
1                                   SENSE AND SENSIBILITY  ...      and
2                                   SENSE AND SENSIBILITY  ...  sensibl
4                                          by Jane Austen  ...       by
5                                          by Jane Austen  ...     jane
...                                                   ...  ...      ...
729021  possible, more distinguished in its domestic v...  ...       in
729022  possible, more distinguished in its domestic v...  ...       it
729023                               national importance.  ...   nation
729024                               national importance.  ...   import
729028                                              Finis  ...     fini

[717876 rows x 5 columns]

Works of Jane Austen:

Step1 Removing stop words

The next step is to remove stop words

Stop words are words that are not useful for analysis: “the”, “of”, “to”, etc.

Python
import nltk
from nltk.corpus import stopwords
#nltk.download('stopwords')
stops = set(stopwords.words('english'))
#Using the function for the stop words
df2 = df2.copy()
df2['text_no_stop'] = df2['text2'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stops)]))
df2.head(5)
                    text                 book  ...    text2  text_no_stop
0  SENSE AND SENSIBILITY  Sense & Sensibility  ...     sens          sens
1  SENSE AND SENSIBILITY  Sense & Sensibility  ...      and              
2  SENSE AND SENSIBILITY  Sense & Sensibility  ...  sensibl       sensibl
4         by Jane Austen  Sense & Sensibility  ...       by              
5         by Jane Austen  Sense & Sensibility  ...     jane          jane

[5 rows x 6 columns]

Works of Jane Austen:

Step1 Removing stop words

Python
import nltk
from nltk.corpus import stopwords
#nltk.download('stopwords')
stops = set(stopwords.words('english'))
#Using the function for the stop words
df2 = df2.copy()
df2['text_no_stop'] = df2['text2'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stops)]))
df2.head(5)

There are three steps in the lambda function:

  • Filter out the stopwords - Remove words that are in the stops list.
  • Rejoin the words - Join the remaining words back into a single string.

Works of Jane Austen:

Step1 Removing stop words

This is what the text with no stop words looks like:

Python
df2.head(5)
                    text                 book  ...    text2  text_no_stop
0  SENSE AND SENSIBILITY  Sense & Sensibility  ...     sens          sens
1  SENSE AND SENSIBILITY  Sense & Sensibility  ...      and              
2  SENSE AND SENSIBILITY  Sense & Sensibility  ...  sensibl       sensibl
4         by Jane Austen  Sense & Sensibility  ...       by              
5         by Jane Austen  Sense & Sensibility  ...     jane          jane

[5 rows x 6 columns]

Works of Jane Austen:

Step 2 Removing NA words

We can now remove the NAs and keep only the text_no_stop column.

Python
#Step6: Remove rows with NaN values in the 'text' column
df3 = df2.dropna(subset=['text_no_stop'])
#Step7: Remove rows with empty spaces
df3 = df3[df3['text_no_stop'].str.strip() != '']
df3.head(5)
                     text                 book  ...    text2  text_no_stop
0   SENSE AND SENSIBILITY  Sense & Sensibility  ...     sens          sens
2   SENSE AND SENSIBILITY  Sense & Sensibility  ...  sensibl       sensibl
5          by Jane Austen  Sense & Sensibility  ...     jane          jane
6          by Jane Austen  Sense & Sensibility  ...   austen        austen
13              CHAPTER 1  Sense & Sensibility  ...  chapter       chapter

[5 rows x 6 columns]

Works of Jane Austen:

Step 2 Removing NA words

This is how we see the differences:

Python
#With NA
df2.shape[0] 
717876
Python
#Without NA
df3.shape[0] 
356749

Combining Everything into one Function

It is a good idea to create a function that performs all these operations in one step.

We first load the necessary packages:

Python
import pandas as pd
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords

# Make sure to download the necessary NLTK resources if you haven't already
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
True
True
True

Combining Everything into one Function

We then define our auxiliary functions

Python
# Define lemmatization function
lemmatizer = WordNetLemmatizer()
def lemmatize_text(text):
    return [lemmatizer.lemmatize(word) for word in word_tokenize(text)]

# Define stemming function
stemmer = PorterStemmer()
def stem_text(text):
    return [stemmer.stem(word) for word in word_tokenize(text)]

Combining Everything into one Function

We can finally define our big function

Python
def preprocess_text(df, text_column):
    # Step1: Ensure the specified column contains only strings
    df[text_column] = df[text_column].astype(str)

    ### Step2: Turning words to Lowercase
    df["text2"] = df[text_column].str.lower()

    ### Step3: Remove punctuation
    df['text2'] = df['text2'].str.translate(str.maketrans('', '', string.punctuation))

    ### Step4: Remove numbers
    df['text2'] = df['text2'].str.replace(r'[0-9]+', " ", regex=True)

    ### Step5: Tokenize
    df['text2'] = df['text2'].apply(lambda tokens: ' '.join(word_tokenize(tokens)))

    ### Step6: Lemmatization
    df['text2'] = df['text2'].apply(lambda tokens: ' '.join(lemmatize_text(tokens)))

    ### Step7: Stemming
    df['text2'] = df['text2'].apply(lambda tokens: ' '.join(stem_text(tokens)))

    ### Step8: Listing words line by line
    # Unnest the tokens into individual rows
    df['text2'] = df['text2'].apply(lambda text: text.split())
    # Then use explode() to unnest each word into a separate row
    df = df.explode('text2').reset_index(drop=True)

    ### Step9: Remove stop words
    stop_words = set(stopwords.words('english'))
    df['text2'] = df['text2'].apply(lambda word: word if word not in stop_words else None)

    ### Step10: Remove rows with NaN values in the 'text2' column
    df = df.dropna(subset=['text2'])

    return df

Combining Everything into one Function

And this is how we now use that function:

Python
# Example usage:
df4b = preprocess_text(df, 'text')
df4b.head(5)
                     text                 book  ...  chapter    text2
0   SENSE AND SENSIBILITY  Sense & Sensibility  ...        0     sens
2   SENSE AND SENSIBILITY  Sense & Sensibility  ...        0  sensibl
5          by Jane Austen  Sense & Sensibility  ...        0     jane
6          by Jane Austen  Sense & Sensibility  ...        0   austen
13              CHAPTER 1  Sense & Sensibility  ...        1  chapter

[5 rows x 6 columns]

Works of Jane Austen:

Counting words

We can now count the most commons words:

Python
# Count the occurrences of each word and sort by count in descending order
word_counts = df4b['text2'].value_counts().reset_index()

# Rename the columns to match the output of count in R
word_counts.columns = ['text2', 'n']

# Sort by count in descending order (optional if you want to ensure sorting)
word_counts = word_counts.sort_values(by='n', ascending=False)
word_counts
          text2      n
0            wa  11159
1            hi   5945
2            mr   5395
3          veri   3720
4         could   3595
...         ...    ...
8541    circumv      1
8540    readmit      1
8539   strenuou      1
8538     hammer      1
13004    wheedl      1

[13005 rows x 2 columns]

Works of Jane Austen:

Counting words

See that an empty space is the most common word. We should remove it.

Python
word_counts2 =  word_counts[word_counts['text2'].str.strip() != '']
word_counts2
          text2      n
0            wa  11159
1            hi   5945
2            mr   5395
3          veri   3720
4         could   3595
...         ...    ...
8541    circumv      1
8540    readmit      1
8539   strenuou      1
8538     hammer      1
13004    wheedl      1

[13005 rows x 2 columns]

Works of Jane Austen:

Counting words

We can now try to visualize the most common words.

Python
word_counts2 = word_counts2[word_counts2['n'] >= 1000]

Works of Jane Austen:

Counting words

We can now try to visualize the most common words.

Show the code
R
word_counts2 <-  reticulate::py$word_counts2
library(forcats)
library(ggplot2)
ggplot(data = word_counts2, aes(x = n, y = fct_reorder(text2, n))) +
  geom_col()

Works of Jane Austen:

Counting words

So this graphs tells us that words “mr”, “would”, “could”, “must” are words that are prevalent in Jane Austen’s work.

It could also be helpful to calculate proportions of these works.

To do that, we need to get back to the original books to calculate the total.

R
library(janeaustenr)
library(reticulate)
original_books = as.data.frame(austen_books())
df = r_to_py(original_books)
Python
#Step 0: Renaming the df 
df = r.df
df["total_text"] = df['text'].str.split().str.len()
total = df['total_text'].sum()
total
717537

Calculating frequency

We can now use it to make frequency calculations

Python
word_counts2["frequency"] = word_counts2["n"]/total
word_counts2
      text2      n  frequency
0        wa  11159   0.015552
1        hi   5945   0.008285
2        mr   5395   0.007519
3      veri   3720   0.005184
4     could   3595   0.005010
5     would   3233   0.004506
6       thi   2429   0.003385
7      must   2079   0.002897
8       ani   2075   0.002892
9      said   2020   0.002815
10     much   1914   0.002667
11     miss   1889   0.002633
12      one   1879   0.002619
13    think   1684   0.002347
14     onli   1624   0.002263
15     know   1505   0.002097
16    everi   1438   0.002004
17     time   1429   0.001992
18     well   1369   0.001908
19    might   1367   0.001905
20    never   1341   0.001869
21      say   1333   0.001858
22    littl   1299   0.001810
23      see   1278   0.001781
24     look   1254   0.001748
25     good   1232   0.001717
26     feel   1190   0.001658
27    befor   1180   0.001645
28       go   1158   0.001614
29     noth   1147   0.001599
30       ha   1086   0.001514
31     ladi   1077   0.001501
32     soon   1053   0.001468
33     make   1052   0.001466
34   sister   1051   0.001465
35   though   1049   0.001462
36     even   1046   0.001458
37     like   1046   0.001458
38  without   1043   0.001454
39     wish   1012   0.001410
40  thought   1009   0.001406

Word Clouds

A very common way to visualize word counts is by using word clouds

Show the code
Python
# Loading libraries
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Convert the Series to a single string
text = " ".join(word_counts2["text2"].astype(str))
# Generate the word cloud
wordcloud = WordCloud(max_font_size=80, max_words=200, background_color="white").generate(text)
# Display the word cloud
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
(-0.5, 399.5, 199.5, -0.5)

Word Clouds

And this is how we save the file.

Python
# Saving the file
wordcloud.to_file("figures/example_wordcloud.png")
<wordcloud.wordcloud.WordCloud object at 0x312fd5e80>

Conclusion