We finally get into text analysis
Let us imagine that we have the following text
This is how we turn it into a dataframe
In a first instance we want to count the number of tokens in our text
Tokens are meaningful units of text, most often a word, that we are interested in using for further analysis
Before we do that, we need follow a few steps:
This is how we make everything lower case:
text text2
0 Because I could not stop for Death - because i could not stop for death -
1 He kindly stopped for me - he kindly stopped for me -
2 The Carriage held but just Ourselves - the carriage held but just ourselves -
3 and Immortality and immortality
This is how we remove punctuation:
import string
text_df['text2'] = text_df['text2'].str.translate(str.maketrans('', '', string.punctuation))
text_df
text text2
0 Because I could not stop for Death - because i could not stop for death
1 He kindly stopped for me - he kindly stopped for me
2 The Carriage held but just Ourselves - the carriage held but just ourselves
3 and Immortality and immortality
This approach is faster than regular expressions such as re.sub
because translate works at a lower level.
This is what each one of those mean:
str.translate()
:
str
allows you to apply string functions to each element of the text2 column..translate()
is a string method that modifies a string according to a translation table.str.maketrans('', '', string.punctuation)
str.maketrans('', '', string.punctuation)
creates a translation table that tells Python how to modify characters in the string.''
indicate that no characters should be mapped or replacedstring.punctuation
, is a string containing all punctuation characters (e.g., !“#$%&’()*+,-./:;<=>?@[]^_{|}~`). text text2
0 Because I could not stop for Death - because i could not stop for death
1 He kindly stopped for me - he kindly stopped for me
2 The Carriage held but just Ourselves - the carriage held but just ourselves
3 and Immortality and immortality
r'[0-9]+'
is a regular expression (regex) pattern that matches any sequence of one or more digits (0–9).
The +
ensures that the pattern matches one or more digits together, rather than replacing digits individually.
The regex=True
specifies that the search pattern is a regular expression (regex). This is set to True by default.
This is how finally tokenize the data:
import nltk
from nltk.tokenize import word_tokenize
# Tokenize the text
text_df['text2'] = text_df['text2'].apply(lambda tokens: ' '.join(word_tokenize(tokens)))
text_df
text text2
0 Because I could not stop for Death - because i could not stop for death
1 He kindly stopped for me - he kindly stopped for me
2 The Carriage held but just Ourselves - the carriage held but just ourselves
3 and Immortality and immortality
import nltk
from nltk.tokenize import word_tokenize
lemmatizer = nltk.stem.WordNetLemmatizer()
# Define function
def lemmatize_text(text):
return [lemmatizer.lemmatize(word) for word in word_tokenize(text)]
# Apply function and join tokens with a space
text_df['text2'] = text_df['text2'].apply(lambda tokens: ' '.join(lemmatize_text(tokens)))
text_df
text text2
0 Because I could not stop for Death - because i could not stop for death
1 He kindly stopped for me - he kindly stopped for me
2 The Carriage held but just Ourselves - the carriage held but just ourselves
3 and Immortality and immortality
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
# Initialize Lemmatizer and Stemmer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
# Define function to stem tokens
def stem_text(text):
return [stemmer.stem(word) for word in word_tokenize(text)] # Tokenize text before stemming
# Apply function and join tokens with a space
text_df['text2'] = text_df['text2'].apply(lambda tokens: ' '.join(stem_text(tokens)))
text_df
text text2
0 Because I could not stop for Death - becaus i could not stop for death
1 He kindly stopped for me - he kindli stop for me
2 The Carriage held but just Ourselves - the carriag held but just ourselv
3 and Immortality and immort
This how we can transform all text so that every word appers on every line
# Unnest the tokens into individual rows
text_df['text2'] = text_df['text2'].apply(lambda text: text.split())
# Then use explode() to unnest each word into a separate row
text_df2 = text_df.explode('text2').reset_index(drop=True)
text_df2
text text2
0 Because I could not stop for Death - becaus
1 Because I could not stop for Death - i
2 Because I could not stop for Death - could
3 Because I could not stop for Death - not
4 Because I could not stop for Death - stop
5 Because I could not stop for Death - for
6 Because I could not stop for Death - death
7 He kindly stopped for me - he
8 He kindly stopped for me - kindli
9 He kindly stopped for me - stop
10 He kindly stopped for me - for
11 He kindly stopped for me - me
12 The Carriage held but just Ourselves - the
13 The Carriage held but just Ourselves - carriag
14 The Carriage held but just Ourselves - held
15 The Carriage held but just Ourselves - but
16 The Carriage held but just Ourselves - just
17 The Carriage held but just Ourselves - ourselv
18 and Immortality and
19 and Immortality immort
But let us use Jane Austen’s works to see how to apply the python methods for text analysis.
But let us use Austen’s works to see how to apply the python methods
Python
#Step 0: Renaming the df
df = r.df
#Step 1: Group the DataFrame by the book Column
grouped = df.groupby('book')
#Step 2: Count the Row Numbers within Each Group
df['linenumber'] = grouped.cumcount()
#Step 3: Identify if each row is a chapter
df['is_chapter'] = df['text'].str.contains(r'^chapter [\divxlc]', case=False, regex=True)
#Step 4: Group by book and create a cumulative chapter count
df['chapter'] = df.groupby('book')['is_chapter'].cumsum()
But let us use Austen’s works to see how to apply the python methods
text book linenumber chapter
0 SENSE AND SENSIBILITY Sense & Sensibility 0 0
1 Sense & Sensibility 1 0
2 by Jane Austen Sense & Sensibility 2 0
3 Sense & Sensibility 3 0
4 (1811) Sense & Sensibility 4 0
... ... ... ... ...
73417 national importance. Persuasion 8323 24
73418 Persuasion 8324 24
73419 Persuasion 8325 24
73420 Persuasion 8326 24
73421 Finis Persuasion 8327 24
[73422 rows x 4 columns]
We now need to transform everything in a one-token-per-row format.
This is how we would do it:
Python
# Step1: Ensure the 'text' column contains only strings
df2['text'] = df2['text'].astype(str)
### Step2: Turning words to Lowercase
df2["text2"] = df2['text'].str.lower()
### Step3: Remove punctuation
import string
df2['text2'] = df2['text2'].str.translate(str.maketrans('', '', string.punctuation))
### Step4: Remove numbers
df2['text2'] = df2['text2'].str.replace(r'[0-9]+', " ", regex=True)
### Step5: Tokenize
from nltk.tokenize import word_tokenize
# Tokenize the text
df2['text2'] = df2['text2'].apply(lambda tokens: ' '.join(word_tokenize(tokens)))
### Step6: Lemmatization
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
# Define function
def lemmatize_text(text):
return [lemmatizer.lemmatize(word) for word in word_tokenize(text)]
# Apply function and join tokens with a space
df2['text2'] = df2['text2'].apply(lambda tokens: ' '.join(lemmatize_text(tokens)))
### Step7: Stemming
from nltk.stem import PorterStemmer
# Initialize Stemmer
stemmer = PorterStemmer()
# Define function to stem tokens
def stem_text(text):
return [stemmer.stem(word) for word in word_tokenize(text)] # Tokenize text before stemming
# Apply function and join tokens with a space
df2['text2'] = df2['text2'].apply(lambda tokens: ' '.join(stem_text(tokens)))
### Step8: Listing words line by line
# Unnest the tokens into individual rows
df2['text2'] = df2['text2'].apply(lambda text: text.split())
# Then use explode() to unnest each word into a separate row
df2 = df2.explode('text2').reset_index(drop=True)
### Step9: Remove rows with NaN values in the 'text2' column
df2 = df2.dropna(subset=['text2'])
# Final DataFrame
df2
text ... text2
0 SENSE AND SENSIBILITY ... sens
1 SENSE AND SENSIBILITY ... and
2 SENSE AND SENSIBILITY ... sensibl
4 by Jane Austen ... by
5 by Jane Austen ... jane
... ... ... ...
729021 possible, more distinguished in its domestic v... ... in
729022 possible, more distinguished in its domestic v... ... it
729023 national importance. ... nation
729024 national importance. ... import
729028 Finis ... fini
[717876 rows x 5 columns]
This is what the final result looks like:
text ... text2
0 SENSE AND SENSIBILITY ... sens
1 SENSE AND SENSIBILITY ... and
2 SENSE AND SENSIBILITY ... sensibl
4 by Jane Austen ... by
5 by Jane Austen ... jane
... ... ... ...
729021 possible, more distinguished in its domestic v... ... in
729022 possible, more distinguished in its domestic v... ... it
729023 national importance. ... nation
729024 national importance. ... import
729028 Finis ... fini
[717876 rows x 5 columns]
The next step is to remove stop words
Stop words are words that are not useful for analysis: “the”, “of”, “to”, etc.
Python
text book ... text2 text_no_stop
0 SENSE AND SENSIBILITY Sense & Sensibility ... sens sens
1 SENSE AND SENSIBILITY Sense & Sensibility ... and
2 SENSE AND SENSIBILITY Sense & Sensibility ... sensibl sensibl
4 by Jane Austen Sense & Sensibility ... by
5 by Jane Austen Sense & Sensibility ... jane jane
[5 rows x 6 columns]
Python
There are three steps in the lambda function:
This is what the text with no stop words looks like:
text book ... text2 text_no_stop
0 SENSE AND SENSIBILITY Sense & Sensibility ... sens sens
1 SENSE AND SENSIBILITY Sense & Sensibility ... and
2 SENSE AND SENSIBILITY Sense & Sensibility ... sensibl sensibl
4 by Jane Austen Sense & Sensibility ... by
5 by Jane Austen Sense & Sensibility ... jane jane
[5 rows x 6 columns]
We can now remove the NAs and keep only the text_no_stop
column.
Python
text book ... text2 text_no_stop
0 SENSE AND SENSIBILITY Sense & Sensibility ... sens sens
2 SENSE AND SENSIBILITY Sense & Sensibility ... sensibl sensibl
5 by Jane Austen Sense & Sensibility ... jane jane
6 by Jane Austen Sense & Sensibility ... austen austen
13 CHAPTER 1 Sense & Sensibility ... chapter chapter
[5 rows x 6 columns]
This is how we see the differences:
It is a good idea to create a function that performs all these operations in one step.
We first load the necessary packages:
Python
import pandas as pd
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords
# Make sure to download the necessary NLTK resources if you haven't already
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
True
True
True
We then define our auxiliary functions
We can finally define our big function
Python
def preprocess_text(df, text_column):
# Step1: Ensure the specified column contains only strings
df[text_column] = df[text_column].astype(str)
### Step2: Turning words to Lowercase
df["text2"] = df[text_column].str.lower()
### Step3: Remove punctuation
df['text2'] = df['text2'].str.translate(str.maketrans('', '', string.punctuation))
### Step4: Remove numbers
df['text2'] = df['text2'].str.replace(r'[0-9]+', " ", regex=True)
### Step5: Tokenize
df['text2'] = df['text2'].apply(lambda tokens: ' '.join(word_tokenize(tokens)))
### Step6: Lemmatization
df['text2'] = df['text2'].apply(lambda tokens: ' '.join(lemmatize_text(tokens)))
### Step7: Stemming
df['text2'] = df['text2'].apply(lambda tokens: ' '.join(stem_text(tokens)))
### Step8: Listing words line by line
# Unnest the tokens into individual rows
df['text2'] = df['text2'].apply(lambda text: text.split())
# Then use explode() to unnest each word into a separate row
df = df.explode('text2').reset_index(drop=True)
### Step9: Remove stop words
stop_words = set(stopwords.words('english'))
df['text2'] = df['text2'].apply(lambda word: word if word not in stop_words else None)
### Step10: Remove rows with NaN values in the 'text2' column
df = df.dropna(subset=['text2'])
return df
And this is how we now use that function:
text book ... chapter text2
0 SENSE AND SENSIBILITY Sense & Sensibility ... 0 sens
2 SENSE AND SENSIBILITY Sense & Sensibility ... 0 sensibl
5 by Jane Austen Sense & Sensibility ... 0 jane
6 by Jane Austen Sense & Sensibility ... 0 austen
13 CHAPTER 1 Sense & Sensibility ... 1 chapter
[5 rows x 6 columns]
We can now count the most commons words:
Python
# Count the occurrences of each word and sort by count in descending order
word_counts = df4b['text2'].value_counts().reset_index()
# Rename the columns to match the output of count in R
word_counts.columns = ['text2', 'n']
# Sort by count in descending order (optional if you want to ensure sorting)
word_counts = word_counts.sort_values(by='n', ascending=False)
word_counts
text2 n
0 wa 11159
1 hi 5945
2 mr 5395
3 veri 3720
4 could 3595
... ... ...
8541 circumv 1
8540 readmit 1
8539 strenuou 1
8538 hammer 1
13004 wheedl 1
[13005 rows x 2 columns]
See that an empty space is the most common word. We should remove it.
We can now try to visualize the most common words.
We can now try to visualize the most common words.
So this graphs tells us that words “mr”, “would”, “could”, “must” are words that are prevalent in Jane Austen’s work.
It could also be helpful to calculate proportions of these works.
To do that, we need to get back to the original books to calculate the total.
We can now use it to make frequency calculations
text2 n frequency
0 wa 11159 0.015552
1 hi 5945 0.008285
2 mr 5395 0.007519
3 veri 3720 0.005184
4 could 3595 0.005010
5 would 3233 0.004506
6 thi 2429 0.003385
7 must 2079 0.002897
8 ani 2075 0.002892
9 said 2020 0.002815
10 much 1914 0.002667
11 miss 1889 0.002633
12 one 1879 0.002619
13 think 1684 0.002347
14 onli 1624 0.002263
15 know 1505 0.002097
16 everi 1438 0.002004
17 time 1429 0.001992
18 well 1369 0.001908
19 might 1367 0.001905
20 never 1341 0.001869
21 say 1333 0.001858
22 littl 1299 0.001810
23 see 1278 0.001781
24 look 1254 0.001748
25 good 1232 0.001717
26 feel 1190 0.001658
27 befor 1180 0.001645
28 go 1158 0.001614
29 noth 1147 0.001599
30 ha 1086 0.001514
31 ladi 1077 0.001501
32 soon 1053 0.001468
33 make 1052 0.001466
34 sister 1051 0.001465
35 though 1049 0.001462
36 even 1046 0.001458
37 like 1046 0.001458
38 without 1043 0.001454
39 wish 1012 0.001410
40 thought 1009 0.001406
A very common way to visualize word counts is by using word clouds
Python
# Loading libraries
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Convert the Series to a single string
text = " ".join(word_counts2["text2"].astype(str))
# Generate the word cloud
wordcloud = WordCloud(max_font_size=80, max_words=200, background_color="white").generate(text)
# Display the word cloud
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
(-0.5, 399.5, 199.5, -0.5)
And this is how we save the file.
Popescu (JCU): Lecture 15