L20: Supervised Learning

Bogdan G. Popescu

John Cabot University

Introduction

Supervised learning allows to classify documents into pre-defined categories or labelled data

Unsupervised learning analyzes and clusters unlabeled data sets. These algorithms discover hidden patterns in data without the need for human intervention (hence, they are ``unsupervised”).

It can be conceptualized as a generalization of dictionary methods

  • There are specific words associated with specific categories defined by the researcher (pre-labelled training data)
  • Words have a weight of 0 or 1 (according to their relative prevalence in each each category)
  • Documents are scored based on the words they contain

Introduction

Within supervised learning, the features associated with each category (and their relative weight) are learned from the data.

Supervised learning methods will often outperform dictionary methods in classification tasks, particularly when the training sample is large.

Components of Supervised Learning

Labelled Datasets

There is a labelled data (usually hand-coded) that puts text in different categories

  • training set: used to train the classifier
  • test set: used to validate the classifier

Classification method

  • this will be used to learn the relationship between coded texts and words

Components of Supervised Learning

Validation method

  • we use cross-validation metrics: confusion matrix, accuracy, sensitivity, specificity, etc

Out-of-sample prediction

  • We will use the model to predict categories for documents that do not have labels

Creating a labelled dataset

We can label data through expert annotation. E.g.:

  • undergraduate students get to code texts into particular categories
  • crowd-sourced people on the internet get to code texts into particular categories.
  • ChatGPT can be used to code texts into particular categories

Intuition: Naive Bayes

Imagine you’re sorting emails (Spam vs. Not Spam)

Naive Bayes helps classify emails as spam or not spam based on the words they contain. For example:

  • Words like “win,” “free,” and “prize” might appear more in spam emails.
  • Words like “meeting,” “project,” and “agenda” might appear more in not-spam emails.

The goal is to calculate which category (spam or not spam) an email is most likely to belong to.

Intuition: Naive Bayes

Step 1: What’s the Question?

We want to figure out:

Note

How likely is it that this email belongs to a specific category (e.g., spam) given the words it contains?

This is called the posterior probability:

\[P(\text{Category∣Words})\]

Intuition: Naive Bayes

Step 2: What Information Do We Use?

To make this decision, Naive Bayes combines:

1.How common is the category overall? (Prior Probability)

\[P(\text{Category})\]

For instance, if 60% of emails are spam, then \(P(\text{Spam})=0.6\)

2.How common is the category overall? (Likelihood)

\[P(\text{Words|Category})\]

For example, the word “win” might appear in 50% of spam emails but only 2% of not-spam emails.

3. How often do the words appear overall? (Normalization Factor):

\[P(\text{Words})\] This ensures probabilities add up to 1 across all categories.

Intuition: Naive Bayes

Step 3: Simplify the Math with an Analogy

Imagine you’re at a dog park. Some dogs are Labradors, and others are Poodles. If you see a black dog, you might ask:

Note

Is this dog more likely to be a Labrador or a Poodle?

1.Prior Probability: Start with what you know about the park:

  • 70% of the dogs are Labradors: \(P(\text{Labrador})=0.7\)
  • 30% are Poodles: \(P(\text{Poodle})=0.3\)

2.Likelihood: Look at traits like coat color:

  • Black Labradors are common: \(P(\text{Black|Labrador}) = 0.8\)
  • Black Poodles are less common: \(P(\text{Black|Poodle}) = 0.3\)

Intuition: Naive Bayes

Step 3: Simplify the Math with an Analogy

Imagine you’re at a dog park. Some dogs are Labradors, and others are Poodles. If you see a black dog, you might ask:

Note

Is this dog more likely to be a Labrador or a Poodle?

3.Combine Information: Naive Bayes combines these probabilities to estimate:

  • \(P(\text{Labrador|Black})\)
  • \(P(\text{Poodle|Black})\)

Intuition: Naive Bayes

Step 4: The Formula in Action

Naive Bayes calculates:

\[P(\text{Category|Words}) = \frac{P(\text{Category}) \times P(\text{Words|Category})}{P(\text{Words})}\]

Let’s apply this to the email example:

  • Suppose the prior probability of spam is \(P(\text{Spam})=0.6\)
  • Suppose the prior probability of spam is \(P(\text{"Win"|Spam})=0.5\)
  • The likelihood of the word “win” overall is \(P(\text{"Win"}) = 0.2\)

The posterior probability of spam given the word “win” is:

\[ P(\text{Spam|"win"}) = \frac{P(\text{Spam}) \times P(\text{"win"|Spam})}{P(\text{"win"})} = \frac{0.6 \times 0.5}{0.2} = 1.5 \] 1.5 will be normalized later to fit a probability range.

Intuition: Naive Bayes

Step 5: Make the Final Decision

Compare the posterior probabilities for all categories (spam and not spam).

We Assign the email to the category with the highest probability.

Methods: Naive Bayes

Steps

For example, in this case, we can try to label aggression in political speeches, based on tiny sample coded in ChatGPT.

1. Train Language Models for Each Category

Example: Build a model for Aggressive speeches and another for Non-Aggressive speeches.
Each model calculates the probability of words appearing in its category.

  • Aggressive speeches often use words like “battle,” “destroy,” “win,” “horrible.”
  • Non-Aggressive speeches often use non-aggressive, neutral words like “diplomacy,” “support,” or “grow.”

2. Get a New Document
Example: A campaign speech or policy statement.

Methods: Naive Bayes

Steps

3. Calculate Probabilities for Each Model

Compute the likelihood that the text was “written” by the Aggressive language model vs the Non-Aggressive model:

  • Aggressive Model: High probabilities for “battle,” “destroy,” “win,” “horrible.”
  • Non-Aggressive Model: High probabilities for “diplomacy,” “support,” or “grow.”

4. Assign the Most Likely Category

  • If the text mentions “battle” and “destroy” - Aggressive.
  • If the text emphasizes “diplomacy” and “support” - Non-Aggressive.

Language Models

We can represent these different “models” for language using a probability distribution over the words in the vocabulary:

A probability distribution over a discrete variable must have three properties:

  • Each element must be greater than or equal to zero
  • Each element must be less than or equal to one
  • The sum of the elements must be 1
Tone Battle Destroy Win Diplomacy Support Grow
Aggressive 0.30 0.25 0.20 0.05 0.10 0.10
Non-Aggressive 0.05 0.10 0.15 0.30 0.25 0.15

Probability Distributions

  • Definition:
    • A mathematical function that shows how likely different outcomes are.
    • For discrete events, it assigns probabilities to each possible event.
  • Key Properties:
    1. Probabilities are always between 0 and 1.
    2. The sum of all probabilities equals 1.

Example: Rolling a Die 🎲

  • Possible Outcomes:

\[( 1, 2, 3, 4, 5, 6)\]

  • Uniform Distribution (Fair Die):

\[P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = \frac{1}{6}\]

  • The sum of probabilities: \[ \frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = 1 \]

Relevance to Language Models

In Language Models:

  • The “outcomes” are words.
  • A probability distribution predicts how likely each word is to appear next.
  • Example: \[P(\text{'hello'}) = 0.4, \; P(\text{'world'}) = 0.3, \; P(\text{'everyone'}) = 0.2, \; \dots\]

Language Models

Given these categories, we can calculate the probability that a given set of word counts (i.e. a document) would be drawn from each distribution.

\[ P(W_i|\mu) = \frac{M_i !}{\prod_{j=1}^J W_{i,j}!} \prod_{j=1}^J \mu_{j}^{W_{i,j}} \]

Where:

  • \(\mu_j\): probability of observing word \(j\) under a given category
  • \(W_{i,j}\): the number of times word \(j\) appears in document \(i\) (i.e. an element of a DFM—document-feature matrix)
  • \(M_i\): the total number of words in document \(i\)
  • \(!\): factorial operator, e.g., \(4! = 4 \times 3 \times 2 \times 1\)

Language Models

Tone Battle Destroy Win Diplomacy Support Grow
Aggressive 0.30 0.25 0.20 0.05 0.10 0.10
Non-Aggressive 0.05 0.10 0.15 0.30 0.25 0.15

Let’s imagine that we have the following DFM:

Document Battle Destroy Win Diplomacy Support Grow
\(W_1\) 5 3 0 0 2 1
\(W_2\) 0 1 4 3 0 2

We can now calculate the probability for each document:

\[ \begin{equation} \begin{aligned} P(W_1|\mu_{\text{Aggressive}}) &= \frac{M_i !}{\Pi_{j=1}^J W_{1,j}!} \Pi_{j=1}^J \mu_{j}^{W_{1,j}}\\ &= \frac{11!}{5! \cdot 3! \cdot 2! \cdot 1!} \times 0.30^5 \times 0.25^3 \times 0.10^2 \times 0.10^1 \\ &= 0.001949 \end{aligned} \end{equation} \]

Language Models

Tone Battle Destroy Win Diplomacy Support Grow
Aggressive 0.30 0.25 0.20 0.05 0.10 0.10
Non-Aggressive 0.05 0.10 0.15 0.30 0.25 0.15

Let’s imagine that we have the following DFM:

Document Battle Destroy Win Diplomacy Support Grow
\(W_1\) 5 3 0 0 2 1
\(W_2\) 0 1 4 3 0 2

\[ \begin{equation} \begin{aligned} P(W_1|\mu_{\text{Non-Aggressive}}) &= \frac{M_i !}{\Pi_{j=1}^J W_{1,j}!} \Pi_{j=1}^J \mu_{j}^{W_{1,j}}\\ &= \frac{11!}{5! \cdot 3! \cdot 2! \cdot 1!} \times 0.05^5 \times 0.10^3 \times 0.25^2 \times 0.15^1 \\ &= 0.000000005414 \end{aligned} \end{equation} \]

The probability of observing \(W_1\) is higher under \(\mu_{\text{Aggressive}}\) than under \(\mu_{\text{Non-Aggressive}}\).

Language Models

Tone Battle Destroy Win Diplomacy Support Grow
Aggressive 0.30 0.25 0.20 0.05 0.10 0.10
Non-Aggressive 0.05 0.10 0.15 0.30 0.25 0.15

Let’s imagine that we have the following DFM:

Document Battle Destroy Win Diplomacy Support Grow
\(W_1\) 5 3 0 0 2 1
\(W_2\) 0 1 4 3 0 2

We can now calculate the probability for each document:

\[ \begin{equation} \begin{aligned} P(W_2|\mu_{\text{Aggressive}}) &= \frac{M_i !}{\Pi_{j=1}^J W_{2,j}!} \Pi_{j=1}^J \mu_{j}^{W_{2,j}}\\ &= \frac{10!}{1! \cdot 4! \cdot 3! \cdot 2!} \times 0.25^1 \times 0.20^4 \times 0.05^3 \times 0.10^2 \\ &= 0.000000315 \end{aligned} \end{equation} \]

Language Models

Tone Battle Destroy Win Diplomacy Support Grow
Aggressive 0.30 0.25 0.20 0.05 0.10 0.10
Non-Aggressive 0.05 0.10 0.15 0.30 0.25 0.15

Let’s imagine that we have the following DFM:

Document Battle Destroy Win Diplomacy Support Grow
\(W_1\) 5 3 0 0 2 1
\(W_2\) 0 1 4 3 0 2

We can now calculate the probability for each document:

\[ \begin{equation} \begin{aligned} P(W_2|\mu_{\text{Non-Aggressive}}) &= \frac{M_i !}{\Pi_{j=1}^J W_{2,j}!} \Pi_{j=1}^J \mu_{j}^{W_{2,j}}\\ &= \frac{10!}{1! \cdot 4! \cdot 3! \cdot 2!} \times 0.10^1 \times 0.15^4 \times 0.30^3 \times 0.15^2 \\ &= 0.1096 \end{aligned} \end{equation} \]

The probability of observing \(W_2\) is higher under \(\mu_{\text{Non-Aggressive}}\) than under \(\mu_{\text{Aggressive}}\)

Implications

Given a set of probabilities, we can work out which model was most likely to have generated any given document.

The likelihood of document generated by a model is:

  • larger when the model gives higher probabilities to the words that occur frequently in the document
  • smaller when the model gives higher probabilities to the words that occur infrequently in the document

Naive Bayes

Naive Bayes is a model that classifies documents into categories on the basis of the words they contain.

\[P(y_i=C_k | W_i) = \frac{P(y_i=C_k) P(W_i|y_i = C_k)}{P(W_i)}\]

Naive Bayes

Naive Bayes is a model that classifies documents into categories on the basis of the words they contain.

\[\color{red}{P(y_i=C_k | W_i)} = \frac{P(y_i=C_k) P(W_i|y_i = C_k)}{P(W_i)}\]

  • \(\color{red}{P(y_i=C_k | W_i)}\) - posterior distribution
    • the probability that document \(i\) is in category \(k\), given the words in the document and the prior probability of category \(k\) (the likelihood of the document belonging to a category before we observe any text.)

Naive Bayes

Naive Bayes is a model that classifies documents into categories on the basis of the words they contain.

\[P(y_i=C_k | W_i) = \frac{P(y_i=C_k) \color{red}{P(W_i|y_i = C_k)}}{P(W_i)}\]

  • \(P(Y=C_k|W)\) - posterior distribution
    • the probability that document \(i\) is in category \(k\), given the words in the document and the prior probability of category \(k\) (the likelihood of the document belonging to a category before we observe any text.)
  • \(\color{red}{P(W_i|y_i = C_k)}\) - conditional probability
    • the probability that we would observe the words in \(W_i\) if the document were from category \(k\)

Naive Bayes

Naive Bayes is a model that classifies documents into categories on the basis of the words they contain.

\[P(y_i=C_k | W_i) = \frac{\color{red}{P(y_i=C_k)} P(W_i|y_i = C_k)}{P(W_i)}\]

  • \(P(Y=C_k|W)\) - posterior distribution
    • the probability that document \(i\) is in category \(k\), given the words in the document and the prior probability of category \(k\) (the likelihood of the document belonging to a category before we observe any text.)
  • \(P(W_i|y_i = C_k)\) - conditional probability
    • the probability that we would observe the words in \(W_i\) if the document were from category \(k\)
  • \(\color{red}{P(y_i=C_k)}\) - prior probability that the document is from category \(k\)
    • the probability of the category of the document, absent any information about the words it contains

Naive Bayes

Naive Bayes is a model that classifies documents into categories on the basis of the words they contain.

\[P(y_i=C_k | W_i) = \frac{P(y_i=C_k) P(W_i|y_i = C_k)}{\color{red}{P(W_i)}}\]

  • \(P(Y=C_k|W)\) - posterior distribution
    • the probability that document \(i\) is in category \(k\), given the words in the document and the prior probability of category \(k\) (the likelihood of the document belonging to a category before we observe any text.)
  • \(P(W_i|y_i = C_k)\) - conditional probability
    • the probability that we would observe the words in \(W_i\) if the document were from category \(k\)
  • \(P(y_i=C_k)\) - prior probability that the document is from category \(k\)
    • the probability of the category of the document, absent any information about the words it contains
  • \(\color{red}{P(W_i)}\) - unconditional probability of the words in document i
    • the probability that we would observe the words in \(W_i\) across all categories

Naive Bayes Caveats

By treating documents as bag-of-words, we assume:

  • Conditional independence - Knowing a document contains one word doesn’t tell us anything about the probability of observing other words in that document
  • Positional independence of word counts - The position of a word within a document doesn’t give us any information about the category of that document

Naive Bayes Classification

The classification decision made by the Naive Bayes model is simple: we assign document to the category, , for which it has the highest posterior probability:

\[ \hat{Y_i} = \underset{k \in \{1,...,k\}}{\mathrm{argmax}} P(y_i = C_k) \times P (W_i|y_i=C_k) \] where means “which category khas the maximum posterior probability”.

Logic:

  • assign documents to categories when the probability of observing the words in that document are high given the probability distribution for that category (i.e. when \(P(W_i|y_i = C_k)\)is large)
  • assign more documents to categories that contain more documents in the training data (i.e. when \(P(y_i = C_k)\) is large)

Naive Bayes Application

Here is how we apply Naive Bayes in Python:

Python
import pandas as pd
import numpy as np
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
# Saving only specific columns
aggression_texts = aggression_texts[["name", "party_short", "year", "body", "gender", "aggression_rating"]]
# Recode aggression_rating to new columns for aggressive and non-aggressive
aggression_texts['aggressive'] = (aggression_texts['aggression_rating'] == 1).astype(int)
aggression_texts['non_aggressive'] = (aggression_texts['aggression_rating'] == 0).astype(int)
# Dropping NAs
texts = aggression_texts.copy()
# Step 1: Add an identifier column
texts["id"] = texts.index
texts.head(8)
                  name   party_short  year  ... aggressive non_aggressive  id
0  Mr Gerry Bermingham        Labour  1992  ...          1              0   0
1       Richard Benyon  Conservative  2018  ...          0              1   1
2       Penny Mordaunt  Conservative  2014  ...          0              1   2
3      Gerry Sutcliffe        Labour  2004  ...          0              1   3
4      Debbie Abrahams        Labour  2014  ...          1              0   4
5     Stephen Metcalfe  Conservative  2012  ...          0              1   5
6      Mr John Gunnell        Labour  1996  ...          0              1   6
7         Julian Lewis  Conservative  2000  ...          1              0   7

[8 rows x 9 columns]

Pre-processing Text

Python
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
import nltk
# Download the NLTK stopword list if not already done
nltk.download('stopwords')

# Get the English stopwords list
stop_words = set(stopwords.words('english'))

# Remove stopwords
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

# Apply the stopword removal function to the aggregated text
texts['body'] = texts['body'].apply(remove_stopwords)
True

Test-set approach

We randomly divide the available set of samples into two parts: a training set and a test set.

The model is fit on the training set, and the fitted model is used to predict the responses for the test set.

We then calculate classification performance scores (accuracy, sensitivity, specificity, etc) for the test set.

Naive Bayes Application

Here is how we apply Naive Bayes in Python:

Python
# Import necessary libraries for text processing and machine learning
# For converting text into numerical data
from sklearn.feature_extraction.text import CountVectorizer
# For training a Naive Bayes classifier
from sklearn.naive_bayes import MultinomialNB
# For splitting data into training and testing sets
from sklearn.model_selection import train_test_split
# For evaluating the model's performance
from sklearn.metrics import classification_report

# Step 1: Convert text data into a Document-Feature Matrix
# `CountVectorizer` turns text data into a matrix where rows are documents and columns are unique words.
# Each cell contains the count of a word's occurrence in a document.
# `stop_words='english'` removes common words (e.g., "the", "and") to focus on more meaningful words.
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts["body"])  # Convert the "body" column of the dataset into numerical data
y = texts["aggressive"]  # Set the target variable to the "aggressive_new" column, which contains category labels (0 or 1)

Naive Bayes Application

The model is fit on the training set, and the fitted model is used to predict the responses for the test set.

Python
# Step 2: Split the data into training and testing sets
# Training data is used to build the model, and testing data is used to evaluate it.
# `test_size=0.33` means 33% of the data will be used for testing.
# `random_state=125` ensures consistent results each time you run the code.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=125)

How many Aggressive speeches are there in the training and test sets?

Python
# Count Aggressive speeches in the training set
aggressive_train_count = (y_train == 1).sum()
non_aggressive_train_count = (y_train == 0).sum()
aggressive_test_count = (y_test == 1).sum()
non_aggressive_test_count = (y_test == 0).sum()
Python
print(f"Aggressive speeches train count: {aggressive_train_count}")
Aggressive speeches train count: 7104
print(f"Aggressive speeches test count: {aggressive_test_count}")
Aggressive speeches test count: 3494
print(f"Non-aggressive speeches train count: {non_aggressive_train_count}")
Non-aggressive speeches train count: 12088
print(f"Non-Aggressive speeches test count: {non_aggressive_test_count}")
Non-Aggressive speeches test count: 5959

Naive Bayes Application

The model is fit on the training set.

Python
# Step 3: Train a Naive Bayes model
# Multinomial Naive Bayes is a simple algorithm often used for text classification.
# The model learns from the training data to associate words with the target labels (e.g., Non-Aggressive or Aggressive).
model = MultinomialNB()
model.fit(X_train, y_train)  # Train the model using the training data
MultinomialNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The fitted model is used to predict the responses for the test set.

Naive Bayes Application

And finally, we predict the category of each speech in the test set:

Python
# Step 4: Evaluate the model's performance
# Predict the labels for the testing data
y_pred = model.predict(X_test)

We then calculate classification performance scores (accuracy, sensitivity, specificity, etc) for the test set.

Python
# Print a detailed report showing metrics like precision, recall, and F1-score for each class
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.79      0.80      0.80      5959
           1       0.65      0.64      0.65      3494

    accuracy                           0.74      9453
   macro avg       0.72      0.72      0.72      9453
weighted avg       0.74      0.74      0.74      9453

Naive Bayes Application

Python
# Print a detailed report showing metrics like precision, recall, and F1-score for each class
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.79      0.80      0.80      5959
           1       0.65      0.64      0.65      3494

    accuracy                           0.74      9453
   macro avg       0.72      0.72      0.72      9453
weighted avg       0.74      0.74      0.74      9453
  • Precision: Correct predictions for each class relative to predictions made.
  • Recall: Correct predictions relative to actual instances.
  • F1-Score: Balance of precision and recall.
  • Support: 5959 (Class 0), 3494 (Class 1).

Insights:

  • The model struggles to balance precision and recall for both classes.
  • Better recall for Class 1.0 indicates fewer false negatives.

Naive Bayes Application

This is how we figure out the most relevant words associates with every category.

Python
import numpy as np

# Step 5: Directly exponentiate the model's log probabilities to get probabilities
feature_probs_exp = np.exp(model.feature_log_prob_)  # Convert log probabilities to regular probabilities

# Step 6: Relabel classes
classes = ['non_aggressive', 'aggressive']
feature_names = vectorizer.get_feature_names_out()  # Vocabulary from vectorizer

feature_probs_df = pd.DataFrame(
    feature_probs_exp,  # Use the probabilities
    columns=vectorizer.get_feature_names_out(),  # Feature names as column names
    index=classes  # Class labels as row index
)

# Step 7: Transpose the DataFrame for easier analysis
# Flip the DataFrame so rows become words and columns are the classes
feature_probs_transposed = feature_probs_df.T

# Print the final DataFrame showing the probabilities of words for each class
print(feature_probs_transposed)
               non_aggressive  aggressive
00                   0.000001    0.000006
000                  0.001239    0.001526
00035                0.000001    0.000002
000s                 0.000004    0.000002
000th                0.000003    0.000002
...                       ...         ...
zoomusicology        0.000003    0.000002
zsa                  0.000001    0.000005
zsolt                0.000003    0.000002
zuckerman            0.000001    0.000003
zurich               0.000001    0.000005

[38769 rows x 2 columns]

Naive Bayes Application

Remember that we try to measure the probability of observing work \(j\) given class \(k\):

\[ \mu_{j(k)}=\frac{W_{j(k)}}{\sum_{j \in V} W_{j(k)}} \] These are the probabilities for our speeches

We can examine the probability of each word. Here are the top 10 words.

Highest probabilty for Non-Aggressive (i.e. \(P(w_j|c_k = "Non-Aggressive"\)))

Python
# Sort the DataFrame by Non Aggressive column in descending order
sorted_non_aggressive = feature_probs_transposed.sort_values(by="non_aggressive", ascending=False)
# Display the top rows
print(sorted_non_aggressive.head(10))
            non_aggressive  aggressive
honourable        0.016144    0.012421
government        0.007651    0.011537
friend            0.007021    0.004315
people            0.006734    0.007688
right             0.006161    0.005326
member            0.005261    0.005197
house             0.004348    0.004068
minister          0.004067    0.005968
gentleman         0.004003    0.003303
new               0.003643    0.002762

Highest probabilty for Aggressive (i.e. \(P(w_j|c_k = "Aggressive")\))

Python
# Sort the DataFrame by Non-Aggressive column in descending order
sorted_aggressive = feature_probs_transposed.sort_values(by="aggressive", ascending=False)
# Display the top rows
print(sorted_aggressive.head(10))
            non_aggressive  aggressive
honourable        0.016144    0.012421
government        0.007651    0.011537
people            0.006734    0.007688
minister          0.004067    0.005968
right             0.006161    0.005326
member            0.005261    0.005197
friend            0.007021    0.004315
house             0.004348    0.004068
secretary         0.002694    0.003868
said              0.002781    0.003827

Naive Bayes Application

What are the probabilities that the following speech is aggressive?

Python
# Saving only specific columns
# Transform the text using the trained vectorizer
example_speech = vectorizer.transform(["You are terrible. I hate you"])

# Calculate probabilities using the trained model
mod_probs = model.predict_proba(example_speech)

# Add probabilities for 'Non-Agressive' and 'Aggressive' to the DataFrame
# Probability for class 0 (Non-Aggressive)
avg_prob_nonaggressive =mod_probs[0, 0]
# Probability for class 1 (Aggressive)
avg_prob_aggressive = mod_probs[0, 1]
Python
# Display the results
print(f"Average probability for Nonaggressive: {avg_prob_nonaggressive:.2f}")
Average probability for Nonaggressive: 0.32
print(f"Average probability for Aggressive: {avg_prob_aggressive:.2f}")
Average probability for Aggressive: 0.68

Naive Bayes Application

Plotting aggression over time by gender:

See code
Python
import pandas as pd
import numpy as np
from scipy.stats import t

X = vectorizer.fit_transform(texts["body"])
y = texts["aggressive"]
# Step 3: Predict labels for all texts in the dataset
texts["predicted_label"] = model.predict(X)

# Assuming aggression_texts_df2 is your DataFrame
aggression_trends = (texts
    .groupby(['year', 'gender'], as_index=False)
    .agg(mean_aggression=('predicted_label', 'mean'),
         sd_aggression=('predicted_label', 'std'),
         n=('predicted_label', 'size'))
)

# Calculating standard error, confidence interval lower and upper bounds
aggression_trends['se'] = aggression_trends['sd_aggression'] / np.sqrt(aggression_trends['n'])
aggression_trends['ci_lower'] = aggression_trends['mean_aggression'] - t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']
aggression_trends['ci_upper'] = aggression_trends['mean_aggression'] + t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']
See code
R
library(reticulate)
library(ggplot2)
aggression_trends2 <- reticulate::py$aggression_trends
aggression_trends3<-subset(aggression_trends2, year>1997)
ggplot(aggression_trends3, aes(x = year, y = mean_aggression, color = gender)) +
  geom_line() +
  geom_point() +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = gender), alpha = 0.2) +
  labs(
    title = "Aggression Rating Over Time by Gender",
    x = "Year",
    y = "Mean Aggression Rating"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(aggression_trends3$year), max(aggression_trends3$year), by = 1))+
    geom_hline(yintercept = 0)+
    theme_bw()

Naive Bayes Application

It differs slightly from the ChatGPT predictions:

See code
Python
import pandas as pd
import numpy as np
from scipy.stats import t

# Assuming aggression_texts_df2 is your DataFrame
aggression_trends = (aggression_texts
    .groupby(['year', 'gender'], as_index=False)
    .agg(mean_aggression=('aggression_rating', 'mean'),
         sd_aggression=('aggression_rating', 'std'),
         n=('aggression_rating', 'size'))
)

# Calculating standard error, confidence interval lower and upper bounds
aggression_trends['se'] = aggression_trends['sd_aggression'] / np.sqrt(aggression_trends['n'])
aggression_trends['ci_lower'] = aggression_trends['mean_aggression'] - t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']
aggression_trends['ci_upper'] = aggression_trends['mean_aggression'] + t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']
See code
R
library(reticulate)
library(ggplot2)
aggression_trends2 <- reticulate::py$aggression_trends
aggression_trends3<-subset(aggression_trends2, year>1997)
ggplot(aggression_trends3, aes(x = year, y = mean_aggression, color = gender)) +
  geom_line() +
  geom_point() +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = gender), alpha = 0.2) +
  labs(
    title = "Aggression Rating Over Time by Gender",
    x = "Year",
    y = "Mean Aggression Rating"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(aggression_trends3$year), max(aggression_trends3$year), by = 1))+
    geom_hline(yintercept = 0)+
    theme_bw()

Purpose of Naive Bayes

The overall purpose of Naive Bayes is to make out-of-sample predictions

Generally, the idea is that we have a small hand-coded training dataset and then we predict for lots of other speeches.

Let us now try another text and see the probability that is aggressive or not.

Python
# Define the simple sentence
simple_sentence = ["We need to invest in healthcare and education for the future."]

# Transform the sentence using the trained vectorizer
simple_X = vectorizer.transform(simple_sentence)

# Predict probabilities using the trained model
simple_probs = model.predict_proba(simple_X)

# Extract probabilities
prob_non_aggressive = simple_probs[0, 0]  # Probability for Non-Aggressive (class 0)
prob_aggressive = simple_probs[0, 1]        # Probability for Aggressive (class 1)
Python
# Display the results
print(f"Probability of Non-Aggressive: {prob_non_aggressive:.2f}")
Probability of Non-Aggressive: 0.86
print(f"Probability of Aggressive: {prob_aggressive:.2f}")
Probability of Aggressive: 0.14

Notes about Naive Bayes

Naive Bayes can be simple to apply

  • it can be used over multiple categories
  • it can be used over various text representations: bigrams, trigrams

Naive Bayes and the Independence Assumption

  • Naive Bayes (NB) assumes that all words in a text contribute to predictions independently.
  • This means it cannot account for word combinations or interactions.
  • Example:
  • “Crush the opposition” may lean Aggressive.
  • “Collaborate with others” may lean Non-Aggressive.
  • NB might misclassify based on the words opposition and collaborate alone
  • This assumption often oversimplifies the complexity of political speeches.

Challenges with Overconfidence

  • NB treats each word as a new piece of information.
    • Frequent words like “battle” or “support” can overly influence predictions.
  • Context is ignored:
    • “Grow through diplomacy” suggests Non-Aggressive policies.
    • “Destroy all opposition” suggests Aggressive priorities.
  • Political language often relies on nuances and interactions that NB cannot capture.

Alternatives to Naive Bayes

  • Other methods, like Support Vector Machines (SVM), address NB’s limitations:
    • Allow non-linear interactions between words.
    • Adjust probabilities based on word frequency and combinations.
  • Example:
    • Repeated mentions of “battle” strengthen Aggressive classification.
    • Repeated mentions of “support” strengthen Non-Aggressive classification.
  • SVM and other advanced models are better suited for nuanced text analysis.

Validating Supervised Learning

An important step is to measure the degree to which the predictions we make correspond to the observed data

The important measures are:

  • accuracy - proportion of all predictions that match the observed data
  • sensitivity - proportion of “true positive” predictions that match the observed data
  • specificity - proportion of “true negative” predictions that match the observed data

To get informative estimates of these quantities, we need to distinguish between the performance of the classifier on the training set and the test set

K-Fold Validation

An alternative to the train-split method is K-Fold Validation

This approach involves randomly dividing the set of observations into \(k\) groups, or folds, of approximately equal size.

K-Fold Validation reduces the variance of the performance estimate and allowing you to use more data for training.

It also helps you avoid overfitting, as it exposes your model to different subsets of the data.

The typical choice has been \(k=10\). K-Fold Validation will entail 3 steps:

  • train the naive Bayes model on all observations not included in the model
  • generate predictions for observations in the fold
  • calculate predictions for the observations in the held-out fold

K-Fold Validation

Here is how we implement K-Fold validation

Python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold  # For k-fold cross-validation
from sklearn.metrics import classification_report, accuracy_score
from pretty_html_table import build_table
from IPython.display import display, HTML

# Step 1: Convert text data into a Document-Feature Matrix
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(texts["body"])  # Transform text into numerical features
y = texts["aggressive"]  # Target variable

# Step 2: Initialize the k-fold cross-validator
k = 10  # Number of folds
kf = StratifiedKFold(n_splits=k, shuffle=True, random_state=125)  # Stratified to preserve class distribution

# Step 3: Perform k-fold validation
fold = 1  # To track the fold number
for train_index, test_index in kf.split(X, y):
    print(f"Fold {fold}:")
    
    # Split data into training and testing sets for the current fold
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Train the Naive Bayes model
    model = MultinomialNB()
    model.fit(X_train, y_train)
    
    # Test the model
    y_pred = model.predict(X_test)
    
    # Print evaluation metrics
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print("-" * 50)
    
    fold += 1

K-Fold Validation

Here is how we implement K-Fold validation

MultinomialNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

precision recall f1-score support fold
0 0.793028 0.806648 0.79978 1805.0 1
1 0.660836 0.641509 0.651029 1060.0 1
accuracy 0.74555 1
macro avg 0.726932 0.724079 0.725405 2865.0 1
weighted avg 0.744119 0.74555 0.744745 2865.0 1
0 0.786868 0.809972 0.798253 1805.0 2
1 0.659384 0.626415 0.642477 1060.0 2
accuracy 0.742059 2
macro avg 0.723126 0.718194 0.720365 2865.0 2
weighted avg 0.739701 0.742059 0.740618 2865.0 2
0 0.79564 0.808864 0.802198 1805.0 3
1 0.665049 0.646226 0.655502 1060.0 3
accuracy 0.748691 3
macro avg 0.730344 0.727545 0.72885 2865.0 3
weighted avg 0.747324 0.748691 0.747923 2865.0 3
0 0.79593 0.801662 0.798786 1805.0 4
1 0.658071 0.65 0.65401 1060.0 4
accuracy 0.74555 4
macro avg 0.727 0.725831 0.726398 2865.0 4
weighted avg 0.744924 0.74555 0.745221 2865.0 4
0 0.78453 0.786704 0.785615 1805.0 5
1 0.635071 0.632075 0.63357 1060.0 5
accuracy 0.729494 5
macro avg 0.709801 0.70939 0.709593 2865.0 5
weighted avg 0.729233 0.729494 0.729361 2865.0 5
0 0.785329 0.807095 0.796063 1804.0 6
1 0.655446 0.624528 0.639614 1060.0 6
accuracy 0.739525 6
macro avg 0.720387 0.715812 0.717838 2864.0 6
weighted avg 0.737258 0.739525 0.738159 2864.0 6
0 0.792504 0.808758 0.800549 1804.0 7
1 0.662757 0.639623 0.650984 1060.0 7
accuracy 0.746159 7
macro avg 0.72763 0.72419 0.725766 2864.0 7
weighted avg 0.744483 0.746159 0.745193 2864.0 7
0 0.785286 0.79878 0.791976 1804.0 8
1 0.64723 0.628302 0.637626 1060.0 8
accuracy 0.735684 8
macro avg 0.716258 0.713541 0.714801 2864.0 8
weighted avg 0.73419 0.735684 0.734849 2864.0 8
0 0.788663 0.793906 0.791276 1805.0 9
1 0.644699 0.637394 0.641026 1059.0 9
accuracy 0.736034 9
macro avg 0.716681 0.71565 0.716151 2864.0 9
weighted avg 0.73543 0.736034 0.735719 2864.0 9
0 0.795817 0.801108 0.798454 1805.0 10
1 0.657116 0.649669 0.653371 1059.0 10
accuracy 0.745112 10
macro avg 0.726466 0.725389 0.725913 2864.0 10
weighted avg 0.744531 0.745112 0.744808 2864.0 10

Comments

Accuracy ranges between 72.9% and 74.8%, indicating stable performance.

Performance slightly favors class 0 due to imbalance.

Class-Specific Observations:

  • Class 0: Higher precision and recall; model predicts this class more reliably.
  • Class 1: Lower precision and recall; more difficult to predict due to fewer samples.

Key Takeaways:

  • The model is reliable for the majority class (0), with room for improvement for class 1.
  • Potential next steps include addressing class imbalance (oversampling/undersampling)

Other Models

Naive Bayes is only one of the many models that we can use to make out-of-sample classification

Other types of models include:

  1. Regularized Logistic Regression
  • Directly models the probability that each document is in class using logistic regression
  • Regularization is required to prevent overfitting data
  1. Support Vector Machines
  • SVMs draw a hyperplane through the multidimensional word space that best separates documents into different classes
  • Can accomodate non-linear boundaries between classes
  1. Tree-based Methods
  • Tree-based methods separate classes by segmenting the predictors (word counts) into a number of distinct regions
  • Like the SVM, this allows for non-linear relationships between features and categories

Conclusion

Supervised learning for text data enables us to identify patterns and associations between words and specific outcome categories.

The Naive Bayes model, a simple yet efficient approach, is quick to implement and often delivers strong classification results, despite relying on certain assumptions.

Once supervised learning classifiers are trained, it is crucial to validate their performance on a test set that was not used during model training.

Cross-validation is a widely used technique for out-of-sample evaluation, helping to compare and select the best-performing models.