Supervised learning allows to classify documents into pre-defined categories or labelled data
Unsupervised learning analyzes and clusters unlabeled data sets. These algorithms discover hidden patterns in data without the need for human intervention (hence, they are ``unsupervised”).
It can be conceptualized as a generalization of dictionary methods
There are specific words associated with specific categories defined by the researcher (pre-labelled training data)
Words have a weight of 0 or 1 (according to their relative prevalence in each each category)
Documents are scored based on the words they contain
Introduction
Within supervised learning, the features associated with each category (and their relative weight) are learned from the data.
Supervised learning methods will often outperform dictionary methods in classification tasks, particularly when the training sample is large.
Components of Supervised Learning
Labelled Datasets
There is a labelled data (usually hand-coded) that puts text in different categories
training set: used to train the classifier
test set: used to validate the classifier
Classification method
this will be used to learn the relationship between coded texts and words
Components of Supervised Learning
Validation method
we use cross-validation metrics: confusion matrix, accuracy, sensitivity, specificity, etc
Out-of-sample prediction
We will use the model to predict categories for documents that do not have labels
Creating a labelled dataset
We can label data through expert annotation. E.g.:
undergraduate students get to code texts into particular categories
crowd-sourced people on the internet get to code texts into particular categories.
ChatGPT can be used to code texts into particular categories
Intuition: Naive Bayes
Imagine you’re sorting emails (Spam vs. Not Spam)
Naive Bayes helps classify emails as spam or not spam based on the words they contain. For example:
Words like “win,” “free,” and “prize” might appear more in spam emails.
Words like “meeting,” “project,” and “agenda” might appear more in not-spam emails.
The goal is to calculate which category (spam or not spam) an email is most likely to belong to.
Intuition: Naive Bayes
Step 1: What’s the Question?
We want to figure out:
Note
How likely is it that this email belongs to a specific category (e.g., spam) given the words it contains?
This is called the posterior probability:
\[P(\text{Category∣Words})\]
Intuition: Naive Bayes
Step 2: What Information Do We Use?
To make this decision, Naive Bayes combines:
1.How common is the category overall? (Prior Probability)
\[P(\text{Category})\]
For instance, if 60% of emails are spam, then \(P(\text{Spam})=0.6\)
2.How common is the category overall? (Likelihood)
\[P(\text{Words|Category})\]
For example, the word “win” might appear in 50% of spam emails but only 2% of not-spam emails.
3. How often do the words appear overall? (Normalization Factor):
\[P(\text{Words})\] This ensures probabilities add up to 1 across all categories.
Intuition: Naive Bayes
Step 3: Simplify the Math with an Analogy
Imagine you’re at a dog park. Some dogs are Labradors, and others are Poodles. If you see a black dog, you might ask:
Note
Is this dog more likely to be a Labrador or a Poodle?
1.Prior Probability: Start with what you know about the park:
70% of the dogs are Labradors: \(P(\text{Labrador})=0.7\)
30% are Poodles: \(P(\text{Poodle})=0.3\)
2.Likelihood: Look at traits like coat color:
Black Labradors are common: \(P(\text{Black|Labrador}) = 0.8\)
Black Poodles are less common: \(P(\text{Black|Poodle}) = 0.3\)
Intuition: Naive Bayes
Step 3: Simplify the Math with an Analogy
Imagine you’re at a dog park. Some dogs are Labradors, and others are Poodles. If you see a black dog, you might ask:
Note
Is this dog more likely to be a Labrador or a Poodle?
3.Combine Information: Naive Bayes combines these probabilities to estimate:
Suppose the prior probability of spam is \(P(\text{Spam})=0.6\)
Suppose the prior probability of spam is \(P(\text{"Win"|Spam})=0.5\)
The likelihood of the word “win” overall is \(P(\text{"Win"}) = 0.2\)
The posterior probability of spam given the word “win” is:
\[
P(\text{Spam|"win"}) = \frac{P(\text{Spam}) \times P(\text{"win"|Spam})}{P(\text{"win"})} = \frac{0.6 \times 0.5}{0.2} = 1.5
\] 1.5 will be normalized later to fit a probability range.
Intuition: Naive Bayes
Step 5: Make the Final Decision
Compare the posterior probabilities for all categories (spam and not spam).
We Assign the email to the category with the highest probability.
Methods: Naive Bayes
Steps
For example, in this case, we can try to label aggression in political speeches, based on tiny sample coded in ChatGPT.
1. Train Language Models for Each Category
Example: Build a model for Aggressive speeches and another for Non-Aggressive speeches.
Each model calculates the probability of words appearing in its category.
Aggressive speeches often use words like “battle,” “destroy,” “win,” “horrible.”
Non-Aggressive speeches often use non-aggressive, neutral words like “diplomacy,” “support,” or “grow.”
2. Get a New Document
Example: A campaign speech or policy statement.
Methods: Naive Bayes
Steps
3. Calculate Probabilities for Each Model
Compute the likelihood that the text was “written” by the Aggressive language model vs the Non-Aggressive model:
Aggressive Model: High probabilities for “battle,” “destroy,” “win,” “horrible.”
Non-Aggressive Model: High probabilities for “diplomacy,” “support,” or “grow.”
4. Assign the Most Likely Category
If the text mentions “battle” and “destroy” - Aggressive.
If the text emphasizes “diplomacy” and “support” - Non-Aggressive.
Language Models
We can represent these different “models” for language using a probability distribution over the words in the vocabulary:
A probability distribution over a discrete variable must have three properties:
Each element must be greater than or equal to zero
Each element must be less than or equal to one
The sum of the elements must be 1
Tone
Battle
Destroy
Win
Diplomacy
Support
Grow
Aggressive
0.30
0.25
0.20
0.05
0.10
0.10
Non-Aggressive
0.05
0.10
0.15
0.30
0.25
0.15
Probability Distributions
Definition:
A mathematical function that shows how likely different outcomes are.
For discrete events, it assigns probabilities to each possible event.
\(\color{red}{P(y_i=C_k | W_i)}\) - posterior distribution
the probability that document \(i\) is in category \(k\), given the words in the document and the prior probability of category \(k\) (the likelihood of the document belonging to a category before we observe any text.)
Naive Bayes
Naive Bayes is a model that classifies documents into categories on the basis of the words they contain.
the probability that document \(i\) is in category \(k\), given the words in the document and the prior probability of category \(k\) (the likelihood of the document belonging to a category before we observe any text.)
\(\color{red}{P(W_i|y_i = C_k)}\) - conditional probability
the probability that we would observe the words in \(W_i\) if the document were from category \(k\)
Naive Bayes
Naive Bayes is a model that classifies documents into categories on the basis of the words they contain.
the probability that document \(i\) is in category \(k\), given the words in the document and the prior probability of category \(k\) (the likelihood of the document belonging to a category before we observe any text.)
\(P(W_i|y_i = C_k)\) - conditional probability
the probability that we would observe the words in \(W_i\) if the document were from category \(k\)
\(\color{red}{P(y_i=C_k)}\) - prior probability that the document is from category \(k\)
the probability of the category of the document, absent any information about the words it contains
Naive Bayes
Naive Bayes is a model that classifies documents into categories on the basis of the words they contain.
the probability that document \(i\) is in category \(k\), given the words in the document and the prior probability of category \(k\) (the likelihood of the document belonging to a category before we observe any text.)
\(P(W_i|y_i = C_k)\) - conditional probability
the probability that we would observe the words in \(W_i\) if the document were from category \(k\)
\(P(y_i=C_k)\) - prior probability that the document is from category \(k\)
the probability of the category of the document, absent any information about the words it contains
\(\color{red}{P(W_i)}\) - unconditional probability of the words in document i
the probability that we would observe the words in \(W_i\) across all categories
Naive Bayes Caveats
By treating documents as bag-of-words, we assume:
Conditional independence - Knowing a document contains one word doesn’t tell us anything about the probability of observing other words in that document
Positional independence of word counts - The position of a word within a document doesn’t give us any information about the category of that document
Naive Bayes Classification
The classification decision made by the Naive Bayes model is simple: we assign document to the category, , for which it has the highest posterior probability:
\[
\hat{Y_i} = \underset{k \in \{1,...,k\}}{\mathrm{argmax}} P(y_i = C_k) \times P (W_i|y_i=C_k)
\] where means “which category khas the maximum posterior probability”.
Logic:
assign documents to categories when the probability of observing the words in that document are high given the probability distribution for that category (i.e. when \(P(W_i|y_i = C_k)\)is large)
assign more documents to categories that contain more documents in the training data (i.e. when \(P(y_i = C_k)\) is large)
Naive Bayes Application
Here is how we apply Naive Bayes in Python:
Python
import pandas as pdimport numpy as npaggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")# Saving only specific columnsaggression_texts = aggression_texts[["name", "party_short", "year", "body", "gender", "aggression_rating"]]# Recode aggression_rating to new columns for aggressive and non-aggressiveaggression_texts['aggressive'] = (aggression_texts['aggression_rating'] ==1).astype(int)aggression_texts['non_aggressive'] = (aggression_texts['aggression_rating'] ==0).astype(int)# Dropping NAstexts = aggression_texts.copy()# Step 1: Add an identifier columntexts["id"] = texts.indextexts.head(8)
name party_short year ... aggressive non_aggressive id
0 Mr Gerry Bermingham Labour 1992 ... 1 0 0
1 Richard Benyon Conservative 2018 ... 0 1 1
2 Penny Mordaunt Conservative 2014 ... 0 1 2
3 Gerry Sutcliffe Labour 2004 ... 0 1 3
4 Debbie Abrahams Labour 2014 ... 1 0 4
5 Stephen Metcalfe Conservative 2012 ... 0 1 5
6 Mr John Gunnell Labour 1996 ... 0 1 6
7 Julian Lewis Conservative 2000 ... 1 0 7
[8 rows x 9 columns]
Pre-processing Text
Python
import pandas as pdfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityfrom nltk.corpus import stopwordsimport nltk# Download the NLTK stopword list if not already donenltk.download('stopwords')# Get the English stopwords liststop_words =set(stopwords.words('english'))# Remove stopwordsdef remove_stopwords(text): words = text.split() filtered_words = [word for word in words if word.lower() notin stop_words]return" ".join(filtered_words)# Apply the stopword removal function to the aggregated texttexts['body'] = texts['body'].apply(remove_stopwords)
True
Test-set approach
We randomly divide the available set of samples into two parts: a training set and a test set.
The model is fit on the training set, and the fitted model is used to predict the responses for the test set.
We then calculate classification performance scores (accuracy, sensitivity, specificity, etc) for the test set.
Naive Bayes Application
Here is how we apply Naive Bayes in Python:
Python
# Import necessary libraries for text processing and machine learning# For converting text into numerical datafrom sklearn.feature_extraction.text import CountVectorizer# For training a Naive Bayes classifierfrom sklearn.naive_bayes import MultinomialNB# For splitting data into training and testing setsfrom sklearn.model_selection import train_test_split# For evaluating the model's performancefrom sklearn.metrics import classification_report# Step 1: Convert text data into a Document-Feature Matrix# `CountVectorizer` turns text data into a matrix where rows are documents and columns are unique words.# Each cell contains the count of a word's occurrence in a document.# `stop_words='english'` removes common words (e.g., "the", "and") to focus on more meaningful words.vectorizer = CountVectorizer()X = vectorizer.fit_transform(texts["body"]) # Convert the "body" column of the dataset into numerical datay = texts["aggressive"] # Set the target variable to the "aggressive_new" column, which contains category labels (0 or 1)
Naive Bayes Application
The model is fit on the training set, and the fitted model is used to predict the responses for the test set.
Python
# Step 2: Split the data into training and testing sets# Training data is used to build the model, and testing data is used to evaluate it.# `test_size=0.33` means 33% of the data will be used for testing.# `random_state=125` ensures consistent results each time you run the code.X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=125)
How many Aggressive speeches are there in the training and test sets?
Python
# Count Aggressive speeches in the training setaggressive_train_count = (y_train ==1).sum()non_aggressive_train_count = (y_train ==0).sum()aggressive_test_count = (y_test ==1).sum()non_aggressive_test_count = (y_test ==0).sum()
print(f"Non-Aggressive speeches test count: {non_aggressive_test_count}")
Non-Aggressive speeches test count: 5959
Naive Bayes Application
The model is fit on the training set.
Python
# Step 3: Train a Naive Bayes model# Multinomial Naive Bayes is a simple algorithm often used for text classification.# The model learns from the training data to associate words with the target labels (e.g., Non-Aggressive or Aggressive).model = MultinomialNB()model.fit(X_train, y_train) # Train the model using the training data
MultinomialNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB()
The fitted model is used to predict the responses for the test set.
Naive Bayes Application
And finally, we predict the category of each speech in the test set:
Python
# Step 4: Evaluate the model's performance# Predict the labels for the testing datay_pred = model.predict(X_test)
We then calculate classification performance scores (accuracy, sensitivity, specificity, etc) for the test set.
Python
# Print a detailed report showing metrics like precision, recall, and F1-score for each classprint(classification_report(y_test, y_pred))
Precision: Correct predictions for each class relative to predictions made.
Recall: Correct predictions relative to actual instances.
F1-Score: Balance of precision and recall.
Support: 5959 (Class 0), 3494 (Class 1).
Insights:
The model struggles to balance precision and recall for both classes.
Better recall for Class 1.0 indicates fewer false negatives.
Naive Bayes Application
This is how we figure out the most relevant words associates with every category.
Python
import numpy as np# Step 5: Directly exponentiate the model's log probabilities to get probabilitiesfeature_probs_exp = np.exp(model.feature_log_prob_) # Convert log probabilities to regular probabilities# Step 6: Relabel classesclasses = ['non_aggressive', 'aggressive']feature_names = vectorizer.get_feature_names_out() # Vocabulary from vectorizerfeature_probs_df = pd.DataFrame( feature_probs_exp, # Use the probabilities columns=vectorizer.get_feature_names_out(), # Feature names as column names index=classes # Class labels as row index)# Step 7: Transpose the DataFrame for easier analysis# Flip the DataFrame so rows become words and columns are the classesfeature_probs_transposed = feature_probs_df.T# Print the final DataFrame showing the probabilities of words for each classprint(feature_probs_transposed)
Remember that we try to measure the probability of observing work \(j\) given class \(k\):
\[
\mu_{j(k)}=\frac{W_{j(k)}}{\sum_{j \in V} W_{j(k)}}
\] These are the probabilities for our speeches
We can examine the probability of each word. Here are the top 10 words.
Highest probabilty for Non-Aggressive (i.e. \(P(w_j|c_k = "Non-Aggressive"\)))
Python
# Sort the DataFrame by Non Aggressive column in descending ordersorted_non_aggressive = feature_probs_transposed.sort_values(by="non_aggressive", ascending=False)# Display the top rowsprint(sorted_non_aggressive.head(10))
non_aggressive aggressive
honourable 0.016144 0.012421
government 0.007651 0.011537
friend 0.007021 0.004315
people 0.006734 0.007688
right 0.006161 0.005326
member 0.005261 0.005197
house 0.004348 0.004068
minister 0.004067 0.005968
gentleman 0.004003 0.003303
new 0.003643 0.002762
Highest probabilty for Aggressive (i.e. \(P(w_j|c_k = "Aggressive")\))
Python
# Sort the DataFrame by Non-Aggressive column in descending ordersorted_aggressive = feature_probs_transposed.sort_values(by="aggressive", ascending=False)# Display the top rowsprint(sorted_aggressive.head(10))
non_aggressive aggressive
honourable 0.016144 0.012421
government 0.007651 0.011537
people 0.006734 0.007688
minister 0.004067 0.005968
right 0.006161 0.005326
member 0.005261 0.005197
friend 0.007021 0.004315
house 0.004348 0.004068
secretary 0.002694 0.003868
said 0.002781 0.003827
Naive Bayes Application
What are the probabilities that the following speech is aggressive?
Python
# Saving only specific columns# Transform the text using the trained vectorizerexample_speech = vectorizer.transform(["You are terrible. I hate you"])# Calculate probabilities using the trained modelmod_probs = model.predict_proba(example_speech)# Add probabilities for 'Non-Agressive' and 'Aggressive' to the DataFrame# Probability for class 0 (Non-Aggressive)avg_prob_nonaggressive =mod_probs[0, 0]# Probability for class 1 (Aggressive)avg_prob_aggressive = mod_probs[0, 1]
Python
# Display the resultsprint(f"Average probability for Nonaggressive: {avg_prob_nonaggressive:.2f}")
Average probability for Nonaggressive: 0.32
print(f"Average probability for Aggressive: {avg_prob_aggressive:.2f}")
Average probability for Aggressive: 0.68
Naive Bayes Application
Plotting aggression over time by gender:
See code
Python
import pandas as pdimport numpy as npfrom scipy.stats import tX = vectorizer.fit_transform(texts["body"])y = texts["aggressive"]# Step 3: Predict labels for all texts in the datasettexts["predicted_label"] = model.predict(X)# Assuming aggression_texts_df2 is your DataFrameaggression_trends = (texts .groupby(['year', 'gender'], as_index=False) .agg(mean_aggression=('predicted_label', 'mean'), sd_aggression=('predicted_label', 'std'), n=('predicted_label', 'size')))# Calculating standard error, confidence interval lower and upper boundsaggression_trends['se'] = aggression_trends['sd_aggression'] / np.sqrt(aggression_trends['n'])aggression_trends['ci_lower'] = aggression_trends['mean_aggression'] - t.ppf(0.975, df=aggression_trends['n'] -1) * aggression_trends['se']aggression_trends['ci_upper'] = aggression_trends['mean_aggression'] + t.ppf(0.975, df=aggression_trends['n'] -1) * aggression_trends['se']
See code
R
library(reticulate)library(ggplot2)aggression_trends2 <- reticulate::py$aggression_trendsaggression_trends3<-subset(aggression_trends2, year>1997)ggplot(aggression_trends3, aes(x = year, y = mean_aggression, color = gender)) +geom_line() +geom_point() +geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = gender), alpha =0.2) +labs(title ="Aggression Rating Over Time by Gender",x ="Year",y ="Mean Aggression Rating" ) +theme(axis.text.x =element_text(angle =45, hjust =1)) +scale_x_continuous(breaks =seq(min(aggression_trends3$year), max(aggression_trends3$year), by =1))+geom_hline(yintercept =0)+theme_bw()
Naive Bayes Application
It differs slightly from the ChatGPT predictions:
See code
Python
import pandas as pdimport numpy as npfrom scipy.stats import t# Assuming aggression_texts_df2 is your DataFrameaggression_trends = (aggression_texts .groupby(['year', 'gender'], as_index=False) .agg(mean_aggression=('aggression_rating', 'mean'), sd_aggression=('aggression_rating', 'std'), n=('aggression_rating', 'size')))# Calculating standard error, confidence interval lower and upper boundsaggression_trends['se'] = aggression_trends['sd_aggression'] / np.sqrt(aggression_trends['n'])aggression_trends['ci_lower'] = aggression_trends['mean_aggression'] - t.ppf(0.975, df=aggression_trends['n'] -1) * aggression_trends['se']aggression_trends['ci_upper'] = aggression_trends['mean_aggression'] + t.ppf(0.975, df=aggression_trends['n'] -1) * aggression_trends['se']
See code
R
library(reticulate)library(ggplot2)aggression_trends2 <- reticulate::py$aggression_trendsaggression_trends3<-subset(aggression_trends2, year>1997)ggplot(aggression_trends3, aes(x = year, y = mean_aggression, color = gender)) +geom_line() +geom_point() +geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = gender), alpha =0.2) +labs(title ="Aggression Rating Over Time by Gender",x ="Year",y ="Mean Aggression Rating" ) +theme(axis.text.x =element_text(angle =45, hjust =1)) +scale_x_continuous(breaks =seq(min(aggression_trends3$year), max(aggression_trends3$year), by =1))+geom_hline(yintercept =0)+theme_bw()
Purpose of Naive Bayes
The overall purpose of Naive Bayes is to make out-of-sample predictions
Generally, the idea is that we have a small hand-coded training dataset and then we predict for lots of other speeches.
Let us now try another text and see the probability that is aggressive or not.
Python
# Define the simple sentencesimple_sentence = ["We need to invest in healthcare and education for the future."]# Transform the sentence using the trained vectorizersimple_X = vectorizer.transform(simple_sentence)# Predict probabilities using the trained modelsimple_probs = model.predict_proba(simple_X)# Extract probabilitiesprob_non_aggressive = simple_probs[0, 0] # Probability for Non-Aggressive (class 0)prob_aggressive = simple_probs[0, 1] # Probability for Aggressive (class 1)
Python
# Display the resultsprint(f"Probability of Non-Aggressive: {prob_non_aggressive:.2f}")
Probability of Non-Aggressive: 0.86
print(f"Probability of Aggressive: {prob_aggressive:.2f}")
Probability of Aggressive: 0.14
Notes about Naive Bayes
Naive Bayes can be simple to apply
it can be used over multiple categories
it can be used over various text representations: bigrams, trigrams
Naive Bayes and the Independence Assumption
Naive Bayes (NB) assumes that all words in a text contribute to predictions independently.
This means it cannot account for word combinations or interactions.
Example:
“Crush the opposition” may lean Aggressive.
“Collaborate with others” may lean Non-Aggressive.
NB might misclassify based on the words opposition and collaborate alone
This assumption often oversimplifies the complexity of political speeches.
Challenges with Overconfidence
NB treats each word as a new piece of information.
Frequent words like “battle” or “support” can overly influence predictions.
Context is ignored:
“Grow through diplomacy” suggests Non-Aggressive policies.
“Destroy all opposition” suggests Aggressive priorities.
Political language often relies on nuances and interactions that NB cannot capture.
Alternatives to Naive Bayes
Other methods, like Support Vector Machines (SVM), address NB’s limitations:
Allow non-linear interactions between words.
Adjust probabilities based on word frequency and combinations.
Example:
Repeated mentions of “battle” strengthen Aggressive classification.
Repeated mentions of “support” strengthen Non-Aggressive classification.
SVM and other advanced models are better suited for nuanced text analysis.
Validating Supervised Learning
An important step is to measure the degree to which the predictions we make correspond to the observed data
The important measures are:
accuracy - proportion of all predictions that match the observed data
sensitivity - proportion of “true positive” predictions that match the observed data
specificity - proportion of “true negative” predictions that match the observed data
To get informative estimates of these quantities, we need to distinguish between the performance of the classifier on the training set and the test set
K-Fold Validation
An alternative to the train-split method is K-Fold Validation
This approach involves randomly dividing the set of observations into \(k\) groups, or folds, of approximately equal size.
K-Fold Validation reduces the variance of the performance estimate and allowing you to use more data for training.
It also helps you avoid overfitting, as it exposes your model to different subsets of the data.
The typical choice has been \(k=10\). K-Fold Validation will entail 3 steps:
train the naive Bayes model on all observations not included in the model
generate predictions for observations in the fold
calculate predictions for the observations in the held-out fold
K-Fold Validation
Here is how we implement K-Fold validation
Python
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.model_selection import StratifiedKFold # For k-fold cross-validationfrom sklearn.metrics import classification_report, accuracy_scorefrom pretty_html_table import build_tablefrom IPython.display import display, HTML# Step 1: Convert text data into a Document-Feature Matrixvectorizer = CountVectorizer(stop_words='english')X = vectorizer.fit_transform(texts["body"]) # Transform text into numerical featuresy = texts["aggressive"] # Target variable# Step 2: Initialize the k-fold cross-validatork =10# Number of foldskf = StratifiedKFold(n_splits=k, shuffle=True, random_state=125) # Stratified to preserve class distribution# Step 3: Perform k-fold validationfold =1# To track the fold numberfor train_index, test_index in kf.split(X, y):print(f"Fold {fold}:")# Split data into training and testing sets for the current fold X_train, X_test = X[train_index], X[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index]# Train the Naive Bayes model model = MultinomialNB() model.fit(X_train, y_train)# Test the model y_pred = model.predict(X_test)# Print evaluation metricsprint("Classification Report:")print(classification_report(y_test, y_pred))print(f"Accuracy: {accuracy_score(y_test, y_pred)}")print("-"*50) fold +=1
K-Fold Validation
Here is how we implement K-Fold validation
MultinomialNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB()
precision
recall
f1-score
support
fold
0
0.793028
0.806648
0.79978
1805.0
1
1
0.660836
0.641509
0.651029
1060.0
1
accuracy
0.74555
1
macro avg
0.726932
0.724079
0.725405
2865.0
1
weighted avg
0.744119
0.74555
0.744745
2865.0
1
0
0.786868
0.809972
0.798253
1805.0
2
1
0.659384
0.626415
0.642477
1060.0
2
accuracy
0.742059
2
macro avg
0.723126
0.718194
0.720365
2865.0
2
weighted avg
0.739701
0.742059
0.740618
2865.0
2
0
0.79564
0.808864
0.802198
1805.0
3
1
0.665049
0.646226
0.655502
1060.0
3
accuracy
0.748691
3
macro avg
0.730344
0.727545
0.72885
2865.0
3
weighted avg
0.747324
0.748691
0.747923
2865.0
3
0
0.79593
0.801662
0.798786
1805.0
4
1
0.658071
0.65
0.65401
1060.0
4
accuracy
0.74555
4
macro avg
0.727
0.725831
0.726398
2865.0
4
weighted avg
0.744924
0.74555
0.745221
2865.0
4
0
0.78453
0.786704
0.785615
1805.0
5
1
0.635071
0.632075
0.63357
1060.0
5
accuracy
0.729494
5
macro avg
0.709801
0.70939
0.709593
2865.0
5
weighted avg
0.729233
0.729494
0.729361
2865.0
5
0
0.785329
0.807095
0.796063
1804.0
6
1
0.655446
0.624528
0.639614
1060.0
6
accuracy
0.739525
6
macro avg
0.720387
0.715812
0.717838
2864.0
6
weighted avg
0.737258
0.739525
0.738159
2864.0
6
0
0.792504
0.808758
0.800549
1804.0
7
1
0.662757
0.639623
0.650984
1060.0
7
accuracy
0.746159
7
macro avg
0.72763
0.72419
0.725766
2864.0
7
weighted avg
0.744483
0.746159
0.745193
2864.0
7
0
0.785286
0.79878
0.791976
1804.0
8
1
0.64723
0.628302
0.637626
1060.0
8
accuracy
0.735684
8
macro avg
0.716258
0.713541
0.714801
2864.0
8
weighted avg
0.73419
0.735684
0.734849
2864.0
8
0
0.788663
0.793906
0.791276
1805.0
9
1
0.644699
0.637394
0.641026
1059.0
9
accuracy
0.736034
9
macro avg
0.716681
0.71565
0.716151
2864.0
9
weighted avg
0.73543
0.736034
0.735719
2864.0
9
0
0.795817
0.801108
0.798454
1805.0
10
1
0.657116
0.649669
0.653371
1059.0
10
accuracy
0.745112
10
macro avg
0.726466
0.725389
0.725913
2864.0
10
weighted avg
0.744531
0.745112
0.744808
2864.0
10
Comments
Accuracy ranges between 72.9% and 74.8%, indicating stable performance.
Performance slightly favors class 0 due to imbalance.
Class-Specific Observations:
Class 0: Higher precision and recall; model predicts this class more reliably.
Class 1: Lower precision and recall; more difficult to predict due to fewer samples.
Key Takeaways:
The model is reliable for the majority class (0), with room for improvement for class 1.
Potential next steps include addressing class imbalance (oversampling/undersampling)
Other Models
Naive Bayes is only one of the many models that we can use to make out-of-sample classification
Other types of models include:
Regularized Logistic Regression
Directly models the probability that each document is in class using logistic regression
Regularization is required to prevent overfitting data
Support Vector Machines
SVMs draw a hyperplane through the multidimensional word space that best separates documents into different classes
Can accomodate non-linear boundaries between classes
Tree-based Methods
Tree-based methods separate classes by segmenting the predictors (word counts) into a number of distinct regions
Like the SVM, this allows for non-linear relationships between features and categories
Conclusion
Supervised learning for text data enables us to identify patterns and associations between words and specific outcome categories.
The Naive Bayes model, a simple yet efficient approach, is quick to implement and often delivers strong classification results, despite relying on certain assumptions.
Once supervised learning classifiers are trained, it is crucial to validate their performance on a test set that was not used during model training.
Cross-validation is a widely used technique for out-of-sample evaluation, helping to compare and select the best-performing models.
Comments
Accuracy ranges between 72.9% and 74.8%, indicating stable performance.
Performance slightly favors class 0 due to imbalance.
Class-Specific Observations:
Key Takeaways: