L20: Supervised Learning

Bogdan G. Popescu

John Cabot University

Introduction

Supervised learning allows to classify documents into pre-defined categories or labelled data

Unsupervised learning analyzes and clusters unlabeled data sets. These algorithms discover hidden patterns in data without the need for human intervention (hence, they are ``unsupervised”).

It can be conceptualized as a generalization of dictionary methods

There are specific words associated with specific categories defined by the researcher (pre-labelled training data)
Words have a weight of 0 or 1 (according to their relative prevalence in each each category)
Documents are scored based on the words they contain

Introduction

Within supervised learning, the features associated with each category (and their relative weight) are learned from the data.

Supervised learning methods will often outperform dictionary methods in classification tasks, particularly when the training sample is large.

Components of Supervised Learning

Labelled Datasets

There is a labelled data (usually hand-coded) that puts text in different categories

training set: used to train the classifier
test set: used to validate the classifier

Classification method

this will be used to learn the relationship between coded texts and words

Components of Supervised Learning

Validation method

we use cross-validation metrics: confusion matrix, accuracy, sensitivity, specificity, etc

Out-of-sample prediction

We will use the model to predict categories for documents that do not have labels

Creating a labelled dataset

We can label data through expert annotation. E.g.:

undergraduate students get to code texts into particular categories
crowd-sourced people on the internet get to code texts into particular categories.
ChatGPT can be used to code texts into particular categories

Intuition: Naive Bayes

Imagine you’re sorting emails (Spam vs. Not Spam)

Naive Bayes helps classify emails as spam or not spam based on the words they contain. For example:

Words like “win,” “free,” and “prize” might appear more in spam emails.
Words like “meeting,” “project,” and “agenda” might appear more in not-spam emails.

The goal is to calculate which category (spam or not spam) an email is most likely to belong to.

Intuition: Naive Bayes

Step 1: What’s the Question?

We want to figure out:

Note

How likely is it that this email belongs to a specific category (e.g., spam) given the words it contains?

This is called the posterior probability:

\[P(\text{Category∣Words})\]

Intuition: Naive Bayes

Step 2: What Information Do We Use?

To make this decision, Naive Bayes combines:

1.How common is the category overall? (Prior Probability)

\[P(\text{Category})\]

For instance, if 60% of emails are spam, then \(P(\text{Spam})=0.6\)

2.How common is the category overall? (Likelihood)

\[P(\text{Words|Category})\]

For example, the word “win” might appear in 50% of spam emails but only 2% of not-spam emails.

3. How often do the words appear overall? (Normalization Factor):

\[P(\text{Words})\] This ensures probabilities add up to 1 across all categories.

Intuition: Naive Bayes

Step 3: Simplify the Math with an Analogy

Imagine you’re at a dog park. Some dogs are Labradors, and others are Poodles. If you see a black dog, you might ask:

Note

Is this dog more likely to be a Labrador or a Poodle?

1.Prior Probability: Start with what you know about the park:

70% of the dogs are Labradors: \(P(\text{Labrador})=0.7\)
30% are Poodles: \(P(\text{Poodle})=0.3\)

2.Likelihood: Look at traits like coat color:

Black Labradors are common: \(P(\text{Black|Labrador}) = 0.8\)
Black Poodles are less common: \(P(\text{Black|Poodle}) = 0.3\)

Intuition: Naive Bayes

Step 3: Simplify the Math with an Analogy

Imagine you’re at a dog park. Some dogs are Labradors, and others are Poodles. If you see a black dog, you might ask:

Note

Is this dog more likely to be a Labrador or a Poodle?

3.Combine Information: Naive Bayes combines these probabilities to estimate:

\(P(\text{Labrador|Black})\)
\(P(\text{Poodle|Black})\)

Intuition: Naive Bayes

Step 4: The Formula in Action

Naive Bayes calculates:

\[P(\text{Category|Words}) = \frac{P(\text{Category}) \times P(\text{Words|Category})}{P(\text{Words})}\]

Let’s apply this to the email example:

Suppose the prior probability of spam is \(P(\text{Spam})=0.6\)
Suppose the prior probability of spam is \(P(\text{"Win"|Spam})=0.5\)
The likelihood of the word “win” overall is \(P(\text{"Win"}) = 0.2\)

The posterior probability of spam given the word “win” is:

\[ P(\text{Spam|"win"}) = \frac{P(\text{Spam}) \times P(\text{"win"|Spam})}{P(\text{"win"})} = \frac{0.6 \times 0.5}{0.2} = 1.5 \] 1.5 will be normalized later to fit a probability range.

Intuition: Naive Bayes

Step 5: Make the Final Decision

Compare the posterior probabilities for all categories (spam and not spam).

We Assign the email to the category with the highest probability.

Methods: Naive Bayes

Steps

For example, in this case, we can try to label aggression in political speeches, based on tiny sample coded in ChatGPT.

1. Train Language Models for Each Category

Example: Build a model for Aggressive speeches and another for Non-Aggressive speeches.
Each model calculates the probability of words appearing in its category.

Aggressive speeches often use words like “battle,” “destroy,” “win,” “horrible.”
Non-Aggressive speeches often use non-aggressive, neutral words like “diplomacy,” “support,” or “grow.”

2. Get a New Document
Example: A campaign speech or policy statement.

Methods: Naive Bayes

Steps

3. Calculate Probabilities for Each Model

Compute the likelihood that the text was “written” by the Aggressive language model vs the Non-Aggressive model:

Aggressive Model: High probabilities for “battle,” “destroy,” “win,” “horrible.”
Non-Aggressive Model: High probabilities for “diplomacy,” “support,” or “grow.”

4. Assign the Most Likely Category

If the text mentions “battle” and “destroy” - Aggressive.
If the text emphasizes “diplomacy” and “support” - Non-Aggressive.

Language Models

We can represent these different “models” for language using a probability distribution over the words in the vocabulary:

A probability distribution over a discrete variable must have three properties:

Each element must be greater than or equal to zero
Each element must be less than or equal to one
The sum of the elements must be 1

Tone	Battle	Destroy	Win	Diplomacy	Support	Grow
Aggressive	0.30	0.25	0.20	0.05	0.10	0.10
Non-Aggressive	0.05	0.10	0.15	0.30	0.25	0.15

Probability Distributions

Definition:
- A mathematical function that shows how likely different outcomes are.
- For discrete events, it assigns probabilities to each possible event.
Key Properties:
1. Probabilities are always between 0 and 1.
2. The sum of all probabilities equals 1.

Example: Rolling a Die 🎲

Possible Outcomes:

\[( 1, 2, 3, 4, 5, 6)\]

Uniform Distribution (Fair Die):

\[P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = \frac{1}{6}\]

The sum of probabilities: \[ \frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = 1 \]

Relevance to Language Models

In Language Models:

The “outcomes” are words.
A probability distribution predicts how likely each word is to appear next.
Example: \[P(\text{'hello'}) = 0.4, \; P(\text{'world'}) = 0.3, \; P(\text{'everyone'}) = 0.2, \; \dots\]

Language Models

Given these categories, we can calculate the probability that a given set of word counts (i.e. a document) would be drawn from each distribution.

\[ P(W_i|\mu) = \frac{M_i !}{\prod_{j=1}^J W_{i,j}!} \prod_{j=1}^J \mu_{j}^{W_{i,j}} \]

Where:

\(\mu_j\): probability of observing word \(j\) under a given category
\(W_{i,j}\): the number of times word \(j\) appears in document \(i\) (i.e. an element of a DFM—document-feature matrix)
\(M_i\): the total number of words in document \(i\)
\(!\): factorial operator, e.g., \(4! = 4 \times 3 \times 2 \times 1\)

Language Models

Tone	Battle	Destroy	Win	Diplomacy	Support	Grow
Aggressive	0.30	0.25	0.20	0.05	0.10	0.10
Non-Aggressive	0.05	0.10	0.15	0.30	0.25	0.15

Let’s imagine that we have the following DFM:

Document	Battle	Destroy	Win	Diplomacy	Support	Grow
\(W_1\)	5	3	0	0	2	1
\(W_2\)	0	1	4	3	0	2

We can now calculate the probability for each document:

\[ \begin{equation} \begin{aligned} P(W_1|\mu_{\text{Aggressive}}) &= \frac{M_i !}{\Pi_{j=1}^J W_{1,j}!} \Pi_{j=1}^J \mu_{j}^{W_{1,j}}\\ &= \frac{11!}{5! \cdot 3! \cdot 2! \cdot 1!} \times 0.30^5 \times 0.25^3 \times 0.10^2 \times 0.10^1 \\ &= 0.001949 \end{aligned} \end{equation} \]

Language Models

Tone	Battle	Destroy	Win	Diplomacy	Support	Grow
Aggressive	0.30	0.25	0.20	0.05	0.10	0.10
Non-Aggressive	0.05	0.10	0.15	0.30	0.25	0.15

Let’s imagine that we have the following DFM:

Document	Battle	Destroy	Win	Diplomacy	Support	Grow
\(W_1\)	5	3	0	0	2	1
\(W_2\)	0	1	4	3	0	2

\[ \begin{equation} \begin{aligned} P(W_1|\mu_{\text{Non-Aggressive}}) &= \frac{M_i !}{\Pi_{j=1}^J W_{1,j}!} \Pi_{j=1}^J \mu_{j}^{W_{1,j}}\\ &= \frac{11!}{5! \cdot 3! \cdot 2! \cdot 1!} \times 0.05^5 \times 0.10^3 \times 0.25^2 \times 0.15^1 \\ &= 0.000000005414 \end{aligned} \end{equation} \]

The probability of observing \(W_1\) is higher under \(\mu_{\text{Aggressive}}\) than under \(\mu_{\text{Non-Aggressive}}\).

Language Models

Tone	Battle	Destroy	Win	Diplomacy	Support	Grow
Aggressive	0.30	0.25	0.20	0.05	0.10	0.10
Non-Aggressive	0.05	0.10	0.15	0.30	0.25	0.15

Let’s imagine that we have the following DFM:

Document	Battle	Destroy	Win	Diplomacy	Support	Grow
\(W_1\)	5	3	0	0	2	1
\(W_2\)	0	1	4	3	0	2

We can now calculate the probability for each document:

\[ \begin{equation} \begin{aligned} P(W_2|\mu_{\text{Aggressive}}) &= \frac{M_i !}{\Pi_{j=1}^J W_{2,j}!} \Pi_{j=1}^J \mu_{j}^{W_{2,j}}\\ &= \frac{10!}{1! \cdot 4! \cdot 3! \cdot 2!} \times 0.25^1 \times 0.20^4 \times 0.05^3 \times 0.10^2 \\ &= 0.000000315 \end{aligned} \end{equation} \]

Language Models

Tone	Battle	Destroy	Win	Diplomacy	Support	Grow
Aggressive	0.30	0.25	0.20	0.05	0.10	0.10
Non-Aggressive	0.05	0.10	0.15	0.30	0.25	0.15

Let’s imagine that we have the following DFM:

Document	Battle	Destroy	Win	Diplomacy	Support	Grow
\(W_1\)	5	3	0	0	2	1
\(W_2\)	0	1	4	3	0	2

We can now calculate the probability for each document:

\[ \begin{equation} \begin{aligned} P(W_2|\mu_{\text{Non-Aggressive}}) &= \frac{M_i !}{\Pi_{j=1}^J W_{2,j}!} \Pi_{j=1}^J \mu_{j}^{W_{2,j}}\\ &= \frac{10!}{1! \cdot 4! \cdot 3! \cdot 2!} \times 0.10^1 \times 0.15^4 \times 0.30^3 \times 0.15^2 \\ &= 0.1096 \end{aligned} \end{equation} \]

The probability of observing \(W_2\) is higher under \(\mu_{\text{Non-Aggressive}}\) than under \(\mu_{\text{Aggressive}}\)

Implications

Given a set of probabilities, we can work out which model was most likely to have generated any given document.

The likelihood of document generated by a model is:

larger when the model gives higher probabilities to the words that occur frequently in the document

smaller when the model gives higher probabilities to the words that occur infrequently in the document