L16: Dictionary Methods

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Introduction

Remember that text analysis is about assigning numbers to words to measure latent concepts in a text.

Before we assign numbers, it is important to be clear about what is the phenomnon that we try to capture

Theorizing and exploratory data analysis play a role in conceptualization

theorizing helps with which aspects of a problem are likely to be important
exploratory data analysis helps reveal previously hidden dimensions of a text

Dimension of a Good Measure

It is important for our concept to be unbiased, accurate, and valid

Does our concept measure what it is supposed to measure?

Our concept also needs to be reliable

If measure our concept again, will we get the same answer?

Quntitative text analyses tend to be do well on the second dimension, but would need more work to demonstrate that concepts are also valid.

Dictionaries

Dictionary methods blend qualitative and quantitative approaches in text analysis.

Qualitative Aspect: Involves identifying concepts and creating categories or keys, associating them with specific textual features.

Contextual Interpretation: Building dictionaries requires qualitative judgment and understanding of context.

Quantitative Aspect: An algorithm is applied to large datasets, generating statistical summaries of text data.

Reliability: Once constructed, dictionaries offer high reliability as analysis involves no further human interpretation.

Why Use Dictionaries?

Rather than counting all words, dictionaries associate specific words with predefined meanings, enhancing interpretability.

Components of a Dictionary:

Key: Represents a concept or category.
Values: Terms or patterns associated with the key, functioning as equivalent instances of the concept.

Key	Values
Emotion	Happiness, Sadness, Anger, Joy
Finance	Investment, Budget, Debt, Savings
Health	Exercise, Nutrition, Medicine, Wellness

Counting words

A dictionary is just a list of words \((m=1,..,M)\) that is related to a common concept.

Aggression
fool
irritated
stupid
stubborn
accusation
accuse
ignorant

Counting words

Applying a dictionary to a corpus of texts \((i=1,..,N)\) simply requires counting the number of times each word occurs in each text and summing them

Thus, the proportion of words in document \(i\) can be defined as:

\[t_{i}=\frac{\sum^{M}_{m=1} W_{im}}{N_i}\] where:

\(W_{im}\) - number of times word \(m\) appears in text \(i\)
\(N_{i}\) - number of words within a text

We need to divide by the number of words as we do not want longer texts to mechanically be assigned higher scores.

Counting words

Note

“That statement is as barbaric as it is downright stupid; it is nothing more than an ignorant, cruel and deliberate misconception to hide behind.”

\[t_{i}=\frac{\sum^{M}_{m=1} W_{im}}{N_i}=\frac{1+1}{24}=0.083\]

Counting weighted words

A slight variation of the dictionary approach is to use weights.

The weights would represent how important different words are for specific concept.

For example, stupid is more aggressive than accuse.

Aggression	Weight
fool	0.6
irritated	0.5
stupid	0.8
stubborn	0.4
accusation	0.3
accuse	0.3
ignorant	0.3

Counting weighted words

Note

“That statement is as barbaric as it is downright stupid; it is nothing more than an ignorant, cruel and deliberate misconception to hide behind.”

We now adjust the formula

\[t_{i}=\frac{\sum^{M}_{m=1} s_{m} W_{im}}{N_i}=\frac{(1*0.6)+(1*0.3)}{24}=0.0375\]

When to use Weight and when Not?

Many applications use unweighted dictionaries.

We sometimes may want to use weighted dictionaries.

Not using weights means that all the words have the same weight.

However, sometimes we may want to use weights to express that some words are more refelctive of specific concepts.

Examples of Dictionaries

There are many dictionaries to measure different concepts:

Linquistic Inquiry and Word Count

Different word categories reflecting psychological states, emotions, thinking styles, and social concerns

Lexicoder Sentiment Dictionary

Negative and Positive Sentiment words designed for the automated coding of sentiment in news coverage, legislative speech and other text

Moral Foundations Dictionary

Multiple-category dictionary of moral terms

Examples of Dictionaries

There are many dictionaries to measure different concepts:

Loughran-McDonald Sentiment Dictionary

category sentiment dictionary, especially developed for financial analyses

Martindale’s Regressive Imagery Dictionary

multiple category dictionary designed to measure primordial vs. conceptual thinking.

Some Caveats

Applying already-existing dictionaries to new contexts may be problematic.

Words can have multiple meanings

Loughran and McDonald classify sentiment for a corpus of 50,115 firm-year 10-K filings from 1994–2008
Three-fourths of their negative words are not negative in a financial context: e.g. tax, cost, liability, foreign, vice, etc.

Dictionaries can lack important words in some contexts

For example, there might be some negative financial words: felony, litigation, restated, misstatement

Some Caveats

Applying already-existing dictionaries to new contexts may be problematic.

Some dictionaries pick more of the topic than the tone of the document

Note

“Applying dictionaries outside the domain for which they were developed can lead to serious errors” (Grimmer and Stewart, 2013, 268)

Some Caveats

Applying already-existing dictionaries to new contexts may be problematic.

Some dictionaries may miss important words.

For example, the word “barbaric”, because it is not present in the dictionary, will be missed

Note

“That statement is as barbaric as it is downright stupid; it is nothing more than an ignorant, cruel and deliberate misconception to hide behind.”

The word babaric should probably be taken into account.

Some Caveats

Applying already-existing dictionaries to new contexts may be problematic.

Dictionaries do not typically capture modifiers

“downright stupid” is an aggression amplifier.

Dictionaries may miscalculate aggression

Note

“Terrible acts of brutality and violence have been carried out against the Rohingya people.”

these are descriptions about aggression and not agressive tone

Motivation

How does aggression within political speech compare between men and women?

This is the question asked by Hargrave and Blumenau, 2022

Motivation

How does aggression within political speech compare between men and women?

This is the question asked by Hargrave and Blumenau, 2022

Reading the Data

Reading the Debates

Let us first download the UK debates data

We can first read all the debates.

Python

import pandas as pd
aggression_texts = pd.read_csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/aggression_texts.csv")
aggression_texts.head(3)

   person_id                 name  ... body_word_count aggression_rating
0      10042  Mr Gerry Bermingham  ...              87               1.0
1      11727       Richard Benyon  ...             109               0.0
2      24938       Penny Mordaunt  ...              81               0.0

[3 rows x 26 columns]

Reading the Data

Reading the Debates

We have the following columns within the dataframe:

Python

print(aggression_texts.columns)

Index(['person_id', 'name', 'constituency', 'entered_house', 'left_house',
       'date_of_birth', 'age_years', 'gender', 'house_start', 'days_in_house',
       'party', 'party_short', 'year', 'parliamentary_term', 'session',
       'pct_con', 'pct_lab', 'pct_ld', 'pct_other', 'margin', 'body',
       'question_time', 'debate_type', 'aggressive_word_count',
       'body_word_count', 'aggression_rating'],
      dtype='object')

Reading the Data

Reading the Debates

The relevant column is body:

Python

# Selecting the relevant column
all_texts = aggression_texts[['body']]
top_three_texts= all_texts.head(3)
# Print the entries
print(top_three_texts)

body
Does the Minister agree that if one does not provide litigants with legal aid as speedily and appropriately as possible, one builds up a backlog? Does he agree with the Lord Chancellor's circular that the way in which cases are banked up has delayed matters, just as the failure to grant expert witnesses and the failure to grant civil legal aid to most people have done? In all, there is no justice in this country because the Conservative party has sought to destroy the whole litigation system.
My right honourable Friend will know that there is no greater critic of the common fisheries policy than me, but I am sure he would agree that even had we not gone into it, we would probably still have a problem, because man's technical ability to harvest vast quantities from the sea has been a problem the world over. I very much hope that the White Paper contains a firm commitment to an ecosystems approach to fisheries management and that within that there is the possibility of rebalancing fishing opportunity to try to assist the smaller, more local fishing fleet and give it a fairer cut of the opportunity.
I congratulate my honourable Friend on his campaign. He is quite right that this is an issue that restricts growth locally. We recognise that and have introduced restricting the use of CCTV to enforce parking, grace periods for on-street parking, and have made it possible for local people to require their local councils to review parking. I draw his attention to the Great British High Street portal, which demonstrates that if local authorities reduce their parking rates they receive greater revenue.

Reading the Data

Reading the Aggression Dictionary

Let us first download the aggression dictionary

We can then read the aggression dictionary:

Python

import pandas as pd
aggression_words = pd.read_csv('/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week8/lecture8b/data/LH_aggression_seed.csv', header=None)
#aggression_words

And then examine the first 50 aggression words:

Python

# Convert aggression words to a set for faster lookup
aggression_words_set = set(aggression_words[0])
print(list(aggression_words_set)[:50])

['scaremongering', 'hatred ', 'fraudulent ', 'laughable ', 'offend', 'ironic ', 'scaremonger', 'ignorant', 'failure ', 'mislead', 'irritated ', 'disgrace', 'distasteful ', 'dodgy ', 'furious ', 'shambolic ', 'trick', 'absurd ', 'stupid ', 'accusation ', 'antagonistic ', 'cruel', 'demonise ', 'stubborn ', 'demonised ', 'betray', 'abysmal ', 'gimmick ', 'condemn', 'shenanigans ', 'patronise ', 'assaulting ', 'blunder ', 'confront ', 'fail ', 'sham ', 'prejudices ', 'provoke', 'hypocrisy', 'dishonest ', 'chaos', 'dishonourable ', 'unacceptbale ', 'cruelty ', 'despicable', 'silliness ', 'debased', 'fool ', 'atrocious ', 'needless']

Counting Aggressive Words in Speech

Python

# Count aggressive words for each row, handling NaN values
aggression_texts['aggressive_word_count'] = aggression_texts['body'].apply(
    lambda sentence: sum(word.lower() in aggression_words_set for word in str(sentence).split()))

Counting Aggressive Words in Speech

Short Digression: `lambda` functions

Used for short, single-expression functions without a def.
Syntax: lambda arguments: expression.
Best for concise operations; avoid for complex logic.

Example 1:

Python

add = lambda x, y: x + y
print(add(3, 5))  # Output: 8

Counting Aggressive Words in Speech

Short Digression: `lambda` functions

Used for short, single-expression functions without a def.
Syntax: lambda arguments: expression.
Best for concise operations; avoid for complex logic.

Example 2

Python

df_example = pd.DataFrame({'numbers': [1, 2, 3]})
df_example['squared'] = df_example['numbers'].apply(lambda x: x ** 2)
print(df_example)

   numbers  squared
0        1        1
1        2        4
2        3        9

Counting Aggressive Words in Speech

Python

# Count aggressive words for each row, handling NaN values
aggression_texts['aggressive_word_count'] = aggression_texts['body'].apply(
    lambda sentence: sum(word.lower() in aggression_words_set for word in str(sentence).split()))

Python

# Step 1: Create a frequency table of aggressive word counts
aggressive_word_freq = aggression_texts['aggressive_word_count'].value_counts().reset_index()

# Step 2: Renaming columns
aggressive_word_freq.columns = ['aggressive_word_count', 'frequency']

# Step 3: Convert aggressive_word_count to numeric for correct ordering
aggressive_word_freq['aggressive_word_count'] = aggressive_word_freq['aggressive_word_count'].astype(int)

# Step 4: Sorting by aggressive_word_count
aggressive_word_freq = aggressive_word_freq.sort_values(by='aggressive_word_count')

Counting Aggressive Words in Speech

See code

aggressive_word_freq <- reticulate::py$aggressive_word_freq

#Step4: Graphing
library("ggplot2")
ggplot(aggressive_word_freq, aes(x = aggressive_word_count, y = frequency)) +
  geom_bar(stat = "identity") +
  scale_x_continuous(breaks = seq(1, max(aggressive_word_freq$aggressive_word_count), by = 1)) +
  labs(x = "Number of Aggressive Words", y = "Frequency", title = "Frequency of Aggressive Word Counts per Speech")+
  theme_bw()

Applying Dictionaries in Python

Python

# Count total words in each 'body' entry, handling NaN values
aggression_texts['body_word_count'] = aggression_texts['body'].apply(
    lambda sentence: len(str(sentence).split())
)
aggression_texts["proportion"] = aggression_texts['aggressive_word_count']/aggression_texts['body_word_count']

# Step 2: Create a frequency table of proportions
prop_aggressive_freq = aggression_texts['proportion'].value_counts().reset_index()

# Step 3: Renaming columns
prop_aggressive_freq.columns = ['aggressive_proportion', 'frequency']

# Step 4: Convert aggressive_proportion to numeric for correct ordering
prop_aggressive_freq['aggressive_proportion'] = prop_aggressive_freq['aggressive_proportion'].astype(float)

# Step 5: Sorting the DataFrame by aggressive_proportion
prop_aggressive_freq = prop_aggressive_freq.sort_values(by='aggressive_proportion')

Applying Dictionaries in Python

See code

# Step1: Tuning the pandas dataframe to an R dataframe using reticulate
prop_aggressive_freq <- reticulate::py$prop_aggressive_freq
# Step2: Creating a histogram
library("ggplot2")
ggplot(prop_aggressive_freq, aes(x = aggressive_proportion)) +
  geom_histogram(binwidth = 0.005, color = "white")+
  labs(x = "Proportion of Aggressive Words relative to Total Length of Speech", 
       y = "Frequency", title = "Proportion of Aggressive Word Counts relative to Speech")+
  theme_bw()

Validation

To measure the extent to which we measured our text with error, it is important to conduct validation tests.

The main concern here is whether texts are flagged for other reasons that have less to do with aggression.

There may be different types of validation depending on the research context.

For example, British debates are much more aggressive during Question Time.

So, should we expect more agression during specific times?

Question Time

One good way is to see whether agression increases during question time

The following line tells us how many debates there are per debate category:

Python

# Assuming aggression_texts is your DataFrame
debate_type_counts = aggression_texts['debate_type'].value_counts()

# To display the result
print(debate_type_counts)

debate_type
prime_ministers_questions    10981
question_time                10289
legislation                   6403
opposition day                 972
Name: count, dtype: int64

Question Time

We can also see these differences if we calculate the average proportion by debate type:

Python

# Group by 'debate_type' and calculate the mean of 'proportions'
mean_dictionary_by_debate_type = (
    aggression_texts
    .groupby('debate_type', as_index=False)  # Group by debate_type, keep the index
    .agg(mean_dictionary=('proportion', 'mean'))  # Calculate mean of proportions
)

# Display the result
print(mean_dictionary_by_debate_type)

                 debate_type  mean_dictionary
0                legislation         0.000459
1             opposition day         0.000485
2  prime_ministers_questions         0.000513
3              question_time         0.000613

Question Time

We also check if these differences are statistically significant.

aggression_texts_df <- reticulate::py$aggression_texts
summary(lm(proportion ~ debate_type, data = aggression_texts_df))


Call:
lm(formula = proportion ~ debate_type, data = aggression_texts_df)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.000613 -0.000613 -0.000513 -0.000459  0.124541 

Coefficients:
                                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)                          4.591e-04  4.120e-05  11.143  < 2e-16 ***
debate_typeopposition day            2.620e-05  1.135e-04   0.231  0.81740    
debate_typeprime_ministers_questions 5.402e-05  5.184e-05   1.042  0.29734    
debate_typequestion_time             1.534e-04  5.247e-05   2.924  0.00346 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.003297 on 28641 degrees of freedom
Multiple R-squared:  0.0003416, Adjusted R-squared:  0.0002369 
F-statistic: 3.263 on 3 and 28641 DF,  p-value: 0.02047

Validity

Let us look at some examples in the order of aggression:

proportion	body
0.125	I thought that I might provoke an intervention.
0.1	Will the honourable Gentleman give way on that ridiculous point?
0.076923	The Electoral Administration Bill includes several provisions to prevent fraud in postal voting.
0.071429	What recent assessment she has made of the effectiveness of the Action Fraud helpline.
0.071429	There can be no defence against the abuse of democratic rights in any country.
0.065217	In two constituency cases I have been told, "It is not in the Serious Fraud Office's remit and the police will not look at the corporate fraud because they do not have the money." So how do we get these corporate fraud cases properly looked at?
0.0625	Order. That is unsatisfactory. No honourable Member would be misleading - perhaps misinformed, but not misleading.
0.0625	What recent representations he has received on the operation of the Rehabilitation of Offenders Act 1974.
0.058824	I hope that the honourable Gentleman will not be so foolish as to trust the Government again.
0.058824	What steps she is taking to ensure that all forms of domestic abuse are recognised and investigated.

Human Judgment as the “Gold Standard”

Human judgment can be considered the typical standard: would humans code agression in the same way?

However, even human judgment could be subject to many biases including:

misinterpretation
subjective idiosyncratic conceptualizations of agression
lack of coder training

Asking human subjects to code speeches can also be expensive.

Comparison to ChatGPT Judgment

Large Language Models such as ChatGPT could be an easier way to cross-validate the dictionary method here.

The aggression_texts data.frame includes a variable, aggression_rating.

This is the variable that that contains the ChatGPT rating.

ChatGPT was asked to rate aggression with 1 or 0.

Comparison to ChatGPT Judgment

The following code shows the extent to which the dictionary method variable coincides with the ChatGPT rating:

Python

contingency_table = pd.crosstab(
    aggression_texts['proportion'] > 0,
    aggression_texts['aggression_rating'],
    rownames=['dictionary'],
    colnames=['ChatGPT']
)
print(contingency_table)

ChatGPT       0.0   1.0
dictionary             
False       17437  9400
True          608  1198

aggression_texts_df2 <- reticulate::py$aggression_texts

Comparison to ChatGPT Judgment

And now we can extract these values separately:

Python

# Extract the values as specified
false_false = contingency_table.loc[False, 0]  # dictionary = False, ChatGPT = 0
true_false = contingency_table.loc[True, 0]    # dictionary = True, ChatGPT = 0
false_true = contingency_table.loc[False, 1]   # dictionary = False, ChatGPT = 1
true_true = contingency_table.loc[True, 1]     # dictionary = True, ChatGPT = 1

Python

print(false_false, true_false, false_true, true_true)

17437 608 9400 1198

There are 1,198 speeches that were categorized as aggressive by both the dictionary method and ChatGPT.

these could be considered as true positives

Comparison to ChatGPT Judgment

And now we can extract these values separately:

Python

# Extract the values as specified
false_false = contingency_table.loc[False, 0]  # dictionary = False, ChatGPT = 0
true_false = contingency_table.loc[True, 0]    # dictionary = True, ChatGPT = 0
false_true = contingency_table.loc[False, 1]   # dictionary = False, ChatGPT = 1
true_true = contingency_table.loc[True, 1]     # dictionary = True, ChatGPT = 1

Python

print(false_false, true_false, false_true, true_true)

17437 608 9400 1198

There are 17,437 speeches that were categorized as non-aggressive by both the dictionary method and ChatGPT.

these could be considered as true negatives

Comparison to ChatGPT Judgment: Accuracy

We can thus try to add the TP and TN and divide them by total observations

Python

(true_true+false_false)/(true_true+false_false+false_true+true_false)

0.6505952588765144

We can think of this calculation as a measure of accuracy.

This measures the proportion of correctly classified cases (both positives and negatives) out of the total cases.

\[Accuracy = \frac{\textrm{True Positives}+ \textrm{True Negatives}}{\textrm{Total Observations}}\]

Comparison to ChatGPT Judgment: Naïve Guess

Another metric could be the naïve guess which would try to predict the most frequent class observed in your dataset for all instances.

Applying it would entail a few steps:

Sum the counts for each class label in ChatGPT:

Class 0: 17437 (false_false) + 608 (true_false) = 18045

Class 1: 9400 (false_true) + 1198 (true_true) = 10598

Class 0 is more frequent

The Naïve Guess Prediction: predicting 0 for all instances

Comparison to ChatGPT Judgment: Naïve Guess

Calculate Accuracy of the Naïve Guess:

Accuracy = (Number of correct predictions) / (Total number of instances)

Correct predictions (all predicted as Class 0): True negatives (actual Class 0): 18045

Total instances: 18045 (Class 0) + 10598 (Class 1) = 28643

Naïve Guess Accuracy = 18045 / 28643 ≈ 0.63 (63%)

Our model’s accuracy (65%) > (63%).

The models captures some patterns beyond the most frequent class, but the improvement is modest.

Sensitivity (True Positive Rate)

Sensitivity Definition:

Measures the proportion of actual aggressive texts correctly identified.
Formula: \(\text{Sensitivity} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}\)

Calculation: true_true/(true_true+true_false)

Sensitivity = 1198/(1198 + 608) ≈ 66.33%

Interpretation: The ChatGPT model identifies about 66% of aggressive texts marked by the dictionary.

Specificity (True Negative Rate)

Specificity Definition:

Measures the proportion of non-aggressive texts correctly identified.
Formula: \(\text{Specificity} = \frac{\text{True Negatives}}{\text{True Negatives + False Positives}}\)

Calculation: false_false / (false_false + false_true)

Specificity = 17437/(17437 + 9400) ≈ 64.97%

Interpretation: The model correctly identifies approximately 64.97% of non-aggressive texts.

Key Evaluation Metrics in Classification

1. Accuracy: The proportion of all correctly classified texts (both aggressive and non-aggressive).

\[ \text{Accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total Observations}} \]

Tells us how often the model makes the correct prediction overall.
Limitation: Can be misleading if one class is much more common (class imbalance).

Sensitivity, Specificity, and the Naïve Guess

2. Sensitivity (Recall): Measures the model’s ability to detect aggressive texts.

\[ \text{Sensitivity} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}} \]

High sensitivity means fewer aggressive texts are missed.

3. Specificity: Measures the model’s ability to detect non-aggressive texts.

\[ \text{Specificity} = \frac{\text{True Negatives}}{\text{True Negatives + False Positives}} \]

High specificity means fewer non-aggressive texts are incorrectly marked as aggressive.

Sensitivity, Specificity, and the Naïve Guess

4. Naïve Guess: A simple baseline where we predict the most frequent class for all texts.

Helps us see if the model performs better than a basic guess.
Provides a minimum benchmark to assess model improvement.

Which Error should we minimize?

Examples of Error Rate Prioritization

An important question here: what type of error should we try and minimize?

Judicial Decisions:
- Often, we may prefer false negatives over false positives.
- Question: Would you rather put an innocent person in jail or let a guilty one go free?

COVID-19 Testing:
- Here, false positives might be more acceptable than false negatives.
- Question: Would you rather isolate unnecessarily or risk unknowingly spreading the virus?

Error Rates in Dictionary-Based Text Analysis

False Positives vs. False Negatives:
- In text analysis with dictionaries, false positives are often lower than false negatives.
- Implication: We may miss some true positives (aggressive texts, for example).
- Solution: Read missed texts, adjust the dictionary, and reapply to improve detection.

Our Case

Evaluating Classifier Performance

Accuracy: Proportion of correctly classified texts.
- In your case: (true_true + false_false) / (true_true + false_false + true_false + false_true)
- Calculation: (1198 + 17437) / (28643) ≈ 65.06%

Sensitivity (Recall): Ability to correctly identify aggressive texts.
- In your case: (true_true) / (true_true + true_false)
- Calculation: 1198 / (1198 + 608) ≈ 66.33%

Specificity: Ability to correctly identify non-aggressive texts.
- In your case: (false_false) / (false_false + false_true)
- Calculation: 17437 / (17437 + 9400) ≈ 64.97%

Our Case

Insights:
- Accuracy provides an overall performance measure but may be misleading in imbalanced datasets.
- Sensitivity and specificity offer deeper insights into the classifier’s effectiveness in detecting aggressive and non-aggressive texts, respectively.

In our case
- Our aggression dictionary has a moderate success in detecting aggressive texts but shows limitations in both precision and recall.
- It frequently misses a substantial portion of aggressive texts, resulting in moderate recall.
- It sometimes misclassifies non-aggressive texts as aggressive

Which errors should we minimize?

Our prioritization of false-negative/false-positive rates will often depend on the application

For judicial decisions, maybe we’d prefer false negatives than false positives
- Would you rather put an innocent person in jail or let a guilty one go free?

For COVID tests, we might be more happy to accept false positives that false negatives
- Would you rather isolate for no reason or spread the virus unknowingly?

How has agreession changed over time?

So, what can we say about agression within political speeches over time?

See code

Python

import pandas as pd
import numpy as np
from scipy.stats import t

# Assuming aggression_texts_df2 is your DataFrame
aggression_trends = (aggression_texts
    .groupby(['year', 'gender'], as_index=False)
    .agg(mean_aggression=('aggression_rating', 'mean'),
         sd_aggression=('aggression_rating', 'std'),
         n=('aggression_rating', 'size'))
)

# Calculating standard error, confidence interval lower and upper bounds
aggression_trends['se'] = aggression_trends['sd_aggression'] / np.sqrt(aggression_trends['n'])
aggression_trends['ci_lower'] = aggression_trends['mean_aggression'] - t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']
aggression_trends['ci_upper'] = aggression_trends['mean_aggression'] + t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']

See code

library(reticulate)
aggression_trends2 <- reticulate::py$aggression_trends
ggplot(aggression_trends2, aes(x = year, y = mean_aggression, color = gender)) +
  geom_line() +
  geom_point() +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = gender), alpha = 0.2) +
  labs(
    title = "Aggression Rating Over Time by Gender",
    x = "Year",
    y = "Mean Aggression Rating"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(aggression_trends2$year), max(aggression_trends2$year), by = 1))+
    geom_hline(yintercept = 0)+
    theme_bw()

How has agreession changed over time?

So, what can we say about agression within political speeches over time?

See code

Python

import pandas as pd
import numpy as np
from scipy.stats import t

# Assuming aggression_texts_df2 is your DataFrame
aggression_trends = (aggression_texts
    .groupby(['year', 'gender'], as_index=False)
    .agg(mean_aggression=('proportion', 'mean'),
         sd_aggression=('proportion', 'std'),
         n=('aggression_rating', 'size'))
)

# Calculating standard error, confidence interval lower and upper bounds
aggression_trends['se'] = aggression_trends['sd_aggression'] / np.sqrt(aggression_trends['n'])
aggression_trends['ci_lower'] = aggression_trends['mean_aggression'] - t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']
aggression_trends['ci_upper'] = aggression_trends['mean_aggression'] + t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']

See code

aggression_trends2 <- reticulate::py$aggression_trends
# Plot the mean aggression rating over time with confidence intervals
ggplot(aggression_trends2, aes(x = year, y = mean_aggression, color = gender)) +
  geom_line() +
  geom_point() +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = gender), alpha = 0.2) +
  labs(
    title = "Aggression Rating Over Time by Gender",
    x = "Year",
    y = "Mean Aggression Rating"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(aggression_trends2$year), max(aggression_trends2$year), by = 1))+
    geom_hline(yintercept = 0)+
    theme_bw()

Observations

The graph using ChatGPT is very similar to the original article:

See code

Python

import pandas as pd
import numpy as np
from scipy.stats import t

# Assuming aggression_texts_df2 is your DataFrame
aggression_trends = (aggression_texts
    .groupby(['year', 'gender'], as_index=False)
    .agg(mean_aggression=('aggression_rating', 'mean'),
         sd_aggression=('aggression_rating', 'std'),
         n=('aggression_rating', 'size'))
)

# Calculating standard error, confidence interval lower and upper bounds
aggression_trends['se'] = aggression_trends['sd_aggression'] / np.sqrt(aggression_trends['n'])
aggression_trends['ci_lower'] = aggression_trends['mean_aggression'] - t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']
aggression_trends['ci_upper'] = aggression_trends['mean_aggression'] + t.ppf(0.975, df=aggression_trends['n'] - 1) * aggression_trends['se']

See code

library(reticulate)
aggression_trends2 <- reticulate::py$aggression_trends
aggression_trends3<-subset(aggression_trends2, year>1997)
ggplot(aggression_trends3, aes(x = year, y = mean_aggression, color = gender)) +
  geom_line() +
  geom_point() +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = gender), alpha = 0.2) +
  labs(
    title = "Aggression Rating Over Time by Gender",
    x = "Year",
    y = "Mean Aggression Rating"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_x_continuous(breaks = seq(min(aggression_trends3$year), max(aggression_trends3$year), by = 1))+
    geom_hline(yintercept = 0)+
    theme_bw()

Observations

Hargrave and Blumenau, 2022 enhance aggression measurement by adopting a more refined approach:

Start with a pre-existing dictionary of aggressive terms.
Leverage word embeddings to:

Expand the dictionary with terms relevant to parliamentary language.
Upweight words frequently used in similar contexts within parliamentary speech.
Downweight words less commonly used in similar contexts.

Use these adjusted word lists to score speeches effectively.

Conclusion

Limitations of Dictionary Methods: Dictionaries often struggle to accurately capture aggression in speeches due to limited contextual understanding.

Evaluating Classifier Performance: Beyond overall accuracy, it’s essential to consider within-class metrics like specificity and sensitivity for a nuanced assessment.

Benefits of Dictionary Methods: Fast and straightforward with many ready-to-use implementations, making them easy to apply.

Context Sensitivity: The validity of dictionaries depends on the context in which they were created and applied, which can limit their generalizability.

Conclusion

ChatGPT has been proven to be as accurate or even more accurate than human coding when it comes to task such as as identifying agression within speeches.

See for example this article in Social Science Computer Review

Deep Contextual Understanding: LLMs grasp complex language patterns and nuances.
Minimal Feature Engineering: Automatically learn relevant features from raw text.
Scalability: Efficiently handle large volumes of text.

L16: Dictionary Methods

Introduction

Dimension of a Good Measure

Dictionaries

Why Use Dictionaries?

Counting words

Counting words

Counting words

Counting weighted words

Counting weighted words

When to use Weight and when Not?

Examples of Dictionaries

Examples of Dictionaries

Some Caveats

Some Caveats

Some Caveats

Some Caveats

Motivation

Motivation

Reading the Data

Reading the Debates

Reading the Data

Reading the Debates

Reading the Data

Reading the Debates

Reading the Data

Reading the Aggression Dictionary

Counting Aggressive Words in Speech

Counting Aggressive Words in Speech

Short Digression: lambda functions

Counting Aggressive Words in Speech

Short Digression: lambda functions

Counting Aggressive Words in Speech

Counting Aggressive Words in Speech

Applying Dictionaries in Python

Applying Dictionaries in Python

Validation

Question Time

Question Time

Question Time

Validity

Human Judgment as the “Gold Standard”

Comparison to ChatGPT Judgment

Comparison to ChatGPT Judgment

Comparison to ChatGPT Judgment

Comparison to ChatGPT Judgment

Comparison to ChatGPT Judgment: Accuracy

Comparison to ChatGPT Judgment: Naïve Guess

Comparison to ChatGPT Judgment: Naïve Guess

Sensitivity (True Positive Rate)

Specificity (True Negative Rate)

Key Evaluation Metrics in Classification

Sensitivity, Specificity, and the Naïve Guess

Sensitivity, Specificity, and the Naïve Guess

Which Error should we minimize?

Examples of Error Rate Prioritization

Error Rates in Dictionary-Based Text Analysis

Our Case

Our Case

Which errors should we minimize?

How has agreession changed over time?

How has agreession changed over time?

Observations

Observations

Conclusion

Conclusion

Short Digression: `lambda` functions

Short Digression: `lambda` functions