# Selecting the relevant columnall_texts = aggression_texts[['body']]top_three_texts= all_texts.head(3)# Print the entriesprint(top_three_texts)
body
Does the Minister agree that if one does not provide litigants with legal aid as speedily and appropriately as possible, one builds up a backlog? Does he agree with the Lord Chancellor's circular that the way in which cases are banked up has delayed matters, just as the failure to grant expert witnesses and the failure to grant civil legal aid to most people have done? In all, there is no justice in this country because the Conservative party has sought to destroy the whole litigation system.
My right honourable Friend will know that there is no greater critic of the common fisheries policy than me, but I am sure he would agree that even had we not gone into it, we would probably still have a problem, because man's technical ability to harvest vast quantities from the sea has been a problem the world over. I very much hope that the White Paper contains a firm commitment to an ecosystems approach to fisheries management and that within that there is the possibility of rebalancing fishing opportunity to try to assist the smaller, more local fishing fleet and give it a fairer cut of the opportunity.
I congratulate my honourable Friend on his campaign. He is quite right that this is an issue that restricts growth locally. We recognise that and have introduced restricting the use of CCTV to enforce parking, grace periods for on-street parking, and have made it possible for local people to require their local councils to review parking. I draw his attention to the Great British High Street portal, which demonstrates that if local authorities reduce their parking rates they receive greater revenue.
# Count aggressive words for each row, handling NaN valuesaggression_texts['aggressive_word_count'] = aggression_texts['body'].apply(lambda sentence: sum(word.lower() in aggression_words_set for word instr(sentence).split()))
Counting Aggressive Words in Speech
Short Digression: lambda functions
Used for short, single-expression functions without a def.
Syntax: lambda arguments: expression.
Best for concise operations; avoid for complex logic.
# Count aggressive words for each row, handling NaN valuesaggression_texts['aggressive_word_count'] = aggression_texts['body'].apply(lambda sentence: sum(word.lower() in aggression_words_set for word instr(sentence).split()))
Python
# Step 1: Create a frequency table of aggressive word countsaggressive_word_freq = aggression_texts['aggressive_word_count'].value_counts().reset_index()# Step 2: Renaming columnsaggressive_word_freq.columns = ['aggressive_word_count', 'frequency']# Step 3: Convert aggressive_word_count to numeric for correct orderingaggressive_word_freq['aggressive_word_count'] = aggressive_word_freq['aggressive_word_count'].astype(int)# Step 4: Sorting by aggressive_word_countaggressive_word_freq = aggressive_word_freq.sort_values(by='aggressive_word_count')
Counting Aggressive Words in Speech
See code
R
aggressive_word_freq <- reticulate::py$aggressive_word_freq#Step4: Graphinglibrary("ggplot2")ggplot(aggressive_word_freq, aes(x = aggressive_word_count, y = frequency)) +geom_bar(stat ="identity") +scale_x_continuous(breaks =seq(1, max(aggressive_word_freq$aggressive_word_count), by =1)) +labs(x ="Number of Aggressive Words", y ="Frequency", title ="Frequency of Aggressive Word Counts per Speech")+theme_bw()
Applying Dictionaries in Python
Python
# Count total words in each 'body' entry, handling NaN valuesaggression_texts['body_word_count'] = aggression_texts['body'].apply(lambda sentence: len(str(sentence).split()))aggression_texts["proportion"] = aggression_texts['aggressive_word_count']/aggression_texts['body_word_count']# Step 2: Create a frequency table of proportionsprop_aggressive_freq = aggression_texts['proportion'].value_counts().reset_index()# Step 3: Renaming columnsprop_aggressive_freq.columns = ['aggressive_proportion', 'frequency']# Step 4: Convert aggressive_proportion to numeric for correct orderingprop_aggressive_freq['aggressive_proportion'] = prop_aggressive_freq['aggressive_proportion'].astype(float)# Step 5: Sorting the DataFrame by aggressive_proportionprop_aggressive_freq = prop_aggressive_freq.sort_values(by='aggressive_proportion')
Applying Dictionaries in Python
See code
R
# Step1: Tuning the pandas dataframe to an R dataframe using reticulateprop_aggressive_freq <- reticulate::py$prop_aggressive_freq# Step2: Creating a histogramlibrary("ggplot2")ggplot(prop_aggressive_freq, aes(x = aggressive_proportion)) +geom_histogram(binwidth =0.005, color ="white")+labs(x ="Proportion of Aggressive Words relative to Total Length of Speech", y ="Frequency", title ="Proportion of Aggressive Word Counts relative to Speech")+theme_bw()
Validation
To measure the extent to which we measured our text with error, it is important to conduct validation tests.
The main concern here is whether texts are flagged for other reasons that have less to do with aggression.
There may be different types of validation depending on the research context.
For example, British debates are much more aggressive during Question Time.
So, should we expect more agression during specific times?
Question Time
One good way is to see whether agression increases during question time
The following line tells us how many debates there are per debate category:
Python
# Assuming aggression_texts is your DataFramedebate_type_counts = aggression_texts['debate_type'].value_counts()# To display the resultprint(debate_type_counts)
debate_type
prime_ministers_questions 10981
question_time 10289
legislation 6403
opposition day 972
Name: count, dtype: int64
Question Time
We can also see these differences if we calculate the average proportion by debate type:
Python
# Group by 'debate_type' and calculate the mean of 'proportions'mean_dictionary_by_debate_type = ( aggression_texts .groupby('debate_type', as_index=False) # Group by debate_type, keep the index .agg(mean_dictionary=('proportion', 'mean')) # Calculate mean of proportions)# Display the resultprint(mean_dictionary_by_debate_type)
debate_type mean_dictionary
0 legislation 0.000459
1 opposition day 0.000485
2 prime_ministers_questions 0.000513
3 question_time 0.000613
Question Time
We also check if these differences are statistically significant.
R
aggression_texts_df <- reticulate::py$aggression_textssummary(lm(proportion ~ debate_type, data = aggression_texts_df))
Call:
lm(formula = proportion ~ debate_type, data = aggression_texts_df)
Residuals:
Min 1Q Median 3Q Max
-0.000613 -0.000613 -0.000513 -0.000459 0.124541
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.591e-04 4.120e-05 11.143 < 2e-16 ***
debate_typeopposition day 2.620e-05 1.135e-04 0.231 0.81740
debate_typeprime_ministers_questions 5.402e-05 5.184e-05 1.042 0.29734
debate_typequestion_time 1.534e-04 5.247e-05 2.924 0.00346 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.003297 on 28641 degrees of freedom
Multiple R-squared: 0.0003416, Adjusted R-squared: 0.0002369
F-statistic: 3.263 on 3 and 28641 DF, p-value: 0.02047
Validity
Let us look at some examples in the order of aggression:
proportion
body
0.125
I thought that I might provoke an intervention.
0.1
Will the honourable Gentleman give way on that ridiculous point?
0.076923
The Electoral Administration Bill includes several provisions to prevent fraud in postal voting.
0.071429
What recent assessment she has made of the effectiveness of the Action Fraud helpline.
0.071429
There can be no defence against the abuse of democratic rights in any country.
0.065217
In two constituency cases I have been told, "It is not in the Serious Fraud Office's remit and the police will not look at the corporate fraud because they do not have the money." So how do we get these corporate fraud cases properly looked at?
0.0625
Order. That is unsatisfactory. No honourable Member would be misleading - perhaps misinformed, but not misleading.
0.0625
What recent representations he has received on the operation of the Rehabilitation of Offenders Act 1974.
0.058824
I hope that the honourable Gentleman will not be so foolish as to trust the Government again.
0.058824
What steps she is taking to ensure that all forms of domestic abuse are recognised and investigated.
Human Judgment as the “Gold Standard”
Human judgment can be considered the typical standard: would humans code agression in the same way?
However, even human judgment could be subject to many biases including:
misinterpretation
subjective idiosyncratic conceptualizations of agression
lack of coder training
Asking human subjects to code speeches can also be expensive.
Comparison to ChatGPT Judgment
Large Language Models such as ChatGPT could be an easier way to cross-validate the dictionary method here.
The aggression_texts data.frame includes a variable, aggression_rating.
This is the variable that that contains the ChatGPT rating.
ChatGPT was asked to rate aggression with 1 or 0.
Comparison to ChatGPT Judgment
The following code shows the extent to which the dictionary method variable coincides with the ChatGPT rating:
High specificity means fewer non-aggressive texts are incorrectly marked as aggressive.
Sensitivity, Specificity, and the Naïve Guess
4. Naïve Guess: A simple baseline where we predict the most frequent class for all texts.
Helps us see if the model performs better than a basic guess.
Provides a minimum benchmark to assess model improvement.
Which Error should we minimize?
Examples of Error Rate Prioritization
An important question here: what type of error should we try and minimize?
Judicial Decisions:
Often, we may prefer false negatives over false positives.
Question: Would you rather put an innocent person in jail or let a guilty one go free?
COVID-19 Testing:
Here, false positives might be more acceptable than false negatives.
Question: Would you rather isolate unnecessarily or risk unknowingly spreading the virus?
Error Rates in Dictionary-Based Text Analysis
False Positives vs. False Negatives:
In text analysis with dictionaries, false positives are often lower than false negatives.
Implication: We may miss some true positives (aggressive texts, for example).
Solution: Read missed texts, adjust the dictionary, and reapply to improve detection.
Our Case
Evaluating Classifier Performance
Accuracy: Proportion of correctly classified texts.
In your case: (true_true + false_false) / (true_true + false_false + true_false + false_true)
Calculation: (1198 + 17437) / (28643) ≈ 65.06%
Sensitivity (Recall): Ability to correctly identify aggressive texts.
In your case: (true_true) / (true_true + true_false)
Calculation: 1198 / (1198 + 608) ≈ 66.33%
Specificity: Ability to correctly identify non-aggressive texts.
In your case: (false_false) / (false_false + false_true)
Calculation: 17437 / (17437 + 9400) ≈ 64.97%
Our Case
Insights:
Accuracy provides an overall performance measure but may be misleading in imbalanced datasets.
Sensitivity and specificity offer deeper insights into the classifier’s effectiveness in detecting aggressive and non-aggressive texts, respectively.
In our case
Our aggression dictionary has a moderate success in detecting aggressive texts but shows limitations in both precision and recall.
It frequently misses a substantial portion of aggressive texts, resulting in moderate recall.
It sometimes misclassifies non-aggressive texts as aggressive
Which errors should we minimize?
Our prioritization of false-negative/false-positive rates will often depend on the application
For judicial decisions, maybe we’d prefer false negatives than false positives
- Would you rather put an innocent person in jail or let a guilty one go free?
For COVID tests, we might be more happy to accept false positives that false negatives
- Would you rather isolate for no reason or spread the virus unknowingly?
How has agreession changed over time?
So, what can we say about agression within political speeches over time?
See code
Python
import pandas as pdimport numpy as npfrom scipy.stats import t# Assuming aggression_texts_df2 is your DataFrameaggression_trends = (aggression_texts .groupby(['year', 'gender'], as_index=False) .agg(mean_aggression=('aggression_rating', 'mean'), sd_aggression=('aggression_rating', 'std'), n=('aggression_rating', 'size')))# Calculating standard error, confidence interval lower and upper boundsaggression_trends['se'] = aggression_trends['sd_aggression'] / np.sqrt(aggression_trends['n'])aggression_trends['ci_lower'] = aggression_trends['mean_aggression'] - t.ppf(0.975, df=aggression_trends['n'] -1) * aggression_trends['se']aggression_trends['ci_upper'] = aggression_trends['mean_aggression'] + t.ppf(0.975, df=aggression_trends['n'] -1) * aggression_trends['se']
See code
R
library(reticulate)aggression_trends2 <- reticulate::py$aggression_trendsggplot(aggression_trends2, aes(x = year, y = mean_aggression, color = gender)) +geom_line() +geom_point() +geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = gender), alpha =0.2) +labs(title ="Aggression Rating Over Time by Gender",x ="Year",y ="Mean Aggression Rating" ) +theme(axis.text.x =element_text(angle =45, hjust =1)) +scale_x_continuous(breaks =seq(min(aggression_trends2$year), max(aggression_trends2$year), by =1))+geom_hline(yintercept =0)+theme_bw()
How has agreession changed over time?
So, what can we say about agression within political speeches over time?
See code
Python
import pandas as pdimport numpy as npfrom scipy.stats import t# Assuming aggression_texts_df2 is your DataFrameaggression_trends = (aggression_texts .groupby(['year', 'gender'], as_index=False) .agg(mean_aggression=('proportion', 'mean'), sd_aggression=('proportion', 'std'), n=('aggression_rating', 'size')))# Calculating standard error, confidence interval lower and upper boundsaggression_trends['se'] = aggression_trends['sd_aggression'] / np.sqrt(aggression_trends['n'])aggression_trends['ci_lower'] = aggression_trends['mean_aggression'] - t.ppf(0.975, df=aggression_trends['n'] -1) * aggression_trends['se']aggression_trends['ci_upper'] = aggression_trends['mean_aggression'] + t.ppf(0.975, df=aggression_trends['n'] -1) * aggression_trends['se']
See code
R
aggression_trends2 <- reticulate::py$aggression_trends# Plot the mean aggression rating over time with confidence intervalsggplot(aggression_trends2, aes(x = year, y = mean_aggression, color = gender)) +geom_line() +geom_point() +geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = gender), alpha =0.2) +labs(title ="Aggression Rating Over Time by Gender",x ="Year",y ="Mean Aggression Rating" ) +theme(axis.text.x =element_text(angle =45, hjust =1)) +scale_x_continuous(breaks =seq(min(aggression_trends2$year), max(aggression_trends2$year), by =1))+geom_hline(yintercept =0)+theme_bw()
Observations
The graph using ChatGPT is very similar to the original article:
See code
Python
import pandas as pdimport numpy as npfrom scipy.stats import t# Assuming aggression_texts_df2 is your DataFrameaggression_trends = (aggression_texts .groupby(['year', 'gender'], as_index=False) .agg(mean_aggression=('aggression_rating', 'mean'), sd_aggression=('aggression_rating', 'std'), n=('aggression_rating', 'size')))# Calculating standard error, confidence interval lower and upper boundsaggression_trends['se'] = aggression_trends['sd_aggression'] / np.sqrt(aggression_trends['n'])aggression_trends['ci_lower'] = aggression_trends['mean_aggression'] - t.ppf(0.975, df=aggression_trends['n'] -1) * aggression_trends['se']aggression_trends['ci_upper'] = aggression_trends['mean_aggression'] + t.ppf(0.975, df=aggression_trends['n'] -1) * aggression_trends['se']
See code
R
library(reticulate)aggression_trends2 <- reticulate::py$aggression_trendsaggression_trends3<-subset(aggression_trends2, year>1997)ggplot(aggression_trends3, aes(x = year, y = mean_aggression, color = gender)) +geom_line() +geom_point() +geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper, fill = gender), alpha =0.2) +labs(title ="Aggression Rating Over Time by Gender",x ="Year",y ="Mean Aggression Rating" ) +theme(axis.text.x =element_text(angle =45, hjust =1)) +scale_x_continuous(breaks =seq(min(aggression_trends3$year), max(aggression_trends3$year), by =1))+geom_hline(yintercept =0)+theme_bw()
Start with a pre-existing dictionary of aggressive terms.
Leverage word embeddings to:
Expand the dictionary with terms relevant to parliamentary language.
Upweight words frequently used in similar contexts within parliamentary speech.
Downweight words less commonly used in similar contexts.
Use these adjusted word lists to score speeches effectively.
Conclusion
Limitations of Dictionary Methods: Dictionaries often struggle to accurately capture aggression in speeches due to limited contextual understanding.
Evaluating Classifier Performance: Beyond overall accuracy, it’s essential to consider within-class metrics like specificity and sensitivity for a nuanced assessment.
Benefits of Dictionary Methods: Fast and straightforward with many ready-to-use implementations, making them easy to apply.
Context Sensitivity: The validity of dictionaries depends on the context in which they were created and applied, which can limit their generalizability.
Conclusion
ChatGPT has been proven to be as accurate or even more accurate than human coding when it comes to task such as as identifying agression within speeches.