Assignment 4
Instructions
Complete the questions by adding the code under the relevant chunks.
Please submit your assignment as html file to me via email. Please call your assignment: “yourlastname_assignment4.html”. For example, I would submit something called “popescu_assignment4.html”. Please make sure that it is all in lower case.
To obtain the same look as this template, make sure that your you assignment preamble or YAML looks like below:
---
title: "Assignment 4"
author: "Your Name"
date: "September 13, 2024"
format:
html:
toc: true
number-sections: true
colorlinks: true
smooth-scroll: true
embed-resources: true
---
Note that Headers such as “Question” need to be marked with #
(i.e # Question
will appear as header in bold and larger font after you render the document. It will thus appear in the Table of Contents). Regular text can just be written one line underneath.
Download the following British parliamentary speeches dataset and load it into your computer’s memory.
1 Question
Calculate and plot the average Flesch Reading Ease Score for the Conservative and the Labour Party in the UK. Your graph should look like below. Has political speech become easier or harder to understand over time?
2 Question
Jeremy Corbyn is a politician who has delivered speeches from 1993 all the way to 2020. Make a time trend of how Corbyn’s Reading Ease Score evolved over time. Make a 3-year rolling average. Your graph should look like below. Is Corbyn easier or harder to understand as we approach the present time?
3 Question
Identify and print out Corbyn’s easiest and hardest speech to comprehend.
4 Question
The topic of immigration is something that is current in political debates. From all the speeches that you have in your dataset, only select the ones that contain the following keywords: “immigration”, “migrant”, “migrants”, “refugee”, “refugees”, “asylum”, “deportation”, “visa”, “immigrant”, “illegal immigrant”. How many speeches are there containing those words.
5 Question
Create a 3-year rolling average showing the percentage of speeches that contain those keywords related to immigration over time. Your graph should look like below.
6 Question
For the speeches that resulted being about immigration, perform LDA. As part of the text cleaning procedure, please also remove the following words: “honourable”, “member”, “government”, “minister”, “gentleman”, “lady”, “secretary”, “right”, “friend.”
Your graph should look like below (approximately).
7 Question
What is the right number of topics in LDA here? Show coherence scores for your model for up to 20 topics. Your graph should look like this (approximately).
8 Question
Use your trained lda_model
and corpus
to extract the topic distribution for each speech. (1) Create a DataFrame with one column per topic, storing the probability each speech belongs to that topic. (2) Merge this topic distribution with your speeches_immigration
DataFrame. (3) For each topic, find the speech with the highest probability. (4) For each topic, find the speech with the highest probability. (5) Store the top speeches in a new DataFrame and map each one to its top 10 terms using lda_model.show_topic
. (6) Extract and display the full text and speaker name of the top speech for topic 0, along with the associated topic keywords.
9 Question
Perform BERT on the same immigration speeches. The output should look like below:
10 Question
What topic is the following speech? Print the main keywords associated with that speech
By implication, surely we are talking about a proposed location. The Government will not go to the expense of building an accommodation centre before they ask the independent monitor whether it meets the needs of asylum seekers. Clearly, the Government will have to ask the independent monitor to express that view on a proposed location before an accommodation centre is built. Otherwise, there will be considerable nugatory expenditure..
11 Question
Select a random sample of 1,000 speeches from all the speeches.
12 Question
Use OpenAI API to classify the sentiment of those 1,000 speeches using chain-of-thought reasoning. The outcomes should be positive, negative, and neutral.
To get full points, you need to:
Include the full prompt
Create a barplot indicating how many speeches are neutral, positive, and negative. The results don’t need to be exactly the same. See the example below.
13 Question
Use two machine learning algorithms (multinomial and logistic regression) and print the classification reports. They should look like below (note: they don’t need to be exactly the same).
TfidfVectorizer(max_features=10000, stop_words='english')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
TfidfVectorizer(max_features=10000, stop_words='english')
MultinomialNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB()
precision recall f1-score support
negative 0.56 0.69 0.62 58
neutral 0.68 0.35 0.46 49
positive 0.63 0.70 0.66 93
accuracy 0.61 200
macro avg 0.62 0.58 0.58 200
weighted avg 0.62 0.61 0.60 200
LogisticRegression(class_weight='balanced', max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(class_weight='balanced', max_iter=1000)
precision recall f1-score support
negative 0.54 0.55 0.55 58
neutral 0.48 0.55 0.51 49
positive 0.73 0.67 0.70 93
accuracy 0.60 200
macro avg 0.58 0.59 0.59 200
weighted avg 0.61 0.60 0.61 200
14 Question
Comment on which model is better. Discuss the F1 score, recall, and accuracy.
15 Question
Use the better machine learning to classify all the other approx. 28,000 speeches.
16 Question
Create a barplot to show how many speeches are positive, negative and neutral.
17 Question
Also create a timetrend showing the number of positive, negative, and neutral speeches over time. Your graph should look like below (approximately).