Assignment 4

Author

Your Name

Published

September 13, 2024

Instructions

Complete the questions by adding the code under the relevant chunks.

Please submit your assignment as html file to me via email. Please call your assignment: “yourlastname_assignment4.html”. For example, I would submit something called “popescu_assignment4.html”. Please make sure that it is all in lower case.

To obtain the same look as this template, make sure that your you assignment preamble or YAML looks like below:

---
title: "Assignment 4"
author: "Your Name"
date: "September 13, 2024"
format:
  html:
    toc: true
    number-sections: true
    colorlinks: true
    smooth-scroll: true
    embed-resources: true
---

Note that Headers such as “Question” need to be marked with # (i.e # Question will appear as header in bold and larger font after you render the document. It will thus appear in the Table of Contents). Regular text can just be written one line underneath.

Download the following British parliamentary speeches dataset and load it into your computer’s memory.

1 Question

Calculate and plot the average Flesch Reading Ease Score for the Conservative and the Labour Party in the UK. Your graph should look like below. Has political speech become easier or harder to understand over time?

2 Question

Jeremy Corbyn is a politician who has delivered speeches from 1993 all the way to 2020. Make a time trend of how Corbyn’s Reading Ease Score evolved over time. Make a 3-year rolling average. Your graph should look like below. Is Corbyn easier or harder to understand as we approach the present time?

3 Question

Identify and print out Corbyn’s easiest and hardest speech to comprehend.

4 Question

The topic of immigration is something that is current in political debates. From all the speeches that you have in your dataset, only select the ones that contain the following keywords: “immigration”, “migrant”, “migrants”, “refugee”, “refugees”, “asylum”, “deportation”, “visa”, “immigrant”, “illegal immigrant”. How many speeches are there containing those words.

5 Question

Create a 3-year rolling average showing the percentage of speeches that contain those keywords related to immigration over time. Your graph should look like below.

6 Question

For the speeches that resulted being about immigration, perform LDA. As part of the text cleaning procedure, please also remove the following words: “honourable”, “member”, “government”, “minister”, “gentleman”, “lady”, “secretary”, “right”, “friend.”

Your graph should look like below (approximately).

7 Question

What is the right number of topics in LDA here? Show coherence scores for your model for up to 20 topics. Your graph should look like this (approximately).

8 Question

Use your trained lda_model and corpus to extract the topic distribution for each speech. (1) Create a DataFrame with one column per topic, storing the probability each speech belongs to that topic. (2) Merge this topic distribution with your speeches_immigration DataFrame. (3) For each topic, find the speech with the highest probability. (4) For each topic, find the speech with the highest probability. (5) Store the top speeches in a new DataFrame and map each one to its top 10 terms using lda_model.show_topic. (6) Extract and display the full text and speaker name of the top speech for topic 0, along with the associated topic keywords.

9 Question

Perform BERT on the same immigration speeches. The output should look like below:

10 Question

What topic is the following speech? Print the main keywords associated with that speech

Tip

By implication, surely we are talking about a proposed location. The Government will not go to the expense of building an accommodation centre before they ask the independent monitor whether it meets the needs of asylum seekers. Clearly, the Government will have to ask the independent monitor to express that view on a proposed location before an accommodation centre is built. Otherwise, there will be considerable nugatory expenditure..

11 Question

Select a random sample of 1,000 speeches from all the speeches.

12 Question

Use OpenAI API to classify the sentiment of those 1,000 speeches using chain-of-thought reasoning. The outcomes should be positive, negative, and neutral.

To get full points, you need to:

  1. Include the full prompt

  2. Create a barplot indicating how many speeches are neutral, positive, and negative. The results don’t need to be exactly the same. See the example below.

13 Question

Use two machine learning algorithms (multinomial and logistic regression) and print the classification reports. They should look like below (note: they don’t need to be exactly the same).

TfidfVectorizer(max_features=10000, stop_words='english')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
MultinomialNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
              precision    recall  f1-score   support

    negative       0.56      0.69      0.62        58
     neutral       0.68      0.35      0.46        49
    positive       0.63      0.70      0.66        93

    accuracy                           0.61       200
   macro avg       0.62      0.58      0.58       200
weighted avg       0.62      0.61      0.60       200
LogisticRegression(class_weight='balanced', max_iter=1000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
              precision    recall  f1-score   support

    negative       0.54      0.55      0.55        58
     neutral       0.48      0.55      0.51        49
    positive       0.73      0.67      0.70        93

    accuracy                           0.60       200
   macro avg       0.58      0.59      0.59       200
weighted avg       0.61      0.60      0.61       200

14 Question

Comment on which model is better. Discuss the F1 score, recall, and accuracy.

15 Question

Use the better machine learning to classify all the other approx. 28,000 speeches.

16 Question

Create a barplot to show how many speeches are positive, negative and neutral.

17 Question

Also create a timetrend showing the number of positive, negative, and neutral speeches over time. Your graph should look like below (approximately).