L: Designing Projects using Text Analysis

Bogdan G. Popescu

John Cabot University

Agenda

Tips for Final Project

  • how to find a topic
  • should you start with data or with question?
  • where to find data

Essentials of Scraping Websites

Ideas for Projects

Intro

The final project is your chance to shine — to apply everything you’ve learned in this course and explore a text-based research question that excites you.

This is your opportunity to move from learning methods to actually doing research with them.

  • Use at least one Text-as-Data method studied in this course
  • Formulate and answer a clear research question
  • Demonstrate your ability to use and interpret the method

Criteria for Grading

  • Clear and Answerable question
  • Description of the selected method
  • Description of the Dataset used in the analysis
  • Implementation of the chosen method
  • Accurate interpretation of the method output
  • Attempts to validate the approach
  • Code Quality
  • Additional points for originality and ambition

How to Find a Question

Read papers that use text analysis
Think about what excites you in text analysis
You can start with a question or a dataset

What Should Come First?

Questions or Data? Which one comes first?

Both approaches are valid — here’s a quick comparison:

Start with a Question

Advantages

  • Likely to lead to more interesting questions
  • Less time dedicated to exploring different methods
  • The question will guide the analysis

Disadvantages

  • Likely a lot of time spent collecting data
  • Data may not exist in an easily accessible format
  • You will need to spend time collecting data.

What Should Come First?

Questions or Data? Which one comes first?

Both approaches are valid — here’s a quick comparison:

Start with Data

Advantages

  • You will realize soon if there are any findings worth discussing
  • You won’t spend extensive amounts of time trying to find data

Disadvantages

  • The research questions may be less interesting

How to Design a Strong Project

  1. Explain what is the concept that you are trying to measure.
  • are you measuring agressions, hatred, use of scientific facts?
  • is the way you measure the concept new?
  1. Validate your measure
  • handcode a few documents to show that the quantitative text measure is similar to the way a human would code the text
  • show some validity checks: does the measure correlate with events in a way that makes sense?
  1. Use your measure to show variation (over time, by geography, etc.)

How to Find a Research Questions

You can pick up on research ideas from academic articles.

  • you can pick up technique especially if you examine the replication files
  • however, your questions may not be the most original

Journals

  • American Journal of Political Science
  • American Political Science Review
  • Journal of Politics
  • British Journal of Political Science

Magazines

  • The Economist
  • The Financial Times
  • Podcasts

Existing Datasets

Example: Harvard Dataverse

The Harvard Dataverse is a data and code repository for many social science journals.

This is an excellent source of data for your projects!

It will sometimes take some time to find files in the different repositories.

Existing Datasets

Example: Harvard Dataverse

Existing Datasets

Example: Harvard Dataverse

Existing Datasets

Example: Harvard Dataverse

Existing Datasets

Example: Harvard Dataverse

Existing Datasets

Example: Harvard Dataverse

Existing Datasets

Example: Harvard Dataverse Scroll Down

Existing Datasets

Example: Harvard Dataverse

Existing Datasets

Example: Harvard Dataverse

Existing Datasets

Example: Harvard Dataverse

Existing Datasets

Example: Harvard Dataverse

Existing Datasets

Example: Harvard Dataverse

Existing Datasets

Example: Harvard Dataverse

Existing Datasets

Example: Harvard Dataverse

Existing Datasets

Example: Harvard Dataverse

Even better you can download the entire dataset and read the “Read Me” file.

Existing Datasets

Example: Harvard Dataverse

Even better you can download the entire dataset and read the “Read Me” file.

Existing Datasets

Example: Kaggle Dataverse

Existing Datasets

Example: Kaggle Dataverse

Existing Datasets

Example: Kaggle Dataverse

Kaggle is a platform with a wide variety of text datasets.

Some examples of dataset which could be useful to people studying social sciences include.

Be careful: still make sure to identify the source of the data and all other relevant information

Scraping Data

Newspaper3k is a and Article scraping & curation library

Scraping Data

This is how we can use it:

#pip install lxml_html_clean
from newspaper import Article
# URL of the article you want to scrape
url = "https://www.bbc.com/news/articles/cr52yrgq48no"

# Create an Article object with the given URL and language (e.g., 'en' for English)
toi_article = Article(url, language="en")

# To download the article
toi_article.download()

# To parse the article (i.e., extract the content)
toi_article.parse()

Scraping Data

This is the result:

# To extract the article's full text
print(toi_article.text)
Five takeaways from leaked US top military chat group

6 days ago Share Save Paulin Kola BBC News Share Save

Watch: Key reactions to reports of a leaked group chat involving Trump officials

Washington DC is still digesting a serious security breach at the heart of the Trump administration. It's the story of how a journalist - the Atlantic magazine's Jeffrey Goldberg - was added to a Signal platform messaging group which apparently included Vice-President JD Vance and Defence Secretary Pete Hegseth, in addition to National Security Adviser Mike Waltz. The topic being discussed was attacking the Iran-backed Houthi group in Yemen. Goldberg said he had seen classified military plans for the strikes, including weapons packages, targets and timing, two hours before the bombs struck. What are the main revelations in a nutshell?

EPA Trump and his top aides have consistently raised concerns about footing the bill for European defence

Vance questions Trump's thinking

On the military action, Goldberg reported that the account named JD Vance wrote: "I think we are making a mistake." The vice-president said targeting Houthi forces that are attacking vessels in the Suez Canal serves European interests more than the US, because Europe has more trade running through the canal. Vance added that his boss was perhaps unaware of how US action could help Europe. "I am not sure the president is aware how inconsistent this is with his message on Europe right now," Vance said. "There's a further risk that we see moderate to severe spike in oil prices." The vice-president went on to say, according to Goldberg, he would support the consensus but would prefer to delay it by a month. US launches wave of air strikes on Yemen's Houthis Goldberg reported in his article that spokesman for JD Vance had later sent him a statement underlining that Trump and Vance had had "subsequent conversations about this matter and are in complete agreement". Since coming to power, Trump has castigated his European Nato allies, urged them to increase defence spending and generally insisted that Europe needs to take responsibility for protecting its own interests.

Blame for 'free-loading' Europe

Arguments over why the US could - and should - carry out the military strike against the Houthis did not sway Vance. He said to the defence secretary, "If you think we should do it let's go. I just hate bailing Europe out again." Hegseth reciprocated: "I fully share your loathing of European free-loading. It's PATHETIC." A group member, only identified as "SM" suggested that after the strike, the US should "make clear to Egypt and Europe what we expect in return". "If Europe doesn't remunerate, then what?" he asked. "If the US successfully restores freedom of navigation at great cost there needs to be some further economic gain extracted in return," the user continues.

After the strike: Emojis and prayers

According to Goldberg, the US national security chief posted three emojis after the strike: "a fist, an American flag, and fire". The Middle East special envoy, Steve Witkoff, responded with five emojis, Goldberg said: "two hands-praying, a flexed bicep, and two American flags". Secretary of State Marco Rubio and White House chief of staff Susie Wiles voiced messages of support, he said. "I will say a prayer for victory," Vance said as updates on the strikes were given. Two others members added prayer emojis, Goldberg reported.

Controlling the message: Blame Biden

To Vance's concerns that the action may be seen as going against Trump's message on Europe, the US defence secretary wrote: "VP: I understand your concerns – and fully support you raising w/ POTUS [Trump]. Important considerations, most of which are tough to know how they play out (economy, Ukraine peace, Gaza, etc). "I think messaging is going to be tough no matter what – nobody knows who the Houthis are – which is why we would need to stay focused on: 1) Biden failed & 2) Iran funded." The Trump administration has consistently blamed Joe Biden for being too lenient with Iran.

Watch: President Trump says he knows 'nothing' about journalist in Houthi strike group chat

Waltz in the spotlight

Scraping Data

If you want to scrape multiple articles, you would do something like this

url1 = "https://www.bbc.com/news/articles/cr52yrgq48no"
url2 = "https://www.bbc.com/news/articles/cvgwjllld1ro"
url2 = "https://www.bbc.com/news/articles/c93n05z48ldo"

Scraping Data

We would do something like this:

import pandas as pd
# List of article URLs
urls = [
    "https://www.bbc.com/news/articles/cr52yrgq48no",
    "https://www.bbc.com/news/articles/cvgwjllld1ro",
    "https://www.bbc.com/news/articles/c93n05z48ldo"]
    
# List to hold article dictionaries
articles_data = []

# Loop through the URLs and process each article
for url in urls:
    article = Article(url, language="en")
    article.download()
    article.parse()
    article.nlp()  # Optional, provides .summary and .keywords

    article_dict = {
        "url": url,
        "title": article.title,
        "summary": article.summary,
        "keywords": article.keywords,
        "text": article.text,
        "date": article.publish_date
    }
    articles_data.append(article_dict)
    
articles_df = pd.DataFrame(articles_data)

Scraping Data

url title summary keywords text date
https://www.bbc.com/news/articles/cr52yrgq48no Signal war plans chat: Five takeaways from leaked US top military meeting Goldberg said he had seen classified military plans for the strikes, including weapons packages, targets and timing, two hours before the bombs struck.\nBlame for 'free-loading' EuropeArguments over why the US could - and should - carry out the military strike against the Houthis did not sway Vance.\nThe Middle East special envoy, Steve Witkoff, responded with five emojis, Goldberg said: "two hands-praying, a flexed bicep, and two American flags".\nTwo others members added prayer emojis, Goldberg reported.\nWatch: President Trump says he knows 'nothing' about journalist in Houthi strike group chatWaltz in the spotlight [chat, vance, group, secretary, military, european, trump, meeting, europe, takeaways, war, plans, strike, leaked, emojis, signal, goldberg] Five takeaways from leaked US top military chat group\n\n6 days ago Share Save Paulin Kola BBC News Share Save\n\nWatch: Key reactions to reports of a leaked group chat involving Trump officials\n\nWashington DC is still digesting a serious security breach at the heart of the Trump administration. It's the story of how a journalist - the Atlantic magazine's Jeffrey Goldberg - was added to a Signal platform messaging group which apparently included Vice-President JD Vance and Defence Secretary Pete Hegseth, in addition to National Security Adviser Mike Waltz. The topic being discussed was attacking the Iran-backed Houthi group in Yemen. Goldberg said he had seen classified military plans for the strikes, including weapons packages, targets and timing, two hours before the bombs struck. What are the main revelations in a nutshell?\n\nEPA Trump and his top aides have consistently raised concerns about footing the bill for European defence\n\nVance questions Trump's thinking\n\nOn the military action, Goldberg reported that the account named JD Vance wrote: "I think we are making a mistake." The vice-president said targeting Houthi forces that are attacking vessels in the Suez Canal serves European interests more than the US, because Europe has more trade running through the canal. Vance added that his boss was perhaps unaware of how US action could help Europe. "I am not sure the president is aware how inconsistent this is with his message on Europe right now," Vance said. "There's a further risk that we see moderate to severe spike in oil prices." The vice-president went on to say, according to Goldberg, he would support the consensus but would prefer to delay it by a month. US launches wave of air strikes on Yemen's Houthis Goldberg reported in his article that spokesman for JD Vance had later sent him a statement underlining that Trump and Vance had had "subsequent conversations about this matter and are in complete agreement". Since coming to power, Trump has castigated his European Nato allies, urged them to increase defence spending and generally insisted that Europe needs to take responsibility for protecting its own interests.\n\nBlame for 'free-loading' Europe\n\nArguments over why the US could - and should - carry out the military strike against the Houthis did not sway Vance. He said to the defence secretary, "If you think we should do it let's go. I just hate bailing Europe out again." Hegseth reciprocated: "I fully share your loathing of European free-loading. It's PATHETIC." A group member, only identified as "SM" suggested that after the strike, the US should "make clear to Egypt and Europe what we expect in return". "If Europe doesn't remunerate, then what?" he asked. "If the US successfully restores freedom of navigation at great cost there needs to be some further economic gain extracted in return," the user continues.\n\nAfter the strike: Emojis and prayers\n\nAccording to Goldberg, the US national security chief posted three emojis after the strike: "a fist, an American flag, and fire". The Middle East special envoy, Steve Witkoff, responded with five emojis, Goldberg said: "two hands-praying, a flexed bicep, and two American flags". Secretary of State Marco Rubio and White House chief of staff Susie Wiles voiced messages of support, he said. "I will say a prayer for victory," Vance said as updates on the strikes were given. Two others members added prayer emojis, Goldberg reported.\n\nControlling the message: Blame Biden\n\nTo Vance's concerns that the action may be seen as going against Trump's message on Europe, the US defence secretary wrote: "VP: I understand your concerns – and fully support you raising w/ POTUS [Trump]. Important considerations, most of which are tough to know how they play out (economy, Ukraine peace, Gaza, etc). "I think messaging is going to be tough no matter what – nobody knows who the Houthis are – which is why we would need to stay focused on: 1) Biden failed & 2) Iran funded." The Trump administration has consistently blamed Joe Biden for being too lenient with Iran.\n\nWatch: President Trump says he knows 'nothing' about journalist in Houthi strike group chat\n\nWaltz in the spotlight None

Applications of Quantitative Text Analysis

  1. When did Western Political Thought start diverging from Islamic political thought?

Applications of Quantitative Text Analysis

  1. When did Western Political Thought start diverging from Islamic political thought?

https://doi.org/10.7910/DVN/CV9AYE

  • The replication files contain all the 46 books, already split in 9,838 sections
  • These are Islamic and Christian books
  • The authors identify how topics in these two traditions from 600 to 1600

Applications of Quantitative Text Analysis

  1. When did Western Political Thought start diverging from Islamic political thought?

Some interesting project ideas based on the same article:

  • How do different eras describe the qualities of a good military leader?
  • How is the metaphor of family or marriage used to talk about ruling?
  • What behaviors are praised or condemned in different political systems?

Applications of Quantitative Text Analysis

  1. Do men and women debate differently?

Applications of Quantitative Text Analysis

  1. Do men and women debate differently?

https://doi.org/10.7910/DVN/PPSFLT

  • The replication files contain all the 1,614,634 speeches from 1992 to 2019
  • These are Parliamentary Speeches in the UK
  • The authors identify how rhetorical styles evolve over time for men and women

Applications of Quantitative Text Analysis

  1. Do men and women debate differently?

Some interesting project ideas based on the same article:

  • Do Labour and Conservative MPs exhibit different stylistic trends over time, and how do those intersect with gender?
  • Does being in a leadership role (e.g. frontbench, committee chair) affect the stylistic choices of MPs? Is this effect gendered?
  • Are some styles more prominent in debates on certain topics (e.g., defense, health, education)? Does gender affect this?

Applications of Quantitative Text Analysis

  1. How does bureaucratic reputation vary over time in the UK and the US?

Applications of Quantitative Text Analysis

  1. How does bureaucratic reputation vary over time in the UK and the US?

https://doi.org/10.7910/DVN/KL36TP

  • The replication files contain 2,528,833 politician speeches in the US and 2,370,131 in the UK
  • They span over a period of 40 years

Applications of Quantitative Text Analysis

  1. How does bureaucratic reputation vary over time in the UK and the US?

Some interesting project ideas based on the same article:

  • Do independent agencies enjoy more stable or higher reputation than executive departments?
  • Do similar types of agencies (e.g., central banks) have similar reputations across countries? Compare the Fed and Bank of England.
  • Do agencies with greater formal independence (or insulation from politics) have more stable or polarized reputations?

Applications of Quantitative Text Analysis

  1. How do incumbents eligible for reelection focus on policies that have high yield in Mexico

Applications of Quantitative Text Analysis

  1. How do incumbents eligible for reelection focus on policies that have high yield in Mexico

https://doi.org/10.7910/DVN/NOMC0H

  • The replication files contain 6,890 legislative speeches from Mexico
  • They span from 2011 to 2018

Applications of Quantitative Text Analysis

  1. How do incumbents eligible for reelection focus on policies that have high yield in Mexico

Some interesting project ideas based on the same article:

  • Do male and female legislators differ in their attention to particularistic legislation?
  • Does the day of the week or proximity to an election affect legislative focus on particularistic topics?
  • Are legislators more likely to focus on physical infrastructure (roads, bridges) or social programs (scholarships, aid) when facing reelection?

Ready to Get Started?

Think about:

  • What social or political issue fascinates you?
  • What dataset might help you explore it?
  • What method would give you insight?

Let’s build something cool.

  • Come and discuss your ideas with me next week (week 11)