L1: Introduction

Bogdan G. Popescu

John Cabot University

Introduction

Learning Outcomes

Upon successful completion of this course, the students will be able to:

  • Write Python programs to perform loops, conditional statements, and function definitions.
  • Employ quantitative techniques to process and analyze textual data.
  • Utilize Python libraries like Numpy, Pandas, and NLTK for text manipulation and analysis.
  • Apply advanced methods such as sentiment analysis, topic modeling, word embeddings, and supervised learning to text data.
  • Use ChatGPT and prompt engineering to enhance text analysis tasks, including summarization, classification, and generating structured outputs.
  • Create and publish a professional website showcasing their portfolio and analytical capabilities.

Introduction

Jobs where these skills are valued

Data Science and Analytics

  • Data Scientist: Developing predictive models and conducting advanced analysis with text data.
  • Text/Language Data Analyst: Extracting insights from text data for business or research.

Introduction

Jobs where these skills are valued

Natural Language Processing (NLP) and AI

  • NLP Engineer: Working on synthesizing customer reviews, language translation, and sentiment analysis.
  • AI Prompt Engineer: Designing and optimizing prompts for large language models like ChatGPT.

Research and Academia

  • Research Scientist: Conduct text-based research in political science, sociology, or computational linguistics.
  • Academic/Teaching Positions: Teaching Python and text analysis at universities or boot camps.

Introduction

Jobs where these skills are valued

Communication and Technical Writing

  • Technical Writer: Explain complex computational methods clearly and in a structured manner.
  • Science Communicator: Building accessible content from technical analyses.

Business and Consulting

  • Business Analyst: Using Python and text analysis to drive data-driven decision-making.
  • Consultant: Advising clients on leveraging data and text-based insights.

Logistics

  • Hours: MW 10:00-11:15AM
  • Room: Garibaldi Comp Lab-Garibaldi Computer Lab
  • Office Hours: per appointment
  • You will use your own laptops (Mac or Windows): ideally you should have 8GB of RAM

Evaluation

Grading

You will be graded on four problem sets during the semester (each worth 12.5% of your grade) and on a final report and presentation (each worth 25%).

  • 4 problem sets: 50% of the final grade (12.5% each)
  • final presentation: 25% of the final grade
  • final project report: 25% of the final grade

Evaluation

Grading

Initial Individual Submission

  • This component contributes 50% of the overall grade for the problem set.
  • When you complete the problem set independently and submit it to the instructor, your grade for this component will be calculated based on the quality of your independent work

Final Submission After Group Consultation

  • This component also contributes 50% of the overall grade for the problem sets.
  • After discussing the problem set with your group members and documenting the correct answers, you will submit this revised version.

Evaluation

Final Project

You will undertake a text analysis project emphasizing the practical application of text analysis techniques using Python programming.

The project entails a few steps:

  • choose a topic that involves text data
  • acquire data either from the course materials or from external sources
  • employ text analysis procedures in Python
  • craft a well-structured two-page report containing an intro to the problem, objectives, data sources, methodology, results, and conclusion
  • appendix with the Python code for the text analysis procedures.

Contents of the Course

The contents of the course can be found at:
https://bgpopescu.net/files/teaching/text_analysis/text_analysis_syllabus.html

What is Python? What is R?

Python and R are versatile programming languages widely used for general-purpose programming, data analysis, and machine learning.

They are both open-source ecosystems (i.e., they are free, and anyone can contribute to their development).

They are compatible with Windows, Mac, and Linux.

We will use Python for text analysis tasks and R for data visualization.

Libraries

A vast array of libraries exists, enabling you to perform tasks quickly, such as:

  • Clean and process data: Libraries like pandas and numpy allow for efficient data manipulation and analysis.
  • Visualize data: Libraries such as matplotlib, seaborn, and plotly create good visualizations.
  • We will, however use ggplot in R to create visualizations
  • Typeset and create visually appealing articles and presentations: Tools like Quarto will help us create rich, interactive documents and presentations

Companies that use Python and R

Examples of companies which use Python and R include:

  • Airbnb
  • Microsoft
  • Uber
  • Meta
  • Google

Relevant Books

General Books on Python

General Books on Python

Text Analysis

Relevant Books

Text Analysis

Text Analysis

Language is at the core of nearly all forms of social interaction:

  • Laws are written and codified
  • Political debates shape public opinion.
  • Historical events are recorded and remembered.
  • People communicate ideas, emotions, and stories.

However, until recently, analyzing these interactions at scale was nearly impossible.

There are large quantities of text available in the form of digitized books and texts

Text Analysis

Today, thanks to the digitization of vast textual archives—books, articles, comments, and more—we have access to unprecedented amounts of text.

Even better, we now have cutting-edge tools and techniques to make sense of it all.

What is Quantitative Text Analysis?

Quantitative text analysis involves converting text into numbers, assigning values to words and documents, and uncovering hidden meanings.

Our focus will be on identifying latent concepts:

  • We cannot directly observe abstract ideas, like political ideology, aggression, or sentiment.

Instead, we infer them by analyzing patterns in text.

To achieve this, we’ll develop strategies for assigning meaningful scores to words and documents, enabling us to extract insights from even the most significant text datasets.

Examples of Latent Concepts based on Observed Text

Quantitative text analysis can uncover concepts like:

  1. Aggression in political speeches.
  2. Economic themes in news coverage.
  3. Hate speech in online forums.
  4. Ideological leanings in party platforms.

Quantitative Text Analysis Techniques

1. Dictionaries

  • Predefined word lists categorize text (e.g., “positive” vs. “negative” words).
  • Documents are scored based on the presence or absence of these words.

2. Supervised Learning

  • Words are weighted based on their usage in labeled data (e.g., spam vs. not spam).
  • Documents are classified by their resemblance to different categories.

3. Text Scaling

  • Words are given weights reflecting their patterns of use across contexts.
  • Documents are positioned on scales such as “liberal” to “conservative.”

4. Topic Models

  • Words are grouped into topics based on co-occurrence patterns.
  • Documents are represented as combinations of these topics.

Quantitative Text Analysis Techniques

5. World Embedding Models

  • Words are encoded as vectors that capture their meanings in context.
  • Documents are analyzed by aggregating the vectors of the words they contain.

Some Cool Things That You Will Learn

  1. Learning how to use ChatGPT OpenAI for Text Analysis

We will learn how to:

  • do prompt engineering: writing prompts to get the language model to do what we want it to do
  • provide targeted summaries of customer reviews for relevant company departments: e.g., summarize customer feedback about product design for the company’s design department
  • decide the sentiment of texts, extract labels, names, locations, topics
  • translate text and fix grammar mistakes

Some Cool Things That You Will Learn

Example 1: Targeted Summaries

We want more targeted summaries to inform specific departments within companies.

Python
# Assuming longest_review is defined and contains text
prompt_template2 = """
Task: Summarize the e-commerce product review below, delimited by triple backticks,\
to provide feedback to the engineering department responsible for the product's design.

Instructions:

1. Create a concise summary in at most 30 words.
2. Focus on aspects relevant to the product's design, craftmanship, or engineering.
3. If the review does not mention design or engineering, respond with precisely the\
the word "None" (without quotes or additional text).

Review: 
```{review_text_here}```
"""

Some Cool Things That You Will Learn

Example 1: Targeted Summaries

This is the original customer feedback on a PS5.

profile_name rating rating_date title review_text word_count
0 J.F. Carroll 5.0 out of 5 stars Reviewed in the United States on May 6, 2024 5.0 out of 5 stars\nIt's a PS5, everyone wants one, was there any doubt?\n \nNobody could get their hands on this until just recently. It's a PS5, the craftmanship is awesome.Operates well, system works well with store, everything is easy, this console is a masterpiece and I keep it close. Haven't got into too many games yet. It looks sexy as heck, it looks dumb in ads but in real life it's fantastic looking. And you can buy sexy plates for it as well.Controller feels great. Considering the upgrade controller, but I haven't decided. The game choices are phenomenal. It's official: XBOX lost, buy a Playstation.Playstation all the way.\n 95

Some Cool Things That You Will Learn

Example 1: Targeted Summaries

This is what we get:

profile_name rating rating_date title review_text word_count targeted_summary
0 J.F. Carroll 5.0 out of 5 stars Reviewed in the United States on May 6, 2024 5.0 out of 5 stars\nIt's a PS5, everyone wants one, was there any doubt?\n \nNobody could get their hands on this until just recently. It's a PS5, the craftmanship is awesome.Operates well, system works well with store, everything is easy, this console is a masterpiece and I keep it close. Haven't got into too many games yet. It looks sexy as heck, it looks dumb in ads but in real life it's fantastic looking. And you can buy sexy plates for it as well.Controller feels great. Considering the upgrade controller, but I haven't decided. The game choices are phenomenal. It's official: XBOX lost, buy a Playstation.Playstation all the way.\n 95 The review praises the PS5's craftsmanship, ease of use, and aesthetics. Positive feedback on controller and game choices. No negative comments on design or engineering.

Some Cool Things That You Will Learn

Example 2: Fixing Grammar Mistakes

We can also use ChatGPT to proofread and fix grammar mistakes.

Python
text_to_fix = """
This is a mock short paragraf to see how ChatGPT performs when it comes to\ 
grammer and spelling. These is a wrong sentence. Speling is of.
"""

Some Cool Things That You Will Learn

Example 2: Fixing Grammar Mistakes

Here is the prompt:

Python
prompt = f"""
Proofread and correct the following text, which
is delimited with triple backticks.

Rewrite the corrected version in a simple string.

If you don't find and errors, leave as such.

```{text_to_fix}```
"""

Some Cool Things That You Will Learn

Example 2: Fixing Grammar Mistakes

And this is the output:

This is a mock short paragraf paragraph to see how ChatGPT performs when it comes to\ ¶ grammer to grammar and spelling. These There is a wrong sentence. Speling Spelling is of.off.

Some Cool Things That You Will Learn

Example 3: Comparing speeches by politicians

Who are the politicians who have speeches most similar to Boris Johnson?

Some Cool Things That You Will Learn

Example 4: Wordclouds

How to create wordclouds?

Some Cool Things That You Will Learn

Example 5: Language Complexity

How do we measure language complexity?

Some Cool Things That You Will Learn

Example 6: Measuring Aggression

Have women become more aggressive in their political communication over time?

Some Cool Things That You Will Learn

Example 7: Topics in Speeches

What does Boris Johnson talk about in his political speeches?

Installing R, R Studio, and Anaconda

We will install three programs.

  • R: programming language
  • R Studio: Interface
  • Python through Anaconda: programming language

Installing R and R Studio

We can do that by going to: https://posit.co/download/rstudio-desktop/

Installing R Studio

We can do that by going to: https://posit.co/download/rstudio-desktop/

Install the version of RStudio that is relevant to your OS.

Note that there are different files for Apple silicon (M1/M2) and Intel Macs.

Installing R Studio

We can do that by going to: https://posit.co/download/rstudio-desktop/

Install the version of RStudio that is relevant to your OS.

Note that there are different files for Apple silicon (M1/M2) and Intel Macs.

R panels

The platform interface for R studio looks like below:

R panels

The platform interface for R studio looks like below:

R panels

The platform interface for R studio looks like below:

R panels

The platform interface for R studio looks like below:

Quarto Files

Quarto is a version of R Markdown from RStudio that allows us to run code and write text.

Quarto files have the *.qmd extension

Quarto Files

You can produce a wide variety of output types:

  • executable code blocks
  • plots
  • tabular output from data frames
  • plain text output

Installation

You can now start typing.

To use Quarto with R, you should install the rmarkdown R package:

install.packages("rmarkdown")

Installation

You can now start typing.

To use Quarto with R, you should install the rmarkdown R package:

Installation

You can now start typing.

To use Quarto with R, you should install the rmarkdown R package:

Installation

You can now start typing.

To use Quarto with R, you should install the rmarkdown R package:

Using R

Let us now use R to understand how it works.

Let’s create a new quarto document and save it in your “week2” folder.

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Installing Anaconda

Now go to https://www.anaconda.com

Installing Anaconda

Installing Anaconda

Installing Anaconda

Installing Anaconda

Installing Anaconda

Installing Anaconda

Setting Python to Work with R Studio

Setting Python to Work with R Studio

Setting Python to Work with R Studio

Setting Python to Work with R Studio

Setting Python to Work with R Studio

macOS

Open the macOS terminal(search for “Terminal” in the search box) and type:

which python

Setting Python to Work with R Studio

macOS

Open the macOS terminal and type:

which python

Setting Python to Work with R Studio

macOS

Open the macOS terminal and type:

which python

Setting Python to Work with R Studio

macOS

This provides the location of Python. In my case, this is:

/opt/anaconda3/bin/python

Setting Python to Work with R Studio

macOS

Copy that address, paste it under the Python Interpreter, and click apply.

Setting Python to Work with R Studio

Windows

For Windows, on the window, press “Select”

Setting Python to Work with R Studio

Windows

On the window, choose “Conda Environments”

Setting Python to Work with R Studio

Windows

On the window, choose “Conda Environments”

Setting Python to Work with R Studio

Windows

On the window, choose “Conda Environments”

Setting Python to Work with R Studio

Windows

On the window, choose “Conda Environments”

Setting Python to Work with R Studio

Windows

Now, choose “Apply”.

Installing packages in Python

macOS

To install Python packages in macOS, open the Terminal (you can search for “Terminal” in your search box)

Simply pip install + name package

For example, pip install pandas

Installing packages in Python

macOS

Installing packages in Python

Windows

To install python packages on Windows, search for Anaconda Prompt.

Within in type pip install + name package

For example, pip install pandas

Installing packages in Python

Windows

Using Python in Quarto

Let us now install the reticulate package in R

install.packages("reticulate")

The final way to use Python in quarto is by adding the following at the beginning of the quarto document:

macOS

```{r}
reticulate::use_python("/opt/anaconda3/bin/python", required = TRUE)
```

Windows

```{r}
reticulate::use_python("C:/ProgramData/anaconda3/python.exe", required = TRUE)
```

Using Python in Quarto

Conclusion

  • Build a strong foundation in Python for text analysis tasks.
  • Learn to process, analyze, and visualize textual data effectively.
  • Gain hands-on experience with essential Python libraries and tools.
  • Understand real-world applications of text analysis across various fields.
  • Develop confidence to explore advanced topics in future courses.