PL/CS 362-1: Applied Computational Methods for Social Sciences

Introduction to Text Analysis with Python

NOTE: This course is not complete. The lectures are not their final version. New lectures and labs will also be added during the course.

1 Details

Instructor: Bogdan G. Popescu
Hours: MW 10:00-11:15AM
Total Hours of Contact: 2:30 per week
Room: G.K.1.4-Guarini Campus, Kushlan Wing, First Floor, Room 4

Credits: 3
Prerequisites: None
Office Hours: TBA

2 Course Description

This is an introductory course to text-as-data in Python, a free programming language and environment developed for statistical computing and graphics. The first part of the course will introduce you to the python programming language. The second part will focus on the Python’s ability to analyze text data: including social media posts, email correspondence, customer reviews, presidential debates. Through hands-on exercises, this course introduces students to the concepts and applications of text analysis.

Much of information about the world is stored in texts which can communicate a lot of information. Knowing how to analyze text in a systematic and replicable way can provide new insights about the world around us.

Note: You should have a computer either Windows or Mac with at least 8GB of RAM memory.

3 Summary of Course Content

This introductory course on text data in python offers a comprehensive exploration of the python programming language’s capabilities in statistical computing and graphics, with a specific focus on text data manipulation. The initial segment of the course acquaints students with the fundamentals of the python programming language, emphasizing its status as a versatile and free tool. Subsequently, the course delves into python’s ability to handle text data. Participants learn to perform text analyses using python, with a particular emphasis on laboratory applications that apply techniques and methodologies to real-world scenarios. Thus, the skills acquired in this course are relevant in political science, economics, business, and marketing.

4 Learning Outcomes

Upon successful completion of this course the students will be able to:

  • execute basic programming tasks in python (e.g. loops, conditional statements, while statements, etc.)
  • understand basic text analysis terms and concepts
  • utilize python for conducting text analysis.
  • create a professional website containing a portfolio

5 Assessment

5.1 Assessment methods:

You will be graded on four problem sets during the semester (each 12.5% of your grade) and a final report and presentation (each 25% of your grade).

  • 4 problem sets: 50% of the final grade (12.5% each)
  • final presentation: 25% of the final grade
  • final project report: 25% of the final grade

5.2 Problem Sets

1.Initial Individual Submission: This component contributes 50% of the overall grade for the problem set. When you complete the problem set independently and submit it to the instructor, your grade for this component will be calculated based on the quality of your independent work. This grade will be weighted at 50% of the total assignment grade.

2.Final Submission After Group Consultation: This component also contributes 50% of the overall grade for the problem sets. After discussing the problem set with your group members and documenting the correct answers, you will submit this revised version individually. Note: no group submission is permitted. Each one of you has to submit the second attempt of the assignment individually. Your grade for this component will be based on the quality of your final submission after group consultation. This grade will also be weighted at 50% of the total assignment grade.

5.3 Final Project

Students will undertake a text analysis project emphasizing the practical application of text analysis techniques using the python programming language. This project offers an opportunity to showcase the acquired skills in manipulating text data and conducting meaningful analyses. Participants are encouraged to choose a topic of interest or relevance, utilizing datasets provided in the course or exploring new ones. The project entails a few steps:

  • choosing a topic that involves text data
  • acquire data either from the course materials or from external sources

Students have to submit a two-page report in Quarto and give a 20-minute and give an in-class presentation in Quarto.

Grading Criteria for the project:

  • Relevance
  • Methodology: demonstration of at least five distinct GIS procedures in R.
  • Analysis
  • Presentation
  • Code Quality

6 Attendance Requirements

Students are required to attend classes following the University’s policies. Students with more than two unexcused absences are assumed to have withdrawn from the course. Students with a justified reason not to attend class have to send me an email explaining why they cannot attend ahead of class and need to submit a form to the Dean’s Office.

7 Academic Honesty

As stated in the university catalog, any student who commits an act of academic dishonesty will receive a failing grade on the work in which the dishonesty occurred. In addition, acts of academic dishonesty, irrespective of the weight of the assignment, may result in the student receiving a failing grade in the course. Instances of academic dishonesty will be reported to the Dean of Academic Affairs. A student who is reported twice for academic dishonesty is subject to summary dismissal from the University. In such a case, the Academic Council will then make a recommendation to the President, who will make the final decision.

8 Students with Learning or Other Disabilities

John Cabot University does not discriminate on the basis of disability or handicap. Students with approved accommodations must inform their professors at the beginning of the term. Please see the website for the complete policy

8.1 Required Books

General Books on Python

McKinney, Wes, 2022. Python for Data Analysis. O’Reilly. https://wesmckinney.com/book/.

Mike Nguyen, 2023. A Guide on Data Analysis. https://bookdown.org/mike/data_analysis/

Text Analysis

Hovy, Dirk. Text Analysis in Python for Social Scientists. Cambridge University Press. https://www.cambridge.org/core/elements/abs/text-analysis-in-python-for-social-scientists/BFAB0A3604C7E29F6198EA2F7941DFF3

Bengfort, Benjamin, Bilbro, Rebecca, and Tony Ojeda. 2018. Applied Text Analysis with Python. O’Reilly Media, Inc. https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/

Hvitfeldt, Emil and Silge, Julia. 2022. Supervised Machine Learning for Text Analysis in R. CRC Press. https://smltar.com

Grimmer, Justin, Brandon M. Stewart, and Margaret E. Roberts. Text As Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press, 2022.

Silge, Julia, and David Robinson. Text Mining with R: A Tidy Approach.. First edition. O’Reilly, 2017.

Week 1

Class 1: Intro to Python and Text Analysis
01/15/2024 - Mon - Lecture

  • Introduction and course overview
  • Introduction to Python programming and its applications in text analysis
  • Overview of text data (e.g., social media, debates, reviews)

Class 2: Intro to Python, Jupyter notebooks, and R Quarto
01/17/2024 - Wed - Lecture

  • Installing Python and setting up your environment
  • Quarto files: notebooks vs. presentations
  • The Python environment

Reading
McKinney, Wes, 2022. Python for Data Analysis. O’Reilly. https://wesmckinney.com/book/. C1-2

Week 2

Class 1: Variables, Strings, Numbers
01/15/2024 - Mon - Lecture

  • Variables
  • Strings
  • Combining strings
  • Numbers
  • Booleans
  • Conditionals: if and elif

Class 2: Conditional Statements, Lists, and Loops
01/24/2024 - Wed - Lecture

  • Comparison Operators
  • Conditional Statements
  • Error Handling
  • Lists
  • Loops
  • Breaks

Reading
McKinney, Wes, 2022. Python for Data Analysis. O’Reilly. https://wesmckinney.com/book/. C3

Assignment 1

Week 3

Class 1: Operations with Lists, Tuples
01/29/2024 - Mon - Lecture

  • Lists
  • List Operations
  • List Slicing
  • Tuples

Class 2: While Loops and Functions
01/29/2024 - Wed - Lecture

  • While loops
  • Functions

Reading
McKinney, Wes, 2022. Python for Data Analysis. O’Reilly. https://wesmckinney.com/book/. C3

Week 4

Class 1: Dictionaries and Sets
02/05/2024 - Mon - Lecture

  • dictionaries
  • extracting elements from dictionary
  • sorting dictionary keys
  • nesting dictionaries
  • dictionaries within dictionaries
  • sets
  • operations with sets

Class 2: Numpy

02/05/2024 - Wed - Lecture

  • Numpy
  • Arrays
  • Numpy functions
  • Indexing and slicing
  • Operations with Arrays
  • Multidimensional arrays

Week 5

Class 1: Pandas
02/07/2024 - Mon - Lecture

  • Dataframes and Series
  • Extracting elements from Pandas Series
  • Appending
  • Dataframe Methods
  • Missing Data
  • Assigning Value

Class 2: Pandas Data Wrangling
02/09/2024 - Wed - Lecture

  • Setting Indexes
  • Changing, ordering column names
  • Identifying and Removing duplicates
  • Filtering Data
  • Aggregating Data
  • Modifying Data Structures
  • Aggregating by Group

Week 6

Class 1: Merging Data
02/09/2024 - Mon - Lecture

  • Inner, Outer Joins
  • Left, Right Joins

Class 2: Data Visualization 1
02/09/2024 - Wed - Lecture

  • Color contrasts
  • Intro to ggplot
  • Aesthetics and geoms
  • Labels and Facets

Week 7

Class 1: Data Visualization 2
02/07/2024 - Mon - Lecture

  • Barplots
  • Uncertainty
  • Boxplots and Violin Plots
  • Annotations
  • Temporal Plots

Class 2: Text Analysis Intro
02/09/2024 - Wed - Lecture

  • Text tokenization
  • turning words to Lowercase
  • removing punctuation
  • removing numbers
  • lemmatization
  • stemming

Week 8

Class 1: NLP Pipelines
02/07/2024 - Mon - Lecture

  • Using text analysis in dataframes
  • Using NLP pipelines within functions
  • Word Counts
  • Word Clouds

Class 2: Sentiment Analysis
02/07/2024 - Wed - Lecture

  • Lexicon-based sentiment analysis
  • Bag-of-Words sentiment analysis
  • Confusion Matrix

Week 9

Class 1: Topic Modeling
02/07/2024 - Mon - [Lecture]

  • Introduction to topic modeling
  • Latent Dirichlet Allocation (LDA)
  • Evaluating model performance

Class 2: Word Embeddings

  • Word Embeddings and their importance in NLP
  • BERT (Bidirectional Encoder Representations from Transformers) architecture
  • Transformer architecture and its components
  • Applications of Bert

Week 10

Class 1: Text Classification
02/07/2024 - Mon - [Lecture]

  • Introduction to text classification
  • Supervised vs. unsupervised learning
  • Building and evaluating a classification model (e.g., spam detection)

Class 2: Named Entity Recognition (NER)
02/07/2024 - Wed - [Lecture]

  • Understanding NER
  • Using libraries like spaCy for NER
  • Applications of NER in social science

Week 11

Class 1: Ethical Considerations in Text Analysis

02/07/2024 - Mon - [Lecture]

  • Ethical implications of text data collection
  • Privacy concerns and data usage
  • Discussing case studies in text analysis ethics

Class 2: Challenges in Text Analysis
02/07/2024 - Wed - [Lecture]

  • Handling biased datasets
  • Dealing with noise in text data
  • Limitations of text analysis techniques

Week 12

Class 1: Project Development Time
02/07/2024 - Mon

  • Students work on their final projects
  • In-class support and troubleshooting

Class 2: Student Proposal Presentations
02/07/2024 - Wed

  • Question
  • Data sources
  • Proposed Text analysis methods

Week 13

Class 1: Project Development Time
02/07/2024 - Mon

Week 14

Class 1: Course Reflection

04/24/2024 - Mon - Lecture

  • Overview of the Course
  • Skills Acquired
  • Jobs where these skills are valued

Class 2: Final Student Presentations
02/07/2024 - Wed

Supplementary Tutorials

Making a website on GitHub: username.github.io

  • Making a website on Quarto
  • Storing a website on GitHub
  • Structuring a professional website for Data Analytics Jobs