Working with Data in R

Dataframes, Lists, External Files, Paths

Bogdan G. Popescu

John Cabot University

What You’ll Learn Today

  • Creating and Subsetting Dataframe
  • Loading External files: .csv and .xlsx
  • Relative vs. Absolute Paths
  • Handling file or directory errors
  • Examining entries in a dataframe

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Press CMD + A or Ctrl + A and then Press Delete

Using R

Using R

Then type:

---
title: "Notebook"
author: "Your Name"
date: "July 26, 2025"
format:
  html:
    toc: true
    number-sections: true
    colorlinks: true
    smooth-scroll: true
    embed-resources: true
---

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Creating a Dataframe

# Define student names and grades
students <- c("Alex", "Jane", "Tom", "Lilly", "Turner", "Ruby", "Nick")
grade <- c(77, 81, 89, 83, 99, 92, 97)
# Create a data frame from the vectors
df <- data.frame(student = students, grade = grade)
df

Creating a Dataframe

Creating a Dataframe

Creating a Dataframe

Creating a Dataframe

Creating a Dataframe

df is a dataframe

student and grade are variables within that dataframe

Creating a Dataframe

Creating a Dataframe

Creating a Dataframe

Subset rows for specific students

This is how we select specific observations

subset_df <- subset(df, student %in% c("Alex", "Jane", "Turner"))
subset_df

Creating a Dataframe

Subset rows for specific students

Creating a Dataframe

Subset rows for specific students

Creating a Dataframe

Subset rows for specific students

Creating a Dataframe

Subset rows for specific students

Summary Statistics

This is how we can calculate average for the two variables from the two dataframes

  • Dataframe 1: df
  • Dataframe 2: subset_df

Summary Statistics

This is how we can calculate average for the two variables from the two dataframes

  • Dataframe 1: df
  • Dataframe 2: subset_df

Summary Statistics

This is how we can calculate average for the two variables from the two dataframes

  • Dataframe 1: df
  • Dataframe 2: subset_df

Summary Statistics

This is how we can calculate average for the two variables from the two dataframes

  • Dataframe 1: df
  • Dataframe 2: subset_df

Summary Statistics

This is how we can calculate average for the two variables from the two dataframes

  • Dataframe 1: df has two variables: student and grade

Summary Statistics

Summary Statistics

Summary Statistics

This is how we can calculate average for the two variables from the two dataframes

  • Dataframe 1: df has two variables: student and grade
  • Dataframe 2: subset_df has two variables: student and grade

Summary Statistics

Summary Statistics

Summary Statistics

This is how we calculate the mean

# Mean grade for the whole class
mean(df$grade)
[1] 88.28571
# Mean grade for Alex, Jane, and Turner
mean(subset_df$grade)
[1] 85.66667

Summary Statistics

This is how we identify Max & Min

# Find the highest and lowest grades
max_grade <- max(df$grade)
min_grade <- min(df$grade)

# Find the students with those grades
df$student[df$grade == max_grade]
[1] "Turner"
df$student[df$grade == min_grade]
[1] "Alex"

Indexing Lists

This is how we work with indexing lists

# Create a list
list_new <- c("el1", "el2", "el3")

# Get the second element
list_new[2]
[1] "el2"
# Get the last element
list_new[length(list_new)]
[1] "el3"

Clearing Memory

We can easily remove everything from your computer’s memory with the following command:

# Remove all objects from memory
rm(list = ls())

Clearing Memory

Notice the difference before and after

Clearing Memory

Notice the difference before and after

Loading External Datasets

Why Load External Files?

  • Real-world data usually comes from external sources
  • You’ll work with .csv, .xlsx, .txt, or .tsv files
  • Goal: Load the file → turn it into a data frame → analyze it

Loading External Datasets

Common File Types

File Type Description R Function
.csv Comma-separated values read.csv()
.tsv Tab-separated values read.delim()
.txt Generic text file read.table()
.xlsx Excel spreadsheet readxl::read_excel()

Opening a File

Download the following datasets from Dropbox:

Opening a File

Now put them in your working directory

Place it in a folder called “data” under the work directory (e.g. “week2/lab/” below)

Opening a File

To open the file add a new chunk and type

# This opens a file dialog to select your file
file_path <- file.choose()
file_path

Opening a File

This is what you should see

Opening a File

This is what you should see

Opening a File

This is what you should see.

The part in red will differ from computer to computer

Opening a File

Notice how the path reflects your folder structure

".../week2/lab/data/life-expectancy.csv"

Opening a File

Notice how the path reflects your folder structure

Opening a File

Notice how the path reflects your folder structure

Relative paths

We can now work with relative paths

Remember this?

Relative paths

We can now work with relative paths

Remember this?

Relative paths

We can now work with relative paths

Remember this?

Relative paths

We can now work with relative paths

Remember this?

Relative paths

We can now work with relative paths

Remember this?

Relative paths

This is how we read the csv file.

# Defining Paths
path_data <- "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture6/data/"
# Use file.path() to construct full path
life_expectancy_df <- read.csv(file.path(path_data, "life-expectancy.csv"))

Relative paths

This is how we read the csv file.

Relative paths

We can now also load the other dataframe

# Use file.path() to construct full path
urbanization_df <- read.csv(file.path(path_data, "share-of-population-urban.csv"))

Recap

Relative Paths vs. Absolute Paths

Notice the difference between relative paths vs. absolute paths

Relative Paths

# Defining Paths
path_data <- "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture6/data/"
# Use file.path() to construct full path
life_expectancy_df <- read.csv(file.path(path_data, "life-expectancy.csv"))

Absolute Paths

# Defining Paths
life_expectancy_df <- read.csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture6/data//life-expectancy.csv")

Common Error

One common error is the following

Common Error

One common error is the following

Common Error

If you get that error, your path is not correct

Go back to the previous steps and identify the path to your file.

# This opens a file dialog to select your file
file_path <- file.choose()
file_path
# Defining Paths
path_data <- "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture6/data/"
# Use file.path() to construct full path
life_expectancy_df <- read.csv(file.path(path_data, "life-expectancy.csv"))

Our Data

Let us now investigate our two datasets

Our Data

Let us now investigate our two datasets

Our Data

Let us now investigate our two datasets

Our Data

Let us now investigate our two datasets

Our Data

Examining the First Entries

This is how we can examine the first five entries

head(life_expectancy_df, n=5)

You should see:

Our Data

Examining the First Entries

This is how we can examine the first five entries

head(urbanization_df, n=5)

You should see:

Installing Packages

This is how you install a packages in R

install.packages("tidyverse")

Notice that you need to have quotes: "tidyverse"

Once you are done:

  • delete install.packages("tidyverse")
  • or comment it out by adding “#”
# install.packages("tidyverse")

If you don’t delete it or comment it out, it will cause errors during rendering.

Once you install the package, it will always be on your machine.

Installing Packages

Errors

As we progress, you might be using commands, that might result in:

Error in library(pillar) : there is no package called 'pillar'

If you see this error, you need to check the name of the package.

install.packages("pillar")

Once the package is installed, comment it out or delete it

# install.packages("pillar")

Installing Packages

Loading Packages

To use the commands associated with the package, you need to load it

library("pillar")

You will need to load this package to use its functions

Our Data

Examining the First Entries using glimpse

We will examine our data using glimpse

library("pillar")
glimpse(life_expectancy_df)
Rows: 20,445
Columns: 4
$ Entity                                <chr> "Afghanistan", "Afghanistan", "A…
$ Code                                  <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ Year                                  <int> 1950, 1951, 1952, 1953, 1954, 19…
$ Life.expectancy.at.birth..historical. <dbl> 27.7, 28.0, 28.4, 28.9, 29.2, 29…

or

library("pillar")
pillar::glimpse(life_expectancy_df)
Rows: 20,445
Columns: 4
$ Entity                                <chr> "Afghanistan", "Afghanistan", "A…
$ Code                                  <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ Year                                  <int> 1950, 1951, 1952, 1953, 1954, 19…
$ Life.expectancy.at.birth..historical. <dbl> 27.7, 28.0, 28.4, 28.9, 29.2, 29…

Our Data

Examining the First Entries using glimpse

We will examine our data using glimpse

library("pillar")
glimpse(life_expectancy_df)
Rows: 20,445
Columns: 4
$ Entity                                <chr> "Afghanistan", "Afghanistan", "A…
$ Code                                  <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ Year                                  <int> 1950, 1951, 1952, 1953, 1954, 19…
$ Life.expectancy.at.birth..historical. <dbl> 27.7, 28.0, 28.4, 28.9, 29.2, 29…

We have four variables within our dataframe:

  • Entity: string or character variable
  • Code: string or character variable
  • Year: numeric or integer variable
  • Life.expectancy.at.birth..historical.: numeric or double precision variable

Our Data

Examining the First Entries using glimpse

We will examine our data using glimpse

library("pillar")
glimpse(life_expectancy_df)
Rows: 20,445
Columns: 4
$ Entity                                <chr> "Afghanistan", "Afghanistan", "A…
$ Code                                  <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ Year                                  <int> 1950, 1951, 1952, 1953, 1954, 19…
$ Life.expectancy.at.birth..historical. <dbl> 27.7, 28.0, 28.4, 28.9, 29.2, 29…

We have four variables within our dataframe:

  • Entity: the country: “Afghanistan”, “Albania”, “Algeria”, etc.
  • Code: the country code: “AFG”, “ALB”, “DZA”, etc
  • Year: year 1950, 1951, 1952, etc.
  • Life.expectancy.at.birth..historical.: life expectancy corresponding to that year

We Learned a Few Things Today

  • Creating and Subsetting Dataframe
  • Loading External files: .csv and .xlsx
  • Relative vs. Absolute Paths
  • Handling file or directory errors
  • Examining entries in a dataframe