Working with Data in R

Dataframes, Lists, External Files, Paths

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

What You’ll Learn Today

Creating and Subsetting Dataframe
Loading External files: .csv and .xlsx
Relative vs. Absolute Paths
Handling file or directory errors
Examining entries in a dataframe

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Press CMD + A or Ctrl + A and then Press Delete

Using R

Using R

Then type:

---
title: "Notebook"
author: "Your Name"
date: "July 26, 2025"
format:
  html:
    toc: true
    number-sections: true
    colorlinks: true
    smooth-scroll: true
    embed-resources: true
---

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Creating a Dataframe

# Define student names and grades
students <- c("Alex", "Jane", "Tom", "Lilly", "Turner", "Ruby", "Nick")
grade <- c(77, 81, 89, 83, 99, 92, 97)

# Create a data frame from the vectors
df <- data.frame(student = students, grade = grade)

df

Creating a Dataframe

Creating a Dataframe

Creating a Dataframe

Creating a Dataframe

Creating a Dataframe

df is a dataframe

student and grade are variables within that dataframe

Creating a Dataframe

Creating a Dataframe

Creating a Dataframe

Subset rows for specific students

This is how we select specific observations

subset_df <- subset(df, student %in% c("Alex", "Jane", "Turner"))
subset_df

Creating a Dataframe

Subset rows for specific students

Creating a Dataframe

Subset rows for specific students

Creating a Dataframe

Subset rows for specific students

Creating a Dataframe

Subset rows for specific students

Summary Statistics

This is how we can calculate average for the two variables from the two dataframes

Dataframe 1: df
Dataframe 2: subset_df

Summary Statistics

This is how we can calculate average for the two variables from the two dataframes

Dataframe 1: df
Dataframe 2: subset_df

Summary Statistics

This is how we can calculate average for the two variables from the two dataframes

Dataframe 1: df
Dataframe 2: subset_df

Summary Statistics

This is how we can calculate average for the two variables from the two dataframes

Dataframe 1: df
Dataframe 2: subset_df

Summary Statistics

This is how we can calculate average for the two variables from the two dataframes

Dataframe 1: df has two variables: student and grade

Summary Statistics

Summary Statistics

Summary Statistics

This is how we can calculate average for the two variables from the two dataframes

Dataframe 1: df has two variables: student and grade
Dataframe 2: subset_df has two variables: student and grade

Summary Statistics

Summary Statistics

Summary Statistics

This is how we calculate the mean

# Mean grade for the whole class
mean(df$grade)

[1] 88.28571

# Mean grade for Alex, Jane, and Turner
mean(subset_df$grade)

[1] 85.66667

Summary Statistics

This is how we identify Max & Min

# Find the highest and lowest grades
max_grade <- max(df$grade)
min_grade <- min(df$grade)

# Find the students with those grades
df$student[df$grade == max_grade]

[1] "Turner"

df$student[df$grade == min_grade]

[1] "Alex"

Indexing Lists

This is how we work with indexing lists

# Create a list
list_new <- c("el1", "el2", "el3")

# Get the second element
list_new[2]

[1] "el2"

# Get the last element
list_new[length(list_new)]

[1] "el3"

Clearing Memory

We can easily remove everything from your computer’s memory with the following command:

# Remove all objects from memory
rm(list = ls())

Clearing Memory

Notice the difference before and after

Clearing Memory

Notice the difference before and after

Loading External Datasets

Why Load External Files?

Real-world data usually comes from external sources
You’ll work with .csv, .xlsx, .txt, or .tsv files
Goal: Load the file → turn it into a data frame → analyze it

Loading External Datasets

Common File Types

File Type	Description	R Function
`.csv`	Comma-separated values	`read.csv()`
`.tsv`	Tab-separated values	`read.delim()`
`.txt`	Generic text file	`read.table()`
`.xlsx`	Excel spreadsheet	`readxl::read_excel()`

Opening a File

Download the following datasets from Dropbox:

Opening a File

Now put them in your working directory

Place it in a folder called “data” under the work directory (e.g. “week2/lab/” below)

Opening a File

To open the file add a new chunk and type

# This opens a file dialog to select your file
file_path <- file.choose()
file_path

Opening a File

This is what you should see

Opening a File

This is what you should see

Opening a File

This is what you should see.

The part in red will differ from computer to computer

Opening a File

Notice how the path reflects your folder structure

".../week2/lab/data/life-expectancy.csv"

Opening a File

Notice how the path reflects your folder structure

Opening a File

Notice how the path reflects your folder structure

Relative paths

We can now work with relative paths

Remember this?

Relative paths

We can now work with relative paths

Remember this?

Relative paths

We can now work with relative paths

Remember this?

Relative paths

We can now work with relative paths

Remember this?

Relative paths

We can now work with relative paths

Remember this?

Relative paths

This is how we read the csv file.

# Defining Paths
path_data <- "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture6/data/"
# Use file.path() to construct full path
life_expectancy_df <- read.csv(file.path(path_data, "life-expectancy.csv"))

Relative paths

This is how we read the csv file.

Relative paths

We can now also load the other dataframe

# Use file.path() to construct full path
urbanization_df <- read.csv(file.path(path_data, "share-of-population-urban.csv"))

Recap

Relative Paths vs. Absolute Paths

Notice the difference between relative paths vs. absolute paths

Relative Paths

# Defining Paths
path_data <- "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture6/data/"
# Use file.path() to construct full path
life_expectancy_df <- read.csv(file.path(path_data, "life-expectancy.csv"))

Absolute Paths

# Defining Paths
life_expectancy_df <- read.csv("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture6/data//life-expectancy.csv")

Common Error

One common error is the following

Common Error

One common error is the following

Common Error

If you get that error, your path is not correct

Go back to the previous steps and identify the path to your file.

# This opens a file dialog to select your file
file_path <- file.choose()
file_path

# Defining Paths
path_data <- "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture6/data/"
# Use file.path() to construct full path
life_expectancy_df <- read.csv(file.path(path_data, "life-expectancy.csv"))

Our Data

Let us now investigate our two datasets

Our Data

Let us now investigate our two datasets

Our Data

Let us now investigate our two datasets

Our Data

Let us now investigate our two datasets

Our Data

Examining the First Entries

This is how we can examine the first five entries

head(life_expectancy_df, n=5)

You should see:

Our Data

Examining the First Entries

This is how we can examine the first five entries

head(urbanization_df, n=5)

You should see:

Installing Packages

This is how you install a packages in R

install.packages("tidyverse")

Notice that you need to have quotes: "tidyverse"

Once you are done:

delete install.packages("tidyverse")
or comment it out by adding “#”

# install.packages("tidyverse")

If you don’t delete it or comment it out, it will cause errors during rendering.

Once you install the package, it will always be on your machine.

Installing Packages

Errors

As we progress, you might be using commands, that might result in:

Error in library(pillar) : there is no package called 'pillar'

If you see this error, you need to check the name of the package.

install.packages("pillar")

Once the package is installed, comment it out or delete it

# install.packages("pillar")

Installing Packages

Loading Packages

To use the commands associated with the package, you need to load it

library("pillar")

You will need to load this package to use its functions

Our Data

Examining the First Entries using `glimpse`

We will examine our data using glimpse

library("pillar")
glimpse(life_expectancy_df)

Rows: 20,445
Columns: 4
$ Entity                                <chr> "Afghanistan", "Afghanistan", "A…
$ Code                                  <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ Year                                  <int> 1950, 1951, 1952, 1953, 1954, 19…
$ Life.expectancy.at.birth..historical. <dbl> 27.7, 28.0, 28.4, 28.9, 29.2, 29…

or

library("pillar")
pillar::glimpse(life_expectancy_df)

Rows: 20,445
Columns: 4
$ Entity                                <chr> "Afghanistan", "Afghanistan", "A…
$ Code                                  <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ Year                                  <int> 1950, 1951, 1952, 1953, 1954, 19…
$ Life.expectancy.at.birth..historical. <dbl> 27.7, 28.0, 28.4, 28.9, 29.2, 29…

Our Data

Examining the First Entries using `glimpse`

We will examine our data using glimpse

library("pillar")
glimpse(life_expectancy_df)

Rows: 20,445
Columns: 4
$ Entity                                <chr> "Afghanistan", "Afghanistan", "A…
$ Code                                  <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ Year                                  <int> 1950, 1951, 1952, 1953, 1954, 19…
$ Life.expectancy.at.birth..historical. <dbl> 27.7, 28.0, 28.4, 28.9, 29.2, 29…

We have four variables within our dataframe:

Entity: string or character variable
Code: string or character variable
Year: numeric or integer variable
Life.expectancy.at.birth..historical.: numeric or double precision variable

Our Data

Examining the First Entries using `glimpse`

We will examine our data using glimpse

library("pillar")
glimpse(life_expectancy_df)

Rows: 20,445
Columns: 4
$ Entity                                <chr> "Afghanistan", "Afghanistan", "A…
$ Code                                  <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ Year                                  <int> 1950, 1951, 1952, 1953, 1954, 19…
$ Life.expectancy.at.birth..historical. <dbl> 27.7, 28.0, 28.4, 28.9, 29.2, 29…

We have four variables within our dataframe:

Entity: the country: “Afghanistan”, “Albania”, “Algeria”, etc.
Code: the country code: “AFG”, “ALB”, “DZA”, etc
Year: year 1950, 1951, 1952, etc.
Life.expectancy.at.birth..historical.: life expectancy corresponding to that year

We Learned a Few Things Today

Creating and Subsetting Dataframe
Loading External files: .csv and .xlsx
Relative vs. Absolute Paths
Handling file or directory errors
Examining entries in a dataframe