L4: Histograms, Barplots, Lineplots

Bogdan G. Popescu

John Cabot University

Intro

Overview

Today, we’ll learn how to:

  • Produce histograms to explore distributions
  • Create bar plots from frequency tables
  • Generate line plots to examine trends over time

Creating a New Quarto Document

We will now create a new quarto document corresponding to the week 4 lab

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Press CMD + A or Ctrl + A and then Press Delete

Using R

Using R

Then type:

---
title: "Notebook"
author: "Your Name"
date: "July 26, 2025"
format:
  html:
    toc: true
    number-sections: true
    colorlinks: true
    smooth-scroll: true
    embed-resources: true
---

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Opening the File

We first remove what we had previously

# Remove all objects from memory
rm(list = ls())

Opening a File

Download the following datasets from Dropbox:

Opening the File

To get the file path we simply go to the relevant folder

# This opens a file dialog to select your file
file_path <- file.choose()
file_path

Opening the File

Once we have the path, we can now read the files:

# Defining Paths
path_data <- "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/essential_stats/week3/lab/data/"
# Use file.path() to construct full path
life_expectancy_df <- read.csv(file.path(path_data, "life-expectancy.csv"))
urbanization_df <- read.csv(file.path(path_data, "share-of-population-urban.csv"))

Please change the path to your computer’s path.

What We Want to Do

Explanation

These are our datasets

Doing it

Calculating Average by Country Code

This is what this looks like in code for life_expectancy_df2:

library(dplyr)
life_expectancy_df2<-life_expectancy_df%>%
  dplyr::group_by(Code)%>%
  dplyr::summarize(life_exp_mean=mean(Life.expectancy.at.birth..historical.))

This is what this looks like in code for urbanization_df2:

urbanization_df2<-urbanization_df%>%
  dplyr::group_by(Code)%>%
  dplyr::summarize(urb_mean=mean(Urban.population....of.total.population.))

Cleaning the Data

Removing Countries with Strange Labels, Left Merge, Removing NA Values

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_expectancy_df<-subset(life_expectancy_df2, !(Code %in% weird_labels))
clean_urbanization_df<-subset(urbanization_df2, !(Code %in% weird_labels))

We will now perform a left join to combine urbanization data with life expectancy data based on Code.

merged_data<-left_join(clean_life_expectancy_df, clean_urbanization_df, by = c("Code"="Code"))

This is how we remove NA values

merged_data2<-subset(na.omit(merged_data))

Histograms

What Is a Histogram?

A histogram is a type of bar chart that shows the distribution of numerical data.

It breaks the range of values into intervals (called bins) and counts how many values fall into each bin.

It helps us answer:

  • What does the data look like?
  • Where are most values concentrated?

Histograms

Example – Student Test Scores

SAT Score Range Number of Students
400–800 1
800–1200 4
1200–1600 5
1600–2000 3
2000–2400 2

Histograms

Example – Student Test Scores

SAT Score Range Number of Students
400–800 1
800–1200 4
1200–1600 5
1600–2000 3
2000–2400 2

Histograms

Example – Student Test Scores

SAT Score Range Number of Students
400–800 1
800–1200 4
1200–1600 5
1600–2000 3
2000–2400 2

Histograms

Example – Student Test Scores

SAT Score Range Number of Students
400–800 1
800–1200 4
1200–1600 5
1600–2000 3
2000–2400 2

Creating a Bar Plot from a Frequency Table

Step1: Rounding the values

#Step1: Rounding the values
merged_data2$life_exp_mean_rounded<-round(merged_data2$life_exp_mean, 0)
merged_data2$urb_mean_rounded<-round(merged_data2$urb_mean, 0)
head(merged_data2, n=5)

Creating a Bar Plot from a Frequency Table

Step2: Creating a frequency table

#Step2: Creating a frequency table
freq_table<-table(merged_data2$life_exp_mean_rounded)
freq_table

38 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 
 1  2  3  4  4  5  5  2  2  4 11  6  5  6  4  5  6  6 10 10  6  6 12 11  5 11 
68 69 70 71 72 73 74 75 77 
 9 11 12 12  4  4  1  6  3 
#Step3: Turning the table into a dataframe
freq_table<-data.frame(freq_table)

#Step4:Identifying how to extract the variable names
names(freq_table)
[1] "Var1" "Freq"

Creating a Bar Plot from a Frequency Table

Step2: Creating a frequency table

#Step5: Providing more intuitive names
names(freq_table)[1]<-"life_exp_mean_rounded"
names(freq_table)[2]<-"frequency"
#Step6: Turning factor variables into numeric
freq_table$life_exp_mean_rounded<-as.numeric(as.character(freq_table$life_exp_mean_rounded))
str(freq_table)
'data.frame':   35 obs. of  2 variables:
 $ life_exp_mean_rounded: num  38 43 44 45 46 47 48 49 50 51 ...
 $ frequency            : int  1 2 3 4 4 5 5 2 2 4 ...
names(freq_table)
[1] "life_exp_mean_rounded" "frequency"            

Creating a Bar Plot from a Frequency Table

Step2: Creating a frequency table

Inspecting the new dataframe

freq_table

Creating a Bar Plot from a Frequency Table

Step3: Creating the Barplot

library(ggplot2)
#Step7: Creating the barplot
fig5<-ggplot(data = freq_table, aes(x=life_exp_mean_rounded, y=frequency))+
  geom_bar(stat="identity")+
  theme_bw()

fig5

Creating a Bar Plot from a Frequency Table

Step4: Creating a Histogram using geom_histogram

fig5_b<-ggplot(data = merged_data2, aes(x=life_exp_mean_rounded))+
  geom_histogram()+
  theme_bw()
fig5_b

Creating a Bar Plot from a Frequency Table

Step4: Creating a Histogram using geom_histogram

This is how we can control the bin size: 50

fig5_b<-ggplot(data = merged_data2, aes(x=life_exp_mean_rounded))+
  geom_histogram(bins = 50, col="white")+
  theme_bw()
fig5_b

Creating a Bar Plot from a Frequency Table

Step4: Creating a Histogram using geom_histogram

This is how we can cantrol the bin size: 35

fig5_b<-ggplot(data = merged_data2, aes(x=life_exp_mean_rounded))+
  geom_histogram(bins = 35, col="white")+
  theme_bw()
fig5_b

Creating a Bar Plot from a Frequency Table

This is how we put them side by side

library(gridExtra)
grid.arrange(fig5, fig5_b, ncol=2)

Creating a line plot

We can create a line plot using our original data

head(life_expectancy_df, n=6)

Creating a line plot

Renaming variable

We can create a line plot using our original data

#Renaming variable
names(life_expectancy_df)[4]<-"life_exp_yearly"
head(life_expectancy_df, n=6)

Creating a line plot

Making a Lineplot

fig7<-ggplot(data = life_expectancy_df, aes(x = Year, y = life_exp_yearly))+
  geom_line(aes(color = Entity))+
  theme_bw()+
  guides(color = "none")
fig7

Creating a line plot

Subsetting the sample to fewer countries

countries_of_interest<-c("United States", "United Kingdom")
df_us_uk<-subset(life_expectancy_df, Entity %in% countries_of_interest)
head(df_us_uk, n=10)

Creating a line plot

Subsetting the sample to fewer countries

fig8<-ggplot(data = df_us_uk, aes(x = Year, y = life_exp_yearly))+
  geom_line(aes(color = Entity))+
  theme_bw()

fig8

Creating a line plot

Subsetting the sample to fewer years

It might be a good idea to restrict our sample to the period after 1900.

df_us_uk_after1900<-subset(df_us_uk, Year>1900)
head(df_us_uk_after1900, n=10)

Creating a line plot

Making a Lineplot

fig9<-ggplot(data = df_us_uk_after1900, aes(x = Year, y = life_exp_yearly))+
  geom_line(aes(color = Entity))+
  theme_bw()

fig9

Creating a line plot

Making a Lineplot

This is how we change the colors

fig9<-ggplot(data = df_us_uk_after1900, aes(x = Year, y = life_exp_yearly))+
  geom_line(aes(color = Entity))+
  theme_bw()+
  scale_color_manual(values=c('Red','Blue'))

fig9

Creating a line plot

Making a Lineplot

This is how we change the colors

fig9<-ggplot(data = df_us_uk_after1900, aes(x = Year, y = life_exp_yearly))+
  geom_line(aes(color = Entity))+
  theme_bw()+
  scale_color_manual(values=c('Red','Blue'))

fig9

What do you think cause the dip in life expectancy?

Exploring Data

What is the year with the lowest life expectancy for the US?

What is the year with the lowest life expectancy for the US?

#Creating a new df with the US
df_us_after1900<-subset(df_us_uk_after1900, Entity=="United States")
df_us_after1900$Year[df_us_after1900$life_exp_yearly==min(df_us_after1900$life_exp_yearly)]
[1] 1918

Exploring Data

What is the year with the second lowest life expectancy for the US?

#Arranging and creating a new dataframe
df <- df_us_after1900 %>% arrange(life_exp_yearly)
#Selecting the second lowest
second_highest_life_expectancy <- df$life_exp_yearly[2]
#Selecting the year with the second lowest
df$Year[df$life_exp_yearly==second_highest_life_expectancy]
[1] 1901

Exploring Data

What is the year with the second lowest life expectancy for the UK?

#Creating a new df with the UK
df_uk_after1900<-subset(df_us_uk_after1900, Entity=="United Kingdom")
df_uk_after1900$Year[df_uk_after1900$life_exp_yearly==min(df_uk_after1900$life_exp_yearly)]
[1] 1901

Conclusion

What Did We Learn Today?

  • How to import and clean real-world data
  • The difference between histograms, bar plots, and line plots
  • How to use ggplot2 for data visualization
  • How to explore patterns and trends in life expectancy and urbanization