Visualizing Data Distributions in R
Histograms, Bar Plots, and the Logic of Central Tendency
Bogdan G. Popescu
John Cabot University
Intro
Overview
Today, we’ll learn how to:
Produce histograms to explore distributions
Create bar plots from frequency tables
Generate line plots to examine trends over time
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Press CMD + A or Ctrl + A and then Press Delete
Using R
Using R
Then type:
---
title : "Notebook"
author : "Your Name"
date : "July 26, 2025"
format :
html :
toc : true
number-sections : true
colorlinks : true
smooth-scroll : true
embed-resources : true
---
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Using R
Opening the File
We first remove what we had previously
# Remove all objects from memory
rm (list = ls ())
Opening a File
Download the following datasets from Dropbox:
Place them in your working directory or folder.
Opening the File
To get the file path we simply go to the relevant folder
# This opens a file dialog to select your file
file_path <- file.choose ()
file_path
Opening the File
Once we have the path, we can now read the files:
# Defining Paths
file_path <- "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture8/data/"
# Use file.path() to construct full path
life_expectancy_df <- read.csv (file.path (file_path, "life-expectancy.csv" ))
urbanization_df <- read.csv (file.path (file_path, "share-of-population-urban.csv" ))
What We Want to Do
Explanation
These are our datasets
Doing it
Calculating Average by Country Code
This is what this looks like in code for life_expectancy_df2:
library (dplyr)
life_expectancy_df2 <- life_expectancy_df%>%
dplyr:: group_by (Code)%>%
dplyr:: summarize (life_exp_mean= mean (Life.expectancy.at.birth..historical.))
This is what this looks like in code for urbanization_df2:
urbanization_df2 <- urbanization_df%>%
dplyr:: group_by (Code)%>%
dplyr:: summarize (urb_mean= mean (Urban.population....of.total.population.))
Cleaning the Data
Removing Countries with Strange Labels, Left Merge, Removing NA Values
weird_labels <- c ("OWID_KOS" , "OWID_WRL" , "" )
clean_life_expectancy_df<- subset (life_expectancy_df2, ! (Code %in% weird_labels))
clean_urbanization_df<- subset (urbanization_df2, ! (Code %in% weird_labels))
We will now perform a left join to combine urbanization data with life expectancy data based on Code.
merged_data <- left_join (clean_life_expectancy_df, clean_urbanization_df, by = c ("Code" = "Code" ))
This is how we remove NA values
merged_data2<- subset (na.omit (merged_data))
Histograms
What Is a Histogram?
A histogram is a type of bar chart that shows the distribution of numerical data.
It breaks the range of values into intervals (called bins) and counts how many values fall into each bin.
It helps us answer:
What does the data look like?
Where are most values concentrated?
Histograms
Example – Student Test Scores
400–800
1
800–1200
4
1200–1600
5
1600–2000
3
2000–2400
2
Histograms
Example – Student Test Scores
400–800
1
800–1200
4
1200–1600
5
1600–2000
3
2000–2400
2
Histograms
Example – Student Test Scores
400–800
1
800–1200
4
1200–1600
5
1600–2000
3
2000–2400
2
Histograms
Example – Student Test Scores
400–800
1
800–1200
4
1200–1600
5
1600–2000
3
2000–2400
2
Creating a Bar Plot from a Frequency Table
Step1: Rounding the values
#Step1: Rounding the values
merged_data2$ life_exp_mean_rounded<- round (merged_data2$ life_exp_mean, 0 )
merged_data2$ urb_mean_rounded<- round (merged_data2$ urb_mean, 0 )
head (merged_data2, n= 5 )
Creating a Bar Plot from a Frequency Table
Step2: Creating a frequency table
#Step2: Creating a frequency table
freq_table<- table (merged_data2$ life_exp_mean_rounded)
freq_table
38 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
1 2 3 4 4 5 5 2 2 4 11 6 5 6 4 5 6 6 10 10 6 6 12 11 5 11
68 69 70 71 72 73 74 75 77
9 11 12 12 4 4 1 6 3
#Step3: Turning the table into a dataframe
freq_table<- data.frame (freq_table)
#Step4: Inspecting the names
names (freq_table)
Creating a Bar Plot from a Frequency Table
Step2: Creating a frequency table
#Step5: Providing more intuitive names
names (freq_table)[1 ]<- "life_exp_mean_rounded"
names (freq_table)[2 ]<- "frequency"
#Step6: Turning factor variables into numeric
freq_table$ life_exp_mean_rounded<- as.numeric (as.character (freq_table$ life_exp_mean_rounded))
'data.frame': 35 obs. of 2 variables:
$ life_exp_mean_rounded: num 38 43 44 45 46 47 48 49 50 51 ...
$ frequency : int 1 2 3 4 4 5 5 2 2 4 ...
[1] "life_exp_mean_rounded" "frequency"
Creating a Bar Plot from a Frequency Table
Step2: Creating a frequency table
Inspecting the new dataframe
Creating a Bar Plot from a Frequency Table
Step3: Creating the Barplot
Show the code
library (ggplot2)
#Step7: Creating the barplot
fig5<- ggplot (data = freq_table, aes (x= life_exp_mean_rounded, y= frequency))+
geom_bar (stat= "identity" )+
theme_bw ()+
coord_cartesian (xlim = c (min (merged_data2$ life_exp_mean)- 5 ,
max (merged_data2$ life_exp_mean)+ 5 ),
ylim = c (0 , 20 ))
fig5
Creating a Bar Plot from a Frequency Table
Step4: Creating a Histogram using geom_histogram
Show the code
fig5_b<- ggplot (data = merged_data2, aes (x= life_exp_mean))+
geom_histogram ()+
theme_bw ()+
coord_cartesian (xlim = c (min (merged_data2$ life_exp_mean)- 5 ,
max (merged_data2$ life_exp_mean)+ 5 ),
ylim = c (0 , 20 ))
fig5_b
Creating a Bar Plot from a Frequency Table
Step4: Creating a Histogram using geom_histogram
This is how we can control the bin size: 50
Show the code
fig5_b<- ggplot (data = merged_data2, aes (x= life_exp_mean))+
geom_histogram (bins = 50 , col= "white" )+
theme_bw ()+
coord_cartesian (xlim = c (min (merged_data2$ life_exp_mean)- 5 ,
max (merged_data2$ life_exp_mean)+ 5 ),
ylim = c (0 , 20 ))
fig5_b
Creating a Bar Plot from a Frequency Table
Step4: Creating a Histogram using geom_histogram
This is how we can control the bin size: 35
Show the code
fig5_b<- ggplot (data = merged_data2, aes (x= life_exp_mean))+
geom_histogram (bins = 35 , col= "white" )+
theme_bw ()+
coord_cartesian (xlim = c (min (merged_data2$ life_exp_mean)- 5 ,
max (merged_data2$ life_exp_mean)+ 5 ),
ylim = c (0 , 20 ))
fig5_b
Creating a Bar Plot from a Frequency Table
This is how we put them side by side
Show the code
library (gridExtra)
grid.arrange (fig5, fig5_b, ncol= 2 )
Mapping the Measures of Central Tendency
Calculating Mean
The mean describes the average value of a variable.
It is calculated as:
\[
\bar{x} = \frac{x_1 + x_2 + \dots + x_n}{n} = \frac{\sum x_i}{n}
\]
Mapping the Measures of Central Tendency
Calculating Mean
In your case, we can calculate the mean for all the values in life expectancy
# Calculate the mean
life_exp_mean <- mean (merged_data2$ life_exp_mean, na.rm = TRUE )
life_exp_mean
mean_label <- paste ("Mean (x̄): \n " , round (life_exp_mean, 2 ))
y_coord<- 17
fig5_b<- ggplot (data = merged_data2, aes (x= life_exp_mean))+
geom_histogram (bins = 35 , col= "white" )+
geom_vline (xintercept= life_exp_mean, linetype= 'dashed' , col = 'red' )+
annotate (geom= "text" ,
x= life_exp_mean-2 ,
y= y_coord,
label= mean_label,
color= "red" )+
theme_bw ()+
coord_cartesian (xlim = c (min (merged_data2$ life_exp_mean)- 5 ,
max (merged_data2$ life_exp_mean)+ 5 ),
ylim = c (0 , 20 ))
fig5_b
Mapping the Measures of Central Tendency
Calculating Mean
In your case, we can calculate the mean for all the values in life expectancy
Mapping the Measures of Central Tendency
The median is the middle value in a dataset ordered from smallest to largest .
It splits the dataset into two equal halves.
If the number of observations is odd :
Median = middle number
Example:
Data = [1, 3, 7] → Median = 3
If the number of observations is even :
Median = average of the two middle values
Example:
Data = [1, 3, 5, 7] → Median = ( = 4)
Mapping the Measures of Central Tendency
In your case, we can calculate the median for all the values in life expectancy
# Calculate the mean
life_exp_median <- median (merged_data2$ life_exp_mean)
life_exp_median
mean_label <- paste ("Mean (x̄): \n " , round (life_exp_mean, 2 ))
median_label <- paste ("Median: \n " , round (life_exp_median, 2 ))
y_coord<- 17
fig5_b <- ggplot (data = merged_data2, aes (x = life_exp_mean)) +
geom_histogram (bins = 35 , col = "white" ) +
geom_vline (xintercept = life_exp_mean, linetype = "dashed" , color = "red" ) +
geom_vline (xintercept = life_exp_median, linetype = "dashed" , color = "blue" ) +
annotate ("text" , x = life_exp_mean - 2 , y = y_coord, label = mean_label, color = "red" ) +
annotate ("text" , x = life_exp_median + 2 , y = y_coord, label = median_label, color = "blue" ) +
theme_bw ()+
coord_cartesian (xlim = c (min (merged_data2$ life_exp_mean)- 5 ,
max (merged_data2$ life_exp_mean)+ 5 ),
ylim = c (0 , 20 ))
fig5_b
Mapping the Measures of Central Tendency
In your case, we can calculate the median for all the values in life expectancy
Mapping the Measures of Central Tendency
Another way to do this is through a boxplot
Mapping the Measures of Central Tendency
Another way to do this is through a boxplot
Measures of Dispersion
Common Measures
Standard Deviation :
\[s = \sqrt{s^2}\] (Gives average distance from the mean)
Mapping the Measures of Dispersion
Standard Deviation
# Calculating sd
life_exp_sd <- sd (merged_data2$ life_exp_mean)
# Create sequence for axis ticks
sigma_breaks <- life_exp_mean + (- 3 : 3 ) * life_exp_sd
# Replacing the x-ticks with the SD
sigma_labels <- c ("-3s" , "-2s" , "-1s" , "x̄" , "+1s" , "+2s" , "+3s" )
Mapping the Measures of Dispersion
Standard Deviation
This is how we create the graph.
median_label <- paste ("Median: \n " , round (life_exp_median, 2 ))
y_coord<- 17
# Histogram with SD-based axis
fig5_b <- ggplot (data = merged_data2, aes (x = life_exp_mean)) +
geom_histogram (bins = 35 , col = "white" ) +
geom_vline (xintercept = life_exp_mean, linetype = "dashed" , color = "red" ) +
# geom_vline(xintercept = median(life_expectancy_df$Life.expectancy.at.birth..historical., na.rm = TRUE), linetype = "dashed", color = "blue") +
annotate ("text" , x = life_exp_mean - 2 , y = y_coord, label = mean_label, color = "red" ) +
# annotate("text", x = life_exp_median + 2, y = y_coord, label = median_label, color = "blue") +
scale_x_continuous (breaks = sigma_breaks, labels = sigma_labels) +
theme_bw ()+
coord_cartesian (xlim = c (min (merged_data2$ life_exp_mean)- 5 ,
max (merged_data2$ life_exp_mean)+ 5 ),
ylim = c (0 , 20 ))
fig5_b
Mapping the Measures of Dispersion
Standard Deviation
This is how we can visualize the standard deviations.
Conclusion
Histograms show the shape of a continuous variable’s distribution
Bar plots visualize frequencies of discrete or grouped values
Mean and median describe central tendency
A big gap between them often signals skew
Standard deviation shows how spread out the values are