Visualizing Data Distributions in R

Histograms, Bar Plots, and the Logic of Central Tendency

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Intro

Overview

Today, we’ll learn how to:

Produce histograms to explore distributions
Create bar plots from frequency tables
Generate line plots to examine trends over time

Using R

Press CMD + A or Ctrl + A and then Press Delete

Using R

Then type:

---
title: "Notebook"
author: "Your Name"
date: "July 26, 2025"
format:
  html:
    toc: true
    number-sections: true
    colorlinks: true
    smooth-scroll: true
    embed-resources: true
---

Using R

Opening the File

We first remove what we had previously

# Remove all objects from memory
rm(list = ls())

Opening a File

Download the following datasets from Dropbox:

Place them in your working directory or folder.

Opening the File

To get the file path we simply go to the relevant folder

# This opens a file dialog to select your file
file_path <- file.choose()
file_path

Opening the File

Once we have the path, we can now read the files:

# Defining Paths
file_path <- "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture8/data/"
# Use file.path() to construct full path
life_expectancy_df <- read.csv(file.path(file_path, "life-expectancy.csv"))
urbanization_df <- read.csv(file.path(file_path, "share-of-population-urban.csv"))

What We Want to Do

Explanation

These are our datasets

Doing it

Calculating Average by Country Code

This is what this looks like in code for life_expectancy_df2:

library(dplyr)
life_expectancy_df2 <- life_expectancy_df%>%
  dplyr::group_by(Code)%>%
  dplyr::summarize(life_exp_mean=mean(Life.expectancy.at.birth..historical.))

This is what this looks like in code for urbanization_df2:

urbanization_df2 <- urbanization_df%>%
  dplyr::group_by(Code)%>%
  dplyr::summarize(urb_mean=mean(Urban.population....of.total.population.))

Cleaning the Data

Removing Countries with Strange Labels, Left Merge, Removing NA Values

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_expectancy_df<-subset(life_expectancy_df2, !(Code %in% weird_labels))
clean_urbanization_df<-subset(urbanization_df2, !(Code %in% weird_labels))

We will now perform a left join to combine urbanization data with life expectancy data based on Code.

merged_data <- left_join(clean_life_expectancy_df, clean_urbanization_df, by = c("Code"="Code"))

This is how we remove NA values

merged_data2<-subset(na.omit(merged_data))

Histograms

What Is a Histogram?

A histogram is a type of bar chart that shows the distribution of numerical data.

It breaks the range of values into intervals (called bins) and counts how many values fall into each bin.

It helps us answer:

What does the data look like?
Where are most values concentrated?

Histograms

Example – Student Test Scores

SAT Score Range	Number of Students
400–800	1
800–1200	4
1200–1600	5
1600–2000	3
2000–2400	2

Histograms

Example – Student Test Scores

SAT Score Range	Number of Students
400–800	1
800–1200	4
1200–1600	5
1600–2000	3
2000–2400	2

Histograms

Example – Student Test Scores

SAT Score Range	Number of Students
400–800	1
800–1200	4
1200–1600	5
1600–2000	3
2000–2400	2

Histograms

Example – Student Test Scores

SAT Score Range	Number of Students
400–800	1
800–1200	4
1200–1600	5
1600–2000	3
2000–2400	2

Creating a Bar Plot from a Frequency Table

Step1: Rounding the values

#Step1: Rounding the values
merged_data2$life_exp_mean_rounded<-round(merged_data2$life_exp_mean, 0)
merged_data2$urb_mean_rounded<-round(merged_data2$urb_mean, 0)
head(merged_data2, n=5)

Creating a Bar Plot from a Frequency Table

Step2: Creating a frequency table

#Step2: Creating a frequency table
freq_table<-table(merged_data2$life_exp_mean_rounded)
freq_table


38 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 
 1  2  3  4  4  5  5  2  2  4 11  6  5  6  4  5  6  6 10 10  6  6 12 11  5 11 
68 69 70 71 72 73 74 75 77 
 9 11 12 12  4  4  1  6  3

#Step3: Turning the table into a dataframe
freq_table<-data.frame(freq_table)

#Step4: Inspecting the names
names(freq_table)

[1] "Var1" "Freq"

Creating a Bar Plot from a Frequency Table

Step2: Creating a frequency table

#Step5: Providing more intuitive names
names(freq_table)[1]<-"life_exp_mean_rounded"
names(freq_table)[2]<-"frequency"
#Step6: Turning factor variables into numeric
freq_table$life_exp_mean_rounded<-as.numeric(as.character(freq_table$life_exp_mean_rounded))

str(freq_table)

'data.frame':   35 obs. of  2 variables:
 $ life_exp_mean_rounded: num  38 43 44 45 46 47 48 49 50 51 ...
 $ frequency            : int  1 2 3 4 4 5 5 2 2 4 ...

names(freq_table)

[1] "life_exp_mean_rounded" "frequency"

Creating a Bar Plot from a Frequency Table

Step2: Creating a frequency table

Inspecting the new dataframe

freq_table

Creating a Bar Plot from a Frequency Table

Step3: Creating the Barplot

Show the code

library(ggplot2)
#Step7: Creating the barplot
fig5<-ggplot(data = freq_table, aes(x=life_exp_mean_rounded, y=frequency))+
  geom_bar(stat="identity")+
  theme_bw()+
  coord_cartesian(xlim = c(min(merged_data2$life_exp_mean)-5,
                    max(merged_data2$life_exp_mean)+5),
           ylim = c(0, 20))
fig5

Creating a Bar Plot from a Frequency Table

Step4: Creating a Histogram using `geom_histogram`

Show the code

fig5_b<-ggplot(data = merged_data2, aes(x=life_exp_mean))+
  geom_histogram()+
  theme_bw()+
    coord_cartesian(xlim = c(min(merged_data2$life_exp_mean)-5,
                    max(merged_data2$life_exp_mean)+5),
           ylim = c(0, 20))
fig5_b

Creating a Bar Plot from a Frequency Table

Step4: Creating a Histogram using `geom_histogram`

This is how we can control the bin size: 50

Show the code

fig5_b<-ggplot(data = merged_data2, aes(x=life_exp_mean))+
  geom_histogram(bins = 50, col="white")+
  theme_bw()+
    coord_cartesian(xlim = c(min(merged_data2$life_exp_mean)-5,
                    max(merged_data2$life_exp_mean)+5),
           ylim = c(0, 20))
fig5_b

Creating a Bar Plot from a Frequency Table

Step4: Creating a Histogram using `geom_histogram`

This is how we can control the bin size: 35

Show the code

fig5_b<-ggplot(data = merged_data2, aes(x=life_exp_mean))+
  geom_histogram(bins = 35, col="white")+
  theme_bw()+
    coord_cartesian(xlim = c(min(merged_data2$life_exp_mean)-5,
                    max(merged_data2$life_exp_mean)+5),
           ylim = c(0, 20))
fig5_b

Creating a Bar Plot from a Frequency Table

This is how we put them side by side

Show the code

library(gridExtra)
grid.arrange(fig5, fig5_b, ncol=2)

Mapping the Measures of Central Tendency

Calculating Mean

The mean describes the average value of a variable.
It is calculated as:

\[ \bar{x} = \frac{x_1 + x_2 + \dots + x_n}{n} = \frac{\sum x_i}{n} \]

Mapping the Measures of Central Tendency

Calculating Mean

In your case, we can calculate the mean for all the values in life expectancy

# Calculate the mean
life_exp_mean <- mean(merged_data2$life_exp_mean, na.rm = TRUE)
life_exp_mean

[1] 61.26972

mean_label <- paste("Mean (x̄):\n", round(life_exp_mean, 2))
y_coord<-17
fig5_b<-ggplot(data = merged_data2, aes(x=life_exp_mean))+
  geom_histogram(bins = 35, col="white")+
  geom_vline(xintercept=life_exp_mean, linetype='dashed', col = 'red')+
  annotate(geom="text", 
           x=life_exp_mean-2, 
           y=y_coord, 
           label=mean_label,
           color="red")+
  theme_bw()+
    coord_cartesian(xlim = c(min(merged_data2$life_exp_mean)-5,
                    max(merged_data2$life_exp_mean)+5),
           ylim = c(0, 20))
fig5_b

Mapping the Measures of Central Tendency

Calculating Mean

In your case, we can calculate the mean for all the values in life expectancy

Mapping the Measures of Central Tendency

Central Tendency: Median

The median is the middle value in a dataset ordered from smallest to largest.
It splits the dataset into two equal halves.

If the number of observations is odd:
- Median = middle number
- Example:
  Data = [1, 3, 7] → Median = 3

If the number of observations is even:
- Median = average of the two middle values
- Example:
  Data = [1, 3, 5, 7] → Median = ( = 4)

Mapping the Measures of Central Tendency

Central Tendency: Median

In your case, we can calculate the median for all the values in life expectancy

# Calculate the mean
life_exp_median <- median(merged_data2$life_exp_mean)
life_exp_median

[1] 62.57153

mean_label <- paste("Mean (x̄):\n", round(life_exp_mean, 2))
median_label <- paste("Median:\n", round(life_exp_median, 2))
y_coord<-17
fig5_b <- ggplot(data = merged_data2, aes(x = life_exp_mean)) +
  geom_histogram(bins = 35, col = "white") +
  geom_vline(xintercept = life_exp_mean, linetype = "dashed", color = "red") +
  geom_vline(xintercept = life_exp_median, linetype = "dashed", color = "blue") +
  annotate("text", x = life_exp_mean - 2, y = y_coord, label = mean_label, color = "red") +
  annotate("text", x = life_exp_median + 2, y = y_coord, label = median_label, color = "blue") +
  theme_bw()+
    coord_cartesian(xlim = c(min(merged_data2$life_exp_mean)-5,
                    max(merged_data2$life_exp_mean)+5),
           ylim = c(0, 20))
fig5_b

Mapping the Measures of Central Tendency

Central Tendency: Median

In your case, we can calculate the median for all the values in life expectancy

Mapping the Measures of Central Tendency

Central Tendency: Median and Boxplot

Another way to do this is through a boxplot

Mapping the Measures of Central Tendency

Central Tendency: Median and Boxplot

Another way to do this is through a boxplot

Measures of Dispersion

Common Measures

Standard Deviation:
\[s = \sqrt{s^2}\] (Gives average distance from the mean)

Mapping the Measures of Dispersion

Standard Deviation

# Calculating sd
life_exp_sd <- sd(merged_data2$life_exp_mean)
# Create sequence for axis ticks
sigma_breaks <- life_exp_mean + (-3:3) * life_exp_sd
# Replacing the x-ticks with the SD
sigma_labels <- c("-3s", "-2s", "-1s", "x̄", "+1s", "+2s", "+3s")

Mapping the Measures of Dispersion

Standard Deviation

This is how we create the graph.

median_label <- paste("Median:\n", round(life_exp_median, 2))
y_coord<-17
# Histogram with SD-based axis
fig5_b <- ggplot(data = merged_data2, aes(x = life_exp_mean)) +
  geom_histogram(bins = 35, col = "white") +
  geom_vline(xintercept = life_exp_mean, linetype = "dashed", color = "red") +
#  geom_vline(xintercept = median(life_expectancy_df$Life.expectancy.at.birth..historical., na.rm = TRUE), linetype = "dashed", color = "blue") +
  annotate("text", x = life_exp_mean - 2, y = y_coord, label = mean_label, color = "red") +
#  annotate("text", x = life_exp_median + 2, y = y_coord, label = median_label, color = "blue") +
  scale_x_continuous(breaks = sigma_breaks, labels = sigma_labels) +
  theme_bw()+
    coord_cartesian(xlim = c(min(merged_data2$life_exp_mean)-5,
                    max(merged_data2$life_exp_mean)+5),
           ylim = c(0, 20))

fig5_b

Mapping the Measures of Dispersion

Standard Deviation

This is how we can visualize the standard deviations.

Conclusion

Histograms show the shape of a continuous variable’s distribution
Bar plots visualize frequencies of discrete or grouped values
Mean and median describe central tendency
A big gap between them often signals skew
Standard deviation shows how spread out the values are

Visualizing Data Distributions in R

Intro

Overview

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Using R

Opening the File

Opening a File

Opening the File

Opening the File

What We Want to Do

Explanation

Doing it

Calculating Average by Country Code

Cleaning the Data

Removing Countries with Strange Labels, Left Merge, Removing NA Values

Histograms

What Is a Histogram?

Histograms

Example – Student Test Scores

Histograms

Example – Student Test Scores

Histograms

Example – Student Test Scores

Histograms

Example – Student Test Scores

Creating a Bar Plot from a Frequency Table

Step1: Rounding the values

Creating a Bar Plot from a Frequency Table

Step2: Creating a frequency table

Creating a Bar Plot from a Frequency Table

Step2: Creating a frequency table

Creating a Bar Plot from a Frequency Table

Step2: Creating a frequency table

Creating a Bar Plot from a Frequency Table

Step3: Creating the Barplot

Creating a Bar Plot from a Frequency Table

Step4: Creating a Histogram using geom_histogram

Creating a Bar Plot from a Frequency Table

Step4: Creating a Histogram using geom_histogram

Creating a Bar Plot from a Frequency Table

Step4: Creating a Histogram using geom_histogram

Creating a Bar Plot from a Frequency Table

Mapping the Measures of Central Tendency

Calculating Mean

Mapping the Measures of Central Tendency

Calculating Mean

Mapping the Measures of Central Tendency

Calculating Mean

Mapping the Measures of Central Tendency

Step4: Creating a Histogram using `geom_histogram`

Step4: Creating a Histogram using `geom_histogram`

Step4: Creating a Histogram using `geom_histogram`