Data Visualization with ggplot2

From Points and Bars to Boxplots and Maps in ggplot2

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Learning Outcomes

By the end of this lecture, you will be able to:

Understand the logic of the Grammar of Graphics.
Create plots in R using ggplot2.
Choose the right geom for your data:
- points
- bars
- lines
- boxplots
- density plots
- maps

Purpose of Visuals

Why Visuals?

Numbers vs Visuals

Would you rather read this?

Country	LifeExpectancy
Monaco	77.25278
Andorra	77.04861
San Marino	76.88056
Guernsey	76.71944
Israel	75.49722

…or see this?

Why Visuals Matter

Visuals are powerful because they:

Reveal patterns we miss in tables

Help us explain results clearly

Persuade others more effectively

Intro to ggplot2

The ggplot2 library in R

R can help us achieve great visualizations with the help of ggplot2

The ggplot2 library in R

R can help us achieve great visualizations with the help of ggplot2

The ggplot2 library in R

R can help us achieve great visualizations with the help of ggplot2

The Essentials

Grammar of Graphics

At the most basic level we have:
- data
- geometries
- aesthetics

Optional layers include:
- facets
- coordinates
- annotations
- themes, etc.

We can add them together in ggplot() with +

Data Cleaning

Preparing the Data

Before we use ggplot, our data has to be tidy. This means:

Each variable is a column

Each observation has its own row

Each value has its own cell

Simple Geoms

Geometries

Simple Geoms

Geometries

Geoms are the “shapes” your data takes on a graph.
Example:
- points (geom_point) make a scatterplot
- bars (geom_bar) make a bar chart
- lines (geom_line) make a line graph.

Simple Geoms

Examples

Simple Geoms

Reference

geom	Use for
`geom_point()`	Relationships between two variables; at least 10 obs.
`geom_col()`	Totals/percent per category; ordered bars. Uses pre-computed values for bar height. You must supply both `x` and `y`
`geom_bar()`	Counts the observations in each category. You must supply `x`, not `y`
`geom_text()`	Direct labels for small N; annotate outliers
`geom_line()`	Relationships between two variables; at least 10 obs.

Simple Geoms

`geom_point`

Imagine we have data about 9 countries that records their level of democracy from -10 to +10 (x-axis) and their GDP per capita in $1,000s (y-axis).

eg1 <- data.frame(
  democracy = c(-8, -7, -5, -3, 0, 2, 5, 8, 9),   # democracy score
  gdp = c(2, 9, 4, 7, 8, 20, 15, 25, 27)  # GDP per capita in $1000s
)

Simple Geoms

`geom_point`

Imagine we have data about 9 countries that records their level of democracy from -10 to +10 (x-axis) and their GDP per capita in $1,000s (y-axis).

democracy	gdp
-8	2
-7	9
-5	4
-3	7
0	8

Simple Geoms

`geom_point`

Imagine we have data about 9 countries that records their level of democracy from -10 to +10 (x-axis) and their GDP per capita in $1,000s (y-axis).

Show the code

# Scatterplot with geom_point
ggplot(data=eg1, 
       aes(x = democracy, y = gdp)) +
  geom_point()

Simple Geoms

`geom_point`

Interpretation: Each dot shows one country’s democracy score (left to right) and its GDP per person (up and down). The dots suggest that more democratic countries usually have higher GDP per person.

Show the code

# Scatterplot with geom_point
ggplot(eg1, aes(x = democracy, y = gdp)) +
  geom_point()

Simple Geoms

`geom_col`

Suppose we collected a small survey and just recorded the education level of each respondent. We want to visualize how many respondents fall into each category.

# Pre-aggregated counts instead of raw rows
eg2 <- data.frame(
  education = c(
    "Primary", "Primary", "High School", "High School", "High School",
    "College", "College", "College", "College"
  ))

education
Primary
Primary
High School
High School

Simple Geoms

`geom_col`

Suppose we collected a small survey and just recorded the education level of each respondent. We want to visualize how many respondents fall into each category.

library(dplyr)
# Count occurrences of each education level
eg2_transformed <- eg2 %>%
  count(education)

education	n
College	4
High School	3
Primary	2

Simple Geoms

`geom_col`

Suppose we collected a small survey and just recorded the education level of each respondent. We want to visualize how many respondents fall into each category.

Show the code

# Drawing the column
ggplot(eg2_transformed, 
       aes(x = education, y = n)) +
  geom_col()

Simple Geoms

`geom_bar`

Suppose we collected a small survey and just recorded the education level of each respondent. We want to visualize how many respondents fall into each category.

Show the code

# Drawing the bar
ggplot(eg2, 
       aes(x = education)) +
  geom_bar()

What is the difference?

`geom_col` vs. `geom_bar`

This is the difference:

# Drawing the column
ggplot(eg2_transformed, 
       aes(x = education, y = n)) +
  geom_col()

# Drawing the bar
ggplot(eg2, 
       aes(x = education)) +
  geom_bar()

Simple Geoms

`geom_text`

Suppose we collected a small survey and just recorded the education level of each respondent. We want to visualize how many respondents fall into each category.

Show the code

# Drawing the column
ggplot(eg2_transformed, aes(x = education, y = n)) +
  geom_col() +
  geom_text(
    aes(label = education), y = n/2)

Simple Geoms

`geom_text`

We can also use geom_text with the first example:

Imagine we have data about 9 countries that records their level of democracy from -10 to +10 (x-axis) and their GDP per capita in $1,000s (y-axis).

eg1 <- data.frame(
  democracy = c(-8, -7, -5, -3, 0, 2, 5, 8, 9),   # democracy score
  gdp = c(2, 9, 4, 7, 8, 20, 15, 25, 27),  # GDP per capita in $1000s
  country = c(
    "North Korea",   # very autocratic, very poor
    "Saudi Arabia",  # autocratic, but richer due to oil
    "Zimbabwe",      # authoritarian, low GDP
    "Russia",        # hybrid regime, middle income
    "Nigeria",       # similar position
    "India",         # low–mid democracy, growing GDP
    "Brazil",        # democracy, mid GDP
    "Poland",        # consolidated democracy, higher GDP
    "South Korea"   # rich democracy
  ))

Simple Geoms

`geom_text`

We can also use geom_text with the first example:

Imagine we have data about 9 countries that records their level of democracy from -10 to +10 (x-axis) and their GDP per capita in $1,000s (y-axis).

democracy	gdp	country
-8	2	North Korea
-7	9	Saudi Arabia
-5	4	Zimbabwe
-3	7	Russia
0	8	Nigeria

Simple Geoms

`geom_text`

We can also use geom_text with the first example:

Imagine we have data about 9 countries that records their level of democracy from -10 to +10 (x-axis) and their GDP per capita in $1,000s (y-axis).

Show the code

# Scatterplot with geom_point
ggplot(eg1, aes(x = democracy, y = gdp)) +
  geom_point() +
  geom_text(aes(label = country), vjust = -1)

Simple Geoms

`geom_line`

Suppose we have data on average voter turnout (%) in national elections over several years. We want to see the trend in participation.

# Toy dataset
eg3 <- data.frame(
  year = c(2000, 2004, 2008, 2012, 2016, 2020),
  turnout = c(55, 58, 62, 60, 59, 65))

Simple Geoms

`geom_line`

Suppose we have data on average voter turnout (%) in national elections over several years. We want to see the trend in participation.

Show the code

# Drawing the time line
ggplot(eg3, aes(x = year, y = turnout)) +
  geom_line()

Complex Geoms

Examples

Complex Geoms

Reference

geom	Use for
`geom_boxplot()`.	Compare distributions across groups
`geom_histogram()`	Distribution of a single continuous variable (frequency bins)
`geom_density()`	Smoothed distribution of a single continuous variable
`geom_violin()`	Visualizing distributions across categories (shape of data, not just box)
`geom_smooth()`	Adding fitted trend/smoothed lines (LOESS, GAM, linear, etc.)
`geom_errorbar()`	Showing uncertainty or variability (e.g., CI ranges around a mean)
`geom_sf()`	Plotting spatial (map) data stored as `sf` objects

Complex Geoms

`geom_boxplot`

Suppose we surveyed people about their trust in government on a 1–10 scale (1 = no trust, 10 = complete trust). We want to compare typical values and how spread out the answers are for men and women.

set.seed(123)

# Toy survey dataset
eg4 <- data.frame(
  gender = rep(c("Men", "Women"), each = 20),
  trust = c(
    rnorm(20, mean = 5, sd = 2),
    rnorm(20, mean = 8, sd = 1)))

Complex Geoms

`geom_boxplot`

Suppose we surveyed people about their trust in government on a 1–10 scale (1 = no trust, 10 = complete trust). We want to see the typical values and how spread out the answers are.

gender	trust
Men	3.879049
Men	4.539645
Men	8.117417
Men	5.141017
Men	5.258576
Men	8.430130

Complex Geoms

`geom_boxplot`

Suppose we surveyed people about their trust in government on a 1–10 scale (1 = no trust, 10 = complete trust). We want to see the typical values and how spread out the answers are.

Show the code

# Drawing the boxplot
ggplot(eg4, aes(y = trust)) +
  geom_boxplot()

Complex Geoms

`geom_boxplot`

Suppose we surveyed people about their trust in government on a 1–10 scale (1 = no trust, 10 = complete trust). We want to see the typical values and how spread out the answers are.

Show the code

ggplot(eg4, aes(y = trust)) +
  geom_boxplot()

Complex Geoms

`geom_boxplot`

set.seed(123)

# Toy survey dataset
eg5 <- data.frame(
  gender = rep(c("Men", "Women"), each = 20),
  trust = c(
    rnorm(20, mean = 5, sd = 2),
    rnorm(20, mean = 8, sd = 1)))

Complex Geoms

`geom_boxplot`

gender	trust
Men	3.879049
Men	4.539645
Men	8.117417
Men	5.141017
Men	5.258576
Men	8.430130
Men	5.921832
Men	2.469877

Complex Geoms

`geom_boxplot`

Show the code

ggplot(eg5, aes(x = gender, y = trust)) +
  geom_boxplot()

Complex Geoms

`geom_boxplot`

Show the code

ggplot(eg5, aes(x = gender, y = trust)) +
  geom_boxplot()

Complex Geoms

`geom_histogram`: What is a Histogram?

Histograms allow us to better understand how frequently or infrequently certain values occur in our dataset.

Imagine a set of values that are spaced out along a number line.

Complex Geoms

`geom_histogram`: What is a Histogram?

To construct a histogram, a section of the number line is divided into equal chunks, called bins.

Next, count how many data points sit inside each bin, and draw bars, one for each bin

The heights of the bars correspond to the number of data points.

Complex Geoms

`geom_histogram`: What is a Histogram?

Label the data (in the example below each data point is an SAT score)

Draw in a y-axis which counts the number of data points in each bin

Finally label your bins.

Complex Geoms

`geom_histogram`: Example

Suppose we surveyed people about their trust in government on a 1–10 scale (1 = no trust, 10 = complete trust). We want to compare typical values and how spread out the answers are.

Show the code

# Drawing the Histogram
ggplot(eg5, aes(x = trust)) +
  geom_histogram(bins = 10, color = "white")+
  coord_cartesian(xlim = c(2, 8))

Complex Geoms

`geom_histogram`: Example

Suppose we surveyed people about their trust in government on a 1–10 scale (1 = no trust, 10 = complete trust). We want to compare typical values and how spread out the answers are.

There is a direct connection between histograms and boxplots:

Complex Geoms

`geom_histogram`: Example

Suppose we surveyed people about their trust in government on a 1–10 scale (1 = no trust, 10 = complete trust). We want to compare typical values and how spread out the answers are.

There is a direct connection between histograms and boxplots:

Complex Geoms

`geom_histogram`: Example

One specific aspect pertaining to histograms is the number of bins:

too wide bins hide structure;
too narrow bins are noisy..

Complex Geoms

`geom_density`: What is a Density Curve?

A histogram counts how many students fall in each trust “bucket.”

Complex Geoms

`geom_density`: What is a Density Curve?

A histogram counts how many students fall in each trust “bucket.”

A density curve smooths those buckets into a single curve so we see the overall shape.

Complex Geoms

`geom_density`: What is a Density Curve?

A histogram counts how many students fall in each trust “bucket.”

A density curve smooths those buckets into a single curve so we see the overall shape.

The area under the entire curve = 1 (i.e., 100% of people).

Complex Geoms

`geom_density`: What is a Density Curve?

A histogram counts how many students fall in each trust “bucket.”

A density curve smooths those buckets into a single curve so we see the overall shape.

The area under the entire curve = 1 (i.e., 100% of people).

The probability that a randomly chosen person’s trust score is between certain values (e.g. 5 and 7)

Complex Geoms

`geom_density`: What is a Density Curve?

A histogram counts how many students fall in each trust “bucket.”

A density curve smooths those buckets into a single curve so we see the overall shape.

The area under the entire curve = 1 (i.e., 100% of people).

The probability that a randomly chosen person’s trust score is above 5.

Complex Geoms

`geom_density`: What is a Density Curve?

A histogram counts how many students fall in each trust “bucket.”

A density curve smooths those buckets into a single curve so we see the overall shape.

The area under the entire curve = 1 (i.e., 100% of people).

The probability that a randomly chosen person’s trust score is below 4.

Complex Geoms

`geom_density`: What is a Density Curve?

Heights tell you nothing by themselves—areas tell you the share.

When I widen the range, the area grows—so the probability grows.

This curve is smoothed from the histogram; bandwidth controls how smooth.

Complex Geoms

`geom_density`: What is a Density Curve?

Heights tell you nothing by themselves—areas tell you the share.

When I widen the range, the area grows—so the probability grows.

This curve is smoothed from the histogram; bandwidth controls how smooth.

Here too there is a connection between density, histograms, and boxplots.

Complex Geoms

`geom_density`: Example

Suppose we surveyed people about their trust in government on a 1–10 scale (1 = no trust, 10 = complete trust). We want to compare typical values and how spread out the answers are.

Show the code

# Match density to histogram counts
N <- nrow(eg5)
binwidth <- diff(range(eg5$trust)) / 10  # because bins = 10

# Draw histogram
ggplot(eg5, aes(x = trust)) +
  geom_density(aes(y = after_stat(density) * N * binwidth), linewidth = 1) +
  labs(
    x = "trust",
    y = "count")

Complex Geoms

`geom_density`: Example

Suppose we surveyed people about their trust in government on a 1–10 scale (1 = no trust, 10 = complete trust). We want to compare typical values and how spread out the answers are.

It is more informative to include both geom_density and geom_histogram.

Show the code

# Match density to histogram counts
N <- nrow(eg5)
binwidth <- diff(range(eg5$trust)) / 10  # because bins = 10

# Draw histogram
ggplot(eg5, aes(x = trust)) +
  geom_histogram(bins = 10, color = "white") +
  geom_density(aes(y = after_stat(density) * N * binwidth), linewidth = 1) +
  labs(
    x = "trust",
    y = "count")

Complex Geoms

`geom_density`: Example

Most people in the survey gave middle-of-the-road trust scores—think around 5 to 7.

As you move toward the extremes (very low 1–3 or very high 8–10), the curve drops, showing fewer people chose those scores.

Show the code

# Match density to histogram counts
N <- nrow(eg5)
binwidth <- diff(range(eg5$trust)) / 10  # because bins = 10

# Draw histogram
ggplot(eg5, aes(x = trust)) +
  geom_histogram(bins = 10, color = "white") +
  geom_density(aes(y = after_stat(density) * N * binwidth), linewidth = 1) +
  labs(
    x = "trust",
    y = "count")

Complex Geoms

`geom_violin`: Example

Violin plots are symmetric representations of geom_density

Complex Geoms

`geom_violin`: Example

Violin plots are symmetric representations of geom_density

This was our density

Complex Geoms

`geom_violin`: Example

Violin plots are symmetric representations of geom_density

This was our density shaded in white

Complex Geoms

`geom_violin`: Example

Violin plots are symmetric representations of geom_density

This is our density flipped

Complex Geoms

`geom_violin`: Example

Violin plots are symmetric representations of geom_density

This is what happens if we create a mirror density

Complex Geoms

`geom_violin`: Example

Violin plots are symmetric representations of geom_density

This is what geom_violin looks like.

ggplot(eg5, aes(x = "", y = trust)) +
  geom_violin()

Complex Geoms

`geom_violin`: Example

Show the code

ggplot(eg5, aes(x = gender, y = trust)) +
  geom_violin()

Complex Geoms

`geom_violin`: Example

Interpretation: The data suggests that men’s responses are more spread out (from low to high trust), while women’s are more concentrated in the middle-to-high range.

Show the code

ggplot(eg5, aes(x = gender, y = trust)) +
  geom_violin()

Complex Geoms

`geom_smooth`

This is what we had previously:

# Example data
eg1 <- data.frame(
  democracy = c(-8, -7, -5, -3, 0, 2, 5, 8, 9),   # democracy score
  gdp = c(2, 9, 4, 7, 8, 20, 15, 25, 27),  # GDP per capita in $1000s
  country = c(
    "North Korea",   # very autocratic, very poor
    "Saudi Arabia",  # autocratic, but richer due to oil
    "Zimbabwe",      # authoritarian, low GDP
    "Russia",        # hybrid regime, middle income
    "Nigeria",       # similar position
    "India",         # low–mid democracy, growing GDP
    "Brazil",        # democracy, mid GDP
    "Poland",        # consolidated democracy, higher GDP
    "South Korea"   # rich democracy
  ))

Complex Geoms

`geom_smooth`

This is what we had previously:

democracy	gdp	country
-8	2	North Korea
-7	9	Saudi Arabia
-5	4	Zimbabwe
-3	7	Russia
0	8	Nigeria
2	20	India
5	15	Brazil
8	25	Poland
9	27	South Korea

Complex Geoms

`geom_smooth`

This is what we had previously:

# Scatterplot with geom_point
ggplot(eg1, aes(x = democracy, y = gdp)) +
  geom_point()

Complex Geoms

`geom_smooth`

This is what happens when we try to fit a line:

# Scatterplot with geom_point
ggplot(eg1, aes(x = democracy, y = gdp)) +
  geom_smooth(method = lm, se = FALSE, color="black")

Complex Geoms

`geom_smooth`

Interpretation: As one moves right (more democratic), the dots usually sit higher—meaning those countries tend to be richer. The black line going up summarizes that big-picture pattern, even though a few dots don’t fit.

# Scatterplot with geom_point
ggplot(eg1, aes(x = democracy, y = gdp)) +
  geom_smooth(method = lm, se = FALSE, color="black")

Complex Geoms

`geom_errorbar`

Error bars are like “antennae” on top of a mean.

They don’t show the full spread like a boxplot, but they give us a quick picture of how precise the average is.

Short bars → data points are close together.
Long bars → data points are spread out.

This is useful when we only want to compare averages.

Complex Geoms

`geom_errorbar`

A natural question is how do error bars compare to boxplots.

Boxplot

Shows the distribution of the data: Median, quartiles, and possible outliers.
Good for: seeing the shape and spread

Error Bar

Summarizes just the mean + uncertainty (often mean ± standard error or confidence interval).
Doesn’t show the full shape of the data.

Complex Geoms

`geom_errorbar`

# Set a random seed so results are reproducible.
# (Without this, R will generate different random numbers each time.)
set.seed(123)

# -------------------------------------------------------------
# Step 1: Create a toy dataset: "trust in government" by gender
# --------------------------------------------------------------
eg5 <- data.frame(
  # Repeat the labels "Men" and "Women" 20 times each
  gender = rep(c("Men", "Women"), each = 20),
  
  # Generate 20 trust values for Men ~ Normal(mean=5, sd=2)
  # Generate 20 trust values for Women ~ Normal(mean=8, sd=1)
  trust = c(
    rnorm(20, mean = 5, sd = 2),
    rnorm(20, mean = 8, sd = 1)
  )
)
#Print the first 3 entries
head(eg5, n=3)

gender	trust
Men	3.879049
Men	4.539645
Men	8.117417

Complex Geoms

`geom_errorbar`

# ----------------------------------------------------------------------
# Step 2: Calculate mean trust and error bars (95% confidence intervals)
# ----------------------------------------------------------------------
df_err <- eg5 %>%
  group_by(gender) %>%       # Group data by gender
  summarise(
    mean = mean(trust),      # Average trust per gender
    se = sd(trust) / sqrt(n()), # Standard error = sd / sqrt(sample size)
  ) %>%
  mutate(
    # 95% confidence interval = mean ± 1.96 * standard error
    ymin = mean - 1.96 * se,   # Lower bound
    ymax = mean + 1.96 * se    # Upper bound
  )
head(df_err, n=3)

gender	mean	se	ymin	ymax
Men	5.283248	0.4349892	4.430669	6.135826
Women	7.948743	0.1855799	7.585006	8.312480

Complex Geoms

`geom_errorbar`

# --------------------------------------------------------
# Step 3: Plotting the means with error bars using ggplot2
# --------------------------------------------------------
ggplot(df_err, aes(x = gender, y = mean)) +
  geom_point(size = 4) +  # Plot mean values as large points
  geom_errorbar(
    aes(ymin = ymin, ymax = ymax), # Add vertical error bars
    width = 0.15                   # Small horizontal "cap" on error bars
  ) +
  labs(
    x = NULL,                       # Remove x-axis label
    y = "trust",      # Label y-axis
    title = "Error Bars"            # Add plot title
  ) +
  coord_cartesian(ylim = c(0, 10))  # Set y-axis limits from 0 to 10

Complex Geoms

`geom_errorbar`

Show the code

set.seed(123)

#Step1: Toy dataset: trust in government by gender
eg5 <- data.frame(
  gender = rep(c("Men", "Women"), each = 20),
  trust = c(
    rnorm(20, mean = 5, sd = 2),
    rnorm(20, mean = 8, sd = 1)))

#Step2: Calculating Error bars
df_err <- eg5 %>%
  group_by(gender) %>%
  summarise(
    mean = mean(trust),
    se = sd(trust) / sqrt(n()),
  ) %>%
  mutate(
    ymin = mean - 1.96 * se,   # 95% CI lower bound
    ymax = mean + 1.96 * se    # 95% CI upper bound
  )

#Step3: Plotting
ggplot(df_err, aes(x = gender, y = mean)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = ymin, ymax = ymax), width = 0.15) +
  labs(
    x = NULL,
    y = "trust",
    title = "Error Bars"
  ) +
  coord_cartesian(ylim = c(0, 10))   # same y-axis

Complex Geoms

`geom_errorbar`

Interpretation: Women in this sample show higher average trust in government than men. The bars show our uncertainty, and since they don’t overlap much, it means the difference is likely real and not just due to chance.

Show the code

set.seed(123)

#Step1: Toy dataset: trust in government by gender
eg5 <- data.frame(
  gender = rep(c("Men", "Women"), each = 20),
  trust = c(
    rnorm(20, mean = 5, sd = 2),
    rnorm(20, mean = 8, sd = 1)))

#Step2: Calculating Error bars
df_err <- eg5 %>%
  group_by(gender) %>%
  summarise(
    mean = mean(trust),
    se = sd(trust) / sqrt(n()),
  ) %>%
  mutate(
    ymin = mean - 1.96 * se,   # 95% CI lower bound
    ymax = mean + 1.96 * se    # 95% CI upper bound
  )

#Step3: Plotting
ggplot(df_err, aes(x = gender, y = mean)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = ymin, ymax = ymax), width = 0.15) +
  labs(
    x = NULL,
    y = "trust",
    title = "Error Bars"
  ) +
  coord_cartesian(ylim = c(0, 10))   # same y-axis

Complex Geoms

`geom_sf`

geom_sf draws maps from sf objects (“simple features” = data + geometry).

It works like other geoms: points and lines

Complex Geoms

`geom_sf`

geom_sf draws maps from sf objects (“simple features” = data + geometry).

With geom_sf, you can map points (cities, places, settlements)

Complex Geoms

`geom_sf`

geom_sf draws maps from sf objects (“simple features” = data + geometry).

With geom_sf, you can map lines (rivers, roads, straight lines)

Complex Geoms

`geom_sf`

geom_sf draws maps from sf objects (“simple features” = data + geometry).

With geom_sf, you can map polygons (countries, parks, seas, etc.)

Complex Geoms

`geom_sf`

geom_sf draws maps from sf objects (“simple features” = data + geometry).

It works like other geoms: points and lines

Typical workflow:

Get a shape (countries, regions, cities, rivers, roads, etc.)
Plot them with geom_sf

Complex Geoms

`geom_sf` Example

# Install once (uncomment if needed):
# install.packages(c("sf", "ggplot2", "rnaturalearth", "rnaturalearthdata"))
library(ggplot2)
library(sf)
library(rnaturalearth)
library(rnaturalearthdata)

# Step1: Get a country polygons dataset as an sf object
world <- rnaturalearth::ne_countries(scale = "medium",
                                     returnclass = "sf")
head(world, n=3)

featurecla	scalerank	labelrank	sovereignt	sov_a3	level	type	tlc	admin	adm0_a3	geounit	gu_a3	subunit	su_a3	name	name_long	brk_a3	brk_name	brk_group	abbrev	postal	formal_en	formal_fr	name_ciawf	note_adm0	note_brk	name_sort	name_alt	mapcolor7	mapcolor8	mapcolor9	mapcolor13	pop_est	pop_rank	pop_year	gdp_md	gdp_year	economy	income_grp	fips_10	iso_a2	iso_a2_eh	iso_a3	iso_a3_eh	iso_n3	iso_n3_eh	un_a3	wb_a2	wb_a3	woe_id	woe_id_eh	woe_note	adm0_iso	adm0_diff	adm0_tlc	adm0_a3_us	adm0_a3_fr	adm0_a3_ru	adm0_a3_es	adm0_a3_cn	adm0_a3_tw	adm0_a3_in	adm0_a3_np	adm0_a3_pk	adm0_a3_de	adm0_a3_gb	adm0_a3_br	adm0_a3_il	adm0_a3_ps	adm0_a3_sa	adm0_a3_eg	adm0_a3_ma	adm0_a3_pt	adm0_a3_ar	adm0_a3_jp	adm0_a3_ko	adm0_a3_vn	adm0_a3_tr	adm0_a3_id	adm0_a3_pl	adm0_a3_gr	adm0_a3_it	adm0_a3_nl	adm0_a3_se	adm0_a3_bd	adm0_a3_ua	adm0_a3_un	adm0_a3_wb	continent	region_un	subregion	region_wb	name_len	long_len	abbrev_len	tiny	homepart	min_label	max_label	label_x	label_y	ne_id	wikidataid	name_ar	name_bn	name_de	name_en	name_es	name_fa	name_fr	name_el	name_he	name_hi	name_hu	name_id	name_it	name_ja	name_ko	name_nl	name_pl	name_pt	name_ru	name_sv	name_tr	name_uk	name_ur	name_vi	name_zh	name_zht	fclass_iso	tlc_diff	fclass_tlc	fclass_us	fclass_fr	fclass_ru	fclass_es	fclass_cn	fclass_tw	fclass_in	fclass_np	fclass_pk	fclass_de	fclass_gb	fclass_br	fclass_il	fclass_ps	fclass_sa	fclass_eg	fclass_ma	fclass_pt	fclass_ar	fclass_jp	fclass_ko	fclass_vn	fclass_tr	fclass_id	fclass_pl	fclass_gr	fclass_it	fclass_nl	fclass_se	fclass_bd	fclass_ua	geometry
Admin-0 country	1	3	Zimbabwe	ZWE	2	Sovereign country	1	Zimbabwe	ZWE	Zimbabwe	ZWE	Zimbabwe	ZWE	Zimbabwe	Zimbabwe	ZWE	Zimbabwe	NA	Zimb.	ZW	Republic of Zimbabwe	NA	Zimbabwe	NA	NA	Zimbabwe	NA	1	5	3	9	14645468	14	2019	21440	2019	5. Emerging region: G20	5. Low income	ZI	ZW	ZW	ZWE	ZWE	716	716	716	ZW	ZWE	23425004	23425004	Exact WOE match as country	ZWE	NA	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	ZWE	-99	-99	Africa	Africa	Eastern Africa	Sub-Saharan Africa	8	8	5	-99	1	2.5	8	29.92544	-18.91164	1159321441	Q954	زيمبابوي	জিম্বাবুয়ে	Simbabwe	Zimbabwe	Zimbabue	زیمبابوه	Zimbabwe	Ζιμπάμπουε	זימבבואה	ज़िम्बाब्वे	Zimbabwe	Zimbabwe	Zimbabwe	ジンバブエ	짐바브웨	Zimbabwe	Zimbabwe	Zimbábue	Зимбабве	Zimbabwe	Zimbabve	Зімбабве	زمبابوے	Zimbabwe	津巴布韦	辛巴威	Admin-0 country	NA	Admin-0 country	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	MULTIPOLYGON (((31.28789 -2...
Admin-0 country	1	3	Zambia	ZMB	2	Sovereign country	1	Zambia	ZMB	Zambia	ZMB	Zambia	ZMB	Zambia	Zambia	ZMB	Zambia	NA	Zambia	ZM	Republic of Zambia	NA	Zambia	NA	NA	Zambia	NA	5	8	5	13	17861030	14	2019	23309	2019	7. Least developed region	4. Lower middle income	ZA	ZM	ZM	ZMB	ZMB	894	894	894	ZM	ZMB	23425003	23425003	Exact WOE match as country	ZMB	NA	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	ZMB	-99	-99	Africa	Africa	Eastern Africa	Sub-Saharan Africa	6	6	6	-99	1	3.0	8	26.39530	-14.66080	1159321439	Q953	زامبيا	জাম্বিয়া	Sambia	Zambia	Zambia	زامبیا	Zambie	Ζάμπια	זמביה	ज़ाम्बिया	Zambia	Zambia	Zambia	ザンビア	잠비아	Zambia	Zambia	Zâmbia	Замбия	Zambia	Zambiya	Замбія	زیمبیا	Zambia	赞比亚	尚比亞	Admin-0 country	NA	Admin-0 country	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	MULTIPOLYGON (((30.39609 -1...
Admin-0 country	1	3	Yemen	YEM	2	Sovereign country	1	Yemen	YEM	Yemen	YEM	Yemen	YEM	Yemen	Yemen	YEM	Yemen	NA	Yem.	YE	Republic of Yemen	NA	Yemen	NA	NA	Yemen, Rep.	NA	5	3	3	11	29161922	15	2019	22581	2019	7. Least developed region	4. Lower middle income	YM	YE	YE	YEM	YEM	887	887	887	RY	YEM	23425002	23425002	Exact WOE match as country	YEM	NA	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	YEM	-99	-99	Asia	Asia	Western Asia	Middle East & North Africa	5	5	4	-99	1	3.0	8	45.87438	15.32823	1159321425	Q805	اليمن	ইয়েমেন	Jemen	Yemen	Yemen	یمن	Yémen	Υεμένη	תימן	यमन	Jemen	Yaman	Yemen	イエメン	예멘	Jemen	Jemen	Iémen	Йемен	Jemen	Yemen	Ємен	یمن	Yemen	也门	葉門	Admin-0 country	NA	Admin-0 country	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	MULTIPOLYGON (((53.08564 16...

Complex Geoms

`geom_sf` Example

# Install once (uncomment if needed):
# install.packages(c("sf", "ggplot2", "rnaturalearth", "rnaturalearthdata"))
library(ggplot2)
library(sf)
library(rnaturalearth)
library(rnaturalearthdata)

# Step1: Get a country polygons dataset as an sf object
world <- rnaturalearth::ne_countries(scale = "medium",
                                     returnclass = "sf")
head(world, n=3)

So, the world dataframe contains 169 variables (e.g. featurecla, scalerank, labelrank, etc.) and 242 observations

Complex Geoms

`geom_sf` Example

# Step2: Draw it
ggplot() +
  geom_sf(data=world) +
  labs(title = "The World (sf polygons)"

Complex Geoms

`geom_sf` Example

Use coord_sf(xlim=..., ylim=...) to zoom

You can layer multiple sf objects:

Countries (polygons)
Cities (points)

Complex Geoms

`geom_sf` Example

europe_bounds <- list(x = c(-10, 40),
                      y = c(35, 70))

ggplot() +
  geom_sf(data = world) +
  coord_sf(xlim = europe_bounds$x, 
           ylim = europe_bounds$y) +
  labs(title = "European Countries")

Complex Geoms

`geom_sf` Example

europe_bounds <- list(x = c(-10, 40),
                      y = c(35, 70))

ggplot() +
  geom_sf(data = world) +
  coord_sf(xlim = europe_bounds$x, 
           ylim = europe_bounds$y) +
  labs(title = "European Countries")

Complex Geoms

`geom_sf` Example

europe_bounds <- list(x = c(-10, 40),
                      y = c(35, 70))

# Create a few points
cities <- data.frame(
  name = c("Rome", "Berlin", "Paris"),
  lon = c(12.5, 13.4, 2.35),
  lat = c(41.9, 52.5, 48.85)
)

# Convert to sf POINTs
cities_sf <- st_as_sf(cities, coords = c("lon", "lat"), crs = 4326)

# Mapping Them
ggplot() +
  geom_sf(data = world) +
  geom_sf(data = cities_sf) +
  coord_sf(xlim = europe_bounds$x, 
           ylim = europe_bounds$y) +
  labs(title = "European Countries")

Complex Geoms

`geom_sf` Example

europe_bounds <- list(x = c(-10, 40),
                      y = c(35, 70))

# Create a few points
cities <- data.frame(
  name = c("Rome", "Berlin", "Paris"),
  lon = c(12.5, 13.4, 2.35),
  lat = c(41.9, 52.5, 48.85)
)

# Convert to sf POINTs
cities_sf <- st_as_sf(cities, coords = c("lon", "lat"), crs = 4326)

# Create a line connecting Rome -> Berlin -> Paris
line_sf <- st_sfc(st_linestring(as.matrix(cities[, c("lon", "lat")])), crs = 4326)

# Mapping Them
ggplot() +
  geom_sf(data = world) +
  geom_sf(data = cities_sf) +
  geom_sf(data = line_sf) +
  coord_sf(xlim = europe_bounds$x, 
           ylim = europe_bounds$y) +
  labs(title = "European Countries")

Conclusion

Key Takeaways

Grammar of Graphics: Data + Aesthetics + Geoms = Visualization.
Simple geoms (points, bars, lines, text) help explore relationships and categories.
Complex geoms (histograms, density, boxplots, violins, maps) reveal deeper patterns.
Always tidy your data before plotting.
Good visualization = clarity + accuracy

Exercises

Points vs. Lines

Use the dataset below:

df <- data.frame(
  year = c(2000, 2004, 2008, 2012, 2016, 2020),
  turnout = c(55, 58, 62, 60, 59, 65))

Make a scatterplot of turnout by year with geom_point().
Now use geom_line().
Which geom makes more sense for these data? Why?

Exercises

Bar vs. Column

Suppose you have survey data on respondents’ education:

df <- data.frame(
  education = c("Primary", "Primary", "High School", "High School", "College", "College", "College")
)

Plot the distribution of education levels with geom_bar()
Plot the distribution of education levels with geom_col()
What is the difference between the two approaches?

Exercises

Boxplot vs. Histogram

We surveyed trust in government on a 1–10 scale:

set.seed(123)
df <- data.frame(
  trust = c(rnorm(30, mean = 5, sd = 2), rnorm(30, mean = 7, sd = 1))
)

Plot the distribution with a histogram.
Plot the same distribution with a boxplot.
What information is easier to see in the histogram? In the boxplot?

Exercises Answers

Exercises

Points vs. Lines

Use the dataset below:

df <- data.frame(
  year = c(2000, 2004, 2008, 2012, 2016, 2020),
  turnout = c(55, 58, 62, 60, 59, 65))

Make a scatterplot of turnout by year with geom_point().

Show the code

library(ggplot2)
ggplot(df, aes(x = year, y=turnout)) +
  geom_point()

Exercises

Points vs. Lines

Use the dataset below:

df <- data.frame(
  year = c(2000, 2004, 2008, 2012, 2016, 2020),
  turnout = c(55, 58, 62, 60, 59, 65))

Now use geom_line().

Show the code

library(ggplot2)
ggplot(df, aes(x = year, y=turnout)) +
  geom_line()

Exercises

Points vs. Lines

Which geom makes more sense for these data? Why?

The line plot is usually more informative because it shows the trend in turnout over time.

Exercises

Bar vs. Column

Suppose you have survey data on respondents’ education:

df <- data.frame(
  education = c("Primary", "Primary", "High School", "High School", "College", "College", "College")
)

Plot the distribution of education levels with geom_bar()

Show the code

# Bar plot
ggplot(df, aes(x = education)) +
  geom_bar()

Exercises

Bar vs. Column

Suppose you have survey data on respondents’ education:

df <- data.frame(
  education = c("Primary", "Primary", "High School", "High School", "College", "College", "College")
)

Pre-count the values with dplyr::count() and plot them with geom_col():

Show the code

# Step1: Pre-count the values
df_counts <- df %>%
  count(education)
print(df_counts)

education	n
College	3
High School	2
Primary	2

Exercises

Bar vs. Column

Suppose you have survey data on respondents’ education:

df <- data.frame(
  education = c("Primary", "Primary", "High School", "High School", "College", "College", "College")
)

Pre-count the values with dplyr::count() and plot them with geom_col():

Show the code

# Step1: Pre-count the values
df_counts <- df %>%
  count(education)

# Step2: Plot with geom_col()
ggplot(df_counts, aes(x = education, y = n)) +
  geom_col()

Exercises

Bar vs. Column

Suppose you have survey data on respondents’ education:

df <- data.frame(
  education = c("Primary", "Primary", "High School", "High School", "College", "College", "College")
)

What is the difference between the two approaches?

The key distinction:

geom_bar() = ggplot does the counting for you.
geom_col() = you must provide the counts yourself.

Exercises

Boxplot vs. Histogram

We surveyed trust in government on a 1–10 scale:

set.seed(123)
df <- data.frame(
  trust = c(rnorm(30, mean = 5, sd = 2), rnorm(30, mean = 7, sd = 1))
)

Plot the distribution with a histogram.

Show the code

# Histogram of trust
ggplot(df, aes(x = trust)) +
  geom_histogram(binwidth = 1, color = "white")

Exercises

Boxplot vs. Histogram

We surveyed trust in government on a 1–10 scale:

set.seed(123)
df <- data.frame(
  trust = c(rnorm(30, mean = 5, sd = 2), rnorm(30, mean = 7, sd = 1))
)

Plot the same distribution with a boxplot.

Show the code

# Boxplot of trust
ggplot(df, aes(y = trust)) +
  geom_boxplot()

Exercises

Boxplot vs. Histogram

What information is easier to see in the histogram? In the boxplot?

Easier to see the shape of the distribution (is it unimodal, bimodal, skewed?).
You can spot clusters (e.g., many respondents around 5 and many around 7).
You get a sense of frequency across different ranges.

Easier to see the median and spread at a glance.
Shows quartiles (middle 50% of the data).
Highlights outliers explicitly.

Data Visualization with ggplot2

Learning Outcomes

Purpose of Visuals

Why Visuals?

Numbers vs Visuals

Why Visuals Matter

Intro to ggplot2

The ggplot2 library in R

The ggplot2 library in R

The ggplot2 library in R

The Essentials

Grammar of Graphics

Data Cleaning

Preparing the Data

Simple Geoms

Simple Geoms

Geometries

Simple Geoms

Geometries

Simple Geoms

Examples

Simple Geoms

Reference

Simple Geoms

geom_point

Simple Geoms

geom_point

Simple Geoms

geom_point

Simple Geoms

geom_point

Simple Geoms

geom_col

Simple Geoms

geom_col

Simple Geoms

geom_col

Simple Geoms

geom_bar

What is the difference?

geom_col vs. geom_bar

Simple Geoms

geom_text

Simple Geoms

geom_text

Simple Geoms

geom_text

Simple Geoms

geom_text

Simple Geoms

geom_line

Simple Geoms

geom_line

Complex Geoms

Complex Geoms

Examples

Complex Geoms

Reference

Complex Geoms

geom_boxplot

Complex Geoms

geom_boxplot

Complex Geoms

geom_boxplot

Complex Geoms

geom_boxplot

Complex Geoms

geom_boxplot

Complex Geoms

geom_boxplot

Complex Geoms

geom_boxplot

Complex Geoms

geom_boxplot

Complex Geoms

geom_histogram: What is a Histogram?

Complex Geoms

geom_histogram: What is a Histogram?

Complex Geoms

geom_histogram: What is a Histogram?

`geom_point`

`geom_point`

`geom_point`

`geom_point`

`geom_col`

`geom_col`

`geom_col`

`geom_bar`

`geom_col` vs. `geom_bar`

`geom_text`

`geom_text`

`geom_text`

`geom_text`

`geom_line`

`geom_line`

`geom_boxplot`

`geom_boxplot`

`geom_boxplot`

`geom_boxplot`

`geom_boxplot`

`geom_boxplot`

`geom_boxplot`

`geom_boxplot`

`geom_histogram`: What is a Histogram?

`geom_histogram`: What is a Histogram?

`geom_histogram`: What is a Histogram?

`geom_histogram`: Example

`geom_histogram`: Example

`geom_histogram`: Example

`geom_histogram`: Example

`geom_density`: What is a Density Curve?

`geom_density`: What is a Density Curve?

`geom_density`: What is a Density Curve?

`geom_density`: What is a Density Curve?

`geom_density`: What is a Density Curve?

`geom_density`: What is a Density Curve?

`geom_density`: What is a Density Curve?

`geom_density`: What is a Density Curve?

`geom_density`: Example

`geom_density`: Example

`geom_density`: Example

`geom_violin`: Example

`geom_violin`: Example

`geom_violin`: Example

`geom_violin`: Example

`geom_violin`: Example

`geom_violin`: Example

`geom_violin`: Example

`geom_violin`: Example

`geom_smooth`

`geom_smooth`

`geom_smooth`

`geom_smooth`

`geom_smooth`

`geom_errorbar`

`geom_errorbar`

`geom_errorbar`

`geom_errorbar`

`geom_errorbar`

`geom_errorbar`

`geom_errorbar`