L8: Data Visualization 2

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Intro

Previously, we covered quite some ground:

importance of color choices

mapping data to aesthetics: points

facets, coordinates, labels, themes

theme option manipulation

creating time animations

Preparing the Data

Before we use ggplot, our data has to be tidy

This means:

Each variable is a column

Each observation has its own row

Each value has its own cell

Plotting Quantities

People are better at seeing height differences than angle and area differences

Plotting Quantities

People are better at seeing height differences than angle and area differences.

This how to obtain the same plots.

#Step1: Loading the library
library(ggplot2)

#Step2: Create a simple dataframe
data <- data.frame(
  name=c("A","B","C","D","E") ,  
  value=c(3,12,5,18,45)
  )

Plotting Quantities

People are better at seeing height differences than angle and area differences

This how to obtain the same plots.

ggplot(data, aes(x = "", 
                 y = value, 
                 fill = name)) +
  geom_col() +
  coord_polar(theta = "y") +
  labs(fill = "Individual") +
  theme_void()

ggplot(data, aes(x = name, 
                 y = value, 
                 fill = name)) +
  geom_col() +
  labs(fill = "Individual") +
  theme_void()

Advice for Barplots

Always start at zero to be transparent about your data

It may sometimes be more informative to visualize data using other tools than barplots

Let us download again: Life expectancy and Urbanization Data

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America with EU

#Setting path
library("dplyr")
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_expectancy <- read.csv(file = './life-expectancy.csv')
urbanization <- read.csv(file = './share-of-population-urban.csv')
#Step2: Removing countries with no 3-letter code
life_expectancy2<-subset(life_expectancy, Code!="")
#Step3: Changing variable name
names(life_expectancy2)[names(life_expectancy2)=="Life.expectancy.at.birth..historical."]<-"life_exp"
#Step4: Selecting only vars of interest
life_expectancy3<-subset(life_expectancy2, selec=c("Entity", "Code", "Year", "life_exp"))
#Step5: Removing countries with no 3-letter code
urbanization2<-subset(urbanization, Code!="")
#Step6: Changing variable name
names(urbanization2)[names(urbanization2)=="Urban.population....of.total.population."]<-"urb"
#Step7: Selecting only vars of interest
urbanization3<-subset(urbanization2, selec=c("Code", "Year", "urb"))
#Step8: Performing a merge
final<-left_join(life_expectancy3, urbanization3, by = c("Code"="Code",
                                                         "Year"="Year"))
#Step9: Removing NAs
final2<-final[complete.cases(final), ]
#Step10: Defining continents
#EU Countries
eu_countries<-c("Austria",
                "Belgium",
                "Bulgaria",
                "Croatia",
                "Cyprus",
                "Czechia",
                "Denmark",
                "Estonia",
                "Finland",
                "France",
                "Germany",
                "Greece",
                "Hungary",
                "Ireland",
                "Italy",
                "Latvia",
                "Lithuania",
                "Luxembourg",
                "Malta",
                "Netherlands",
                "Poland",
                "Portugal",
                "Romania",
                "Slovakia",
                "Slovenia",
                "Spain",
                "Sweden")

latam_countries<-c("Belize",
                   "Costa Rica",
                   "El Salvador",
                   "Guatemala",
                   "Honduras",
                   "Mexico",
                   "Nicaragua",
                   "Panama",
                   "Argentina",
                   "Bolivia",
                   "Brazil",
                   "Chile",
                   "Colombia",
                   "Ecuador",
                   "Guyana",
                   "Paraguay",
                   "Peru",
                   "Suriname",
                   "Uruguay",
                   "Venezuela",
                   "Cuba",
                   "Dominican Republic",
                   "Haiti")

#Step11: Labeling continents 
final2$continent[final2$Entity  %in% eu_countries]<-"EU"
final2$continent[final2$Entity  %in% latam_countries]<-"Latin America"
final2$continent[is.na(final2$continent)]<-"Everything Else"

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America with EU

library(dplyr)
averages<-final2%>%
  group_by(continent)%>%
  dplyr::summarize(life_exp_mean=mean(life_exp, na.rm=TRUE))

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America, EU, and the Rest of the World

ggplot(averages, aes(x = continent, 
                 y = life_exp_mean, 
                 fill = continent)) +
  geom_col() +
  labs(fill = "Individual") +
  theme_void()

Advice for Barplots

A more compelling way is: boxplots with points

ggplot(final2, aes(x = continent, 
                   y = life_exp,
                   color = continent)) +
  geom_boxplot()+
  geom_point(position = position_jitter(height = 0), 
             alpha = 0.05) +
  guides(color = "none")

Advice for Barplots

Another way is to combine violin with points

ggplot(final2, aes(x = continent, 
                   y = life_exp,
                   color = continent)) +
  geom_violin()+
  geom_point(position = position_jitter(height = 0), 
             alpha = 0.05) +
  guides(color = "none")

Advice for Barplots

Another way is to have overlapping ridgeplots

library(ggridges)
ggplot(final2, aes(x = life_exp, 
                   y = continent,
                   fill = continent)) +
  geom_density_ridges() +
  guides(color = "none")

Advice for Barplots

Another way is to have all of them superimposed

library(ggridges)
ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  geom_density(alpha = 0.5)+
  guides(color = "none")

Plotting Uncertainty

As discussed it is good to add more information to your graphs to display the whole distribution of numbers

For example, the right is better than the left

Plotting Uncertainty

This is the code for left and right

ggplot(averages, aes(x = continent, 
                 y = life_exp_mean, 
                 fill = continent)) +
  geom_col() +
  labs(fill = "Individual") +
  theme_bw()

ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  geom_histogram(binwidth = 2, 
                 color = "white") +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()

Plotting Uncertainty

It could also be helpful to play with the binwidth: binwidth = 2

ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  geom_histogram(binwidth = 2, 
                 color = "white") +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()

Plotting Uncertainty

It could also be helpful to play with the binwidth: binwidth = 10

ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  geom_histogram(binwidth = 10, 
                 color = "white") +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()

Plotting Uncertainty

We can obtain something similar with densities: they are a smoothed version of the histogram.

ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  geom_density() +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()

Plotting Uncertainty

The difference is that one should count and the other one density.

The second shows the probability density function (PDF) of the variable: use calculus to find the probability of each x value

Plotting Uncertainty

We can obviously also plot them together

ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  geom_histogram(binwidth = 2, 
                 color = "white") +
  #scale the density to a similar scale to the histogram:
  #in this case, I multiply by 4000
  #note also aes(y = ..density..* 4000)
  geom_density(aes(y = ..density..* 4000), 
               alpha = 0.5)+
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/4000, 
                                name = "density"))

Plotting Uncertainty

Having a closer look at the code

ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  #geom_histogram with the same parameters
  geom_histogram(binwidth = 2, 
                 color = "white") +
  #note the aes(y = ..density..* 4000)
  #scale the density to a similar scale to the histogram
  #in this case, I multiply by 4000
  #otherwise it will not be visible
  #density calculates relative frequency:
  #count / sum(count): e.g. 385/1403
  geom_density(aes(y = ..density..* 4000), 
               alpha = 0.5)+
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/4000, 
                                name = "density"))

Why use a density curve vs. a histogram?

A histogram shows the counts of values in each range
It is made up of bars that touch each other

A density plot shows the proportion of values in each range
It is a smooth curve that shows the distribution of the data in a more continuous way

Frequencies and Densities

Box Plots

Here is a boxlot for life expectancy for the entire dataset

ggplot(final2, aes(x = life_exp)) +
  geom_boxplot()+
  labs(x = "Life Expectancy") +
  theme_bw()

Box Plots

Here is the interpretation

Box Plots

This is what the histogram look like:

Box Plots

This is what the density function looks like:

Box Plots

This is what the actual values look like:

Box Plots

And this is what they look like together:

Violin Plots

ggplot(final2, aes(x = life_exp, y="")) +
  geom_violin()+
  geom_boxplot()+
  labs(x = "Life Expectancy") +
  theme_bw()

Plotting Uncertainty in Scatterplots

We have already covered this, but here is the relationship between urbanization and life expectancy

setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_exp_urb <- read.csv(file = './life_exp_urb.csv')

Plotting Uncertainty in Scatterplots

We have already covered this, but here is the relationship between urbanization and life expectancy

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
    geom_point()

Plotting Uncertainty in Scatterplots

And here are the slopes for the three groups

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
    geom_point()+
    geom_smooth(method = "lm", se=FALSE)

Plotting Uncertainty in Scatterplots

And here are the confidence intervals for the three slopes

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
    geom_point()+
    geom_smooth(method = "lm")

Annotations

We also need to discuss text in plots

On the most basic level, we can label actual data points with the help of options such as:

geom_text()
geom_label()
geom_text_repel()

We can also add annotations with annotate()

Annotations

geom_text is not great when we have too many points

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_text(aes(label = Entity))

Annotations

geom_label is not great when we have too many points. It covers the points

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_label(aes(label = Entity))

Annotations

ggrepel is one solution: geom_text_repel

library(ggrepel)
ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_text_repel(aes(label = Entity))

Annotations

ggrepel is one solution: geom_label_repel

library(ggrepel)
ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_label_repel(aes(label = Entity))

Highlighting

We can highlight specific observations in our data

#Creating a Variable to Identify Obs
life_exp_urb$eu<-ifelse(life_exp_urb$type=="EU", 1, 0)
#Making the variable a factor
life_exp_urb$eu<-as.factor(life_exp_urb$eu)
#Plotting
ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = eu))+
  geom_point()+
  scale_color_manual(values = c("grey70", 
                                "red"))+
  guides(color = "none")

Annotating Graphs

This is how we can add text in the graph

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = eu))+
  geom_point()+
  scale_color_manual(values = c("grey70", 
                                "red"))+
  guides(color = "none")+
  annotate(geom = "text",
           x = 75, y = 45,
           label = "EU Countries", color="red")

Annotating Graphs

This is how we can add a rectangle around our points of interest

xmin_val<-min(life_exp_urb$urb_mean[life_exp_urb$eu==1])
xmax_val<-max(life_exp_urb$urb_mean[life_exp_urb$eu==1])
ymin_val<-min(life_exp_urb$life_exp_mean[life_exp_urb$eu==1])
ymax_val<-max(life_exp_urb$life_exp_mean[life_exp_urb$eu==1])

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = eu))+
  geom_point()+
  scale_color_manual(values = c("grey70", 
                                "red"))+
  guides(color = "none")+
  annotate(geom = "text",
           x = 75, y = 45,
           label = "EU Countries", color="red")+
  annotate(geom = "rect", 
           xmin = xmin_val, 
           xmax = xmax_val,
           ymin = ymin_val, 
           ymax = ymax_val,
           fill = "red", alpha = 0.2)

Annotating Graphs

This is how we can add an arrow towards the rectangle

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = eu))+
  geom_point()+
  scale_color_manual(values = c("grey70", 
                                "red"))+
  guides(color = "none")+
  annotate(geom = "text",
           x = 75, y = 45,
           label = "EU Countries", color="red")+
  annotate(geom = "rect", 
           xmin = xmin_val, 
           xmax = xmax_val,
           ymin = ymin_val, 
           ymax = ymax_val,
           fill = "red", alpha = 0.2)+
  annotate(geom = "segment", 
           x = 75, 
           xend = 75, 
           y = 45, 
           yend = ymin_val,
           arrow = arrow(
             length = unit(0.1, "in")))

Annotating Graphs

Or let’s turn the text into a label

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = eu))+
  geom_point()+
  scale_color_manual(values = c("grey70", 
                                "red"))+
  guides(color = "none")+
  annotate(geom = "label",
           x = 75, 
           y = 45,
           label = "EU Countries", 
           color="red")+
  annotate(geom = "rect", 
           xmin = xmin_val, 
           xmax = xmax_val,
           ymin = ymin_val, 
           ymax = ymax_val,
           fill = "red", alpha = 0.2)+
  annotate(geom = "segment", 
           x = 75, 
           xend = 75, 
           y = 45, 
           yend = ymin_val,
           arrow = arrow(
             length = unit(0.1, "in")))

Visualizing Time

You may also want to plot temporal data

To do that we need to load the life expectancy data over time

#Step1: Setting path
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step2: Loading the data
life_expectancy <- read.csv(file = './life-expectancy.csv')

Visualizing Time

Let us look again at the data

head(life_expectancy, n=3)

       Entity Code Year Life.expectancy.at.birth..historical.
1 Afghanistan  AFG 1950                                  27.7
2 Afghanistan  AFG 1951                                  28.0
3 Afghanistan  AFG 1952                                  28.4

Temporal Data

Let us imagine that we want to plot life expectancy over time

We first need to calculate average by year

life_expectancy2<-life_expectancy%>%
  group_by(Year)%>%
  dplyr::summarize(life_exp_mean=mean(Life.expectancy.at.birth..historical.))

Let us look at the transformed data.

head(life_expectancy2, n=3)

# A tibble: 3 × 2
   Year life_exp_mean
  <int>         <dbl>
1  1543          33.9
2  1548          38.8
3  1553          39.6

Temporal Data

We are now ready to plot the time trend

ggplot(life_expectancy2, aes(x = Year, y = life_exp_mean)) +
  geom_line()

Temporal Data

We can also produce an animation

#Note: you should install:
#install.packages("av")
#install.packages("magick")
#install.packages("gifski")
#install.packages("gganimate")
library(gganimate)
ggplot(life_expectancy2, aes(x = Year, y = life_exp_mean)) +
  geom_line()+
    labs(title = 'Year: {frame_along}', 
       x = 'Urbanization', 
       y = 'Life Expectancy') +
  transition_reveal(Year)

Temporal Data

To view the upward sloping trend in life expectancy, it might be worth aggregating some of the years.

ggplot(life_expectancy2, aes(x = Year, y = life_exp_mean)) +
  geom_line()+
  geom_smooth()

Temporal Data: Two Countries

Let us imagine that we want to compare Italy and the US in life expectancy

This means we go back to the original file and select the US and Italy

#Subsetting the data
life_expectancy_itus<-subset(life_expectancy, Entity %in% c("Italy", "United States"))
#Selecting only after 1900
life_expectancy_itus<-subset(life_expectancy_itus, Year>1899)
#Renaming life expectancy variable
names(life_expectancy_itus)[names(life_expectancy_itus)=="Life.expectancy.at.birth..historical."]<-"life_exp"

Let’s examine the data

head(life_expectancy_itus, n=4)

     Entity Code Year life_exp
8583  Italy  ITA 1900    41.67
8584  Italy  ITA 1901    43.54
8585  Italy  ITA 1902    42.99
8586  Italy  ITA 1903    43.11

Temporal Data: Two Countries

We can now plot life expectancy for the two countries over time

#Plotting
ggplot(life_expectancy_itus, 
       aes(x=Year, y= life_exp, color = Entity))+
  geom_line()+
  scale_colour_viridis_d()

Temporal Data: Two Countries

Changing the starting value of our y-axis can change the appearance of our data

Temporal Data: Two Countries

Changing the starting value of our y-axis can change the appearance of our data

ggplot(life_expectancy_itus, 
       aes(x=Year, 
           y= life_exp, 
           color = Entity))+
  geom_line()+
  scale_colour_viridis_d()

ggplot(life_expectancy_itus, 
       aes(x=Year, 
           y= life_exp, 
           color = Entity))+
  geom_line()+
  scale_colour_viridis_d()+
  ylim(0, NA)

Temporal Data Annotations

Annotations work very well with temporal data

ggplot(life_expectancy_itus, 
       aes(x=Year, 
           y= life_exp, 
           color = Entity))+
  geom_line()+
  scale_colour_viridis_d()+
  annotate(geom = "label",
           x = 1918, y = 80,
           label = "WW1", 
           color="red")+
  annotate(geom = "rect", 
           xmin = 1917, 
           xmax = 1920,
           ymin = 0, 
           ymax = Inf,
           fill = "red", 
           alpha = 0.2)

Visualizing Time

Another cool way to visualize data is to create density ridges for every year

We will however want fewer years to make the graph easier to read

Visualizing Time

Here is how we go about it:

#Step1: Load the library
library("forcats")
#Step2: Subset the data to fewer years
final3<-subset(final2, Year>1980)
#Step3: Turn year into factor
final3$year_factor<-as.factor(final3$Year)
#Step4: Plot
ggplot(final3, 
       aes(x = life_exp, 
           y = fct_rev(as.factor(Year)), fill = ..x..))+
  geom_density_ridges_gradient(color = "white", 
                               quantile_lines = TRUE)+
  labs(x = "Life expectancy", y = NULL, fill = NULL)+
  scale_fill_viridis_c(name = "")

Visualizing Time

This produces:

Visualizing Time

We can see this way the first, second, and third quartile

Visualizing Time

We also see that the life expectancy gets narrower over time: most people live longer lives

Saving Figures

One final aspect related to visualization is how we save the data

The command for this ggsave

Let us look at a few options

Saving Figures

Let’s imagine we want to save the following plot

ggplot(life_expectancy_itus, 
       aes(x=Year, 
           y= life_exp, 
           color = Entity))+
  geom_line()+
  scale_colour_viridis_d()+
  annotate(geom = "label",
           x = 1918, y = 80,
           label = "WW1", 
           color="red")+
  annotate(geom = "rect", 
           xmin = 1917, 
           xmax = 1920,
           ymin = 0, 
           ymax = Inf,
           fill = "red", 
           alpha = 0.2)

Saving Figures

Let’s imagine we want to save the following plot

Saving Figures

Let’s imagine we want to save the following plot

We first create an object: pic1

pic1<-ggplot(life_expectancy_itus, 
       aes(x=Year, 
           y= life_exp, 
           color = Entity))+
  geom_line()+
  scale_colour_viridis_d()+
  annotate(geom = "label",
           x = 1918, y = 80,
           label = "WW1", 
           color="red")+
  annotate(geom = "rect", 
           xmin = 1917, 
           xmax = 1920,
           ymin = 0, 
           ymax = Inf,
           fill = "red", 
           alpha = 0.2)

Saving Figures

The next step is to use the ggsave function.

This will save the object in your directory

#Saving as PDF
ggsave(pic1, filename = "./figures_lecture8/pic1.pdf")

#Saving as PNG
ggsave(pic1, filename = "./figures_lecture8/pic1.png")

#Saving as JPG
ggsave(pic1, filename = "./figures_lecture8/pic1.jpg")

Saving Figures

It is important to be aware how much space these files take

pic1.pdf - 7KB
pic1.png - 119KB
pic.jpg - 410KB

Saving Figures

Depending on your goal, you may want to opt for one of the three

For web, pngs are good

For academic publications, jpgs are acceptable

Saving Figures

One should also be aware that width and heigh options, as part of the ggsave function will alter the appearance of your figure

#Saving as JPG
ggsave(pic1, filename = "./figures_lecture8/pic1_square.jpg", 
       width = 20, 
       height = 20, 
       units = "cm")

#Saving as JPG
ggsave(pic1, filename = "./figures_lecture8/pic1_rect.jpg", 
       width = 20, 
       height = 10, 
       units = "cm")

Saving Figures

Here is what they look like:

Saving Figures

For publication quality figures, you should use a dpi of 300

#Saving as JPG
ggsave(pic1, filename = "./figures_lecture8/pic1_300dpi.jpg", 
       width = 20, 
       height = 20, 
       units = "cm",
       dpi=300)

And here is 50 dpi

#Saving as JPG
ggsave(pic1, filename = "./figures_lecture8/pic1_50dpi.jpg", 
       width = 20, 
       height = 20, 
       units = "cm",
       dpi=50)

Saving Figures

Here is what they look like:

Saving Figures

The take-away is that higher dpi result in greater clarity at the expense of space

Lower dpi means lower clarity and lower space:

pic1_50dpi - 25KB
pic1_300dpi - 471KB

Conclusion

We have covered more ground in data visualization today:

different ways of plotting quantities: pie charts vs. bar charts
boxplots, violin plots, histograms, and density plots
the relationship between histograms and density plots
labelling and annotating graphs
highlighting graphs
visualizing time
saving figures

Exercise 1: Function for Scatter Plot

Write a function called scatter_plot that takes two vectors x and y as inputs and generates a scatter plot using ggplot2.

The x and y look like:

# Example usage:
x <- c(1, 2, 3, 4, 5)
y <- c(2, 3, 5, 7, 11)

scatter_plot(x, y)

Exercise 2: Function for Histogram

Write a function called histogram_plot that takes a vector data as input and generates a histogram using ggplot2.

set.seed(123)
data <- rnorm(100, mean = 0, sd = 1)
histogram_plot(data)

Exercise 3: Produce a time visualization like below:

Exercise 4:

Create the following dataframe:

# Load required library
library(ggplot2)

# Generate a fun dataset
set.seed(123) # for reproducibility
n_creatures <- 50

creatures <- data.frame(
  type = rep(c("Dragon", "Unicorn", "Yeti"), each = n_creatures),
  cuteness = c(rnorm(n_creatures, mean = 8, sd = 2),
               rnorm(n_creatures, mean = 7, sd = 2),
               rnorm(n_creatures, mean = 5, sd = 2)),
  fierceness = c(rnorm(n_creatures, mean = 7, sd = 2),
                 rnorm(n_creatures, mean = 5, sd = 2),
                 rnorm(n_creatures, mean = 8, sd = 2))
)

Exercise 4:

Produce the following graph