Data Visualization 2

Bogdan G. Popescu

John Cabot University

Intro

Previously, we covered quite some ground:

  • importance of color choices
  • mapping data to aesthetics: points
  • facets, coordinates, labels, themes
  • theme option manipulation
  • creating time animations

Preparing the Data

Before we use ggplot, our data has to be tidy

This means:

  • Each variable is a column
  • Each observation has its own row
  • Each value has its own cell

Plotting Quantities

People are better at seeing height differences than angle and area differences

Plotting Quantities

People are better at seeing height differences than angle and area differences.

This how to obtain the same plots.

#Step1: Loading the library
library(ggplot2)

#Step2: Create a simple dataframe
data <- data.frame(
  name=c("A","B","C","D","E") ,  
  value=c(3,12,5,18,45)
  )

Plotting Quantities

People are better at seeing height differences than angle and area differences

This how to obtain the same plots.

ggplot(data, aes(x = "", 
                 y = value, 
                 fill = name)) +
  geom_col() +
  coord_polar(theta = "y") +
  labs(fill = "Individual") +
  theme_void()
ggplot(data, aes(x = name, 
                 y = value, 
                 fill = name)) +
  geom_col() +
  labs(fill = "Individual") +
  theme_void()

Advice for Barplots

  • Always start at zero to be transparent about your data
  • It may sometimes be more informative to visualize data using other tools than barplots

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America with EU

#Setting path
library("dplyr")
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_expectancy <- read.csv(file = './life-expectancy.csv')
urbanization <- read.csv(file = './share-of-population-urban.csv')
#Step2: Removing countries with no 3-letter code
life_expectancy2<-subset(life_expectancy, Code!="")
#Step3: Changing variable name
names(life_expectancy2)[names(life_expectancy2)=="Life.expectancy.at.birth..historical."]<-"life_exp"
#Step4: Selecting only vars of interest
life_expectancy3<-subset(life_expectancy2, selec=c("Entity", "Code", "Year", "life_exp"))
#Step5: Removing countries with no 3-letter code
urbanization2<-subset(urbanization, Code!="")
#Step6: Changing variable name
names(urbanization2)[names(urbanization2)=="Urban.population....of.total.population."]<-"urb"
#Step7: Selecting only vars of interest
urbanization3<-subset(urbanization2, selec=c("Code", "Year", "urb"))
#Step8: Performing a merge
final<-left_join(life_expectancy3, urbanization3, by = c("Code"="Code",
                                                         "Year"="Year"))
#Step9: Removing NAs
final2<-final[complete.cases(final), ]
#Step10: Defining continents
#EU Countries
eu_countries<-c("Austria",
                "Belgium",
                "Bulgaria",
                "Croatia",
                "Cyprus",
                "Czechia",
                "Denmark",
                "Estonia",
                "Finland",
                "France",
                "Germany",
                "Greece",
                "Hungary",
                "Ireland",
                "Italy",
                "Latvia",
                "Lithuania",
                "Luxembourg",
                "Malta",
                "Netherlands",
                "Poland",
                "Portugal",
                "Romania",
                "Slovakia",
                "Slovenia",
                "Spain",
                "Sweden")

latam_countries<-c("Belize",
                   "Costa Rica",
                   "El Salvador",
                   "Guatemala",
                   "Honduras",
                   "Mexico",
                   "Nicaragua",
                   "Panama",
                   "Argentina",
                   "Bolivia",
                   "Brazil",
                   "Chile",
                   "Colombia",
                   "Ecuador",
                   "Guyana",
                   "Paraguay",
                   "Peru",
                   "Suriname",
                   "Uruguay",
                   "Venezuela",
                   "Cuba",
                   "Dominican Republic",
                   "Haiti")

#Step11: Labeling continents 
final2$continent[final2$Entity  %in% eu_countries]<-"EU"
final2$continent[final2$Entity  %in% latam_countries]<-"Latin America"
final2$continent[is.na(final2$continent)]<-"Everything Else"

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America with EU

library(dplyr)
averages<-final2%>%
  group_by(continent)%>%
  dplyr::summarize(life_exp_mean=mean(life_exp, na.rm=TRUE))

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America, EU, and the Rest of the World

ggplot(averages, aes(x = continent, 
                 y = life_exp_mean, 
                 fill = continent)) +
  geom_col() +
  labs(fill = "Individual") +
  theme_void()

Advice for Barplots

A more compelling way is: boxplots with points

ggplot(final2, aes(x = continent, 
                   y = life_exp,
                   color = continent)) +
  geom_boxplot()+
  geom_point(position = position_jitter(height = 0), 
             alpha = 0.05) +
  guides(color = "none")

Advice for Barplots

Another way is to combine violin with points

ggplot(final2, aes(x = continent, 
                   y = life_exp,
                   color = continent)) +
  geom_violin()+
  geom_point(position = position_jitter(height = 0), 
             alpha = 0.05) +
  guides(color = "none")

Advice for Barplots

Another way is to have overlapping ridgeplots

library(ggridges)
ggplot(final2, aes(x = life_exp, 
                   y = continent,
                   fill = continent)) +
  geom_density_ridges() +
  guides(color = "none")

Advice for Barplots

Another way is to have all of them superimposed

library(ggridges)
ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  geom_density(alpha = 0.5)+
  guides(color = "none")

Plotting Uncertainty

As discussed it is good to add more information to your graphs to display the whole distribution of numbers

For example, the right is better than the left

Plotting Uncertainty

This is the code for left and right

ggplot(averages, aes(x = continent, 
                 y = life_exp_mean, 
                 fill = continent)) +
  geom_col() +
  labs(fill = "Individual") +
  theme_bw()
ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  geom_histogram(binwidth = 2, 
                 color = "white") +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()

Plotting Uncertainty

It could also be helpful to play with the binwidth: binwidth = 2

ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  geom_histogram(binwidth = 2, 
                 color = "white") +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()

Plotting Uncertainty

It could also be helpful to play with the binwidth: binwidth = 10

ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  geom_histogram(binwidth = 10, 
                 color = "white") +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()

Plotting Uncertainty

We can obtain something similar with densities: they are a smoothed version of the histogram.

ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  geom_density() +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()

Plotting Uncertainty

The difference is that one should count and the other one density.

The second shows the probability density function (PDF) of the variable: use calculus to find the probability of each x value

Plotting Uncertainty

We can obviously also plot them together

ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  geom_histogram(binwidth = 2, 
                 color = "white") +
  #scale the density to a similar scale to the histogram:
  #in this case, I multiply by 4000
  #note also aes(y = ..density..* 4000)
  geom_density(aes(y = ..density..* 4000), 
               alpha = 0.5)+
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/4000, 
                                name = "density"))

Plotting Uncertainty

Having a closer look at the code

ggplot(final2, aes(x = life_exp, 
                   fill = continent)) +
  #geom_histogram with the same parameters
  geom_histogram(binwidth = 2, 
                 color = "white") +
  #note the aes(y = ..density..* 4000)
  #scale the density to a similar scale to the histogram
  #in this case, I multiply by 4000
  #otherwise it will not be visible
  #density calculates relative frequency:
  #count / sum(count): e.g. 385/1403
  geom_density(aes(y = ..density..* 4000), 
               alpha = 0.5)+
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/4000, 
                                name = "density"))

Why use a density curve vs. a histogram?

  • A histogram shows the counts of values in each range

  • It is made up of bars that touch each other

  • A density plot shows the proportion of values in each range

  • It is a smooth curve that shows the distribution of the data in a more continuous way

Frequencies and Densities

Box Plots

Here is a boxlot for life expectancy for the entire dataset

ggplot(final2, aes(x = life_exp)) +
  geom_boxplot()+
  labs(x = "Life Expectancy") +
  theme_bw()

Box Plots

Here is the interpretation

Box Plots

This is what the histogram look like:

Box Plots

This is what the density function looks like:

Box Plots

This is what the actual values look like:

Box Plots

And this is what they look like together:

Violin Plots

ggplot(final2, aes(x = life_exp, y="")) +
  geom_violin()+
  geom_boxplot()+
  labs(x = "Life Expectancy") +
  theme_bw()