Data Visualization 2

Bogdan G. Popescu

John Cabot University

Intro

Previously, we covered quite some ground:

  • importance of color choices
  • mapping data to aesthetics: points
  • facets, coordinates, labels, themes
  • theme option manipulation
  • creating time animations

Preparing the Data

Before we use ggplot, our data has to be tidy

This means:

  • Each variable is a column
  • Each observation has its own row
  • Each value has its own cell

Plotting Quantities

People are better at seeing height differences than angle and area differences

Plotting Quantities

People are better at seeing height differences than angle and area differences.

This how to obtain the same plots.

#Step1: Loading the library
library(ggplot2)

#Step2: Create a simple dataframe
data <- data.frame(
  name=c("A","B","C","D","E") ,  
  value=c(3,12,5,18,45)
  )

Plotting Quantities

People are better at seeing height differences than angle and area differences

This how to obtain the same plots.

ggplot(data, aes(x = "", 
                 y = value, 
                 fill = name)) +
  geom_col() +
  coord_polar(theta = "y") +
  labs(fill = "Individual") +
  theme_void()
ggplot(data, aes(x = name, 
                 y = value, 
                 fill = name)) +
  geom_col() +
  labs(fill = "Individual") +
  theme_void()

Advice for Barplots

  • Always start at zero to be transparent about your data
  • It may sometimes be more informative to visualize data using other tools than barplots

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America with EU

#Setting path
library("dplyr")
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_expectancy <- read.csv(file = './life-expectancy.csv')
urbanization <- read.csv(file = './share-of-population-urban.csv')
#Step2: Removing countries with no 3-letter code
life_expectancy2<-subset(life_expectancy, Code!="")
#Step3: Changing variable name
names(life_expectancy2)[names(life_expectancy2)=="Life.expectancy.at.birth..historical."]<-"life_exp"
#Step4: Selecting only vars of interest
life_expectancy3<-subset(life_expectancy2, selec=c("Entity", "Code", "Year", "life_exp"))
#Step5: Removing countries with no 3-letter code
urbanization2<-subset(urbanization, Code!="")
#Step6: Changing variable name
names(urbanization2)[names(urbanization2)=="Urban.population....of.total.population."]<-"urb"
#Step7: Selecting only vars of interest
urbanization3<-subset(urbanization2, selec=c("Code", "Year", "urb"))
#Step8: Performing a merge
final<-left_join(life_expectancy3, urbanization3, by = c("Code"="Code",
                                                         "Year"="Year"))
#Step9: Removing NAs
final2<-final[complete.cases(final), ]
#Step10: Defining continents
#EU Countries
eu_countries<-c("Austria",
                "Belgium",
                "Bulgaria",
                "Croatia",
                "Cyprus",
                "Czechia",
                "Denmark",
                "Estonia",
                "Finland",
                "France",
                "Germany",
                "Greece",
                "Hungary",
                "Ireland",
                "Italy",
                "Latvia",
                "Lithuania",
                "Luxembourg",
                "Malta",
                "Netherlands",
                "Poland",
                "Portugal",
                "Romania",
                "Slovakia",
                "Slovenia",
                "Spain",
                "Sweden")

latam_countries<-c("Belize",
                   "Costa Rica",
                   "El Salvador",
                   "Guatemala",
                   "Honduras",
                   "Mexico",
                   "Nicaragua",
                   "Panama",
                   "Argentina",
                   "Bolivia",
                   "Brazil",
                   "Chile",
                   "Colombia",
                   "Ecuador",
                   "Guyana",
                   "Paraguay",
                   "Peru",
                   "Suriname",
                   "Uruguay",
                   "Venezuela",
                   "Cuba",
                   "Dominican Republic",
                   "Haiti")

#Step11: Labeling continents 
final2$continent[final2$Entity  %in% eu_countries]<-"EU"
final2$continent[final2$Entity  %in% latam_countries]<-"Latin America"
final2$continent[is.na(final2$continent)]<-"Everything Else"

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America with EU

library(dplyr)
averages<-final2%>%
  group_by(continent)%>%
  dplyr::summarize(life_exp_mean=mean(life_exp, na.rm=TRUE))

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America, EU, and the Rest of the World

ggplot(averages, aes(x = continent, 
                 y = life_exp_mean, 
                 fill = continent)) +
  geom_col() +
  labs(fill = "Individual") +
  theme_void()

Advice for Barplots

A more compelling way is: boxplots with points

ggplot(final2, aes(x = continent, 
                   y = life_exp,
                   color = continent)) +
  geom_boxplot()+
  geom_point(position = position_jitter(height = 0), 
             alpha = 0.05) +
  guides(color = "none")