L7: Data Visualization 1

Bogdan G. Popescu

John Cabot University

Intro

Visualization in R is about presenting data in an aesthetically pleasing way

More importantly, visualization is about presenting the data in a truthful way

Finding the Truth

One good way to tell if something is true or not is by using science

Truth vs. Art

However, data visualization can affect perceptions of what is scientific

The scientist can consciosly or unconsciously make particular style choices to a graph that can affect such perceptions

Thus, perception of science can sometimes be impacted by aesthetic choices and the way content is presented

Duties of a Data Analyst

  • Not manipulate data and thus, not lie
  • Not misrepresent data
  • Find the right balance between dumbing down and sounding too esoteric when communicating results
    • You are utimately a translator
  • Emphasize the story and make data available

The Data

At the most basic level, we can present snippets of the data or basic information about the data

To demonstrate how we can present data, let us download: Life expectancy and Urbanization Data

The dataframe once you load it into R looks like the following:

setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_exp_urb <- read.csv(file = './life_exp_urb.csv')
#Step2: Examining the first five entries
head(life_exp_urb, n=5)
          Entity life_exp_mean urb_mean            type
1    Afghanistan      45.38333 18.61175 Everything Else
2        Albania      68.28611 40.44416 Everything Else
3        Algeria      57.53013 52.60921 Everything Else
4 American Samoa      68.63750 79.76249 Everything Else
5        Andorra      77.04861 87.04302 Everything Else

The Data

In principle, we can just present basic information about the data:

df<-subset(life_exp_urb, select = c(life_exp_mean, urb_mean))
names(df)<-c("life_exp", "urb")
head(df, n=10)
# A tibble: 10 × 2
   life_exp   urb
      <dbl> <dbl>
 1     45.4  18.6
 2     68.3  40.4
 3     57.5  52.6
 4     68.6  79.8
 5     77.0  87.0
 6     45.1  37.5
 7     69.4  NA  
 8     71.2  32.4
 9     65.4  85.3
10     67.2  63.1
mean(df$life_exp, na.rm=TRUE)
[1] 61.93416
mean(df$urb, na.rm=TRUE)
[1] 51.36518
cor(df$urb, df$life_exp, use = "complete.obs")
[1] 0.630194

This is reasonable.

This is reasonable.

There is a positive correlation.

Visualizing the Relationship

But the following scatterplot is more compelling.

library(ggplot2)
ggplot(data = df, 
  mapping = aes(x=urb, y=life_exp))+
  geom_point()

Visualization

Good visualizations are:

  • truthful
  • functional
  • beautiful
  • insightful
  • enlightening

Visualization

“Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency.[…] [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space… And graphical excellence requires telling the truth about the data.”

Edward Tufte, The Visual Display of Quantitative Information, p. 51

Bad Visualizations 1

Bad Visualizations 1

The figure has way too many categories

Bad Visualizations 2

Bad Visualizations 2

The categories in a pie chart should add up to 100%

The areas should correspond to the percentage size

Bad Visualizations 3

Bad Visualizations 3

The categories in a pie chart should add up to 100%

Bad Visualizations 4

Bad Visualizations 4

Bad Visualizations 5

Bad Visualizations 5

Goals of Visualization

R can help us achieve great visualizations with the help of ggplot2

Goals of Visualization

R can help us achieve great visualizations with the help of ggplot2

Goals of Visualization

R can help us achieve great visualizations with the help of ggplot2

Visualizations

Part of visualization is how we translate essential content to different forms for specific audiences

Visualizations should ultimately tell stories

Truth comes from a combination of content and form.

Colors

On the most fundamental level, we need to use the right colors for our visualizations

This is relevant for:

  • Clarity and Readability: users can distinguish among different categories
  • Accessibility: color-blind people can also see the different categories in your visualization
  • Emphasis: the right colors can be used to emphasize specific aspects of the data or analysis
  • Consistency: using a consistent color palette for the same project is helpful

Adobe Color

One excellent free source for color choice is https://color.adobe.com


Adobe Color

With Adobe Color you can:

  • Create color themes based on color theory
  • Extract themes & gradients from pictures
  • Create Accessible themes for color-blind audiences

Color Contrast

It is important to use color with a high contrast

The contrast ratio calculator from Adobe’s Create Accessible themes can be helpful

Color Contrast Example

Here is a low color contrast

Color Contrast Example

Here is a low color contrast

Color Contrast Example

Here is a high color contrast

Color Contrast Example

Here is a high color contrast

Color

8% of men and 0.05% have some form of color blindness

Thus, colors should be distinguishable by people with different forms of color blindness

Color Contrast

The Viridis palette in R allows us to create color-blind friendly graphs

These are predefined palettes that are widely used.

Color-blind Not Friendly

Color-blind Friendly

Color Contrast

This is the difference that this makes

ggplot() +
  geom_sf(data = merged, aes(fill = life_exp_mean))

Color Contrast

This is the difference that this makes

ggplot() +
    geom_sf(data = merged, aes(fill = life_exp_mean))+
    scale_fill_viridis_b(name = "Life Expectancy", option = "viridis")

Mapping data to aesthetics

On the most fundamental level, we can plot points in ggplot2

Data aes() geom
urb_mean x geom_point()
life_exp_mean y geom_point()
type color geom_point()

Mapping data to aesthetics

This is what this looks like for our data

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()

Grammatical Layers

  • Thus, we have data, aesthetics, and geometries.
  • We can think of these as layers
  • We can add them together in ggplot() with +

Possible Aesthetics

color(discrete)

color(continuous)

size

fill

shape

alpha

Possible geoms

Example geom What it does
geom_point() Points
geom_col() Bar charts
geom_text() Text
geom_boxplot() Boxplots
geom_sf() Maps

Possible geoms

There are many possible geoms

Check out the layers sections of the ggplot documentation

Other Layers

There are many grammatical layers to describe graphs

We can sequentially add layers to ggplot

Example layer What it does
scale_x_continuous() Make the x-axis continuous
scale_x_continuous(breaks = 1:5)  Manually specify axis ticks
scale_x_log10() Log the x-axis
scale_color_gradient() Use a gradient
scale_fill_viridis_d() Fill with discrete viridis colors

scale_x_log10()

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, y=life_exp_mean, color = type, size = life_exp_mean))+
  geom_point()+
  scale_x_log10()

scale_x_log10()

Note the difference when we don’t use scale_x_log10()

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, y=life_exp_mean, color = type, size = life_exp_mean))+
  geom_point()

scale_color_viridis_d()

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, y=life_exp_mean, color = type, size = life_exp_mean))+
  geom_point()+
  scale_x_log10()+
  scale_color_viridis_d()

Facets

Example layer What it does
facet_wrap(…, ncol = 1) Put all facets in one column
facet_wrap(…, nrow = 1) Put all facets in one row

Facets

facet_wrap(vars(continent))

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, y=life_exp_mean, color = type, size = life_exp_mean))+
  geom_point()+
  scale_x_log10()+
  facet_wrap(vars(type))

Coordinates

Example layer What it does
coord_cartesian(ylim = c(1, 10)) Zoom in where y is 1–10
coord_flip() Switch x and y

Coordinates

This is what happens if we limit the coordinates

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, y=life_exp_mean, color = type, size = life_exp_mean))+
  geom_point()+
  coord_cartesian(ylim = c(30, 50), xlim = c(10, 40))

Coordinates

This is what happens if we flip the axes

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, y=life_exp_mean, color = type, size = life_exp_mean))+
  geom_point()+
  coord_flip()

Labels

You can add labels to the plot using a labs layer

Example layer What it does
labs(title = “Neat title”) Title
labs(caption = “Something”) Caption
labs(y = “Something”) y-axis
labs(size = “Population”) Title of size legend

Labels

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type, 
                size = life_exp_mean))+
  geom_point()+
  labs(title = "Health and Urbanization",
       subtitle = "Insights",
       x = "Urbanization",
       y = "Life Expectancy",
       color = "Continent",
       size = "Population",
       caption = "Source: Our World in Data")

Themes

You can change the appearance of the plots by changing the theme

Example layer What it does
theme_grey() Default grey background
theme_bw() Black and white
theme_dark() Dark
theme_minimal() Minimal

theme_grey()

theme_bw()

theme_dark()

theme_minimal()

theme_economist()

This can be achieved after loading the ggthemes library

theme_wsj()

This can be achieved after loading the ggthemes library

Theme Options

We can make adjustments to the theme:

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type, 
                size = life_exp_mean))+
  geom_point()+
  labs(title = "Health and Urbanization",
       subtitle = "Insights",
       x = "Urbanization",
       y = "Life Expectancy",
       color = "Continent",
       size = "Population",
       caption = "Source: Our World in Data")+
  theme_bw()+
  theme(legend.position = "bottom",
      plot.title = element_text(face = "bold"),
      panel.grid = element_blank(),
      axis.title.y = element_text(face = "italic"))

Theme Options

We can make adjustments to the theme:

Anatomy of a Theme

Anatomy of a Theme

Thus, each element of a theme can be manipulated.

There 94 possible arguments that you can manipulate. For example:

  • Plot title = plot.title
  • Grid lines = panel.grid
  • Legend background = legend.background
  • Text-based elements = element_text()
  • Disabling elements = element_blank()
  • Something as specific as the length of tick marks the bottom = axis.ticks.length.x.bottom

Let us create our own theme

  • First let’s decide on our color theme.

  • Let’s say we want to recreate the color scheme from the 1992 Dracula movie.

Let us create our own theme

Let us create our own theme

Once we do that we obtain the following color scheme:

Let us create our own theme

Note the color codes

Let us create our own theme

We can finally use the following colors:

Greys

  • #0D0D0B - almost black
  • #40403E - dark grey
  • #A69E94 - light grey

Reds

  • #BF0404 - red
  • #590202 - brown

Let us create our own theme

color_palette <- c("#BF0404", "#590202", "grey90")
ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type, 
                size = life_exp_mean))+
  geom_point()+theme_bw()+
  scale_color_manual(values = color_palette)+ # Assign colors manually
    theme(panel.background = element_rect(fill = "white"), # Set background color
        axis.title.x = element_text(color = "#BF0404", face = "bold"), # Set x-axis label color
        axis.title.y = element_text(color = "#BF0404", face = "bold"),
        axis.line = element_line(color = "#BF0404", size = 1.5), 
        panel.grid = element_blank(), 
        panel.border = element_blank(), 
        panel.grid.major.y = element_line(colour = "#40403E", size = 0.5, linetype = "dotted"))

Let us create our own theme

Anatomy of a Theme

You should check out C17: Themes of ggplot2: Elegant Graphics for Data Analysis (3e)

You will learn how to exercise fine control over the non-data elements of your plot.

Adding the Layers Together

We can make a plot sequentually to see how each grammatical layer changes the appearance

Adding the Layers Together

Start with data and aesthetics

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))

Adding the Layers Together

Add a geom point

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()

Adding the Layers Together

Adding geom smooth

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth()

Adding the Layers Together

Getting straight lines

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")

Adding the Layers Together

Use a viridis color scale

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")+
  scale_color_viridis_d()

Adding the Layers Together

Facets by continents

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")+
  scale_color_viridis_d()+
  facet_wrap(vars(type), ncol = 1)

Adding the Layers Together

Add labels

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")+
  scale_color_viridis_d()+
  facet_wrap(vars(type), ncol = 1)+
  labs(title = "Health and Urbanization",
       x = "Urbanization",
       y = "Life Expectancy",
       color = "Continent",
       size = "Population")

Adding the Layers Together

Add theme

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")+
  scale_color_viridis_d()+
  facet_wrap(vars(type), ncol = 1)+
  labs(title = "Health and Urbanization",
       x = "Urbanization",
       y = "Life Expectancy",
       color = "Continent",
       size = "Population")+
  theme_bw()

Adding the Layers Together

Modify the theme

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")+
  scale_color_viridis_d()+
  facet_wrap(vars(type), ncol = 1)+
  labs(title = "Health and Urbanization",
       x = "Urbanization",
       y = "Life Expectancy",
       color = "Continent",
       size = "Population")+
  theme_bw()+
  theme(legend.position = "bottom",
        plot.title = element_text(face = "bold"))

Describing graphs with grammar

We can map life expectancy to the x-axis, add a histogram with bins, fill and facet by continent

ggplot(data = life_exp_urb, 
       mapping = aes(x = life_exp_mean,
                     fill = type)) +
  geom_histogram(binwidth = 5, 
                 color = "white") +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(type))

What is a histogram?

Histograms allow us to better understand how frequently or infrequently certain values occur in our dataset.

Imagine a set of values that are spaced out along a number line.

What is a histogram?

To construct a histogram, a section of the number line is divided into equal chunks, called bins.

Next, count how many data points sit inside each bin, and draw bars, one for each bin

The heights of the weights correspond to the number of data points.

What is a histogram?

Label the data (in the example below each data point is an SAT score)

Draw in a y-axis which counts the number of data points in each bin

Finally label your bins.

What is a histogram?

This how it all looks together

What is a histogram?

And these are three histograms applied to our data.

ggplot(data = life_exp_urb, 
       mapping = aes(x = life_exp_mean,
                     fill = type)) +
  geom_histogram(binwidth = 5, 
                 color = "white") +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(type))

Describing graphs with grammar

We can map continent to the x-axis, life expectancy to the y-axis, add violin plots and semi-transparent boxplots, fill and facet by continent

ggplot(data = life_exp_urb, 
       mapping = aes(x = type,
                     y = life_exp_mean,
                     fill = type)) +
  geom_violin() +
  geom_boxplot(alpha = 0.5) +
  guides(fill = "none")  # Turn off legend

Adding a Time Dimension

library("gganimate")
#Note: you should install:
#install.packages("av")
#install.packages("magick")
#install.packages("gifski")
#install.packages("gganimate")
ggplot(merged_data_temp, aes(x = urb_yearly, 
                   y = life_exp_yearly, 
                  size = urb_yearly, 
                  color=Entity)) +
  geom_point(alpha = 0.7) +
  scale_size(range = c(2, 12)) +
  guides(size = "none", 
         color = "none") +
  facet_wrap(~continent)+
  scale_color_viridis_d()+
  #animate arguments
  labs(title = 'Year: {frame_time}', 
       x = 'Urbanization', 
       y = 'Life Expectancy') +
  transition_time(Year) +
  ease_aes('linear')

Adding a Time Dimension

Here is how I obtained the merged_data_temp dataframe:

#Setting path
library("dplyr")
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_expectancy_df <- read.csv(file = './life-expectancy.csv')
urbanization_df <- read.csv(file = './share-of-population-urban.csv')
#Step2: Removing countries with no country code
weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_expectancy_df<-subset(life_expectancy_df, !(Code %in% weird_labels))
#Step3: Changing variable name
names(clean_life_expectancy_df)[names(clean_life_expectancy_df)=="Life.expectancy.at.birth..historical."]<-"life_exp_yearly"
#Step4: Keeping only relevant vars
clean_life_expectancy_df2<-subset(clean_life_expectancy_df, selec=c("Entity", "Code", "Year", "life_exp_yearly"))
#Step5: Removing countries with no country code
weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_urbanization_df<-subset(urbanization_df, !(Code %in% weird_labels))
#Step6: Changing variable name
names(clean_urbanization_df)[names(clean_urbanization_df)=="Urban.population....of.total.population."]<-"urb_yearly"
#Step7: Keeping only relevant vars
clean_urbanization_df2<-subset(clean_urbanization_df, selec=c("Code", "Year", "urb_yearly"))
#Step8: Performing a merge
merged_data_temp<-left_join(clean_life_expectancy_df2, clean_urbanization_df2, by = c("Code"="Code",
                                                         "Year"="Year"))
#Step9: Removing NAs
merged_data_temp<-merged_data_temp[complete.cases(merged_data_temp), ]

#Step10: Defining continents
#EU Countries
eu_countries<-c("Austria",
                "Belgium",
                "Bulgaria",
                "Croatia",
                "Cyprus",
                "Czechia",
                "Denmark",
                "Estonia",
                "Finland",
                "France",
                "Germany",
                "Greece",
                "Hungary",
                "Ireland",
                "Italy",
                "Latvia",
                "Lithuania",
                "Luxembourg",
                "Malta",
                "Netherlands",
                "Poland",
                "Portugal",
                "Romania",
                "Slovakia",
                "Slovenia",
                "Spain",
                "Sweden")

latam_countries<-c("Belize",
                   "Costa Rica",
                   "El Salvador",
                   "Guatemala",
                   "Honduras",
                   "Mexico",
                   "Nicaragua",
                   "Panama",
                   "Argentina",
                   "Bolivia",
                   "Brazil",
                   "Chile",
                   "Colombia",
                   "Ecuador",
                   "Guyana",
                   "Paraguay",
                   "Peru",
                   "Suriname",
                   "Uruguay",
                   "Venezuela",
                   "Cuba",
                   "Dominican Republic",
                   "Haiti")

#Step11: Labeling continents 
merged_data_temp$continent[merged_data_temp$Entity  %in% eu_countries]<-"EU"
merged_data_temp$continent[merged_data_temp$Entity  %in% latam_countries]<-"Latin America"
merged_data_temp$continent[is.na(merged_data_temp$continent)]<-"Everything Else"

Conclusion

Today, we covered quite some ground:

  • importance of color choices

  • mapping data to aesthetics: points

  • facets, coordinates, labels, themes

  • theme option manipulation

  • using histograms, violin, boxplot

  • creating time animations