L11: Data Visualization

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Introduction

Visualization in R is about presenting data in an aesthetically pleasing way

More importantly, visualization is about presenting the data in a truthful way

Finding the Truth

One good way to tell if something is true or not is by using science

Truth vs. Art

However, data visualization can affect perceptions of what is scientific

The scientist can consciosly or unconsciously make particular style choices to a graph that can affect such perceptions

Thus, perception of science can sometimes be impacted by aesthetic choices and the way content is presented

Duties of a Data Analyst

Not manipulate data and thus, not lie

Not misrepresent data

Find the right balance between dumbing down and sounding too esoteric when communicating results
- You are utimately a translator

Emphasize the story and make data available

The Data

At the most basic level, we can present snippets of the data or basic information about the data

To demonstrate how we can present data, let us download: Life expectancy and Urbanization Data

The dataframe once you load it into python looks like the following:

Python

import os
import pandas as pd
# Step 0: Setting the working directory
os.chdir("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6b/data")
#Step 1: Loading the data
life_exp_urb = pd.read_csv('./life_exp_urb.csv')
# Step 2: Examining the first five entries
life_exp_urb.head(3)

        Entity  life_exp_mean   urb_mean             type
0  Afghanistan      45.383333  18.611754  Everything Else
1      Albania      68.286111  40.444164  Everything Else
2      Algeria      57.530133  52.609213  Everything Else

The Data

In principle, we can just present basic information about the data:

Python

# Step 1: Subset the DataFrame
df = life_exp_urb[['life_exp_mean', 'urb_mean']].copy()
# Rename the columns
df.columns = ['life_exp', 'urb']
# Display the first 10 entries
print(df.head(10))

    life_exp        urb
0  45.383333  18.611754
1  68.286111  40.444164
2  57.530133  52.609213
3  68.637500  79.762491
4  77.048611  87.043017
5  45.084658  37.539705
6  69.440278        NaN
7  71.209722  32.350656
8  65.408046  85.282148
9  67.156944  63.111394

Python

df['life_exp'].mean(skipna=True)

61.93415901715886

Python

df['urb'].mean(skipna=True)

51.3651830718757

Python

df['urb'].corr(df['life_exp'])

0.6301940464135396

All of these ways of examining the data are reasonable. There is a positive correlation between urbanization and life expectancy.

Visualizing the Relationship

We will now make the transition to R for creating plots.

This is the dataframe in Python:

Python

df

We can turn in the Python dataframe into a R dataframe in the following way:

library(reticulate)
df2 <-  reticulate::py$df

Installing Packages in R

To install libraries in R, you need to do something like in the “Concole” tab

install.packages("ggplot2")

Make sure you delete install.packages("ggplot2") after you installed the package.

Visualizing the Relationship

This is how we create a scatterplot in R:

Show the code

library(ggplot2)
ggplot(data = df2, 
  mapping = aes(x=urb, y=life_exp))+
  geom_point()

Visualization

Good visualizations are:

truthful
functional
beautiful
insightful
enlightening

Visualization

“Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency.[…] [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space… And graphical excellence requires telling the truth about the data.”

Edward Tufte, The Visual Display of Quantitative Information, p. 51

Bad Visualizations 1

The figure has way too many categories

Bad Visualizations 2

The categories in a pie chart should add up to 100%

The areas should correspond to the percentage size

Bad Visualizations 3

The categories in a pie chart should add up to 100%

Bad Visualizations 4

Bad Visualizations 5

Goals of Visualization

Python has a variety of libraries meant for data visualization:

matplotlib
seaborn
plotnine

Plotnine is useful as it allows direct iteraction with ggplot2, a library native to R.

ggplot2 allows to create powerful visualizations and has better data handling abilities.

Check out this Article comparing Matplotlib and Ggplot2

Visualizations

Part of visualization is how we translate essential content to different forms for specific audiences

Visualizations should ultimately tell stories

Truth comes from a combination of content and form.

Colors

On the most fundamental level, we need to use the right colors for our visualizations

This is relevant for:

Clarity and Readability: users can distinguish among different categories

Accessibility: color-blind people can also see the different categories in your visualization

Emphasis: the right colors can be used to emphasize specific aspects of the data or analysis

Consistency: using a consistent color palette for the same project is helpful

Adobe Color

One excellent free source for color choice is https://color.adobe.com

Adobe Color

With Adobe Color you can:

Create color themes based on color theory

Extract themes & gradients from pictures

Create Accessible themes for color-blind audiences

Color Contrast

It is important to use color with a high contrast

The contrast ratio calculator from Adobe’s Create Accessible themes can be helpful

Color Contrast Example

Here is a low color contrast

Color Contrast Example

Here is a low color contrast

Color Contrast Example

Here is a high color contrast

Color Contrast Example

Here is a high color contrast

Color

8% of men and 0.05% of women have some form of color blindness

Thus, colors should be distinguishable by people with different forms of color blindness

Color Contrast

The Viridis palette in R allows us to create color-blind friendly graphs

These are predefined palettes that are widely used.

Color Contrast

This is the difference that this makes

ggplot() +
  geom_sf(data = merged, aes(fill = life_exp_mean))

Color Contrast

This is the difference that this makes

ggplot() +
    geom_sf(data = merged, aes(fill = life_exp_mean))+
    scale_fill_viridis_b(name = "Life Expectancy", option = "viridis")

Mapping data to aesthetics

On the most fundamental level, we can plot points in ggplot2

Data	`aes()`	`geom`
urb_mean	`x`	`geom_point()`
life_exp_mean	`y`	`geom_point()`
type	`color`	`geom_point()`

Mapping data to aesthetics

This is what this looks like for our data

Show the code

Python

df = pd.read_csv('/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6b/data/life_exp_urb.csv')

Show the code

df2 <-  reticulate::py$df
ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()

Grammatical Layers

Thus, we have data, aesthetics, and geometries.

We can think of these as layers

We can add them together in ggplot() with +

Possible Aesthetics

Possible geoms

	Example geom	What it does
	`geom_point()`	Points
	`geom_col()`	Bar charts
	`geom_text()`	Text
	`geom_boxplot()`	Boxplots
	`geom_sf()`	Maps

Possible geoms

There are many possible geoms

Check out the layers sections of the ggplot documentation

Other Layers

There are many grammatical layers to describe graphs

We can sequentially add layers to ggplot

Example layer	What it does
`scale_x_continuous()`	Make the x-axis continuous
`scale_x_continuous(breaks = 1:5)`	Manually specify axis ticks
`scale_x_log10()`	Log the x-axis
`scale_color_gradient()`	Use a gradient
`scale_fill_viridis_d()`	Fill with discrete viridis colors

`scale_x_log10()`

Show the code

Python

df = pd.read_csv('/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6b/data/life_exp_urb.csv')

Show the code

df2 <-  reticulate::py$df
ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean,
                size = life_exp_mean,
                color = type))+
  geom_point()+
  scale_x_log10()

`scale_x_log10()`

Note the difference when we don’t use scale_x_log10()

Show the code

Python

df = pd.read_csv('/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6b/data/life_exp_urb.csv')

Show the code

df2 <-  reticulate::py$df
ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean,
                size = life_exp_mean,
                color = type))+
  geom_point()

`scale_color_viridis_d()`

Show the code

ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean,
                size = life_exp_mean,
                color = type))+
  geom_point()+
  scale_x_log10()+
  scale_color_viridis_d()

Facets

facet_wrap(vars(continent))

Show the code

ggplot(data = df2, 
  mapping = aes(x=urb_mean, y=life_exp_mean, color = type, size = life_exp_mean))+
  geom_point()+
  scale_x_log10()+
  facet_wrap(vars(type))

Coordinates

Example layer	What it does
`coord_cartesian(ylim = c(1, 10))`	Zoom in where y is 1–10
`coord_flip()`	Switch x and y

Coordinates

This is what happens if we limit the coordinates

Show the code

ggplot(data = df2, 
  mapping = aes(x=urb_mean, y=life_exp_mean, color = type, size = life_exp_mean))+
  geom_point()+
  coord_cartesian(ylim = c(30, 50), xlim = c(10, 40))

Coordinates

This is what happens if we flip the axes

ggplot(data = df2, 
  mapping = aes(x=urb_mean, y=life_exp_mean, color = type, size = life_exp_mean))+
  geom_point()+
  coord_flip()

Labels

You can add labels to the plot using a labs layer

Example layer	What it does
`labs(title = “Neat title”)`	Title
`labs(caption = “Something”)`	Caption
`labs(y = “Something”)`	y-axis
`labs(size = “Population”)`	Title of size legend

Labels

ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type, 
                size = life_exp_mean))+
  geom_point()+
  labs(title = "Health and Urbanization",
       subtitle = "Insights",
       x = "Urbanization",
       y = "Life Expectancy",
       color = "Continent",
       size = "Population",
       caption = "Source: Our World in Data")

Themes

You can change the appearance of the plots by changing the theme

Example layer	What it does
`theme_grey()`	Default grey background
`theme_bw()`	Black and white
`theme_dark()`	Dark
`theme_minimal()`	Minimal

`theme_grey()`

`theme_bw()`

`theme_dark()`

`theme_minimal()`

`theme_economist()`

This can be achieved after loading the ggthemes library

`theme_wsj()`

This can be achieved after loading the ggthemes library

Theme Options

We can make adjustments to the theme:

ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type, 
                size = life_exp_mean))+
  geom_point()+
  labs(title = "Health and Urbanization",
       subtitle = "Insights",
       x = "Urbanization",
       y = "Life Expectancy",
       color = "Continent",
       size = "Population",
       caption = "Source: Our World in Data")+
  theme_bw()+
  theme(legend.position = "bottom",
      plot.title = element_text(face = "bold"),
      panel.grid = element_blank(),
      axis.title.y = element_text(face = "italic"))

Theme Options

We can make adjustments to the theme:

Anatomy of a Theme

Thus, each element of a theme can be manipulated.

There 94 possible arguments that you can manipulate. For example:

Plot title = plot.title

Grid lines = panel.grid

Legend background = legend.background

Text-based elements = element_text()

Disabling elements = element_blank()

Something as specific as the length of tick marks the bottom = axis.ticks.length.x.bottom

Let us create our own theme

First let’s decide on our color theme.
Let’s say we want to recreate the color scheme from the 1992 Dracula movie.

Let us create our own theme

First we go to Extract themes & gradients
We upload the jpeg by dragging and dropping

Let us create our own theme

Once we do that we obtain the following color scheme:

Let us create our own theme

Note the color codes

Let us create our own theme

We can finally use the following colors:

Greys

#0D0D0B - almost black
#40403E - dark grey
#A69E94 - light grey

Reds

#BF0404 - red
#590202 - brown

Let us create our own theme

color_palette <- c("#BF0404", "#590202", "grey90")
ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type, 
                size = life_exp_mean))+
  geom_point()+theme_bw()+
  scale_color_manual(values = color_palette)+ # Assign colors manually
    theme(panel.background = element_rect(fill = "white"), # Set background color
        axis.title.x = element_text(color = "#BF0404", face = "bold"), # Set x-axis label color
        axis.title.y = element_text(color = "#BF0404", face = "bold"),
        axis.line = element_line(color = "#BF0404", size = 1.5), 
        panel.grid = element_blank(), 
        panel.border = element_blank(), 
        panel.grid.major.y = element_line(colour = "#40403E", size = 0.5, linetype = "dotted"))

Let us create our own theme

Anatomy of a Theme

You should check out C17: Themes of ggplot2: Elegant Graphics for Data Analysis (3e)

You will learn how to exercise fine control over the non-data elements of your plot.

Adding the Layers Together

We can make a plot sequentually to see how each grammatical layer changes the appearance

Adding the Layers Together

Start with data and aesthetics

ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))

Adding the Layers Together

Add a geom point

ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()

Adding the Layers Together

Adding geom smooth

ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth()

Adding the Layers Together

Getting straight lines

ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")

Adding the Layers Together

Use a viridis color scale

ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")+
  scale_color_viridis_d()

Adding the Layers Together

Facets by continents

ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")+
  scale_color_viridis_d()+
  facet_wrap(vars(type), ncol = 1)

Adding the Layers Together

Add labels

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")+
  scale_color_viridis_d()+
  facet_wrap(vars(type), ncol = 1)+
  labs(title = "Health and Urbanization",
       x = "Urbanization",
       y = "Life Expectancy",
       color = "Continent",
       size = "Population")

Adding the Layers Together

Add theme

ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")+
  scale_color_viridis_d()+
  facet_wrap(vars(type), ncol = 1)+
  labs(title = "Health and Urbanization",
       x = "Urbanization",
       y = "Life Expectancy",
       color = "Continent",
       size = "Population")+
  theme_bw()

Adding the Layers Together

Modify the theme

ggplot(data = df2, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_smooth(method = "lm")+
  scale_color_viridis_d()+
  facet_wrap(vars(type), ncol = 1)+
  labs(title = "Health and Urbanization",
       x = "Urbanization",
       y = "Life Expectancy",
       color = "Continent",
       size = "Population")+
  theme_bw()+
  theme(legend.position = "bottom",
        plot.title = element_text(face = "bold"))

Describing graphs with grammar

We can map life expectancy to the x-axis, add a histogram with bins, fill and facet by continent

ggplot(data = df2, 
       mapping = aes(x = life_exp_mean,
                     fill = type)) +
  geom_histogram(binwidth = 5, 
                 color = "white") +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(type))

What is a histogram?

Histograms allow us to better understand how frequently or infrequently certain values occur in our dataset.

Imagine a set of values that are spaced out along a number line.

What is a histogram?

To construct a histogram, a section of the number line is divided into equal chunks, called bins.

Next, count how many data points sit inside each bin, and draw bars, one for each bin

The heights of the weights correspond to the number of data points.

What is a histogram?

Label the data (in the example below each data point is an SAT score)

Draw in a y-axis which counts the number of data points in each bin

Finally label your bins.

What is a histogram?

This how it all looks together

What is a histogram?

And these are three histograms applied to our data.

ggplot(data = df2, 
       mapping = aes(x = life_exp_mean,
                     fill = type)) +
  geom_histogram(binwidth = 5, 
                 color = "white") +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(type))

Describing graphs with grammar

We can map continent to the x-axis, life expectancy to the y-axis, add violin plots and semi-transparent boxplots, fill and facet by continent

ggplot(data = df2, 
       mapping = aes(x = type,
                     y = life_exp_mean,
                     fill = type)) +
  geom_violin() +
  geom_boxplot(alpha = 0.5) +
  guides(fill = "none")  # Turn off legend

Adding a Time Dimension

library("gganimate")
#Note: you should install:
#install.packages("av")
#install.packages("magick")
#install.packages("gifski")
#install.packages("gganimate")
ggplot(merged_data_temp, aes(x = urb_yearly, 
                   y = life_exp_yearly, 
                  size = urb_yearly, 
                  color=Entity)) +
  geom_point(alpha = 0.7) +
  scale_size(range = c(2, 12)) +
  guides(size = "none", 
         color = "none") +
  facet_wrap(~continent)+
  scale_color_viridis_d()+
  #animate arguments
  labs(title = 'Year: {frame_time}', 
       x = 'Urbanization', 
       y = 'Life Expectancy') +
  transition_time(Year) +
  ease_aes('linear')

Adding a Time Dimension

Here is how I obtained the merged_data_temp dataframe:

Python

# Setting path
path = "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6b/data/"

# Step 1: Loading the data
life_expectancy_df = pd.read_csv(f'{path}life-expectancy.csv')
urbanization_df = pd.read_csv(f'{path}share-of-population-urban.csv')

# Step 2: Removing countries with no country code
weird_labels = ["OWID_KOS", "OWID_WRL", ""]
clean_life_expectancy_df = life_expectancy_df[~life_expectancy_df['Code'].isin(weird_labels)]

# Step 3: Changing variable name
clean_life_expectancy_df = clean_life_expectancy_df.rename(columns={"Life expectancy at birth (historical)": "life_exp_yearly"})

# Step 4: Keeping only relevant vars
clean_life_expectancy_df2 = clean_life_expectancy_df[['Entity', 'Code', 'Year', 'life_exp_yearly']]

# Step 5: Removing countries with no country code
clean_urbanization_df = urbanization_df[~urbanization_df['Code'].isin(weird_labels)]

# Step 6: Changing variable name
clean_urbanization_df = clean_urbanization_df.rename(columns={"Urban population (% of total population)": "urb_yearly"})

# Step 7: Keeping only relevant vars
clean_urbanization_df2 = clean_urbanization_df[['Code', 'Year', 'urb_yearly']]

# Step 8: Performing a merge
merged_data_temp = pd.merge(clean_life_expectancy_df2, clean_urbanization_df2, on=['Code', 'Year'], how='left')

# Step 9: Removing NAs
merged_data_temp = merged_data_temp.dropna()

# Step 10: Defining continents
eu_countries = [
    "Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia", "Denmark", "Estonia", 
    "Finland", "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", 
    "Lithuania", "Luxembourg", "Malta", "Netherlands", "Poland", "Portugal", "Romania", 
    "Slovakia", "Slovenia", "Spain", "Sweden"]

latam_countries = [
    "Belize", "Costa Rica", "El Salvador", "Guatemala", "Honduras", "Mexico", "Nicaragua", "Panama", 
    "Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "Guyana", "Paraguay", "Peru", 
    "Suriname", "Uruguay", "Venezuela", "Cuba", "Dominican Republic", "Haiti"]
    
# Step 11: Labeling continents 
merged_data_temp['continent'] = 'Everything Else'  # Default value

merged_data_temp.loc[merged_data_temp['Entity'].isin(eu_countries), 'continent'] = 'EU'
merged_data_temp.loc[merged_data_temp['Entity'].isin(latam_countries), 'continent'] = 'Latin America'

Adding a Time Dimension

Here is how I obtained the merged_data_temp dataframe:

library(reticulate)
merged_data_temp <-  reticulate::py$merged_data_temp

Conclusion

Today, we covered quite some ground:

importance of color choices
mapping data to aesthetics: points
facets, coordinates, labels, themes
theme option manipulation
using histograms, violin, boxplot
creating time animations