L11: Data Visualization
Bogdan G. Popescu
John Cabot University
Introduction
Visualization in R is about presenting data in an aesthetically pleasing way
More importantly, visualization is about presenting the data in a truthful way
Finding the Truth
One good way to tell if something is true or not is by using science
Truth vs. Art
However, data visualization can affect perceptions of what is scientific
The scientist can consciosly or unconsciously make particular style choices to a graph that can affect such perceptions
Thus, perception of science can sometimes be impacted by aesthetic choices and the way content is presented
Duties of a Data Analyst
Not manipulate data and thus, not lie
Find the right balance between dumbing down and sounding too esoteric when communicating results
You are utimately a translator
Emphasize the story and make data available
The Data
At the most basic level, we can present snippets of the data or basic information about the data
The dataframe once you load it into python looks like the following:
import os
import pandas as pd
# Step 0: Setting the working directory
os.chdir("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6b/data" )
#Step 1: Loading the data
life_exp_urb = pd.read_csv('./life_exp_urb.csv' )
# Step 2: Examining the first five entries
life_exp_urb.head(3 )
Entity life_exp_mean urb_mean type
0 Afghanistan 45.383333 18.611754 Everything Else
1 Albania 68.286111 40.444164 Everything Else
2 Algeria 57.530133 52.609213 Everything Else
The Data
In principle, we can just present basic information about the data:
# Step 1: Subset the DataFrame
df = life_exp_urb[['life_exp_mean' , 'urb_mean' ]].copy()
# Rename the columns
df.columns = ['life_exp' , 'urb' ]
# Display the first 10 entries
print (df.head(10 ))
life_exp urb
0 45.383333 18.611754
1 68.286111 40.444164
2 57.530133 52.609213
3 68.637500 79.762491
4 77.048611 87.043017
5 45.084658 37.539705
6 69.440278 NaN
7 71.209722 32.350656
8 65.408046 85.282148
9 67.156944 63.111394
df['life_exp' ].mean(skipna= True )
df['urb' ].mean(skipna= True )
df['urb' ].corr(df['life_exp' ])
All of these ways of examining the data are reasonable. There is a positive correlation between urbanization and life expectancy.
Visualizing the Relationship
We will now make the transition to R for creating plots.
This is the dataframe in Python:
We can turn in the Python dataframe into a R dataframe in the following way:
library (reticulate)
df2 <- reticulate:: py$ df
Installing Packages in R
To install libraries in R, you need to do something like in the “Concole” tab
install.packages ("ggplot2" )
Make sure you delete install.packages("ggplot2")
after you installed the package.
Visualizing the Relationship
This is how we create a scatterplot in R:
Show the code
library (ggplot2)
ggplot (data = df2,
mapping = aes (x= urb, y= life_exp))+
geom_point ()
Visualization
Good visualizations are:
truthful
functional
beautiful
insightful
enlightening
Visualization
“Graphical excellence is the well-designed presentation of interesting data —a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency .[…] [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space … And graphical excellence requires telling the truth about the data.”
Edward Tufte, The Visual Display of Quantitative Information, p. 51
Bad Visualizations 1
Bad Visualizations 1
The figure has way too many categories
Bad Visualizations 2
Bad Visualizations 2
The categories in a pie chart should add up to 100%
The areas should correspond to the percentage size
Bad Visualizations 3
Bad Visualizations 3
The categories in a pie chart should add up to 100%
Bad Visualizations 4
Bad Visualizations 4
Bad Visualizations 5
Bad Visualizations 5
Goals of Visualization
Python has a variety of libraries meant for data visualization:
matplotlib
seaborn
plotnine
Plotnine is useful as it allows direct iteraction with ggplot2
, a library native to R.
ggplot2
allows to create powerful visualizations and has better data handling abilities.
Check out this Article comparing Matplotlib and Ggplot2
Visualizations
Part of visualization is how we translate essential content to different forms for specific audiences
Visualizations should ultimately tell stories
Truth comes from a combination of content and form.
Colors
On the most fundamental level, we need to use the right colors for our visualizations
This is relevant for:
Clarity and Readability : users can distinguish among different categories
Accessibility : color-blind people can also see the different categories in your visualization
Emphasis : the right colors can be used to emphasize specific aspects of the data or analysis
Consistency : using a consistent color palette for the same project is helpful
Adobe Color
With Adobe Color you can:
Create color themes based on color theory
Extract themes & gradients from pictures
Create Accessible themes for color-blind audiences
Color Contrast
It is important to use color with a high contrast
Color Contrast Example
Here is a low color contrast
Color Contrast Example
Here is a low color contrast
Color Contrast Example
Here is a high color contrast
Color Contrast Example
Here is a high color contrast
Color
8% of men and 0.05% of women have some form of color blindness
Thus, colors should be distinguishable by people with different forms of color blindness
Color Contrast
The Viridis palette in R allows us to create color-blind friendly graphs
These are predefined palettes that are widely used.
Color Contrast
This is the difference that this makes
ggplot () +
geom_sf (data = merged, aes (fill = life_exp_mean))
Color Contrast
This is the difference that this makes
ggplot () +
geom_sf (data = merged, aes (fill = life_exp_mean))+
scale_fill_viridis_b (name = "Life Expectancy" , option = "viridis" )
Mapping data to aesthetics
On the most fundamental level, we can plot points in ggplot2
urb_mean
x
geom_point()
life_exp_mean
y
geom_point()
type
color
geom_point()
Mapping data to aesthetics
This is what this looks like for our data
Show the code
df = pd.read_csv('/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6b/data/life_exp_urb.csv' )
Show the code
df2 <- reticulate:: py$ df
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
color = type))+
geom_point ()
Grammatical Layers
Thus, we have data, aesthetics, and geometries.
We can think of these as layers
We can add them together in ggplot()
with +
Possible geoms
Example geom
What it does
Points
Bar charts
Text
Boxplots
Maps
Possible geoms
There are many possible geoms
Check out the layers sections of the ggplot documentation
Other Layers
There are many grammatical layers to describe graphs
We can sequentially add layers to ggplot
Example layer
What it does
Make the x-axis continuous
Manually specify axis ticks
Log the x-axis
Use a gradient
Fill with discrete viridis colors
scale_x_log10()
Show the code
df = pd.read_csv('/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6b/data/life_exp_urb.csv' )
Show the code
df2 <- reticulate:: py$ df
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
size = life_exp_mean,
color = type))+
geom_point ()+
scale_x_log10 ()
scale_x_log10()
Note the difference when we don’t use scale_x_log10()
Show the code
df = pd.read_csv('/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6b/data/life_exp_urb.csv' )
Show the code
df2 <- reticulate:: py$ df
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
size = life_exp_mean,
color = type))+
geom_point ()
scale_color_viridis_d()
Show the code
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
size = life_exp_mean,
color = type))+
geom_point ()+
scale_x_log10 ()+
scale_color_viridis_d ()
Facets
facet_wrap(vars(continent))
Show the code
ggplot (data = df2,
mapping = aes (x= urb_mean, y= life_exp_mean, color = type, size = life_exp_mean))+
geom_point ()+
scale_x_log10 ()+
facet_wrap (vars (type))
Coordinates
Example layer
What it does
Zoom in where y is 1–10
Switch x and y
Coordinates
This is what happens if we limit the coordinates
Show the code
ggplot (data = df2,
mapping = aes (x= urb_mean, y= life_exp_mean, color = type, size = life_exp_mean))+
geom_point ()+
coord_cartesian (ylim = c (30 , 50 ), xlim = c (10 , 40 ))
Coordinates
This is what happens if we flip the axes
ggplot (data = df2,
mapping = aes (x= urb_mean, y= life_exp_mean, color = type, size = life_exp_mean))+
geom_point ()+
coord_flip ()
Labels
You can add labels to the plot using a labs
layer
Example layer
What it does
Title
Caption
y-axis
Title of size legend
Labels
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
color = type,
size = life_exp_mean))+
geom_point ()+
labs (title = "Health and Urbanization" ,
subtitle = "Insights" ,
x = "Urbanization" ,
y = "Life Expectancy" ,
color = "Continent" ,
size = "Population" ,
caption = "Source: Our World in Data" )
Themes
You can change the appearance of the plots by changing the theme
Example layer
What it does
Default grey background
Black and white
Dark
Minimal
theme_grey()
theme_bw()
theme_dark()
theme_minimal()
theme_economist()
This can be achieved after loading the ggthemes
library
theme_wsj()
This can be achieved after loading the ggthemes
library
Theme Options
We can make adjustments to the theme:
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
color = type,
size = life_exp_mean))+
geom_point ()+
labs (title = "Health and Urbanization" ,
subtitle = "Insights" ,
x = "Urbanization" ,
y = "Life Expectancy" ,
color = "Continent" ,
size = "Population" ,
caption = "Source: Our World in Data" )+
theme_bw ()+
theme (legend.position = "bottom" ,
plot.title = element_text (face = "bold" ),
panel.grid = element_blank (),
axis.title.y = element_text (face = "italic" ))
Theme Options
We can make adjustments to the theme:
Anatomy of a Theme
Anatomy of a Theme
Thus, each element of a theme can be manipulated.
There 94 possible arguments that you can manipulate. For example:
Legend background = legend.background
Text-based elements = element_text()
Disabling elements = element_blank()
Something as specific as the length of tick marks the bottom = axis.ticks.length.x.bottom
Let us create our own theme
Let us create our own theme
Let us create our own theme
Once we do that we obtain the following color scheme:
Let us create our own theme
Note the color codes
Let us create our own theme
We can finally use the following colors:
Greys
#0D0D0B - almost black
#40403E - dark grey
#A69E94 - light grey
Reds
#BF0404 - red
#590202 - brown
Let us create our own theme
color_palette <- c ("#BF0404" , "#590202" , "grey90" )
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
color = type,
size = life_exp_mean))+
geom_point ()+ theme_bw ()+
scale_color_manual (values = color_palette)+ # Assign colors manually
theme (panel.background = element_rect (fill = "white" ), # Set background color
axis.title.x = element_text (color = "#BF0404" , face = "bold" ), # Set x-axis label color
axis.title.y = element_text (color = "#BF0404" , face = "bold" ),
axis.line = element_line (color = "#BF0404" , size = 1.5 ),
panel.grid = element_blank (),
panel.border = element_blank (),
panel.grid.major.y = element_line (colour = "#40403E" , size = 0.5 , linetype = "dotted" ))
Let us create our own theme
Anatomy of a Theme
You should check out C17: Themes of ggplot2: Elegant Graphics for Data Analysis (3e)
You will learn how to exercise fine control over the non-data elements of your plot.
Adding the Layers Together
We can make a plot sequentually to see how each grammatical layer changes the appearance
Adding the Layers Together
Start with data and aesthetics
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
color = type))
Adding the Layers Together
Add a geom point
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
color = type))+
geom_point ()
Adding the Layers Together
Adding geom smooth
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
color = type))+
geom_point ()+
geom_smooth ()
Adding the Layers Together
Getting straight lines
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
color = type))+
geom_point ()+
geom_smooth (method = "lm" )
Adding the Layers Together
Use a viridis color scale
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
color = type))+
geom_point ()+
geom_smooth (method = "lm" )+
scale_color_viridis_d ()
Adding the Layers Together
Facets by continents
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
color = type))+
geom_point ()+
geom_smooth (method = "lm" )+
scale_color_viridis_d ()+
facet_wrap (vars (type), ncol = 1 )
Adding the Layers Together
Add labels
ggplot (data = life_exp_urb,
mapping = aes (x= urb_mean,
y= life_exp_mean,
color = type))+
geom_point ()+
geom_smooth (method = "lm" )+
scale_color_viridis_d ()+
facet_wrap (vars (type), ncol = 1 )+
labs (title = "Health and Urbanization" ,
x = "Urbanization" ,
y = "Life Expectancy" ,
color = "Continent" ,
size = "Population" )
Adding the Layers Together
Add theme
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
color = type))+
geom_point ()+
geom_smooth (method = "lm" )+
scale_color_viridis_d ()+
facet_wrap (vars (type), ncol = 1 )+
labs (title = "Health and Urbanization" ,
x = "Urbanization" ,
y = "Life Expectancy" ,
color = "Continent" ,
size = "Population" )+
theme_bw ()
Adding the Layers Together
Modify the theme
ggplot (data = df2,
mapping = aes (x= urb_mean,
y= life_exp_mean,
color = type))+
geom_point ()+
geom_smooth (method = "lm" )+
scale_color_viridis_d ()+
facet_wrap (vars (type), ncol = 1 )+
labs (title = "Health and Urbanization" ,
x = "Urbanization" ,
y = "Life Expectancy" ,
color = "Continent" ,
size = "Population" )+
theme_bw ()+
theme (legend.position = "bottom" ,
plot.title = element_text (face = "bold" ))
Describing graphs with grammar
We can map life expectancy to the x-axis, add a histogram with bins, fill and facet by continent
ggplot (data = df2,
mapping = aes (x = life_exp_mean,
fill = type)) +
geom_histogram (binwidth = 5 ,
color = "white" ) +
guides (fill = "none" ) + # Turn off legend
facet_wrap (vars (type))
What is a histogram?
Histograms allow us to better understand how frequently or infrequently certain values occur in our dataset.
Imagine a set of values that are spaced out along a number line.
What is a histogram?
To construct a histogram, a section of the number line is divided into equal chunks, called bins.
Next, count how many data points sit inside each bin, and draw bars, one for each bin
The heights of the weights correspond to the number of data points.
What is a histogram?
Label the data (in the example below each data point is an SAT score)
Draw in a y-axis which counts the number of data points in each bin
What is a histogram?
This how it all looks together
What is a histogram?
And these are three histograms applied to our data.
ggplot (data = df2,
mapping = aes (x = life_exp_mean,
fill = type)) +
geom_histogram (binwidth = 5 ,
color = "white" ) +
guides (fill = "none" ) + # Turn off legend
facet_wrap (vars (type))
Describing graphs with grammar
We can map continent to the x-axis, life expectancy to the y-axis, add violin plots and semi-transparent boxplots, fill and facet by continent
ggplot (data = df2,
mapping = aes (x = type,
y = life_exp_mean,
fill = type)) +
geom_violin () +
geom_boxplot (alpha = 0.5 ) +
guides (fill = "none" ) # Turn off legend
Adding a Time Dimension
library ("gganimate" )
#Note: you should install:
#install.packages("av")
#install.packages("magick")
#install.packages("gifski")
#install.packages("gganimate")
ggplot (merged_data_temp, aes (x = urb_yearly,
y = life_exp_yearly,
size = urb_yearly,
color= Entity)) +
geom_point (alpha = 0.7 ) +
scale_size (range = c (2 , 12 )) +
guides (size = "none" ,
color = "none" ) +
facet_wrap (~ continent)+
scale_color_viridis_d ()+
#animate arguments
labs (title = 'Year: {frame_time}' ,
x = 'Urbanization' ,
y = 'Life Expectancy' ) +
transition_time (Year) +
ease_aes ('linear' )