Visualization in R is about presenting data in an aesthetically pleasing way
More importantly, visualization is about presenting the data in a truthful way
One good way to tell if something is true or not is by using science
However, data visualization can affect perceptions of what is scientific
The scientist can consciosly or unconsciously make particular style choices to a graph that can affect such perceptions
Thus, perception of science can sometimes be impacted by aesthetic choices and the way content is presented
At the most basic level, we can present snippets of the data or basic information about the data
To demonstrate how we can present data, let us download: Life expectancy and Urbanization Data
The dataframe once you load it into python looks like the following:
import os
import pandas as pd
# Step 0: Setting the working directory
os.chdir("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6b/data")
#Step1: Loading the data
life_exp_urb = pd.read_csv('./life_exp_urb.csv')
# Step 2: Examining the first five entries
life_exp_urb.head(3)
Entity life_exp_mean urb_mean type
0 Afghanistan 45.383333 18.611754 Everything Else
1 Albania 68.286111 40.444164 Everything Else
2 Algeria 57.530133 52.609213 Everything Else
In principle, we can just present basic information about the data:
# Step 1: Subset the DataFrame
df = life_exp_urb[['life_exp_mean', 'urb_mean']].copy()
# Rename the columns
df.columns = ['life_exp', 'urb']
# Display the first 10 entries
print(df.head(10))
life_exp urb
0 45.383333 18.611754
1 68.286111 40.444164
2 57.530133 52.609213
3 68.637500 79.762491
4 77.048611 87.043017
5 45.084658 37.539705
6 69.440278 NaN
7 71.209722 32.350656
8 65.408046 85.282148
9 67.156944 63.111394
All of these ways of examining the data are reasonable. There is a positive correlation between urbanization and life expectancy.
But the following scatterplot is more compelling.
Good visualizations are:
“Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency.[…] [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space… And graphical excellence requires telling the truth about the data.”
Edward Tufte, The Visual Display of Quantitative Information, p. 51
The figure has way too many categories
The categories in a pie chart should add up to 100%
The areas should correspond to the percentage size
The categories in a pie chart should add up to 100%
Python has a variety of libraries meant for data visualization:
Plotnine is useful as it allows direct iteraction with ggplot2
, a library native to R.
ggplot2
allows to create powerful visualizations and has better data handling abilities.
Check out this Article comparing Matplotlib and Ggplot2
Part of visualization is how we translate essential content to different forms for specific audiences
Visualizations should ultimately tell stories
Truth comes from a combination of content and form.
On the most fundamental level, we need to use the right colors for our visualizations
This is relevant for:
One excellent free source for color choice is https://color.adobe.com
With Adobe Color you can:
It is important to use color with a high contrast
The contrast ratio calculator from Adobe’s Create Accessible themes can be helpful
Here is a low color contrast
Here is a low color contrast
Here is a high color contrast
Here is a high color contrast
8% of men and 0.05% of women have some form of color blindness
Thus, colors should be distinguishable by people with different forms of color blindness
The Viridis palette in R allows us to create color-blind friendly graphs
These are predefined palettes that are widely used.
This is the difference that this makes
This is the difference that this makes
On the most fundamental level, we can plot points in ggplot2
Data | aes() |
geom |
---|---|---|
urb_mean | x |
geom_point() |
life_exp_mean | y |
geom_point() |
type | color |
geom_point() |
This is what this looks like for our data
ggplot()
with +
Example geom | What it does | |
---|---|---|
geom_point()
|
Points | |
geom_col()
|
Bar charts | |
geom_text()
|
Text | |
geom_boxplot()
|
Boxplots | |
geom_sf()
|
Maps |
There are many possible geoms
Check out the layers sections of the ggplot documentation
There are many grammatical layers to describe graphs
We can sequentially add layers to ggplot
Example layer | What it does |
---|---|
scale_x_continuous()
|
Make the x-axis continuous |
scale_x_continuous(breaks = 1:5)
|
Manually specify axis ticks |
scale_x_log10()
|
Log the x-axis |
scale_color_gradient()
|
Use a gradient |
scale_fill_viridis_d()
|
Fill with discrete viridis colors |
scale_x_log10()
scale_x_log10()
Note the difference when we don’t use scale_x_log10()
scale_color_viridis_d()
facet_wrap(vars(continent))
Example layer | What it does |
---|---|
coord_cartesian(ylim = c(1, 10))
|
Zoom in where y is 1–10 |
coord_flip()
|
Switch x and y |
This is what happens if we limit the coordinates
This is what happens if we flip the axes
You can add labels to the plot using a labs
layer
Example layer | What it does |
---|---|
labs(title = “Neat title”)
|
Title |
labs(caption = “Something”)
|
Caption |
labs(y = “Something”)
|
y-axis |
labs(size = “Population”)
|
Title of size legend |
ggplot(data = df2,
mapping = aes(x=urb_mean,
y=life_exp_mean,
color = type,
size = life_exp_mean))+
geom_point()+
labs(title = "Health and Urbanization",
subtitle = "Insights",
x = "Urbanization",
y = "Life Expectancy",
color = "Continent",
size = "Population",
caption = "Source: Our World in Data")
You can change the appearance of the plots by changing the theme
Example layer | What it does |
---|---|
theme_grey()
|
Default grey background |
theme_bw()
|
Black and white |
theme_dark()
|
Dark |
theme_minimal()
|
Minimal |
theme_grey()
theme_bw()
theme_dark()
theme_minimal()
theme_economist()
This can be achieved after loading the ggthemes
library
theme_wsj()
This can be achieved after loading the ggthemes
library
We can make adjustments to the theme:
ggplot(data = df2,
mapping = aes(x=urb_mean,
y=life_exp_mean,
color = type,
size = life_exp_mean))+
geom_point()+
labs(title = "Health and Urbanization",
subtitle = "Insights",
x = "Urbanization",
y = "Life Expectancy",
color = "Continent",
size = "Population",
caption = "Source: Our World in Data")+
theme_bw()+
theme(legend.position = "bottom",
plot.title = element_text(face = "bold"),
panel.grid = element_blank(),
axis.title.y = element_text(face = "italic"))
We can make adjustments to the theme:
Thus, each element of a theme can be manipulated.
There 94 possible arguments that you can manipulate. For example:
plot.title
panel.grid
legend.background
element_text()
element_blank()
axis.ticks.length.x.bottom
First let’s decide on our color theme.
Let’s say we want to recreate the color scheme from the 1992 Dracula movie.
First we go to Extract themes & gradients
We upload the jpeg by dragging and dropping
Once we do that we obtain the following color scheme:
Note the color codes
We can finally use the following colors:
Greys
Reds
color_palette <- c("#BF0404", "#590202", "grey90")
ggplot(data = df2,
mapping = aes(x=urb_mean,
y=life_exp_mean,
color = type,
size = life_exp_mean))+
geom_point()+theme_bw()+
scale_color_manual(values = color_palette)+ # Assign colors manually
theme(panel.background = element_rect(fill = "white"), # Set background color
axis.title.x = element_text(color = "#BF0404", face = "bold"), # Set x-axis label color
axis.title.y = element_text(color = "#BF0404", face = "bold"),
axis.line = element_line(color = "#BF0404", size = 1.5),
panel.grid = element_blank(),
panel.border = element_blank(),
panel.grid.major.y = element_line(colour = "#40403E", size = 0.5, linetype = "dotted"))
You should check out C17: Themes of ggplot2: Elegant Graphics for Data Analysis (3e)
You will learn how to exercise fine control over the non-data elements of your plot.
We can make a plot sequentually to see how each grammatical layer changes the appearance
Start with data and aesthetics
Add a geom point
Adding geom smooth
Getting straight lines
Use a viridis color scale
Facets by continents
Add labels
ggplot(data = life_exp_urb,
mapping = aes(x=urb_mean,
y=life_exp_mean,
color = type))+
geom_point()+
geom_smooth(method = "lm")+
scale_color_viridis_d()+
facet_wrap(vars(type), ncol = 1)+
labs(title = "Health and Urbanization",
x = "Urbanization",
y = "Life Expectancy",
color = "Continent",
size = "Population")
Add theme
ggplot(data = df2,
mapping = aes(x=urb_mean,
y=life_exp_mean,
color = type))+
geom_point()+
geom_smooth(method = "lm")+
scale_color_viridis_d()+
facet_wrap(vars(type), ncol = 1)+
labs(title = "Health and Urbanization",
x = "Urbanization",
y = "Life Expectancy",
color = "Continent",
size = "Population")+
theme_bw()
Modify the theme
ggplot(data = df2,
mapping = aes(x=urb_mean,
y=life_exp_mean,
color = type))+
geom_point()+
geom_smooth(method = "lm")+
scale_color_viridis_d()+
facet_wrap(vars(type), ncol = 1)+
labs(title = "Health and Urbanization",
x = "Urbanization",
y = "Life Expectancy",
color = "Continent",
size = "Population")+
theme_bw()+
theme(legend.position = "bottom",
plot.title = element_text(face = "bold"))
We can map life expectancy to the x-axis, add a histogram with bins, fill and facet by continent
Histograms allow us to better understand how frequently or infrequently certain values occur in our dataset.
Imagine a set of values that are spaced out along a number line.
To construct a histogram, a section of the number line is divided into equal chunks, called bins.
Next, count how many data points sit inside each bin, and draw bars, one for each bin
The heights of the weights correspond to the number of data points.
Label the data (in the example below each data point is an SAT score)
Draw in a y-axis which counts the number of data points in each bin
Finally label your bins.
This how it all looks together
And these are three histograms applied to our data.
We can map continent to the x-axis, life expectancy to the y-axis, add violin plots and semi-transparent boxplots, fill and facet by continent
library("gganimate")
#Note: you should install:
#install.packages("av")
#install.packages("magick")
#install.packages("gifski")
#install.packages("gganimate")
ggplot(merged_data_temp, aes(x = urb_yearly,
y = life_exp_yearly,
size = urb_yearly,
color=Entity)) +
geom_point(alpha = 0.7) +
scale_size(range = c(2, 12)) +
guides(size = "none",
color = "none") +
facet_wrap(~continent)+
scale_color_viridis_d()+
#animate arguments
labs(title = 'Year: {frame_time}',
x = 'Urbanization',
y = 'Life Expectancy') +
transition_time(Year) +
ease_aes('linear')