L13: Data Visualization 2

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Introduction

Intro

Previously, we covered quite some ground:

importance of color choices

mapping data to aesthetics: points

facets, coordinates, labels, themes

theme option manipulation

creating time animations

Preparing the Data

Before we use ggplot, our data has to be tidy

This means:

Each variable is a column

Each observation has its own row

Each value has its own cell

Plotting Quantities

People are better at seeing height differences than angle and area differences

Plotting Quantities

People are better at seeing height differences than angle and area differences.

This how to obtain the same plots.

#Step1: Load library
import pandas as pd

#Step2: Create a simple dataframe
data = pd.DataFrame({
    'name': ['A', 'B', 'C', 'D', 'E'],
    'value': [3, 12, 5, 18, 45]
})

Plotting Quantities

People are better at seeing height differences than angle and area differences

This how to obtain the same plots.

Show the code

#Step1: Turn the data to R
library(reticulate)
data <-  reticulate::py$data

#Step2: Graph
library(ggplot2)
ggplot(data, aes(x = "", 
                 y = value, 
                 fill = name)) +
  geom_col() +
  coord_polar(theta = "y") +
  labs(fill = "Individual") +
  theme_void()

Show the code

#Step1: Turn the data to R
library(reticulate)
data <-  reticulate::py$data

#Step2: Loading the library
library(ggplot2)
ggplot(data, aes(x = name, 
                 y = value, 
                 fill = name)) +
  geom_col() +
  labs(fill = "Individual") +
  theme_void()

Advice for Barplots

Always start at zero to be transparent about your data

It may sometimes be more informative to visualize data using other tools than barplots

Let us download again: Life expectancy and Urbanization Data

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America with EU

# Setting path
path = "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week7/lecture7a/data/"

# Step 1: Loading the data
life_expectancy_df = pd.read_csv(f'{path}life-expectancy.csv')
urbanization_df = pd.read_csv(f'{path}share-of-population-urban.csv')

# Step 2: Removing countries with no country code
weird_labels = ["OWID_KOS", "OWID_WRL", ""]
clean_life_expectancy_df = life_expectancy_df[~life_expectancy_df['Code'].isin(weird_labels)]

# Step 3: Changing variable name
clean_life_expectancy_df = clean_life_expectancy_df.rename(columns={"Life expectancy at birth (historical)": "life_exp_yearly"})

# Step 4: Keeping only relevant vars
clean_life_expectancy_df2 = clean_life_expectancy_df[['Entity', 'Code', 'Year', 'life_exp_yearly']]

# Step 5: Removing countries with no country code
clean_urbanization_df = urbanization_df[~urbanization_df['Code'].isin(weird_labels)]

# Step 6: Changing variable name
clean_urbanization_df = clean_urbanization_df.rename(columns={"Urban population (% of total population)": "urb_yearly"})

# Step 7: Keeping only relevant vars
clean_urbanization_df2 = clean_urbanization_df[['Code', 'Year', 'urb_yearly']]

# Step 8: Performing a merge
merged_data_temp = pd.merge(clean_life_expectancy_df2, clean_urbanization_df2, on=['Code', 'Year'], how='left')

# Step 9: Removing NAs
merged_data_temp = merged_data_temp.dropna()

# Step 10: Defining continents
eu_countries = [
    "Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia", "Denmark", "Estonia", 
    "Finland", "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", 
    "Lithuania", "Luxembourg", "Malta", "Netherlands", "Poland", "Portugal", "Romania", 
    "Slovakia", "Slovenia", "Spain", "Sweden"]

latam_countries = [
    "Belize", "Costa Rica", "El Salvador", "Guatemala", "Honduras", "Mexico", "Nicaragua", "Panama", 
    "Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "Guyana", "Paraguay", "Peru", 
    "Suriname", "Uruguay", "Venezuela", "Cuba", "Dominican Republic", "Haiti"]
    
# Step 11: Labeling continents 
merged_data_temp['continent'] = 'Everything Else'  # Default value

merged_data_temp.loc[merged_data_temp['Entity'].isin(eu_countries), 'continent'] = 'EU'
merged_data_temp.loc[merged_data_temp['Entity'].isin(latam_countries), 'continent'] = 'Latin America'

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America with EU

#Calculate by group
averages = merged_data_temp.groupby(by='continent')
#Make Data Frame
averages = pd.DataFrame(averages.mean(numeric_only=True)).reset_index()
# Rename the columns
averages = averages.rename(columns={"life_exp_yearly": "life_exp_mean", "urb_yearly": "urb_mean"})
averages

         continent         Year  life_exp_mean   urb_mean
0               EU  1990.000000      74.259927  67.138362
1  Everything Else  1989.976588      62.945803  47.720601
2    Latin America  1990.000000      66.128083  59.018407

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America, EU, and the Rest of the World

#Step1: Turn the data to R
library(reticulate)
averages <-  reticulate::py$averages

#Step2: Mapping
ggplot(averages, aes(x = continent, 
                 y = life_exp_mean, 
                 fill = continent)) +
  geom_col() +
  labs(fill = "Individual") +
  theme_void()

Advice for Barplots

A more compelling way is: boxplots with points

#Step1: Turn the data to R
library(reticulate)
final2 <-  reticulate::py$merged_data_temp

#Step2: Graph
ggplot(final2, aes(x = continent, 
                   y = life_exp_yearly,
                   color = continent)) +
  geom_boxplot()+
  geom_point(position = position_jitter(height = 0), 
             alpha = 0.05) +
  guides(color = "none")

Advice for Barplots

Another way is to combine violin with points

#Step1: Turn the data to R
library(reticulate)
final2 <-  reticulate::py$merged_data_temp

#Step2: Graph
ggplot(final2, aes(x = continent, 
                   y = life_exp_yearly,
                   color = continent)) +
  geom_violin()+
  geom_point(position = position_jitter(height = 0), 
             alpha = 0.05) +
  guides(color = "none")

Advice for Barplots

Another way is to have overlapping ridgeplots

#Step1: Turn the data to R
library(reticulate)
final2 <-  reticulate::py$merged_data_temp

#Step2: Graph
library(ggridges)
ggplot(final2, aes(x = life_exp_yearly, 
                   y = continent,
                   fill = continent)) +
  geom_density_ridges() +
  guides(color = "none")

Advice for Barplots

Another way is to have all of them superimposed

#Step1: Turn the data to R
library(reticulate)
final2 <-  reticulate::py$merged_data_temp

#Step2: Graph
library(ggridges)
ggplot(final2, aes(x = life_exp_yearly, 
                   fill = continent)) +
  geom_density(alpha = 0.5)+
  guides(color = "none")

Plotting Uncertainty

As discussed it is good to add more information to your graphs to display the whole distribution of numbers

For example, the right is better than the left

Plotting Uncertainty

This is the code for left and right

#Step1: Turn the data to R
library(reticulate)
averages <-  reticulate::py$averages

#Step2: Graph
ggplot(averages, aes(x = continent, 
                 y = life_exp_mean, 
                 fill = continent)) +
  geom_col() +
  labs(fill = "Individual") +
  theme_bw()

#Step1: Turn the data to R
library(reticulate)
final2 <-  reticulate::py$merged_data_temp

#Step2: Graph
ggplot(final2, aes(x = life_exp_yearly, 
                   fill = continent)) +
  geom_histogram(binwidth = 2, 
                 color = "white") +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()

Plotting Uncertainty

It could also be helpful to play with the binwidth: binwidth = 2

#Step1: Turn the data to R
library(reticulate)
final2 <-  reticulate::py$merged_data_temp

#Step2: Graph
ggplot(final2, aes(x = life_exp_yearly, 
                   fill = continent)) +
  geom_histogram(binwidth = 2, 
                 color = "white") +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()

Plotting Uncertainty

It could also be helpful to play with the binwidth: binwidth = 10

#Step1: Turn the data to R
library(reticulate)
final2 <-  reticulate::py$merged_data_temp

#Step2: Graph
ggplot(final2, aes(x = life_exp_yearly, 
                   fill = continent)) +
  geom_histogram(binwidth = 10, 
                 color = "white") +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()

Plotting Uncertainty

We can obtain something similar with densities: they are a smoothed version of the histogram.

#Step1: Turn the data to R
library(reticulate)
final2 <-  reticulate::py$merged_data_temp

#Step2: Graph
ggplot(final2, aes(x = life_exp_yearly, 
                   fill = continent)) +
  geom_density() +
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()

Plotting Uncertainty

The difference is that one should count and the other one density.

The second shows the probability density function (PDF) of the variable: use calculus to find the probability of each x value

Plotting Uncertainty

We can obviously also plot them together

#Step1: Turn the data to R
library(reticulate)
final2 <-  reticulate::py$merged_data_temp

#Step2: Graph
ggplot(final2, aes(x = life_exp_yearly, 
                   fill = continent)) +
  geom_histogram(binwidth = 2, 
                 color = "white") +
  #scale the density to a similar scale to the histogram:
  #in this case, I multiply by 4000
  #note also aes(y = ..density..* 4000)
  geom_density(aes(y = ..density..* 4000), 
               alpha = 0.5)+
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/4000, 
                                name = "density"))

Plotting Uncertainty

Having a closer look at the code

#Step1: Turn the data to R
library(reticulate)
final2 <-  reticulate::py$merged_data_temp

#Step2: Graph
ggplot(final2, aes(x = life_exp_yearly, 
                   fill = continent)) +
  #geom_histogram with the same parameters
  geom_histogram(binwidth = 2, 
                 color = "white") +
  #note the aes(y = ..density..* 4000)
  #scale the density to a similar scale to the histogram
  #in this case, I multiply by 4000
  #otherwise it will not be visible
  #density calculates relative frequency:
  #count / sum(count): e.g. 385/1403
  geom_density(aes(y = ..density..* 4000), 
               alpha = 0.5)+
  guides(fill = "none") +  # Turn off legend
  facet_wrap(vars(continent))+
  theme_bw()+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/4000, 
                                name = "density"))

Why use a density curve vs. a histogram?

A histogram shows the counts of values in each range
It is made up of bars that touch each other

A density plot shows the proportion of values in each range
It is a smooth curve that shows the distribution of the data in a more continuous way

Frequencies and Densities

Box Plots

Here is a boxlot for life expectancy for the entire dataset

#Step1: Turn the data to R
library(reticulate)
final2 <-  reticulate::py$merged_data_temp

#Step2: Graph
ggplot(final2, aes(x = life_exp_yearly)) +
  geom_boxplot()+
  labs(x = "Life Expectancy") +
  theme_bw()

Box Plots

Here is the interpretation

Box Plots

This is what the histogram look like:

Box Plots

This is what the density function looks like:

Box Plots

This is what the actual values look like:

Box Plots

And this is what they look like together:

Violin Plots

ggplot(final2, aes(x = life_exp_yearly, y="")) +
  geom_violin()+
  geom_boxplot()+
  labs(x = "Life Expectancy") +
  theme_bw()

Plotting Uncertainty in Scatterplots

We have already covered this, but here is the relationship between urbanization and life expectancy

# Setting path
path = "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week7/lecture7a/data/"

# Step 1: Loading the data
life_exp_urb = pd.read_csv(f'{path}life_exp_urb.csv')

Plotting Uncertainty in Scatterplots

We have already covered this, but here is the relationship between urbanization and life expectancy

#Step1: Turn the data to R
library(reticulate)
life_exp_urb <-  reticulate::py$life_exp_urb

#Step2: Graph
ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
    geom_point()

Plotting Uncertainty in Scatterplots

And here are the slopes for the three groups

#Step1: Turn the data to R
library(reticulate)
life_exp_urb <-  reticulate::py$life_exp_urb

#Step2: Graph
ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
    geom_point()+
    geom_smooth(method = "lm", se=FALSE)

Plotting Uncertainty in Scatterplots

And here are the confidence intervals for the three slopes

#Step1: Turn the data to R
library(reticulate)
life_exp_urb <-  reticulate::py$life_exp_urb

#Step2: Graph
ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
    geom_point()+
    geom_smooth(method = "lm")

Annotations

We also need to discuss text in plots

On the most basic level, we can label actual data points with the help of options such as:

geom_text()
geom_label()
geom_text_repel()

We can also add annotations with annotate()

Annotations

geom_text is not great when we have too many points

Show the code

#Step1: Turn the data to R
library(reticulate)
life_exp_urb <-  reticulate::py$life_exp_urb

#Step2: Graph
ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_text(aes(label = Entity))

Annotations

geom_label is not great when we have too many points. It covers the points

Show the code

#Step1: Turn the data to R
library(reticulate)
life_exp_urb <-  reticulate::py$life_exp_urb

#Step2: Graph
ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_label(aes(label = Entity))

Annotations

ggrepel is one solution: geom_text_repel

Show the code

#Step1: Turn the data to R
library(reticulate)
life_exp_urb <-  reticulate::py$life_exp_urb

#Step2: Graph
library(ggrepel)
ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_text_repel(aes(label = Entity))

Annotations

ggrepel is one solution: geom_label_repel

Show the code

#Step1: Turn the data to R
library(reticulate)
life_exp_urb <-  reticulate::py$life_exp_urb

#Step2: Graph
library(ggrepel)
ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = type))+
  geom_point()+
  geom_label_repel(aes(label = Entity))

Highlighting

We can highlight specific observations in our data

#Step1: Turn the data to R
library(reticulate)
life_exp_urb <-  reticulate::py$life_exp_urb

#Step2: Graph
#Creating a Variable to Identify Obs
life_exp_urb$eu<-ifelse(life_exp_urb$type=="EU", 1, 0)
#Making the variable a factor
life_exp_urb$eu<-as.factor(life_exp_urb$eu)
#Plotting
ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = eu))+
  geom_point()+
  scale_color_manual(values = c("grey70", 
                                "red"))+
  guides(color = "none")

Annotating Graphs

This is how we can add text in the graph

#Step1: Turn the data to R
library(reticulate)
life_exp_urb <-  reticulate::py$life_exp_urb
#Step2: Graph
#Creating a Variable to Identify Obs
life_exp_urb$eu<-ifelse(life_exp_urb$type=="EU", 1, 0)
#Making the variable a factor
life_exp_urb$eu<-as.factor(life_exp_urb$eu)
#Plotting
ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = eu))+
  geom_point()+
  scale_color_manual(values = c("grey70", 
                                "red"))+
  guides(color = "none")+
  annotate(geom = "text",
           x = 75, y = 45,
           label = "EU Countries", color="red")

Annotating Graphs

This is how we can add a rectangle around our points of interest

xmin_val<-min(life_exp_urb$urb_mean[life_exp_urb$eu==1])
xmax_val<-max(life_exp_urb$urb_mean[life_exp_urb$eu==1])
ymin_val<-min(life_exp_urb$life_exp_mean[life_exp_urb$eu==1])
ymax_val<-max(life_exp_urb$life_exp_mean[life_exp_urb$eu==1])

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = eu))+
  geom_point()+
  scale_color_manual(values = c("grey70", 
                                "red"))+
  guides(color = "none")+
  annotate(geom = "text",
           x = 75, y = 45,
           label = "EU Countries", color="red")+
  annotate(geom = "rect", 
           xmin = xmin_val, 
           xmax = xmax_val,
           ymin = ymin_val, 
           ymax = ymax_val,
           fill = "red", alpha = 0.2)

Annotating Graphs

This is how we can add an arrow towards the rectangle

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = eu))+
  geom_point()+
  scale_color_manual(values = c("grey70", 
                                "red"))+
  guides(color = "none")+
  annotate(geom = "text",
           x = 75, y = 45,
           label = "EU Countries", color="red")+
  annotate(geom = "rect", 
           xmin = xmin_val, 
           xmax = xmax_val,
           ymin = ymin_val, 
           ymax = ymax_val,
           fill = "red", alpha = 0.2)+
  annotate(geom = "segment", 
           x = 75, 
           xend = 75, 
           y = 45, 
           yend = ymin_val,
           arrow = arrow(
             length = unit(0.1, "in")))

Annotating Graphs

Or let’s turn the text into a label

ggplot(data = life_exp_urb, 
  mapping = aes(x=urb_mean, 
                y=life_exp_mean, 
                color = eu))+
  geom_point()+
  scale_color_manual(values = c("grey70", 
                                "red"))+
  guides(color = "none")+
  annotate(geom = "label",
           x = 75, 
           y = 45,
           label = "EU Countries", 
           color="red")+
  annotate(geom = "rect", 
           xmin = xmin_val, 
           xmax = xmax_val,
           ymin = ymin_val, 
           ymax = ymax_val,
           fill = "red", alpha = 0.2)+
  annotate(geom = "segment", 
           x = 75, 
           xend = 75, 
           y = 45, 
           yend = ymin_val,
           arrow = arrow(
             length = unit(0.1, "in")))

Visualizing Time

You may also want to plot temporal data

To do that we need to load the life expectancy data over time

# Setting path
path = "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week7/lecture7a/data/"
# Step 1: Loading the data
life_expectancy = pd.read_csv(f'{path}life-expectancy.csv')

#Step2: Turn the data to R
library(reticulate)
life_expectancy <-  reticulate::py$life_expectancy

Visualizing Time

Let us look again at the data

head(life_expectancy, n=3)

       Entity Code Year Life expectancy at birth (historical)
1 Afghanistan  AFG 1950                                  27.7
2 Afghanistan  AFG 1951                                  28.0
3 Afghanistan  AFG 1952                                  28.4

Temporal Data

Let us imagine that we want to plot life expectancy over time

We first need to calculate average by year

life_expectancy2 = life_expectancy.groupby(['Year'])['Life expectancy at birth (historical)'].agg(["mean"])
#Creating a dataframe
life_expectancy2 = pd.DataFrame(life_expectancy2).reset_index()
#Renaming the column
life_expectancy2 = life_expectancy2.rename(columns={"mean": 'life_exp_mean'})

Let us look at the transformed data.

life_expectancy2.head(3)

   Year  life_exp_mean
0  1543          33.94
1  1548          38.82
2  1553          39.59

Temporal Data

We are now ready to plot the time trend

Show the code

#Step1: Turn the data to R
library(reticulate)
life_expectancy2 <-  reticulate::py$life_expectancy2
#Step2: Graph
ggplot(life_expectancy2, aes(x = Year, y = life_exp_mean)) +
  geom_line()

Temporal Data

We can also produce an animation

Show the code

#Note: you should install:
#install.packages("av")
#install.packages("magick")
#install.packages("gifski")
#install.packages("gganimate")
library(gganimate)
ggplot(life_expectancy2, aes(x = Year, y = life_exp_mean)) +
  geom_line()+
    labs(title = 'Year: {frame_along}', 
       x = 'Urbanization', 
       y = 'Life Expectancy') +
  transition_reveal(Year)

Temporal Data

To view the upward sloping trend in life expectancy, it might be worth aggregating some of the years.

ggplot(life_expectancy2, aes(x = Year, y = life_exp_mean)) +
  geom_line()+
  geom_smooth()

Temporal Data: Two Countries

Let us imagine that we want to compare Italy and the US in life expectancy

This means we go back to the original file and select the US and Italy

# Subsetting the data
life_expectancy_itus = life_expectancy[life_expectancy['Entity'].isin(["Italy", "United States"])]

# Selecting only after 1900
life_expectancy_itus = life_expectancy_itus[life_expectancy_itus['Year'] > 1899]

# Renaming life expectancy variable
life_expectancy_itus = life_expectancy_itus.rename(columns={"Life expectancy at birth (historical)": "life_exp"})

Let’s examine the data

life_expectancy_itus.head(4)

     Entity Code  Year  life_exp
8582  Italy  ITA  1900     41.67
8583  Italy  ITA  1901     43.54
8584  Italy  ITA  1902     42.99
8585  Italy  ITA  1903     43.11

Temporal Data: Two Countries

We can now plot life expectancy for the two countries over time

#Step1: Turn the data to R
library(reticulate)
life_expectancy_itus2 <-  reticulate::py$life_expectancy_itus
#Plotting
ggplot(life_expectancy_itus2, 
       aes(x=Year, y= life_exp, color = Entity))+
  geom_line()+
  scale_colour_viridis_d()

Temporal Data: Two Countries

Changing the starting value of our y-axis can change the appearance of our data

Temporal Data: Two Countries

Changing the starting value of our y-axis can change the appearance of our data

ggplot(life_expectancy_itus2, 
       aes(x=Year, 
           y= life_exp, 
           color = Entity))+
  geom_line()+
  scale_colour_viridis_d()

ggplot(life_expectancy_itus2, 
       aes(x=Year, 
           y= life_exp, 
           color = Entity))+
  geom_line()+
  scale_colour_viridis_d()+
  ylim(0, NA)

Temporal Data Annotations

Annotations work very well with temporal data

ggplot(life_expectancy_itus2, 
       aes(x=Year, 
           y= life_exp, 
           color = Entity))+
  geom_line()+
  scale_colour_viridis_d()+
  annotate(geom = "label",
           x = 1918, y = 80,
           label = "WW1", 
           color="red")+
  annotate(geom = "rect", 
           xmin = 1917, 
           xmax = 1920,
           ymin = 0, 
           ymax = Inf,
           fill = "red", 
           alpha = 0.2)

Visualizing Time

Another cool way to visualize data is to create density ridges for every year

We will however want fewer years to make the graph easier to read

Visualizing Time

Here is how we go about it:

# Selecting only after 1900
final3 = merged_data_temp[merged_data_temp['Year'] > 1980]
# Converting the 'Year' column to a categorical data type
final3['year_factor'] = final3['Year'].astype('category')

#Step1: Turn the data to R
library(ggplot2)
library(reticulate)
library(viridis)
library(ggridges)
library(forcats)
final3 <-  reticulate::py$final3
#Step4: Plot
ggplot(final3, 
       aes(x = life_exp_yearly, 
           y = fct_rev(as.factor(year_factor)), fill = ..x..))+
  geom_density_ridges_gradient(color = "white", 
                               quantile_lines = TRUE)+
  labs(x = "Life expectancy", y = NULL, fill = NULL)+
  scale_fill_viridis_c(name = "")

Visualizing Time

This produces:

Visualizing Time

We can see this way the first, second, and third quartile

Visualizing Time

We also see that the life expectancy gets narrower over time: most people live longer lives

Saving Figures

One final aspect related to visualization is how we save the data

The command for this ggsave

Let us look at a few options

Saving Figures

Let’s imagine we want to save the following plot

ggplot(life_expectancy_itus2, 
       aes(x=Year, 
           y= life_exp, 
           color = Entity))+
  geom_line()+
  scale_colour_viridis_d()+
  annotate(geom = "label",
           x = 1918, y = 80,
           label = "WW1", 
           color="red")+
  annotate(geom = "rect", 
           xmin = 1917, 
           xmax = 1920,
           ymin = 0, 
           ymax = Inf,
           fill = "red", 
           alpha = 0.2)

Saving Figures

Let’s imagine we want to save the following plot

Saving Figures

Let’s imagine we want to save the following plot

We first create an object: pic1

pic1<-ggplot(life_expectancy_itus2, 
       aes(x=Year, 
           y= life_exp, 
           color = Entity))+
  geom_line()+
  scale_colour_viridis_d()+
  annotate(geom = "label",
           x = 1918, y = 80,
           label = "WW1", 
           color="red")+
  annotate(geom = "rect", 
           xmin = 1917, 
           xmax = 1920,
           ymin = 0, 
           ymax = Inf,
           fill = "red", 
           alpha = 0.2)

Saving Figures

The next step is to use the ggsave function.

This will save the object in your directory

#Saving as PDF
ggsave(pic1, filename = "./figures/pic1.pdf")

#Saving as PNG
ggsave(pic1, filename = "./figures/pic1.png")

#Saving as JPG
ggsave(pic1, filename = "./figures/pic1.jpg")

Saving Figures

It is important to be aware how much space these files take

pic1.pdf - 7KB
pic1.png - 119KB
pic.jpg - 410KB

Saving Figures

Depending on your goal, you may want to opt for one of the three

For web, pngs are good

For academic publications, jpgs are acceptable

Saving Figures

One should also be aware that width and heigh options, as part of the ggsave function will alter the appearance of your figure

#Saving as JPG
ggsave(pic1, filename = "./figures/pic1_square.jpg", 
       width = 20, 
       height = 20, 
       units = "cm")

#Saving as JPG
ggsave(pic1, filename = "./figures/pic1_rect.jpg", 
       width = 20, 
       height = 10, 
       units = "cm")

Saving Figures

Here is what they look like:

Saving Figures

For publication quality figures, you should use a dpi of 300

pic1<-ggplot(life_expectancy_itus2, 
       aes(x=Year, 
           y= life_exp, 
           color = Entity))+
  geom_line()+
  scale_colour_viridis_d()+
  annotate(geom = "label",
           x = 1918, y = 80,
           label = "WW1", 
           color="red")+
  annotate(geom = "rect", 
           xmin = 1917, 
           xmax = 1920,
           ymin = 0, 
           ymax = Inf,
           fill = "red", 
           alpha = 0.2)

#Saving as JPG
ggsave(pic1, filename = "./figures/pic1_300dpi.jpg", 
       width = 20, 
       height = 20, 
       units = "cm",
       dpi=300)

Saving Figures

And here is 50 dpi

#Saving as JPG
ggsave(pic1, filename = "./figures/pic1_50dpi.jpg", 
       width = 20, 
       height = 20, 
       units = "cm",
       dpi=50)

The take-away is that higher dpi result in greater clarity at the expense of space

Lower dpi means lower clarity and lower space:

pic1_50dpi - 25KB
pic1_300dpi - 471KB

Conclusion

We have covered more ground in data visualization today:

different ways of plotting quantities: pie charts vs. bar charts
boxplots, violin plots, histograms, and density plots
the relationship between histograms and density plots
labelling and annotating graphs
highlighting graphs
visualizing time
saving figures