L12: Data Visualization 2

Bogdan G. Popescu

John Cabot University

Introduction

Previously, we covered quite some ground:

  • importance of color choices
  • mapping data to aesthetics: points
  • facets, coordinates, labels, themes
  • theme option manipulation
  • creating time animations

Preparing the Data

Before we use ggplot, our data has to be tidy

This means:

  • Each variable is a column
  • Each observation has its own row
  • Each value has its own cell

Plotting Quantities

People are better at seeing height differences than angle and area differences

Plotting Quantities

People are better at seeing height differences than angle and area differences.

This how to obtain the same plots.

Python
#Step1: Load library
import pandas as pd

#Step2: Create a simple dataframe
data = pd.DataFrame({
    'name': ['A', 'B', 'C', 'D', 'E'],
    'value': [3, 12, 5, 18, 45]
})

Plotting Quantities

People are better at seeing height differences than angle and area differences

This how to obtain the same plots.

Show the code
R
#Step1: Turn the data to R
library(reticulate)
data <-  reticulate::py$data

#Step2: Graph
library(ggplot2)
ggplot(data, aes(x = "", 
                 y = value, 
                 fill = name)) +
  geom_col() +
  coord_polar(theta = "y") +
  labs(fill = "Individual") +
  theme_void()

Show the code
R
#Step1: Turn the data to R
library(reticulate)
data <-  reticulate::py$data

#Step2: Loading the library
library(ggplot2)
ggplot(data, aes(x = name, 
                 y = value, 
                 fill = name)) +
  geom_col() +
  labs(fill = "Individual") +
  theme_void()

Advice for Barplots

  • Always start at zero to be transparent about your data
  • It may sometimes be more informative to visualize data using other tools than barplots

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America with EU

Python
# Setting path
path = "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6a/data/"

# Step 1: Loading the data
life_expectancy_df = pd.read_csv(f'{path}life-expectancy.csv')
urbanization_df = pd.read_csv(f'{path}share-of-population-urban.csv')

# Step 2: Removing countries with no country code
weird_labels = ["OWID_KOS", "OWID_WRL", ""]
clean_life_expectancy_df = life_expectancy_df[~life_expectancy_df['Code'].isin(weird_labels)]

# Step 3: Changing variable name
clean_life_expectancy_df = clean_life_expectancy_df.rename(columns={"Life expectancy at birth (historical)": "life_exp_yearly"})

# Step 4: Keeping only relevant vars
clean_life_expectancy_df2 = clean_life_expectancy_df[['Entity', 'Code', 'Year', 'life_exp_yearly']]

# Step 5: Removing countries with no country code
clean_urbanization_df = urbanization_df[~urbanization_df['Code'].isin(weird_labels)]

# Step 6: Changing variable name
clean_urbanization_df = clean_urbanization_df.rename(columns={"Urban population (% of total population)": "urb_yearly"})

# Step 7: Keeping only relevant vars
clean_urbanization_df2 = clean_urbanization_df[['Code', 'Year', 'urb_yearly']]

# Step 8: Performing a merge
merged_data_temp = pd.merge(clean_life_expectancy_df2, clean_urbanization_df2, on=['Code', 'Year'], how='left')

# Step 9: Removing NAs
merged_data_temp = merged_data_temp.dropna()

# Step 10: Defining continents
eu_countries = [
    "Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia", "Denmark", "Estonia", 
    "Finland", "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", 
    "Lithuania", "Luxembourg", "Malta", "Netherlands", "Poland", "Portugal", "Romania", 
    "Slovakia", "Slovenia", "Spain", "Sweden"]

latam_countries = [
    "Belize", "Costa Rica", "El Salvador", "Guatemala", "Honduras", "Mexico", "Nicaragua", "Panama", 
    "Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "Guyana", "Paraguay", "Peru", 
    "Suriname", "Uruguay", "Venezuela", "Cuba", "Dominican Republic", "Haiti"]
    
# Step 11: Labeling continents 
merged_data_temp['continent'] = 'Everything Else'  # Default value

merged_data_temp.loc[merged_data_temp['Entity'].isin(eu_countries), 'continent'] = 'EU'
merged_data_temp.loc[merged_data_temp['Entity'].isin(latam_countries), 'continent'] = 'Latin America'

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America with EU

Python
#Calculate by group
averages = merged_data_temp.groupby(by='continent')
#Make Data Frame
averages = pd.DataFrame(averages.mean(numeric_only=True)).reset_index()
# Rename the columns
averages = averages.rename(columns={"life_exp_yearly": "life_exp_mean", "urb_yearly": "urb_mean"})
averages
         continent         Year  life_exp_mean   urb_mean
0               EU  1990.000000      74.259927  67.138362
1  Everything Else  1989.976588      62.945803  47.720601
2    Latin America  1990.000000      66.128083  59.018407

Advice for Barplots

For example, let’s say we want to compare life expectancy in Latin America, EU, and the Rest of the World

R
#Step1: Turn the data to R
library(reticulate)
averages <-  reticulate::py$averages

#Step2: Mapping
ggplot(averages, aes(x = continent, 
                 y = life_exp_mean, 
                 fill = continent)) +
  geom_col() +
  labs(fill = "Individual") +
  theme_void()

Advice for Barplots

A more compelling way is: boxplots with points

R
#Step1: Turn the data to R
library(reticulate)
final2 <-  reticulate::py$merged_data_temp

#Step2: Graph
ggplot(final2, aes(x = continent, 
                   y = life_exp_yearly,
                   color = continent)) +
  geom_boxplot()+
  geom_point(position = position_jitter(height = 0), 
             alpha = 0.05) +
  guides(color = "none")