L12: Data Visualization

Bogdan G. Popescu

John Cabot University

Introduction

Visualization in R is about presenting data in an aesthetically pleasing way

More importantly, visualization is about presenting the data in a truthful way

Finding the Truth

One good way to tell if something is true or not is by using science

Truth vs. Art

However, data visualization can affect perceptions of what is scientific

The scientist can consciosly or unconsciously make particular style choices to a graph that can affect such perceptions

Thus, perception of science can sometimes be impacted by aesthetic choices and the way content is presented

Duties of a Data Analyst

  • Not manipulate data and thus, not lie
  • Not misrepresent data
  • Find the right balance between dumbing down and sounding too esoteric when communicating results
    • You are utimately a translator
  • Emphasize the story and make data available

The Data

At the most basic level, we can present snippets of the data or basic information about the data

To demonstrate how we can present data, let us download: Life expectancy and Urbanization Data

The dataframe once you load it into python looks like the following:

import os
import pandas as pd
# Step 0: Setting the working directory
os.chdir("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6b/data")
#Step1: Loading the data
life_exp_urb = pd.read_csv('./life_exp_urb.csv')
# Step 2: Examining the first five entries
life_exp_urb.head(3)
        Entity  life_exp_mean   urb_mean             type
0  Afghanistan      45.383333  18.611754  Everything Else
1      Albania      68.286111  40.444164  Everything Else
2      Algeria      57.530133  52.609213  Everything Else

The Data

In principle, we can just present basic information about the data:

# Step 1: Subset the DataFrame
df = life_exp_urb[['life_exp_mean', 'urb_mean']].copy()
# Rename the columns
df.columns = ['life_exp', 'urb']
# Display the first 10 entries
print(df.head(10))
    life_exp        urb
0  45.383333  18.611754
1  68.286111  40.444164
2  57.530133  52.609213
3  68.637500  79.762491
4  77.048611  87.043017
5  45.084658  37.539705
6  69.440278        NaN
7  71.209722  32.350656
8  65.408046  85.282148
9  67.156944  63.111394
df['life_exp'].mean(skipna=True)
61.93415901715886
df['urb'].mean(skipna=True)
51.3651830718757
df['urb'].corr(df['life_exp'])
0.6301940464135396

All of these ways of examining the data are reasonable. There is a positive correlation between urbanization and life expectancy.

Visualizing the Relationship

But the following scatterplot is more compelling.

Show the code
library(reticulate)
df2 <-  reticulate::py$df

library(ggplot2)
ggplot(data = df2, 
  mapping = aes(x=urb, y=life_exp))+
  geom_point()

Visualization

Good visualizations are:

  • truthful
  • functional
  • beautiful
  • insightful
  • enlightening

Visualization

“Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency.[…] [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space… And graphical excellence requires telling the truth about the data.”

Edward Tufte, The Visual Display of Quantitative Information, p. 51

Bad Visualizations 1