L11: Data Visualization

Bogdan G. Popescu

John Cabot University

Introduction

Visualization in R is about presenting data in an aesthetically pleasing way

More importantly, visualization is about presenting the data in a truthful way

Finding the Truth

One good way to tell if something is true or not is by using science

Truth vs. Art

However, data visualization can affect perceptions of what is scientific

The scientist can consciosly or unconsciously make particular style choices to a graph that can affect such perceptions

Thus, perception of science can sometimes be impacted by aesthetic choices and the way content is presented

Duties of a Data Analyst

  • Not manipulate data and thus, not lie
  • Not misrepresent data
  • Find the right balance between dumbing down and sounding too esoteric when communicating results
    • You are utimately a translator
  • Emphasize the story and make data available

The Data

At the most basic level, we can present snippets of the data or basic information about the data

To demonstrate how we can present data, let us download: Life expectancy and Urbanization Data

The dataframe once you load it into python looks like the following:

Python
import os
import pandas as pd
# Step 0: Setting the working directory
os.chdir("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6b/data")
#Step 1: Loading the data
life_exp_urb = pd.read_csv('./life_exp_urb.csv')
# Step 2: Examining the first five entries
life_exp_urb.head(3)
        Entity  life_exp_mean   urb_mean             type
0  Afghanistan      45.383333  18.611754  Everything Else
1      Albania      68.286111  40.444164  Everything Else
2      Algeria      57.530133  52.609213  Everything Else

The Data

In principle, we can just present basic information about the data:

Python
# Step 1: Subset the DataFrame
df = life_exp_urb[['life_exp_mean', 'urb_mean']].copy()
# Rename the columns
df.columns = ['life_exp', 'urb']
# Display the first 10 entries
print(df.head(10))
    life_exp        urb
0  45.383333  18.611754
1  68.286111  40.444164
2  57.530133  52.609213
3  68.637500  79.762491
4  77.048611  87.043017
5  45.084658  37.539705
6  69.440278        NaN
7  71.209722  32.350656
8  65.408046  85.282148
9  67.156944  63.111394
Python
df['life_exp'].mean(skipna=True)
61.93415901715886
Python
df['urb'].mean(skipna=True)
51.3651830718757
Python
df['urb'].corr(df['life_exp'])
0.6301940464135396

All of these ways of examining the data are reasonable. There is a positive correlation between urbanization and life expectancy.

Visualizing the Relationship

We will now make the transition to R for creating plots.

This is the dataframe in Python:

Python
df

We can turn in the Python dataframe into a R dataframe in the following way:

R
library(reticulate)
df2 <-  reticulate::py$df

Installing Packages in R

To install libraries in R, you need to do something like in the “Concole” tab

R
install.packages("ggplot2")

Make sure you delete install.packages("ggplot2") after you installed the package.

Visualizing the Relationship

This is how we create a scatterplot in R:

Show the code
R
library(ggplot2)
ggplot(data = df2, 
  mapping = aes(x=urb, y=life_exp))+
  geom_point()

Visualization

Good visualizations are:

  • truthful
  • functional
  • beautiful
  • insightful
  • enlightening

Visualization

“Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency.[…] [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space… And graphical excellence requires telling the truth about the data.”

Edward Tufte, The Visual Display of Quantitative Information, p. 51

Bad Visualizations 1

Bad Visualizations 1

The figure has way too many categories

Bad Visualizations 2

Bad Visualizations 2

The categories in a pie chart should add up to 100%

The areas should correspond to the percentage size

Bad Visualizations 3

Bad Visualizations 3

The categories in a pie chart should add up to 100%

Bad Visualizations 4

Bad Visualizations 4

Bad Visualizations 5

Bad Visualizations 5

Goals of Visualization

Python has a variety of libraries meant for data visualization:

  • matplotlib
  • seaborn
  • plotnine

Plotnine is useful as it allows direct iteraction with ggplot2, a library native to R.

ggplot2 allows to create powerful visualizations and has better data handling abilities.

Check out this Article comparing Matplotlib and Ggplot2

Visualizations

Part of visualization is how we translate essential content to different forms for specific audiences

Visualizations should ultimately tell stories

Truth comes from a combination of content and form.

Colors

On the most fundamental level, we need to use the right colors for our visualizations

This is relevant for:

  • Clarity and Readability: users can distinguish among different categories
  • Accessibility: color-blind people can also see the different categories in your visualization
  • Emphasis: the right colors can be used to emphasize specific aspects of the data or analysis
  • Consistency: using a consistent color palette for the same project is helpful

Adobe Color

One excellent free source for color choice is https://color.adobe.com


Adobe Color

With Adobe Color you can:

  • Create color themes based on color theory
  • Extract themes & gradients from pictures
  • Create Accessible themes for color-blind audiences

Color Contrast

It is important to use color with a high contrast

The contrast ratio calculator from Adobe’s Create Accessible themes can be helpful

Color Contrast Example

Here is a low color contrast

Color Contrast Example

Here is a low color contrast

Color Contrast Example

Here is a high color contrast

Color Contrast Example

Here is a high color contrast

Color

8% of men and 0.05% of women have some form of color blindness

Thus, colors should be distinguishable by people with different forms of color blindness

Color Contrast

The Viridis palette in R allows us to create color-blind friendly graphs

These are predefined palettes that are widely used.

Color-blind Not Friendly

Color-blind Friendly

Color Contrast

This is the difference that this makes

R
ggplot() +
  geom_sf(data = merged, aes(fill = life_exp_mean))