Data Visualization 1

Bogdan G. Popescu

John Cabot University

Intro

Visualization in R is about presenting data in an aesthetically pleasing way

More importantly, visualization is about presenting the data in a truthful way

Finding the Truth

One good way to tell if something is true or not is by using science

Truth vs. Art

However, data visualization can affect perceptions of what is scientific

The scientist can consciosly or unconsciously make particular style choices to a graph that can affect such perceptions

Thus, perception of science can sometimes be impacted by aesthetic choices and the way content is presented

Duties of a Data Analyst

  • Not manipulate data and thus, not lie
  • Not misrepresent data
  • Find the right balance between dumbing down and sounding too esoteric when communicating results
    • You are utimately a translator
  • Emphasize the story and make data available

The Data

At the most basic level, we can present snippets of the data or basic information about the data

To demonstrate how we can present data, let us download: Life expectancy and Urbanization Data

The dataframe once you load it into R looks like the following:

setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_exp_urb <- read.csv(file = './life_exp_urb.csv')
#Step2: Examining the first five entries
head(life_exp_urb, n=5)
          Entity life_exp_mean urb_mean            type
1    Afghanistan      45.38333 18.61175 Everything Else
2        Albania      68.28611 40.44416 Everything Else
3        Algeria      57.53013 52.60921 Everything Else
4 American Samoa      68.63750 79.76249 Everything Else
5        Andorra      77.04861 87.04302 Everything Else

The Data

In principle, we can just present basic information about the data:

df<-subset(life_exp_urb, select = c(life_exp_mean, urb_mean))
names(df)<-c("life_exp", "urb")
head(df, n=10)
# A tibble: 10 × 2
   life_exp   urb
      <dbl> <dbl>
 1     45.4  18.6
 2     68.3  40.4
 3     57.5  52.6
 4     68.6  79.8
 5     77.0  87.0
 6     45.1  37.5
 7     69.4  NA  
 8     71.2  32.4
 9     65.4  85.3
10     67.2  63.1
mean(df$life_exp, na.rm=TRUE)
[1] 61.93416
mean(df$urb, na.rm=TRUE)
[1] 51.36518
cor(df$urb, df$life_exp, use = "complete.obs")
[1] 0.630194

This is reasonable.

This is reasonable.

There is a positive correlation.

Visualizing the Relationship

But the following scatterplot is more compelling.

library(ggplot2)
ggplot(data = df, 
  mapping = aes(x=urb, y=life_exp))+
  geom_point()

Visualization

Good visualizations are:

  • truthful
  • functional
  • beautiful
  • insightful
  • enlightening

Visualization

“Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency.[…] [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space… And graphical excellence requires telling the truth about the data.”

Edward Tufte, The Visual Display of Quantitative Information, p. 51

Bad Visualizations 1