Previously, we covered quite some ground:
Before we use ggplot, our data has to be tidy
This means:
People are better at seeing height differences than angle and area differences
People are better at seeing height differences than angle and area differences.
This how to obtain the same plots.
People are better at seeing height differences than angle and area differences
This how to obtain the same plots.
For example, let’s say we want to compare life expectancy in Latin America with EU
#Setting path
library("dplyr")
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_expectancy <- read.csv(file = './life-expectancy.csv')
urbanization <- read.csv(file = './share-of-population-urban.csv')
#Step2: Removing countries with no 3-letter code
life_expectancy2<-subset(life_expectancy, Code!="")
#Step3: Changing variable name
names(life_expectancy2)[names(life_expectancy2)=="Life.expectancy.at.birth..historical."]<-"life_exp"
#Step4: Selecting only vars of interest
life_expectancy3<-subset(life_expectancy2, selec=c("Entity", "Code", "Year", "life_exp"))
#Step5: Removing countries with no 3-letter code
urbanization2<-subset(urbanization, Code!="")
#Step6: Changing variable name
names(urbanization2)[names(urbanization2)=="Urban.population....of.total.population."]<-"urb"
#Step7: Selecting only vars of interest
urbanization3<-subset(urbanization2, selec=c("Code", "Year", "urb"))
#Step8: Performing a merge
final<-left_join(life_expectancy3, urbanization3, by = c("Code"="Code",
"Year"="Year"))
#Step9: Removing NAs
final2<-final[complete.cases(final), ]
#Step10: Defining continents
#EU Countries
eu_countries<-c("Austria",
"Belgium",
"Bulgaria",
"Croatia",
"Cyprus",
"Czechia",
"Denmark",
"Estonia",
"Finland",
"France",
"Germany",
"Greece",
"Hungary",
"Ireland",
"Italy",
"Latvia",
"Lithuania",
"Luxembourg",
"Malta",
"Netherlands",
"Poland",
"Portugal",
"Romania",
"Slovakia",
"Slovenia",
"Spain",
"Sweden")
latam_countries<-c("Belize",
"Costa Rica",
"El Salvador",
"Guatemala",
"Honduras",
"Mexico",
"Nicaragua",
"Panama",
"Argentina",
"Bolivia",
"Brazil",
"Chile",
"Colombia",
"Ecuador",
"Guyana",
"Paraguay",
"Peru",
"Suriname",
"Uruguay",
"Venezuela",
"Cuba",
"Dominican Republic",
"Haiti")
#Step11: Labeling continents
final2$continent[final2$Entity %in% eu_countries]<-"EU"
final2$continent[final2$Entity %in% latam_countries]<-"Latin America"
final2$continent[is.na(final2$continent)]<-"Everything Else"
For example, let’s say we want to compare life expectancy in Latin America with EU
For example, let’s say we want to compare life expectancy in Latin America, EU, and the Rest of the World
A more compelling way is: boxplots with points
Another way is to combine violin with points
Another way is to have overlapping ridgeplots
Another way is to have all of them superimposed
As discussed it is good to add more information to your graphs to display the whole distribution of numbers
For example, the right is better than the left
This is the code for left and right
It could also be helpful to play with the binwidth: binwidth = 2
It could also be helpful to play with the binwidth: binwidth = 10
We can obtain something similar with densities: they are a smoothed version of the histogram.
The difference is that one should count and the other one density.
The second shows the probability density function (PDF) of the variable: use calculus to find the probability of each x value
We can obviously also plot them together
ggplot(final2, aes(x = life_exp,
fill = continent)) +
geom_histogram(binwidth = 2,
color = "white") +
#scale the density to a similar scale to the histogram:
#in this case, I multiply by 4000
#note also aes(y = ..density..* 4000)
geom_density(aes(y = ..density..* 4000),
alpha = 0.5)+
guides(fill = "none") + # Turn off legend
facet_wrap(vars(continent))+
theme_bw()+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/4000,
name = "density"))
Having a closer look at the code
ggplot(final2, aes(x = life_exp,
fill = continent)) +
#geom_histogram with the same parameters
geom_histogram(binwidth = 2,
color = "white") +
#note the aes(y = ..density..* 4000)
#scale the density to a similar scale to the histogram
#in this case, I multiply by 4000
#otherwise it will not be visible
#density calculates relative frequency:
#count / sum(count): e.g. 385/1403
geom_density(aes(y = ..density..* 4000),
alpha = 0.5)+
guides(fill = "none") + # Turn off legend
facet_wrap(vars(continent))+
theme_bw()+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/4000,
name = "density"))
A histogram shows the counts of values in each range
It is made up of bars that touch each other
A density plot shows the proportion of values in each range
It is a smooth curve that shows the distribution of the data in a more continuous way
Here is a boxlot for life expectancy for the entire dataset
Here is the interpretation
This is what the histogram look like:
This is what the density function looks like:
This is what the actual values look like:
And this is what they look like together: