Previously, we covered quite some ground:
Before we use ggplot, our data has to be tidy
This means:
People are better at seeing height differences than angle and area differences
People are better at seeing height differences than angle and area differences.
This how to obtain the same plots.
People are better at seeing height differences than angle and area differences
This how to obtain the same plots.
For example, let’s say we want to compare life expectancy in Latin America with EU
#Setting path
library("dplyr")
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week4/data/")
#Step1: Loading the data
life_expectancy <- read.csv(file = './life-expectancy.csv')
urbanization <- read.csv(file = './share-of-population-urban.csv')
#Step2: Removing countries with no 3-letter code
life_expectancy2<-subset(life_expectancy, Code!="")
#Step3: Changing variable name
names(life_expectancy2)[names(life_expectancy2)=="Life.expectancy.at.birth..historical."]<-"life_exp"
#Step4: Selecting only vars of interest
life_expectancy3<-subset(life_expectancy2, selec=c("Entity", "Code", "Year", "life_exp"))
#Step5: Removing countries with no 3-letter code
urbanization2<-subset(urbanization, Code!="")
#Step6: Changing variable name
names(urbanization2)[names(urbanization2)=="Urban.population....of.total.population."]<-"urb"
#Step7: Selecting only vars of interest
urbanization3<-subset(urbanization2, selec=c("Code", "Year", "urb"))
#Step8: Performing a merge
final<-left_join(life_expectancy3, urbanization3, by = c("Code"="Code",
"Year"="Year"))
#Step9: Removing NAs
final2<-final[complete.cases(final), ]
#Step10: Defining continents
#EU Countries
eu_countries<-c("Austria",
"Belgium",
"Bulgaria",
"Croatia",
"Cyprus",
"Czechia",
"Denmark",
"Estonia",
"Finland",
"France",
"Germany",
"Greece",
"Hungary",
"Ireland",
"Italy",
"Latvia",
"Lithuania",
"Luxembourg",
"Malta",
"Netherlands",
"Poland",
"Portugal",
"Romania",
"Slovakia",
"Slovenia",
"Spain",
"Sweden")
latam_countries<-c("Belize",
"Costa Rica",
"El Salvador",
"Guatemala",
"Honduras",
"Mexico",
"Nicaragua",
"Panama",
"Argentina",
"Bolivia",
"Brazil",
"Chile",
"Colombia",
"Ecuador",
"Guyana",
"Paraguay",
"Peru",
"Suriname",
"Uruguay",
"Venezuela",
"Cuba",
"Dominican Republic",
"Haiti")
#Step11: Labeling continents
final2$continent[final2$Entity %in% eu_countries]<-"EU"
final2$continent[final2$Entity %in% latam_countries]<-"Latin America"
final2$continent[is.na(final2$continent)]<-"Everything Else"
For example, let’s say we want to compare life expectancy in Latin America with EU
For example, let’s say we want to compare life expectancy in Latin America, EU, and the Rest of the World
A more compelling way is: boxplots with points
Another way is to combine violin with points
Another way is to have overlapping ridgeplots
Another way is to have all of them superimposed
As discussed it is good to add more information to your graphs to display the whole distribution of numbers
For example, the right is better than the left
This is the code for left and right
It could also be helpful to play with the binwidth: binwidth = 2
It could also be helpful to play with the binwidth: binwidth = 10
We can obtain something similar with densities: they are a smoothed version of the histogram.
The difference is that one should count and the other one density.
The second shows the probability density function (PDF) of the variable: use calculus to find the probability of each x value
We can obviously also plot them together
ggplot(final2, aes(x = life_exp,
fill = continent)) +
geom_histogram(binwidth = 2,
color = "white") +
#scale the density to a similar scale to the histogram:
#in this case, I multiply by 4000
#note also aes(y = ..density..* 4000)
geom_density(aes(y = ..density..* 4000),
alpha = 0.5)+
guides(fill = "none") + # Turn off legend
facet_wrap(vars(continent))+
theme_bw()+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/4000,
name = "density"))
Having a closer look at the code
ggplot(final2, aes(x = life_exp,
fill = continent)) +
#geom_histogram with the same parameters
geom_histogram(binwidth = 2,
color = "white") +
#note the aes(y = ..density..* 4000)
#scale the density to a similar scale to the histogram
#in this case, I multiply by 4000
#otherwise it will not be visible
#density calculates relative frequency:
#count / sum(count): e.g. 385/1403
geom_density(aes(y = ..density..* 4000),
alpha = 0.5)+
guides(fill = "none") + # Turn off legend
facet_wrap(vars(continent))+
theme_bw()+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/4000,
name = "density"))
A histogram shows the counts of values in each range
It is made up of bars that touch each other
A density plot shows the proportion of values in each range
It is a smooth curve that shows the distribution of the data in a more continuous way
Here is a boxlot for life expectancy for the entire dataset
Here is the interpretation
This is what the histogram look like:
This is what the density function looks like:
This is what the actual values look like:
And this is what they look like together:
We have already covered this, but here is the relationship between urbanization and life expectancy
We have already covered this, but here is the relationship between urbanization and life expectancy
And here are the slopes for the three groups
And here are the confidence intervals for the three slopes
We also need to discuss text in plots
On the most basic level, we can label actual data points with the help of options such as:
geom_text()
geom_label()
geom_text_repel()
We can also add annotations with annotate()
geom_text
is not great when we have too many points
geom_label
is not great when we have too many points. It covers the points
ggrepel
is one solution: geom_text_repel
ggrepel
is one solution: geom_label_repel
We can highlight specific observations in our data
#Creating a Variable to Identify Obs
life_exp_urb$eu<-ifelse(life_exp_urb$type=="EU", 1, 0)
#Making the variable a factor
life_exp_urb$eu<-as.factor(life_exp_urb$eu)
#Plotting
ggplot(data = life_exp_urb,
mapping = aes(x=urb_mean,
y=life_exp_mean,
color = eu))+
geom_point()+
scale_color_manual(values = c("grey70",
"red"))+
guides(color = "none")
This is how we can add text in the graph
This is how we can add a rectangle around our points of interest
xmin_val<-min(life_exp_urb$urb_mean[life_exp_urb$eu==1])
xmax_val<-max(life_exp_urb$urb_mean[life_exp_urb$eu==1])
ymin_val<-min(life_exp_urb$life_exp_mean[life_exp_urb$eu==1])
ymax_val<-max(life_exp_urb$life_exp_mean[life_exp_urb$eu==1])
ggplot(data = life_exp_urb,
mapping = aes(x=urb_mean,
y=life_exp_mean,
color = eu))+
geom_point()+
scale_color_manual(values = c("grey70",
"red"))+
guides(color = "none")+
annotate(geom = "text",
x = 75, y = 45,
label = "EU Countries", color="red")+
annotate(geom = "rect",
xmin = xmin_val,
xmax = xmax_val,
ymin = ymin_val,
ymax = ymax_val,
fill = "red", alpha = 0.2)
This is how we can add an arrow towards the rectangle
ggplot(data = life_exp_urb,
mapping = aes(x=urb_mean,
y=life_exp_mean,
color = eu))+
geom_point()+
scale_color_manual(values = c("grey70",
"red"))+
guides(color = "none")+
annotate(geom = "text",
x = 75, y = 45,
label = "EU Countries", color="red")+
annotate(geom = "rect",
xmin = xmin_val,
xmax = xmax_val,
ymin = ymin_val,
ymax = ymax_val,
fill = "red", alpha = 0.2)+
annotate(geom = "segment",
x = 75,
xend = 75,
y = 45,
yend = ymin_val,
arrow = arrow(
length = unit(0.1, "in")))
Or let’s turn the text into a label
ggplot(data = life_exp_urb,
mapping = aes(x=urb_mean,
y=life_exp_mean,
color = eu))+
geom_point()+
scale_color_manual(values = c("grey70",
"red"))+
guides(color = "none")+
annotate(geom = "label",
x = 75,
y = 45,
label = "EU Countries",
color="red")+
annotate(geom = "rect",
xmin = xmin_val,
xmax = xmax_val,
ymin = ymin_val,
ymax = ymax_val,
fill = "red", alpha = 0.2)+
annotate(geom = "segment",
x = 75,
xend = 75,
y = 45,
yend = ymin_val,
arrow = arrow(
length = unit(0.1, "in")))
You may also want to plot temporal data
To do that we need to load the life expectancy data over time
Let us look again at the data
Let us imagine that we want to plot life expectancy over time
We first need to calculate average by year
Let us look at the transformed data.
We are now ready to plot the time trend
We can also produce an animation
#Note: you should install:
#install.packages("av")
#install.packages("magick")
#install.packages("gifski")
#install.packages("gganimate")
library(gganimate)
ggplot(life_expectancy2, aes(x = Year, y = life_exp_mean)) +
geom_line()+
labs(title = 'Year: {frame_along}',
x = 'Urbanization',
y = 'Life Expectancy') +
transition_reveal(Year)
To view the upward sloping trend in life expectancy, it might be worth aggregating some of the years.
Let us imagine that we want to compare Italy and the US in life expectancy
This means we go back to the original file and select the US and Italy
#Subsetting the data
life_expectancy_itus<-subset(life_expectancy, Entity %in% c("Italy", "United States"))
#Selecting only after 1900
life_expectancy_itus<-subset(life_expectancy_itus, Year>1899)
#Renaming life expectancy variable
names(life_expectancy_itus)[names(life_expectancy_itus)=="Life.expectancy.at.birth..historical."]<-"life_exp"
Let’s examine the data
We can now plot life expectancy for the two countries over time
Changing the starting value of our y-axis can change the appearance of our data
Changing the starting value of our y-axis can change the appearance of our data
Annotations work very well with temporal data
Another cool way to visualize data is to create density ridges for every year
We will however want fewer years to make the graph easier to read
Here is how we go about it:
#Step1: Load the library
library("forcats")
#Step2: Subset the data to fewer years
final3<-subset(final2, Year>1980)
#Step3: Turn year into factor
final3$year_factor<-as.factor(final3$Year)
#Step4: Plot
ggplot(final3,
aes(x = life_exp,
y = fct_rev(as.factor(Year)), fill = ..x..))+
geom_density_ridges_gradient(color = "white",
quantile_lines = TRUE)+
labs(x = "Life expectancy", y = NULL, fill = NULL)+
scale_fill_viridis_c(name = "")
This produces:
We can see this way the first, second, and third quartile
We also see that the life expectancy gets narrower over time: most people live longer lives
One final aspect related to visualization is how we save the data
The command for this ggsave
Let us look at a few options
Let’s imagine we want to save the following plot
Let’s imagine we want to save the following plot
Let’s imagine we want to save the following plot
We first create an object: pic1
The next step is to use the ggsave
function.
This will save the object in your directory
It is important to be aware how much space these files take
pic1.pdf - 7KB
pic1.png - 119KB
pic.jpg - 410KB
Depending on your goal, you may want to opt for one of the three
For web, pngs are good
For academic publications, jpgs are acceptable
One should also be aware that width and heigh options, as part of the ggsave
function will alter the appearance of your figure
Here is what they look like:
For publication quality figures, you should use a dpi of 300
And here is 50 dpi
Here is what they look like:
The take-away is that higher dpi result in greater clarity at the expense of space
Lower dpi means lower clarity and lower space:
pic1_50dpi - 25KB
pic1_300dpi - 471KB
We have covered more ground in data visualization today:
different ways of plotting quantities: pie charts vs. bar charts
boxplots, violin plots, histograms, and density plots
the relationship between histograms and density plots
labelling and annotating graphs
highlighting graphs
visualizing time
saving figures
Write a function called scatter_plot
that takes two vectors x
and y
as inputs and generates a scatter plot using ggplot2
.
The x
and y
look like:
Write a function called histogram_plot
that takes a vector data as input and generates a histogram using ggplot2
.
Create the following dataframe:
# Load required library
library(ggplot2)
# Generate a fun dataset
set.seed(123) # for reproducibility
n_creatures <- 50
creatures <- data.frame(
type = rep(c("Dragon", "Unicorn", "Yeti"), each = n_creatures),
cuteness = c(rnorm(n_creatures, mean = 8, sd = 2),
rnorm(n_creatures, mean = 7, sd = 2),
rnorm(n_creatures, mean = 5, sd = 2)),
fierceness = c(rnorm(n_creatures, mean = 7, sd = 2),
rnorm(n_creatures, mean = 5, sd = 2),
rnorm(n_creatures, mean = 8, sd = 2))
)
Produce the following graph
Popescu (JCU): Lecture 8