#Setting path
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week3/lab/")
<- read.csv(file = './data/life-expectancy.csv')
life_expectancy_df <- read.csv(file = './data/share-of-population-urban.csv') urbanization_df
1 Loading the Data
Let’s go back to our old datasets: life expectancy and urbanization. Download them from the following links:
2 Creating a directory
The next step is to create a sub-folder within your stats folder: Lab3. Place the csv files within the lab3.
Let’s now open them.
2.1 Option1: Opening using a relative path
2.2 Option2: Opening using an absolute path
<- read.csv(file ='/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week3/lab/data/life-expectancy.csv')
life_expectancy_df <- read.csv(file = '/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week3/lab/data/share-of-population-urban.csv') urbanization_df
We should now have in memory the two datasets.
3 Examining the data
We can either print the entire dataset, but that will lead to a very long list of observations. In addition, if the dataset is too big, this will likely freeze your computer. The way to print the dataset is by simply typing the name of the datase. However, I am commenting this out.
#life_expectancy
A more efficient method is to simply print only a few observations. For example, we can simply print the first 5 observations for each dataset.
head(life_expectancy_df, n=5)
head(urbanization_df, n=10)
Another good way to examine your data is to glimpse at the data.
#install.packages("pillar")
library("pillar")
glimpse(life_expectancy_df)
Rows: 20,445
Columns: 4
$ Entity <chr> "Afghanistan", "Afghanistan", "A…
$ Code <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ Year <int> 1950, 1951, 1952, 1953, 1954, 19…
$ Life.expectancy.at.birth..historical. <dbl> 27.7, 28.0, 28.4, 28.9, 29.2, 29…
Note that glimpse allows us to also see what kind of variables we are dealing with. Thus glimpse allows us to see that we have four variables within our dataframe: 1) Entity; 2) Code; 3) Year; 4) Life.expectancy.at.birth..historical.
- Entity is a string or character variable that tells us the country that we are dealing with: “Afghanistan”, “Albania”, “Algeria”, etc.
- Code is also a string or character variable that gives us the country code: “AFG”, “ALB”, “DZA”, etc.
- Year is is a numeric or integer variable: 1950, 1951, 1952, etc.
- Life.expectancy.at.birth..historical. is a numeric or double precision variable: 27.7, 28.0, 28.4, etc.
4 Creating a scatterplot
Let’s say that we want to create a scatterplot depicting the relationship between urbanization and life expectancy. The first thing we need to do is to calculate averages per country. We will proceed as we did previously.
4.1 Create averages
The next step is to create averages for all the countries. The %>%
is called the pipe operator. This operator makes it possible to easily chain a sequence of calculations. “Chaining” means that you invoke multiple method calls. As each method returns an object, you can actually allow the calls to be chained together in a single statement, without needing variables to store the intermediate results.
library(dplyr)
<-life_expectancy_df%>%
life_expectancy_df2::group_by(Code)%>%
dplyr::summarize(life_exp_mean=mean(Life.expectancy.at.birth..historical.)) dplyr
<-urbanization_df%>%
urbanization_df2::group_by(Code)%>%
dplyr::summarize(urb_mean=mean(Urban.population....of.total.population.)) dplyr
4.2 Inspect the first 4 entries of each new dataframe
head(life_expectancy_df2, n=4)
head(urbanization_df2, n=4)
We see that some some entries do not have a code. Why is that? Let’s have a quick look at the original datasets and see what the issue is with those which do not have a code. Let’s first inspect the unique values in Code
unique(life_expectancy_df2$Code)
[1] "" "ABW" "AFG" "AGO" "AIA" "ALB"
[7] "AND" "ARE" "ARG" "ARM" "ASM" "ATG"
[13] "AUS" "AUT" "AZE" "BDI" "BEL" "BEN"
[19] "BES" "BFA" "BGD" "BGR" "BHR" "BHS"
[25] "BIH" "BLR" "BLZ" "BMU" "BOL" "BRA"
[31] "BRB" "BRN" "BTN" "BWA" "CAF" "CAN"
[37] "CHE" "CHL" "CHN" "CIV" "CMR" "COD"
[43] "COG" "COK" "COL" "COM" "CPV" "CRI"
[49] "CUB" "CUW" "CYM" "CYP" "CZE" "DEU"
[55] "DJI" "DMA" "DNK" "DOM" "DZA" "ECU"
[61] "EGY" "ERI" "ESH" "ESP" "EST" "ETH"
[67] "FIN" "FJI" "FLK" "FRA" "FRO" "FSM"
[73] "GAB" "GBR" "GEO" "GGY" "GHA" "GIB"
[79] "GIN" "GLP" "GMB" "GNB" "GNQ" "GRC"
[85] "GRD" "GRL" "GTM" "GUF" "GUM" "GUY"
[91] "HKG" "HND" "HRV" "HTI" "HUN" "IDN"
[97] "IMN" "IND" "IRL" "IRN" "IRQ" "ISL"
[103] "ISR" "ITA" "JAM" "JEY" "JOR" "JPN"
[109] "KAZ" "KEN" "KGZ" "KHM" "KIR" "KNA"
[115] "KOR" "KWT" "LAO" "LBN" "LBR" "LBY"
[121] "LCA" "LIE" "LKA" "LSO" "LTU" "LUX"
[127] "LVA" "MAC" "MAF" "MAR" "MCO" "MDA"
[133] "MDG" "MDV" "MEX" "MHL" "MKD" "MLI"
[139] "MLT" "MMR" "MNE" "MNG" "MNP" "MOZ"
[145] "MRT" "MSR" "MTQ" "MUS" "MWI" "MYS"
[151] "MYT" "NAM" "NCL" "NER" "NGA" "NIC"
[157] "NIU" "NLD" "NOR" "NPL" "NRU" "NZL"
[163] "OMN" "OWID_KOS" "OWID_WRL" "PAK" "PAN" "PER"
[169] "PHL" "PLW" "PNG" "POL" "PRI" "PRK"
[175] "PRT" "PRY" "PSE" "PYF" "QAT" "REU"
[181] "ROU" "RUS" "RWA" "SAU" "SDN" "SEN"
[187] "SGP" "SHN" "SLB" "SLE" "SLV" "SMR"
[193] "SOM" "SPM" "SRB" "SSD" "STP" "SUR"
[199] "SVK" "SVN" "SWE" "SWZ" "SXM" "SYC"
[205] "SYR" "TCA" "TCD" "TGO" "THA" "TJK"
[211] "TKL" "TKM" "TLS" "TON" "TTO" "TUN"
[217] "TUR" "TUV" "TWN" "TZA" "UGA" "UKR"
[223] "URY" "USA" "UZB" "VAT" "VCT" "VEN"
[229] "VGB" "VIR" "VNM" "VUT" "WLF" "WSM"
[235] "YEM" "ZAF" "ZMB" "ZWE"
So we have some unusual entries such as ““,”OWID_KOS” or “OWID_WRL”. What are these countries? We can do a subset to answer this question.
<- c("OWID_KOS", "OWID_WRL", "")
weird_labels <-subset(life_expectancy_df2, (Code %in% weird_labels)) weird_countries
Let’s inspect them by verifying unique countries
<-subset(weird_countries, !duplicated(weird_countries$Code))
weird_countries2head(weird_countries2, n=10)
So what is the problem with these data?
The problem is that these are aggregates that were decided by Our World in Data. So, we should remove them from the data. The way to do that is subsetting them based on the names that seemed problematic. Note the difference in code from before !(Code %in% weird_labels)
vs. (Code %in% weird_labels)
4.3 Cleaning our dataset
<- c("OWID_KOS", "OWID_WRL", "")
weird_labels <-subset(life_expectancy_df2, !(Code %in% weird_labels))
clean_life_expectancy_dfhead(clean_life_expectancy_df, n=5)
Great! Let’s repeat the procedure for urbanization.
<- c("OWID_KOS", "OWID_WRL", "")
weird_labels <-subset(urbanization_df2, !(Code %in% weird_labels))
clean_urbanization_dfhead(clean_urbanization_df, n=5)
4.4 Performing a left merge
We will now perform a left merge whereby we try to merge urbanization data to life expectancy based on Code.
<-left_join(clean_life_expectancy_df, clean_urbanization_df, by = c("Code"="Code"))
merged_datahead(merged_data, n=10)
4.5 Getting rid of NAs for urbanization
<-subset(na.omit(merged_data))
merged_data2head(merged_data2, n=10)
How many countries did we drop?
nrow(merged_data) - nrow(merged_data2)
[1] 21
I dropped 21.
What are those countries?
<-merged_data[!complete.cases(merged_data), ]
countries_dropped$Code countries_dropped
[1] "AIA" "BES" "COK" "ESH" "FLK" "GGY" "GLP" "GUF" "JEY" "MAF" "MSR" "MTQ"
[13] "MYT" "NIU" "REU" "SHN" "SPM" "TKL" "TWN" "VAT" "WLF"
Let us have quick look at what these codes mean.
<-subset(life_expectancy_df, Code %in% countries_dropped$Code)
cntriesunique(cntries$Entity)
[1] "Anguilla" "Bonaire Sint Eustatius and Saba"
[3] "Cook Islands" "Falkland Islands"
[5] "French Guiana" "Guadeloupe"
[7] "Guernsey" "Jersey"
[9] "Martinique" "Mayotte"
[11] "Montserrat" "Niue"
[13] "Reunion" "Saint Helena"
[15] "Saint Martin (French part)" "Saint Pierre and Miquelon"
[17] "Taiwan" "Tokelau"
[19] "Vatican" "Wallis and Futuna"
[21] "Western Sahara"
4.6 Doing a scatterplot
ggplot2 in R is one of the most elegant and most versatile graphing libraries in R. It implements the grammar of graphics, a coherent system for describing and building graphs. We will start by creating simple scatterplots and use that to introduce aesthetic mappings and geometric objects.
We are finally able to do simple scatterplot.
library(ggplot2)
<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
fig1geom_point()
fig1
4.7 Saving our first plot
We can save our plot by using the following code!
4.7.1 Relative Path Option
ggsave(fig1, file = "./graphs/fig1.jpg",
height = 20, width = 20,
units = "cm", dpi = 300)
4.7.2 Absolute Path Option
ggsave(fig1, file = "/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week3/lab/graphs/fig1.jpg",
height = 20, width = 20,
units = "cm", dpi = 300)
5 ggplot options
Let us play a bit with the options in ggplot so that we make our graph nicer.
5.1 Improving X and Y axis labels
<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
fig2geom_point()+
xlab("Urbanization") + ylab("Life Expectancy")
fig2
5.2 Adding a Title
<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
fig2geom_point()+
xlab("Urbanization") + ylab("Life Expectancy")+
ggtitle("Example Title")
fig2
5.3 Getting rid of the grey background
<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
fig2geom_point()+
xlab("Urbanization") + ylab("Life Expectancy")+
ggtitle("Example Title")+
theme_bw()
fig2
5.4 Fitting a line
Let us fit a line to see if there is any relationship between urbanization and life expectancy. Is it the case that countries with higher levels of urbanization also have higher life expectancy?
<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
fig2geom_point()+
xlab("Urbanization") + ylab("Life Expectancy")+
ggtitle("Example Title")+
theme_bw()+
geom_smooth(method = "lm", se=FALSE)
fig2
The blue line would indeed suggest that as the level of urbanization goes up, so does life expectancy.
5.5 Fixing the X and Y axis to particular values
But what if we fixed our axies to particular values. Woudl we still see the same effects?
<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
fig3geom_point()+
xlab("Urbanization") + ylab("Life Expectancy")+
ggtitle("Example Title")+
theme_bw()+
geom_smooth(method = "lm", se=FALSE)+
scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(50,100))+
scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(50,100))
fig3
Let’s put the two graphs together. The first graph is the one in which we fix the x and y axis while the second one is the one in which we don’t.
library(gridExtra)
grid.arrange(fig2, fig3, ncol=2)
This is interesting. So depending on how we present the data, the relationship may not seem as strong. However, the relationship is exactly the same. Thus, how you present the data matters!
5.6 Labelling the data
There are many ways of labeling the data in ggplot. But let us try a simple way.
First let us do a quick left merge so that we keep the original country names
<-subset(life_expectancy_df, select = c("Entity", "Code"))
label_cntry<-subset(label_cntry, !duplicated(Code))
label_cntry<-left_join(merged_data2, label_cntry, by = c("Code"="Code")) merged_data3
<-ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
fig3geom_point()+
xlab("Urbanization") + ylab("Life Expectancy")+
ggtitle("Example Title")+
theme_bw()+
geom_smooth(method = "lm", se=FALSE)+
scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
geom_text(aes(label = Entity),
size = 4,
check_overlap = TRUE,
position = position_nudge(y = 1))
fig3
What if we wanted to do selective labeling? Let’s say, we just want to label Italy and the US. How could we do that?
Well, we could create a new dataset based on a subset and then label the data.
5.7 Subsetting the data to two countries
= c("Italy", "United States")
countries_of_interest <-subset(merged_data3, Entity %in% countries_of_interest) italy_us_df
5.8 Let’s try the labelling again
<-ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
fig4geom_point()+
xlab("Urbanization") + ylab("Life Expectancy")+
ggtitle("Example Title")+
theme_bw()+
geom_smooth(method = "lm", se=FALSE)+
scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
geom_text(aes(label = Entity),
size = 4,
check_overlap = TRUE,
position = position_nudge(y = 1),
data = italy_us_df)
fig4
Tis is not ideal. You can barely see the text. We should probably find a better way. The better way is called geom_label
.
<-ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
fig4geom_point()+
xlab("Urbanization") + ylab("Life Expectancy")+
ggtitle("Example Title")+
theme_bw()+
geom_smooth(method = "lm", se=FALSE)+
scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
geom_label(aes(label = Entity),
size = 4,
position = position_nudge(y = 1),
data = italy_us_df)
fig4
This is better. But still, which one is Italy and which one is the US?
library(ggrepel)
<-ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
fig4geom_point()+
xlab("Urbanization") + ylab("Life Expectancy")+
ggtitle("Example Title")+
theme_bw()+
geom_smooth(method = "lm", se=FALSE)+
scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
geom_label_repel(box.padding = 0.5,
aes(label = Entity),
size = 4,
position = position_nudge(y = 1),
data = italy_us_df)
fig4
So much better!