Lab 3: Statistics

Data cleaning, Data merges, Scatterplots

In this lab, we’ll dive back into the life expectancy and urbanization datasets, learning how to load and examine them using R. We’ll start by organizing our files, then inspect and clean the data to remove any irrelevant entries. Finally, we’ll create scatterplots using ggplot2, exploring the relationship between urbanization and life expectancy, and customize our plots with titles, labels, and trend lines to make our findings clear and compelling.
Data cleaning
Data merges
Scatterplots
Author

Bogdan G. Popescu

1 Loading the Data

Let’s go back to our old datasets: life expectancy and urbanization. Download them from the following links:

2 Creating a directory

The next step is to create a sub-folder within your stats folder: Lab3. Place the csv files within the lab3.

Let’s now open them.

2.1 Option1: Opening using a relative path

#Setting path
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week3/lab/")
life_expectancy_df <- read.csv(file = './data/life-expectancy.csv')
urbanization_df <- read.csv(file = './data/share-of-population-urban.csv')

2.2 Option2: Opening using an absolute path

life_expectancy_df <- read.csv(file ='/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week3/lab/data/life-expectancy.csv')
urbanization_df <- read.csv(file = '/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week3/lab/data/share-of-population-urban.csv')

We should now have in memory the two datasets.

3 Examining the data

We can either print the entire dataset, but that will lead to a very long list of observations. In addition, if the dataset is too big, this will likely freeze your computer. The way to print the dataset is by simply typing the name of the datase. However, I am commenting this out.

#life_expectancy

A more efficient method is to simply print only a few observations. For example, we can simply print the first 5 observations for each dataset.

head(life_expectancy_df, n=5)
head(urbanization_df, n=10)

Another good way to examine your data is to glimpse at the data.

#install.packages("pillar")
library("pillar")
glimpse(life_expectancy_df)
Rows: 20,445
Columns: 4
$ Entity                                <chr> "Afghanistan", "Afghanistan", "A…
$ Code                                  <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ Year                                  <int> 1950, 1951, 1952, 1953, 1954, 19…
$ Life.expectancy.at.birth..historical. <dbl> 27.7, 28.0, 28.4, 28.9, 29.2, 29…

Note that glimpse allows us to also see what kind of variables we are dealing with. Thus glimpse allows us to see that we have four variables within our dataframe: 1) Entity; 2) Code; 3) Year; 4) Life.expectancy.at.birth..historical.

  • Entity is a string or character variable that tells us the country that we are dealing with: “Afghanistan”, “Albania”, “Algeria”, etc.
  • Code is also a string or character variable that gives us the country code: “AFG”, “ALB”, “DZA”, etc.
  • Year is is a numeric or integer variable: 1950, 1951, 1952, etc.
  • Life.expectancy.at.birth..historical. is a numeric or double precision variable: 27.7, 28.0, 28.4, etc.

4 Creating a scatterplot

Let’s say that we want to create a scatterplot depicting the relationship between urbanization and life expectancy. The first thing we need to do is to calculate averages per country. We will proceed as we did previously.

4.1 Create averages

The next step is to create averages for all the countries. The %>% is called the pipe operator. This operator makes it possible to easily chain a sequence of calculations. “Chaining” means that you invoke multiple method calls. As each method returns an object, you can actually allow the calls to be chained together in a single statement, without needing variables to store the intermediate results.

library(dplyr)
life_expectancy_df2<-life_expectancy_df%>%
  dplyr::group_by(Code)%>%
  dplyr::summarize(life_exp_mean=mean(Life.expectancy.at.birth..historical.))
urbanization_df2<-urbanization_df%>%
  dplyr::group_by(Code)%>%
  dplyr::summarize(urb_mean=mean(Urban.population....of.total.population.))

4.2 Inspect the first 4 entries of each new dataframe

head(life_expectancy_df2, n=4)
head(urbanization_df2, n=4)

We see that some some entries do not have a code. Why is that? Let’s have a quick look at the original datasets and see what the issue is with those which do not have a code. Let’s first inspect the unique values in Code

unique(life_expectancy_df2$Code)
  [1] ""         "ABW"      "AFG"      "AGO"      "AIA"      "ALB"     
  [7] "AND"      "ARE"      "ARG"      "ARM"      "ASM"      "ATG"     
 [13] "AUS"      "AUT"      "AZE"      "BDI"      "BEL"      "BEN"     
 [19] "BES"      "BFA"      "BGD"      "BGR"      "BHR"      "BHS"     
 [25] "BIH"      "BLR"      "BLZ"      "BMU"      "BOL"      "BRA"     
 [31] "BRB"      "BRN"      "BTN"      "BWA"      "CAF"      "CAN"     
 [37] "CHE"      "CHL"      "CHN"      "CIV"      "CMR"      "COD"     
 [43] "COG"      "COK"      "COL"      "COM"      "CPV"      "CRI"     
 [49] "CUB"      "CUW"      "CYM"      "CYP"      "CZE"      "DEU"     
 [55] "DJI"      "DMA"      "DNK"      "DOM"      "DZA"      "ECU"     
 [61] "EGY"      "ERI"      "ESH"      "ESP"      "EST"      "ETH"     
 [67] "FIN"      "FJI"      "FLK"      "FRA"      "FRO"      "FSM"     
 [73] "GAB"      "GBR"      "GEO"      "GGY"      "GHA"      "GIB"     
 [79] "GIN"      "GLP"      "GMB"      "GNB"      "GNQ"      "GRC"     
 [85] "GRD"      "GRL"      "GTM"      "GUF"      "GUM"      "GUY"     
 [91] "HKG"      "HND"      "HRV"      "HTI"      "HUN"      "IDN"     
 [97] "IMN"      "IND"      "IRL"      "IRN"      "IRQ"      "ISL"     
[103] "ISR"      "ITA"      "JAM"      "JEY"      "JOR"      "JPN"     
[109] "KAZ"      "KEN"      "KGZ"      "KHM"      "KIR"      "KNA"     
[115] "KOR"      "KWT"      "LAO"      "LBN"      "LBR"      "LBY"     
[121] "LCA"      "LIE"      "LKA"      "LSO"      "LTU"      "LUX"     
[127] "LVA"      "MAC"      "MAF"      "MAR"      "MCO"      "MDA"     
[133] "MDG"      "MDV"      "MEX"      "MHL"      "MKD"      "MLI"     
[139] "MLT"      "MMR"      "MNE"      "MNG"      "MNP"      "MOZ"     
[145] "MRT"      "MSR"      "MTQ"      "MUS"      "MWI"      "MYS"     
[151] "MYT"      "NAM"      "NCL"      "NER"      "NGA"      "NIC"     
[157] "NIU"      "NLD"      "NOR"      "NPL"      "NRU"      "NZL"     
[163] "OMN"      "OWID_KOS" "OWID_WRL" "PAK"      "PAN"      "PER"     
[169] "PHL"      "PLW"      "PNG"      "POL"      "PRI"      "PRK"     
[175] "PRT"      "PRY"      "PSE"      "PYF"      "QAT"      "REU"     
[181] "ROU"      "RUS"      "RWA"      "SAU"      "SDN"      "SEN"     
[187] "SGP"      "SHN"      "SLB"      "SLE"      "SLV"      "SMR"     
[193] "SOM"      "SPM"      "SRB"      "SSD"      "STP"      "SUR"     
[199] "SVK"      "SVN"      "SWE"      "SWZ"      "SXM"      "SYC"     
[205] "SYR"      "TCA"      "TCD"      "TGO"      "THA"      "TJK"     
[211] "TKL"      "TKM"      "TLS"      "TON"      "TTO"      "TUN"     
[217] "TUR"      "TUV"      "TWN"      "TZA"      "UGA"      "UKR"     
[223] "URY"      "USA"      "UZB"      "VAT"      "VCT"      "VEN"     
[229] "VGB"      "VIR"      "VNM"      "VUT"      "WLF"      "WSM"     
[235] "YEM"      "ZAF"      "ZMB"      "ZWE"     

So we have some unusual entries such as ““,”OWID_KOS” or “OWID_WRL”. What are these countries? We can do a subset to answer this question.

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
weird_countries<-subset(life_expectancy_df2, (Code %in% weird_labels))

Let’s inspect them by verifying unique countries

weird_countries2<-subset(weird_countries, !duplicated(weird_countries$Code))
head(weird_countries2, n=10)

So what is the problem with these data?

The problem is that these are aggregates that were decided by Our World in Data. So, we should remove them from the data. The way to do that is subsetting them based on the names that seemed problematic. Note the difference in code from before !(Code %in% weird_labels) vs. (Code %in% weird_labels)

4.3 Cleaning our dataset

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_expectancy_df<-subset(life_expectancy_df2, !(Code %in% weird_labels))
head(clean_life_expectancy_df, n=5)

Great! Let’s repeat the procedure for urbanization.

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_urbanization_df<-subset(urbanization_df2, !(Code %in% weird_labels))
head(clean_urbanization_df, n=5)

4.4 Performing a left merge

We will now perform a left merge whereby we try to merge urbanization data to life expectancy based on Code.

merged_data<-left_join(clean_life_expectancy_df, clean_urbanization_df, by = c("Code"="Code"))
head(merged_data, n=10)

4.5 Getting rid of NAs for urbanization

merged_data2<-subset(na.omit(merged_data))
head(merged_data2, n=10)

How many countries did we drop?

nrow(merged_data) - nrow(merged_data2)
[1] 21

I dropped 21.

What are those countries?

countries_dropped<-merged_data[!complete.cases(merged_data), ]
countries_dropped$Code
 [1] "AIA" "BES" "COK" "ESH" "FLK" "GGY" "GLP" "GUF" "JEY" "MAF" "MSR" "MTQ"
[13] "MYT" "NIU" "REU" "SHN" "SPM" "TKL" "TWN" "VAT" "WLF"

Let us have quick look at what these codes mean.

cntries<-subset(life_expectancy_df, Code %in% countries_dropped$Code)
unique(cntries$Entity)
 [1] "Anguilla"                        "Bonaire Sint Eustatius and Saba"
 [3] "Cook Islands"                    "Falkland Islands"               
 [5] "French Guiana"                   "Guadeloupe"                     
 [7] "Guernsey"                        "Jersey"                         
 [9] "Martinique"                      "Mayotte"                        
[11] "Montserrat"                      "Niue"                           
[13] "Reunion"                         "Saint Helena"                   
[15] "Saint Martin (French part)"      "Saint Pierre and Miquelon"      
[17] "Taiwan"                          "Tokelau"                        
[19] "Vatican"                         "Wallis and Futuna"              
[21] "Western Sahara"                 

4.6 Doing a scatterplot

ggplot2 in R is one of the most elegant and most versatile graphing libraries in R. It implements the grammar of graphics, a coherent system for describing and building graphs. We will start by creating simple scatterplots and use that to introduce aesthetic mappings and geometric objects.

We are finally able to do simple scatterplot.

library(ggplot2)
fig1<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()
fig1

4.7 Saving our first plot

We can save our plot by using the following code!

4.7.1 Relative Path Option

ggsave(fig1, file = "./graphs/fig1.jpg", 
       height = 20, width = 20, 
       units = "cm", dpi = 300)

4.7.2 Absolute Path Option

ggsave(fig1, file = "/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week3/lab/graphs/fig1.jpg", 
       height = 20, width = 20, 
       units = "cm", dpi = 300)

5 ggplot options

Let us play a bit with the options in ggplot so that we make our graph nicer.

5.1 Improving X and Y axis labels

fig2<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")
fig2

5.2 Adding a Title

fig2<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")
fig2

5.3 Getting rid of the grey background

fig2<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()
fig2

5.4 Fitting a line

Let us fit a line to see if there is any relationship between urbanization and life expectancy. Is it the case that countries with higher levels of urbanization also have higher life expectancy?

fig2<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)

fig2

The blue line would indeed suggest that as the level of urbanization goes up, so does life expectancy.

5.5 Fixing the X and Y axis to particular values

But what if we fixed our axies to particular values. Woudl we still see the same effects?

fig3<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(50,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(50,100))
fig3

Let’s put the two graphs together. The first graph is the one in which we fix the x and y axis while the second one is the one in which we don’t.

library(gridExtra)
grid.arrange(fig2, fig3, ncol=2)

This is interesting. So depending on how we present the data, the relationship may not seem as strong. However, the relationship is exactly the same. Thus, how you present the data matters!

5.6 Labelling the data

There are many ways of labeling the data in ggplot. But let us try a simple way.

First let us do a quick left merge so that we keep the original country names

label_cntry<-subset(life_expectancy_df, select = c("Entity", "Code"))
label_cntry<-subset(label_cntry, !duplicated(Code))
merged_data3<-left_join(merged_data2, label_cntry, by = c("Code"="Code"))
fig3<-ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  geom_text(aes(label = Entity),
            size = 4, 
            check_overlap = TRUE, 
            position = position_nudge(y = 1))
fig3

What if we wanted to do selective labeling? Let’s say, we just want to label Italy and the US. How could we do that?

Well, we could create a new dataset based on a subset and then label the data.

5.7 Subsetting the data to two countries

countries_of_interest = c("Italy", "United States")
italy_us_df<-subset(merged_data3, Entity %in%  countries_of_interest)

5.8 Let’s try the labelling again

fig4<-ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  geom_text(aes(label = Entity),
            size = 4, 
            check_overlap = TRUE, 
            position = position_nudge(y = 1),
            data = italy_us_df)
fig4

Tis is not ideal. You can barely see the text. We should probably find a better way. The better way is called geom_label.

fig4<-ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  geom_label(aes(label = Entity),
            size = 4, 
            position = position_nudge(y = 1),
            data = italy_us_df)
fig4

This is better. But still, which one is Italy and which one is the US?

library(ggrepel)
fig4<-ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
 geom_label_repel(box.padding = 0.5,
    aes(label = Entity),
            size = 4, 
            position = position_nudge(y = 1),
            data = italy_us_df)
fig4

So much better!