Lab 3: Statistics

Data cleaning, Data merges, Scatterplots

Author

Bogdan G. Popescu

1 Loading the Data

Let’s go back to our old datasets: life expectancy and urbanization. Download them from the following links:

2 Creating a directory

The next step is to create a sub-folder within your stats folder: Lab3. Place the csv files within the lab3.

Let’s now open them.

2.1 Option1: Opening using a relative path

#Setting path
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week3/lab/")
life_expectancy_df <- read.csv(file = './data/life-expectancy.csv')
urbanization_df <- read.csv(file = './data/share-of-population-urban.csv')

2.2 Option2: Opening using an absolute path

life_expectancy_df <- read.csv(file ='/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week3/lab/data/life-expectancy.csv')
urbanization_df <- read.csv(file = '/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week3/lab/data/share-of-population-urban.csv')

We should now have in memory the two datasets.

3 Examining the data

We can either print the entire dataset, but that will lead to a very long list of observations. In addition, if the dataset is too big, this will likely freeze your computer. The way to print the dataset is by simply typing the name of the datase. However, I am commenting this out.

#life_expectancy

A more efficient method is to simply print only a few observations. For example, we can simply print the first 5 observations for each dataset.

head(life_expectancy_df, n=5)
head(urbanization_df, n=10)

Another good way to examine your data is to glimpse at the data.

#install.packages("pillar")
library("pillar")
glimpse(life_expectancy_df)
Rows: 20,445
Columns: 4
$ Entity                                <chr> "Afghanistan", "Afghanistan", "A…
$ Code                                  <chr> "AFG", "AFG", "AFG", "AFG", "AFG…
$ Year                                  <int> 1950, 1951, 1952, 1953, 1954, 19…
$ Life.expectancy.at.birth..historical. <dbl> 27.7, 28.0, 28.4, 28.9, 29.2, 29…

Note that glimpse allows us to also see what kind of variables we are dealing with. Thus glimpse allows us to see that we have four variables within our dataframe: 1) Entity; 2) Code; 3) Year; 4) Life.expectancy.at.birth..historical.

  • Entity is a string or character variable that tells us the country that we are dealing with: “Afghanistan”, “Albania”, “Algeria”, etc.
  • Code is also a string or character variable that gives us the country code: “AFG”, “ALB”, “DZA”, etc.
  • Year is is a numeric or integer variable: 1950, 1951, 1952, etc.
  • Life.expectancy.at.birth..historical. is a numeric or double precision variable: 27.7, 28.0, 28.4, etc.

4 Creating a scatterplot

Let’s say that we want to create a scatterplot depicting the relationship between urbanization and life expectancy. The first thing we need to do is to calculate averages per country. We will proceed as we did previously.

4.1 Create averages

The next step is to create averages for all the countries. The %>% is called the pipe operator. This operator makes it possible to easily chain a sequence of calculations. “Chaining” means that you invoke multiple method calls. As each method returns an object, you can actually allow the calls to be chained together in a single statement, without needing variables to store the intermediate results.

library(dplyr)
life_expectancy_df2<-life_expectancy_df%>%
  dplyr::group_by(Code)%>%
  dplyr::summarize(life_exp_mean=mean(Life.expectancy.at.birth..historical.))
urbanization_df2<-urbanization_df%>%
  dplyr::group_by(Code)%>%
  dplyr::summarize(urb_mean=mean(Urban.population....of.total.population.))

4.2 Inspect the first 4 entries of each new dataframe

head(life_expectancy_df2, n=4)
head(urbanization_df2, n=4)

We see that some some entries do not have a code. Why is that? Let’s have a quick look at the original datasets and see what the issue is with those which do not have a code. Let’s first inspect the unique values in Code

unique(life_expectancy_df2$Code)
  [1] ""         "ABW"      "AFG"      "AGO"      "AIA"      "ALB"     
  [7] "AND"      "ARE"      "ARG"      "ARM"      "ASM"      "ATG"     
 [13] "AUS"      "AUT"      "AZE"      "BDI"      "BEL"      "BEN"     
 [19] "BES"      "BFA"      "BGD"      "BGR"      "BHR"      "BHS"     
 [25] "BIH"      "BLR"      "BLZ"      "BMU"      "BOL"      "BRA"     
 [31] "BRB"      "BRN"      "BTN"      "BWA"      "CAF"      "CAN"     
 [37] "CHE"      "CHL"      "CHN"      "CIV"      "CMR"      "COD"     
 [43] "COG"      "COK"      "COL"      "COM"      "CPV"      "CRI"     
 [49] "CUB"      "CUW"      "CYM"      "CYP"      "CZE"      "DEU"     
 [55] "DJI"      "DMA"      "DNK"      "DOM"      "DZA"      "ECU"     
 [61] "EGY"      "ERI"      "ESH"      "ESP"      "EST"      "ETH"     
 [67] "FIN"      "FJI"      "FLK"      "FRA"      "FRO"      "FSM"     
 [73] "GAB"      "GBR"      "GEO"      "GGY"      "GHA"      "GIB"     
 [79] "GIN"      "GLP"      "GMB"      "GNB"      "GNQ"      "GRC"     
 [85] "GRD"      "GRL"      "GTM"      "GUF"      "GUM"      "GUY"     
 [91] "HKG"      "HND"      "HRV"      "HTI"      "HUN"      "IDN"     
 [97] "IMN"      "IND"      "IRL"      "IRN"      "IRQ"      "ISL"     
[103] "ISR"      "ITA"      "JAM"      "JEY"      "JOR"      "JPN"     
[109] "KAZ"      "KEN"      "KGZ"      "KHM"      "KIR"      "KNA"     
[115] "KOR"      "KWT"      "LAO"      "LBN"      "LBR"      "LBY"     
[121] "LCA"      "LIE"      "LKA"      "LSO"      "LTU"      "LUX"     
[127] "LVA"      "MAC"      "MAF"      "MAR"      "MCO"      "MDA"     
[133] "MDG"      "MDV"      "MEX"      "MHL"      "MKD"      "MLI"     
[139] "MLT"      "MMR"      "MNE"      "MNG"      "MNP"      "MOZ"     
[145] "MRT"      "MSR"      "MTQ"      "MUS"      "MWI"      "MYS"     
[151] "MYT"      "NAM"      "NCL"      "NER"      "NGA"      "NIC"     
[157] "NIU"      "NLD"      "NOR"      "NPL"      "NRU"      "NZL"     
[163] "OMN"      "OWID_KOS" "OWID_WRL" "PAK"      "PAN"      "PER"     
[169] "PHL"      "PLW"      "PNG"      "POL"      "PRI"      "PRK"     
[175] "PRT"      "PRY"      "PSE"      "PYF"      "QAT"      "REU"     
[181] "ROU"      "RUS"      "RWA"      "SAU"      "SDN"      "SEN"     
[187] "SGP"      "SHN"      "SLB"      "SLE"      "SLV"      "SMR"     
[193] "SOM"      "SPM"      "SRB"      "SSD"      "STP"      "SUR"     
[199] "SVK"      "SVN"      "SWE"      "SWZ"      "SXM"      "SYC"     
[205] "SYR"      "TCA"      "TCD"      "TGO"      "THA"      "TJK"     
[211] "TKL"      "TKM"      "TLS"      "TON"      "TTO"      "TUN"     
[217] "TUR"      "TUV"      "TWN"      "TZA"      "UGA"      "UKR"     
[223] "URY"      "USA"      "UZB"      "VAT"      "VCT"      "VEN"     
[229] "VGB"      "VIR"      "VNM"      "VUT"      "WLF"      "WSM"     
[235] "YEM"      "ZAF"      "ZMB"      "ZWE"     

So we have some unusual entries such as ““,”OWID_KOS” or “OWID_WRL”. What are these countries? We can do a subset to answer this question.

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
weird_countries<-subset(life_expectancy_df2, (Code %in% weird_labels))

Let’s inspect them by verifying unique countries

weird_countries2<-subset(weird_countries, !duplicated(weird_countries$Code))
head(weird_countries2, n=10)

So what is the problem with these data?

The problem is that these are aggregates that were decided by Our World in Data. So, we should remove them from the data. The way to do that is subsetting them based on the names that seemed problematic. Note the difference in code from before !(Code %in% weird_labels) vs. (Code %in% weird_labels)

4.3 Cleaning our dataset

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_expectancy_df<-subset(life_expectancy_df2, !(Code %in% weird_labels))
head(clean_life_expectancy_df, n=5)

Great! Let’s repeat the procedure for urbanization.

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_urbanization_df<-subset(urbanization_df2, !(Code %in% weird_labels))
head(clean_urbanization_df, n=5)

4.4 Performing a left merge

We will now perform a left merge whereby we try to merge urbanization data to life expectancy based on Code.

merged_data<-left_join(clean_life_expectancy_df, clean_urbanization_df, by = c("Code"="Code"))
head(merged_data, n=10)

4.5 Getting rid of NAs for urbanization

merged_data2<-subset(na.omit(merged_data))
head(merged_data2, n=10)

How many countries did we drop?

nrow(merged_data) - nrow(merged_data2)
[1] 21

I dropped 21.

What are those countries?

countries_dropped<-merged_data[!complete.cases(merged_data), ]
countries_dropped$Code
 [1] "AIA" "BES" "COK" "ESH" "FLK" "GGY" "GLP" "GUF" "JEY" "MAF" "MSR" "MTQ"
[13] "MYT" "NIU" "REU" "SHN" "SPM" "TKL" "TWN" "VAT" "WLF"

Let us have quick look at what these codes mean.

cntries<-subset(life_expectancy_df, Code %in% countries_dropped$Code)
unique(cntries$Entity)
 [1] "Anguilla"                        "Bonaire Sint Eustatius and Saba"
 [3] "Cook Islands"                    "Falkland Islands"               
 [5] "French Guiana"                   "Guadeloupe"                     
 [7] "Guernsey"                        "Jersey"                         
 [9] "Martinique"                      "Mayotte"                        
[11] "Montserrat"                      "Niue"                           
[13] "Reunion"                         "Saint Helena"                   
[15] "Saint Martin (French part)"      "Saint Pierre and Miquelon"      
[17] "Taiwan"                          "Tokelau"                        
[19] "Vatican"                         "Wallis and Futuna"              
[21] "Western Sahara"                 

4.6 Doing a scatterplot

ggplot2 in R is one of the most elegant and most versatile graphing libraries in R. It implements the grammar of graphics, a coherent system for describing and building graphs. We will start by creating simple scatterplots and use that to introduce aesthetic mappings and geometric objects.

We are finally able to do simple scatterplot.

library(ggplot2)
fig1<-ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()
fig1