Merging Data, Scatterplots
Today, we’ll learn how to:
dplyr to group and summarize valuesggplot2Press CMD + A or Ctrl + A and then Press Delete
Then type:
We first remove what we had previously
Download the following datasets from Dropbox:
Place them in your working directory or folder.
To get the file path we simply go to the relevant folder
Once we have the path, we can now read the files:
# Defining Paths
file_path <- "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture7/data/"
# Use file.path() to construct full path
life_expectancy_df <- read.csv(file.path(file_path, "life-expectancy.csv"))
urbanization_df <- read.csv(file.path(file_path, "share-of-population-urban.csv"))These are our datasets
What Are Measures of Central Tendency?
These are our datasets
These are our datasets
These are our datasets
These are our datasets
These are our datasets
These are our datasets
These are our datasets
These are our datasets
These are our datasets
These are our datasets
This is what this looks like in code:
Note that %>% is called a pipe operator.
It allows you you “pipe” an object forward to a function.
You may also sometimes encounter the |> sign.
It has the same role.
You could also write:
instead of:
The reason for including dplyr:: is because other libraries have functions like group_by.
Not including the library name can lead to unexpected behavior.
Finally, we can also write this, if we have NAs:
This allows us to ignore the NA values that might exist within a variable.
For simplicity, we will not include the na.rm=TRUE code.
\[ \bar{x} = \frac{x_1 + x_2 + \dots + x_n}{n} = \frac{\sum x_i}{n} \]
Not resistant to outliers
In your case, we can calculate the mean for all the values in life expectancy
Resistant to outliers:
In your case, we can calculate the median for all the values in life expectancy
Why Dispersion Matters
where:
- \(y_i\) stands for the ith value
- \(\bar{x}\) is the population mean of the variable
- n - is the population size
In your case, we can calculate the range for all the values in life expectancy
Thus, the difference is 74.5.
This is how we calculate the variance
This is how we calculate the standard deviation
group_by and summarizeThis is what this looks like in code for life_expectancy_df2:
This is how we inspect the first entries for life_expectancy_df2:
This corresponds to the first three entries of the dataframe:
This is how we inspect the first entries for life_expectancy_df2:
This corresponds to the first three entries of the dataframe:
This is how we inspect the first entries for urbanization_df2:
This corresponds to the first three entries of the dataframe:
This is how we inspect the first entries for urbanization_df2:
This corresponds to the first three entries of the dataframe:
We notice the country codes.
  [1] ""         "ABW"      "AFG"      "AGO"      "AIA"      "ALB"     
  [7] "AND"      "ARE"      "ARG"      "ARM"      "ASM"      "ATG"     
 [13] "AUS"      "AUT"      "AZE"      "BDI"      "BEL"      "BEN"     
 [19] "BES"      "BFA"      "BGD"      "BGR"      "BHR"      "BHS"     
 [25] "BIH"      "BLR"      "BLZ"      "BMU"      "BOL"      "BRA"     
 [31] "BRB"      "BRN"      "BTN"      "BWA"      "CAF"      "CAN"     
 [37] "CHE"      "CHL"      "CHN"      "CIV"      "CMR"      "COD"     
 [43] "COG"      "COK"      "COL"      "COM"      "CPV"      "CRI"     
 [49] "CUB"      "CUW"      "CYM"      "CYP"      "CZE"      "DEU"     
 [55] "DJI"      "DMA"      "DNK"      "DOM"      "DZA"      "ECU"     
 [61] "EGY"      "ERI"      "ESH"      "ESP"      "EST"      "ETH"     
 [67] "FIN"      "FJI"      "FLK"      "FRA"      "FRO"      "FSM"     
 [73] "GAB"      "GBR"      "GEO"      "GGY"      "GHA"      "GIB"     
 [79] "GIN"      "GLP"      "GMB"      "GNB"      "GNQ"      "GRC"     
 [85] "GRD"      "GRL"      "GTM"      "GUF"      "GUM"      "GUY"     
 [91] "HKG"      "HND"      "HRV"      "HTI"      "HUN"      "IDN"     
 [97] "IMN"      "IND"      "IRL"      "IRN"      "IRQ"      "ISL"     
[103] "ISR"      "ITA"      "JAM"      "JEY"      "JOR"      "JPN"     
[109] "KAZ"      "KEN"      "KGZ"      "KHM"      "KIR"      "KNA"     
[115] "KOR"      "KWT"      "LAO"      "LBN"      "LBR"      "LBY"     
[121] "LCA"      "LIE"      "LKA"      "LSO"      "LTU"      "LUX"     
[127] "LVA"      "MAC"      "MAF"      "MAR"      "MCO"      "MDA"     
[133] "MDG"      "MDV"      "MEX"      "MHL"      "MKD"      "MLI"     
[139] "MLT"      "MMR"      "MNE"      "MNG"      "MNP"      "MOZ"     
[145] "MRT"      "MSR"      "MTQ"      "MUS"      "MWI"      "MYS"     
[151] "MYT"      "NAM"      "NCL"      "NER"      "NGA"      "NIC"     
[157] "NIU"      "NLD"      "NOR"      "NPL"      "NRU"      "NZL"     
[163] "OMN"      "OWID_KOS" "OWID_WRL" "PAK"      "PAN"      "PER"     
[169] "PHL"      "PLW"      "PNG"      "POL"      "PRI"      "PRK"     
[175] "PRT"      "PRY"      "PSE"      "PYF"      "QAT"      "REU"     
[181] "ROU"      "RUS"      "RWA"      "SAU"      "SDN"      "SEN"     
[187] "SGP"      "SHN"      "SLB"      "SLE"      "SLV"      "SMR"     
[193] "SOM"      "SPM"      "SRB"      "SSD"      "STP"      "SUR"     
[199] "SVK"      "SVN"      "SWE"      "SWZ"      "SXM"      "SYC"     
[205] "SYR"      "TCA"      "TCD"      "TGO"      "THA"      "TJK"     
[211] "TKL"      "TKM"      "TLS"      "TON"      "TTO"      "TUN"     
[217] "TUR"      "TUV"      "TWN"      "TZA"      "UGA"      "UKR"     
[223] "URY"      "USA"      "UZB"      "VAT"      "VCT"      "VEN"     
[229] "VGB"      "VIR"      "VNM"      "VUT"      "WLF"      "WSM"     
[235] "YEM"      "ZAF"      "ZMB"      "ZWE"     
We notice the country codes.
We notice the country codes.
So we have some unusual entries such as:
Which countries do these codes refer to? We can subset the data to find out.
We can now clean the dataset
We can now clean the dataset
Let’s repeat the procedure for the other dataset
Let’s repeat the procedure for the other dataset
We will now perform a left join to combine urbanization data with life expectancy data based on Code.
We will now perform a left join to combine urbanization data with life expectancy data based on Code.
We will now perform a left join to combine urbanization data with life expectancy data based on Code.
We will now perform a left join to combine urbanization data with life expectancy data based on Code.
This is how we remove NA values
How many countries did we drop?
How many countries did we drop?
We dropped 21.
How many countries did we drop?
We dropped 21.
How many countries did we drop?
We dropped 21.
What are those countries?
Let us have a quick look at what these codes mean.
 [1] "Anguilla"                        "Bonaire Sint Eustatius and Saba"
 [3] "Cook Islands"                    "Falkland Islands"               
 [5] "French Guiana"                   "Guadeloupe"                     
 [7] "Guernsey"                        "Jersey"                         
 [9] "Martinique"                      "Mayotte"                        
[11] "Montserrat"                      "Niue"                           
[13] "Reunion"                         "Saint Helena"                   
[15] "Saint Martin (French part)"      "Saint Pierre and Miquelon"      
[17] "Taiwan"                          "Tokelau"                        
[19] "Vatican"                         "Wallis and Futuna"              
[21] "Western Sahara"                 
ggplot2 is one of the most elegant and most versatile graphing libraries in R
Before we use ggplot we need to install and load the library:
Now, you have to load the package
Note: this has to be ggplot2, not ggplot.
Within your working folder, create a sub-folder called “graphs”
In this case, the working folder is the folder “lab” which is inside the “week3” folder.
I then created another folder, called “graphs” to keep things organized.
Next copy and paste the life-expectancy.csv from “data” to “graphs”
Next copy and paste the life-expectancy.csv from “data” to “graphs”
Figure out the path of the “graphs” folder by using file.choose()
Figure out the path of the “graphs” folder by using file.choose()
Figure out the path of the “graphs” folder by using file.choose()
Figure out the path of the “graphs” folder by using file.choose()
Figure out the path of the “graphs” folder by using file.choose()
Figure out the path of the “graphs” folder by using file.choose()
Figure out the path of the “graphs” folder by using file.choose()
Setting the working directory
Relative Path Option
Once you figured out the path, delete file.choose() and the life-expectancy.csv
fig3 <- ggplot(merged_data2, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))
fig3There are many ways of labeling the data in ggplot. But let us try a simple way.
First let us do a quick left join so that we keep the original country names
fig3 <- ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  geom_text(aes(label = Entity),
            size = 4, 
            check_overlap = TRUE, 
            position = position_nudge(y = 1))
fig3What if we wanted to do selective labeling? Let’s say, we just want to label Italy and the US. How could we do that?
fig4 <- ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  geom_text(aes(label = Entity),
            size = 4, 
            check_overlap = TRUE, 
            position = position_nudge(y = 1),
            data = italy_us_df)
fig4This does not work so well
The better way is called geom_label.
fig4 <- ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  geom_label(aes(label = Entity),
            size = 4, 
            position = position_nudge(y = 1),
            data = italy_us_df)
fig4The better way is called geom_label.
But which one is Italy and which one is the US?
library(ggrepel)
fig4 <- ggplot(merged_data3, mapping = aes(x=urb_mean, y=life_exp_mean)) +
  geom_point()+
  xlab("Urbanization") + ylab("Life Expectancy")+
  ggtitle("Example Title")+
  theme_bw()+
  geom_smooth(method = "lm", se=FALSE)+
  scale_x_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  scale_y_continuous(breaks=seq(0, 100, by = 20), limits = c(0,100))+
  geom_label_repel(box.padding = 0.5,
    aes(label = Entity),
            size = 4, 
            position = position_nudge(y = 1),
            data = italy_us_df)
fig4But which one is Italy and which one is the US?
dplyr to group, summarize, and clean dataleft_join()ggplot2Popescu (JCU): Merging Data, Scatterplots