Lab 8: Statistics

Regressions, Tables, Coefficient plots, Residuals

Author

Bogdan G. Popescu

#Removing previous datasets in memory
rm(list = ls())
#Loading the relevant libraries
library(ggplot2)
library(gridExtra)
library(dplyr)
library(purrr)
library(sf)
library(rnaturalearth)
library(rnaturalearthdata)
library(ggrepel)
library(modelsummary)
library(broom)
library(gganimate)

1 Intro

In this lab, we learn about running regressions, interpreting regression coeffcients, making nice tables, and making coefficient plots.

2 Loading the Data

Let us go back to the old Life Expectancy and urbanization datasets. If you don’t have them anymore, you can download them from the following links:

#Setting path
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week8/lab/")
#Step1: Loading the data
life_expectancy_df <- read.csv(file = './data/life-expectancy.csv')
urbanization_df <- read.csv(file = './data/share-of-population-urban.csv')

3 Cleaning the Data

In the next few lines we average life expectancy over country (the original dataset is a panel - with countries and years). This means that we are getting rid of the time component.

#Step1: Selecting after 1900
life_expectancy_df<-subset(life_expectancy_df, Year>1900)
#Step2: Calculating the mean
life_expectancy_df2<-life_expectancy_df%>%
  dplyr::group_by(Entity, Code)%>%
  dplyr::summarize(life_exp_mean=mean(Life.expectancy.at.birth..historical.))
#Step3: Cleaning the Data
weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_expectancy_df<-subset(life_expectancy_df2, !(Code %in% weird_labels))
head(clean_life_expectancy_df, n=5)
#Step1: Calculating the mean
urbanization_df2<-urbanization_df%>%
  dplyr::group_by(Code)%>%
  dplyr::summarize(urb_mean=mean(Urban.population....of.total.population.))
#Step2: Cleaning the Data
weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_urbanization_df<-subset(urbanization_df2, !(Code %in% weird_labels))
head(clean_urbanization_df, n=5)

We are now left merge urbanization on life expectancy

#Step1: Performing the left_merge
merged_data<-left_join(clean_life_expectancy_df, clean_urbanization_df, by = c("Code"="Code"))
#Step2: Getting rid of the NAs
merged_data2<-subset(na.omit(merged_data))
#Step3: Sorting by Entity name
merged_data2 <- merged_data2[order(merged_data2$Entity),] 
glimpse(merged_data2)
Rows: 214
Columns: 4
Groups: Entity [214]
$ Entity        <chr> "Afghanistan", "Albania", "Algeria", "American Samoa", "…
$ Code          <chr> "AFG", "ALB", "DZA", "ASM", "AND", "AGO", "ATG", "ARG", …
$ life_exp_mean <dbl> 45.38333, 68.28611, 57.53013, 68.63750, 77.04861, 45.084…
$ urb_mean      <dbl> 18.61175, 40.44416, 52.60921, 79.76249, 87.04302, 37.539…

Let us now calculate the correlation between life expectancy and urbanization using the two ways we learned.

4 Correlation between urbanization and life expectancy

As we discussed in the previous lab, we can calculate correlations using the cor command. This is given by the following formula:

\[\rho = r = \frac{\frac{\Sigma(X_{i} - \bar{X}) (Y_i - \bar{Y})}{n}}{s_X * s_Y}\]

cor_coeff<-cor(merged_data2$urb_mean, merged_data2$life_exp_mean, method = c("pearson"))
cor_coeff
[1] 0.6906849

We can use the correlation coefficient to calculate the b, following the formula:

\[b = r \frac{s_Y}{s_X}\]

b_res<-cor_coeff*(sd(merged_data2$life_exp_mean)/sd(merged_data2$urb_mean))
b_res
[1] 0.2515915

We can also obtain the same result by running a regression

model<-lm(life_exp_mean~urb_mean, data=merged_data2)
summary(model)

Call:
lm(formula = life_exp_mean ~ urb_mean, data = merged_data2)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.371  -3.929  -0.529   4.562  18.623 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 48.96112    1.02706   47.67   <2e-16 ***
urb_mean     0.25159    0.01809   13.91   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.398 on 212 degrees of freedom
Multiple R-squared:  0.477, Adjusted R-squared:  0.4746 
F-statistic: 193.4 on 1 and 212 DF,  p-value: < 2.2e-16

The way we interpret b is the following: every unit increase in urbanization is associated with an increase in life expectancy of 0.25. This effect is highly statistically significant.

5 Creating a regression table

This is how we create a professionally-looking table.

model<-lm(life_exp_mean~urb_mean, data=merged_data2)

models<-list("Life Expectancy" = model)

cm <- c('urb_mean'='Urbanization',
        "(Intercept)"="Intercept")

modelsummary(models, stars = TRUE, coef_map = cm, gof_map = c("nobs", "r.squared"))
Life Expectancy
Urbanization 0.252***
(0.018)
Intercept 48.961***
(1.027)
Num.Obs. 214
R2 0.477
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

6 Creating a coefficient plot

We will now use a new function - tidy, which will allow us to save the regression coefficients in a dataframe.

results <- tidy(model)
results

We can plot these coefficients in the following way:

#To calculate the confidence interval, we use 1.96, 
#meaning that 95% of the area under a normal curve lies 
#within approximately 1.96 standard deviations from the mean
ggplot(results, 
      aes(x = estimate, y = term)) +
      geom_point()+
      geom_errorbar(aes(xmin = estimate-1.96*std.error, 
                        xmax = estimate+1.96*std.error), 
                linewidth = 1, width=0)