Modeling Relationships Between Variables

From Correlation Coefficients to Regression Lines and Interpretation

Bogdan G. Popescu

John Cabot University

Table of Contents

  1. Correlation
  2. Simple Linear Regression
  3. Regression Output & Interpretation

Intro

A correlation coefficient is a statistic that measures the strength of the relationship between two variables.

For example, urbanization could be positively correlated with life expectancy.

Show the code
library(dplyr)
library(ggplot2)
setwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture9/")

life_expectancy_df <- read.csv(file = './data/life-expectancy.csv')
urbanization_df <- read.csv(file = './data/share-of-population-urban.csv')

urbanization_df2<-urbanization_df%>%
  dplyr::group_by(Entity, Code)%>%
  dplyr::summarize(urb_mean=mean(Urban.population....of.total.population.))

life_expectancy_df2<-life_expectancy_df%>%
  dplyr::group_by(Entity, Code)%>%
  dplyr::summarize(life_expectancy=mean(Life.expectancy.at.birth..historical.))

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_expectancy_df<-subset(life_expectancy_df2, !(Code %in% weird_labels))

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_urbanization_df<-subset(urbanization_df2, !(Code %in% weird_labels))

clean_urbanization_df<-subset(clean_urbanization_df, select = -c(Entity))
merged_data<-left_join(clean_life_expectancy_df, clean_urbanization_df, by = c("Code"="Code"))
merged_data2<-na.omit(merged_data)

figure1<-ggplot(merged_data, aes(x=urb_mean, y=life_expectancy)) +
  geom_point() + 
#  geom_smooth(method = "lm", se=F)+
  theme_bw()+
  theme(axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title=element_text(size=14),
        plot.title = element_text(hjust = 0.5))+
  xlim(0, 100) + 
  ylim(0, 100)
figure1

Intro

It turns out that the correlation between urbanization and life expectancy is: 0.63

Show the code
library(dplyr)
setwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture9/")

life_expectancy_df <- read.csv(file = './data/life-expectancy.csv')
urbanization_df <- read.csv(file = './data/share-of-population-urban.csv')

urbanization_df2<-urbanization_df%>%
  dplyr::group_by(Entity, Code)%>%
  dplyr::summarize(urb_mean=mean(Urban.population....of.total.population.))

life_expectancy_df2<-life_expectancy_df%>%
  dplyr::group_by(Entity, Code)%>%
  dplyr::summarize(life_expectancy=mean(Life.expectancy.at.birth..historical.))

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_expectancy_df<-subset(life_expectancy_df2, !(Code %in% weird_labels))

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_urbanization_df<-subset(urbanization_df2, !(Code %in% weird_labels))

clean_urbanization_df<-subset(clean_urbanization_df, select = -c(Entity))
merged_data<-left_join(clean_life_expectancy_df, clean_urbanization_df, by = c("Code"="Code"))
merged_data2<-na.omit(merged_data)

figure1<-ggplot(merged_data, aes(x=urb_mean, y=life_expectancy)) +
  geom_point() + 
#  geom_smooth(method = "lm", se=F)+
  theme_bw()+
  theme(axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title=element_text(size=14),
        plot.title = element_text(hjust = 0.5))+
  xlim(0, 100) + 
  ylim(0, 100)
figure1
cor(merged_data$life_expectancy, merged_data$urb_mean, use = "complete.obs")
[1] 0.630194

Today, we will calculate correlations

\[ r = \frac{\frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n-1}}{s_x \cdot s_y} \]

  • The Pearson’s correlation coefficient r describes the strength and nature of the relationship between two variables.

The Correlation Coefficient

\[ r = \frac{\frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n-1}}{s_x \cdot s_y} \]

  • \(r\) is the sample Pearson correlation coefficient
  • \(x_i\) and \(y_i\) are the values of variables X and Y for the same observation
  • \(\bar{x}\) and \(\bar{y}\) are the sample means of variables X and Y
  • \(n\) is the sample size
  • \(s_x\) and \(s_y\) are the sample standard deviations of X and Y

Correlation vs. Causation

A correlation signifies the relationship between two variables.

It is tempting to argue that changes in one variable cause changes in the other.

This is problematic for three reasons:

  • We do not know the direction of causation: does X produce changes in y, or does y produce changes in x?
  • There might be a third variable z, which causes changes in both x and y. 
  • Both x and y could be measured with error.

Correlation vs. Causation

All of these problems suggest that there is endogeneity in the relationship between X and Y.

The lesson: Correlation does not imply causation.

Regression Line

  • We will make predictions with the help of regressions.
  • These can be written as:

\[ \hat{y}_i = b x_i + a \]

Where:

  • \(\hat{y}_i\) – individual predicted value on the dependent variable
  • \(x_i\) – individual value of the independent variable

Regression Line

  • We will make predictions with the help of regressions.
  • These can be written as:

\[ \hat{y}_i = b x_i + a \]

This is very similar to the formula \(y = mx + b\), which represents a linear equation in mathematics:

  • \(m\) – slope
  • \(b\) – y-intercept

Regression Line

Regression Example

In the urbanization and life expectancy example, we know:

  • \(\bar{x} = \overline{\text{urbanization}} = 51.33\)
  • \(\bar{y} = \overline{\text{life_expectancy}} = 61.259\)
  • \(s_x = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n}} = 24.18021\)
  • \(s_y = \sqrt{\frac{\sum (y_i - \bar{y})^2}{n}} = 8.643744\)
  • \(r = \frac{\frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n}}{s_x \cdot s_y} = 0.6303347\)

Regression Line

Calculating the Slope

We can plug in these numbers into the formula and obtain the following:

\[ b = r \cdot \frac{s_y}{s_x} \]

\[ b = 0.6303347 \cdot \frac{8.643744}{24.18021} \]

\[ b = 0.6303347 \cdot 0.3574718 \]

\[ b = 0.2253269 \]

Regression Line

Calculating the Slope

We can obtain the 0.2252918 in the following way:

x<-lm(life_expectancy~urb_mean, data=merged_data2)
summary(x)

Call:
lm(formula = life_expectancy ~ urb_mean, data = merged_data2)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.6442  -4.2065  -0.5017   5.0702  18.3283 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.69757    1.08225   45.92   <2e-16 ***
urb_mean     0.22529    0.01906   11.82   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.742 on 212 degrees of freedom
Multiple R-squared:  0.3971,    Adjusted R-squared:  0.3943 
F-statistic: 139.7 on 1 and 212 DF,  p-value: < 2.2e-16

Regression Line

Calculating the Slope

This is what the slope looks like.

Show the code
# Fit model
x <- lm(life_expectancy ~ urb_mean, data = merged_data2)
coef_val <- round(coef(x)["urb_mean"], 3)

# Create label
coef_label <- paste("Slope (b):", coef_val)

# Create plot
figure1 <- ggplot(merged_data, aes(x = urb_mean, y = life_expectancy)) +
  geom_point() + 
 geom_smooth(method = "lm", se = FALSE) +
  theme_bw() +
  theme(
    axis.text.x = element_text(size = 14),
    axis.text.y = element_text(size = 14),
    axis.title = element_text(size = 14),
    plot.title = element_text(hjust = 0.5)
  ) +
  xlim(0, 100) + 
  ylim(0, 100) +
  annotate("text", x = 10, y = 95, label = coef_label, size = 5, color = "blue")

figure1

Regression Line

Calculating the Slope

What if the slope was negative?

Show the code
# Simulate negative relationship
coef_val <- -round(coef(x)["urb_mean"], 3)

# Create label
coef_label <- paste("Slope (b):", coef_val)
a <-round(coef(x)["(Intercept)"], 3)

# Plot
figure1x <- ggplot(merged_data2, aes(x = urb_mean, y = life_expectancy)) +
#  geom_point() +
  geom_segment(aes(x = 0, y = a, xend = 100, yend = coef_val * max(merged_data2$urb_mean) + a), color = "red")+

  #geom_smooth(method = "lm", se = FALSE, color = "red") +
  theme_bw() +
  theme(
    axis.text.x = element_text(size = 14),
    axis.text.y = element_text(size = 14),
    axis.title = element_text(size = 14),
    plot.title = element_text(hjust = 0.5)
  ) +
  xlim(0, 100) +
  ylim(0, 100) +
  annotate("text", x = 10, y = 95, label = coef_label, size = 5, color = "red")

figure1x

Regression Line

Regression Example

The next step is to calculate \(a\) using the formula:

\[ \begin{align*} a = \bar{y} - b \bar{x} \\ a = 61.259 - 0.2253269 \cdot 51.33 \\ a = 61.259 - 11.56603 \\ a = 49.69297 \end{align*} \]

Regression Line

Regression Example

This is the a

Show the code
# Fit model
x <- lm(life_expectancy ~ urb_mean, data = merged_data2)

# Extract slope (b) and intercept (a)
slope_val      <- round(coef(x)["urb_mean"], 3)
intercept_val  <- round(coef(x)["(Intercept)"], 3)

# Labels
slope_label     <- paste("Slope (b):", slope_val)
intercept_label <- paste("Intercept (a):\n", intercept_val)

# Build plot
figure1 <- ggplot(merged_data2, aes(x = urb_mean, y = life_expectancy)) +
  geom_point(color = "black", alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE, alpha = 0.2) +  # color via aes

  # Mark the intercept on the y-axis
  geom_point(aes(x = 0, y = intercept_val), size = 3, color = "blue") +
  annotate("text",
           x = 5, 
           y = intercept_val + 2,
           label = intercept_label,
           hjust = 0, color = "blue", size = 5) +

  # Show the slope label in the upper-left
  annotate("text",
           x = 10, y = 95,
           label = slope_label,
           hjust = 0, color = "black", size = 5) +

  # Axis & theme tweaks
  xlim(0, 100) +
  ylim(0, 100) +
  theme_bw() +
  theme(
    axis.text  = element_text(size = 14),
    axis.title = element_text(size = 14),
    plot.title = element_text(hjust = 0.5)
  )
figure1

Regression Line

Regression Example

Knowing \(a\) and \(b\), we can use them to calculate the predicted value for \(y\):

\[ \begin{align*} \hat{y}_i = b x_i + a \\ \hat{y}_i = 0.2253269 \cdot x_i + 49.69297 \end{align*} \]

We are missing \(x_i\), the individual’s independent variable value.

Regression Line

Regression Example

These are the predicted values for our \(y_i\)

Regression Line

Regression Example

Actual (circles) - predicted value (triangles) = residual.

\[\text{residual}_i = y_i - \hat{y}_i\]

Regression Example

We note that there is a difference between \(y_i\) and \(\hat{y_i}\), where:

\(y_i\) is the country’s observed value on life expectancy

\(\hat{y_i}\) is the country’s predicted value on life expectancy

We can obtain each country’s \(\hat{y_i}\) by plugging their x-value into the regression equation.

The regression line is the straight line that minimizes the size of the residuals and makes them as small as possible.

This is why the regression line is sometimes called the line of best fit.

Regression Example

Interpretation

x<-lm(life_expectancy~urb_mean, data=merged_data2)
summary(x)

Call:
lm(formula = life_expectancy ~ urb_mean, data = merged_data2)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.6442  -4.2065  -0.5017   5.0702  18.3283 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 49.69757    1.08225   45.92   <2e-16 ***
urb_mean     0.22529    0.01906   11.82   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.742 on 212 degrees of freedom
Multiple R-squared:  0.3971,    Adjusted R-squared:  0.3943 
F-statistic: 139.7 on 1 and 212 DF,  p-value: < 2.2e-16

Regression Example

Interpretation

This is how we create a professionally-looking table.

Show the code
library("modelsummary")
model<-lm(life_expectancy~urb_mean, data=merged_data2)

models<-list("DV: Life Expectancy" = model)

cm <- c('urb_mean'='Urbanization',
        "(Intercept)"="Intercept")

modelsummary(models, stars = TRUE, coef_map = cm, gof_map = c("nobs", "r.squared"))
DV: Life Expectancy
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Urbanization 0.225***
(0.019)
Intercept 49.698***
(1.082)
Num.Obs. 214
R2 0.397

Regression Example

Interpretation

This is how we create a professionally-looking plot.

Show the code
library(broom)
model_scaled<-lm(scale(life_expectancy)~scale(urb_mean), data=merged_data2)
results<-tidy(model_scaled)

cm <- c('scale(urb_mean)'='Urbanization',
        "(Intercept)"="Intercept")

ggplot(results, aes(x = estimate, y = term)) +
      geom_point()+
      geom_errorbar(aes(xmin = estimate-1.96*std.error, 
                        xmax = estimate+1.96*std.error), 
                linewidth = 1, width=0)+
  scale_y_discrete(labels = cm) +  # this maps term names to nicer labels
  theme_bw()+
   theme(
    axis.text  = element_text(size = 14),
    axis.title = element_text(size = 14),
    plot.title = element_text(hjust = 0.5))

Regression Example

Interpretation: The Slope (b)

Show the code
library("modelsummary")
model<-lm(life_expectancy~urb_mean, data=merged_data2)

models<-list("DV: Life Expectancy" = model)

cm <- c('urb_mean'='Urbanization',
        "(Intercept)"="Intercept")

modelsummary(models, stars = TRUE, coef_map = cm, gof_map = c("nobs", "r.squared"))
DV: Life Expectancy
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Urbanization 0.225***
(0.019)
Intercept 49.698***
(1.082)
Num.Obs. 214
R2 0.397

Interpretation: For every 1 percentage point increase in urbanization (urb_mean), the model predicts an average increase of approximately 0.225 in life expectancy. This relationship is statistically significant at the 0.1% level (p < 0.001), as indicated by the *** in the regression output.

Regression Example

Interpretation: The Standard Error (SE)

Show the code
library("modelsummary")
model<-lm(life_expectancy~urb_mean, data=merged_data2)

models<-list("DV: Life Expectancy" = model)

cm <- c('urb_mean'='Urbanization',
        "(Intercept)"="Intercept")

modelsummary(models, stars = TRUE, coef_map = cm, gof_map = c("nobs", "r.squared"))
DV: Life Expectancy
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Urbanization 0.225***
(0.019)
Intercept 49.698***
(1.082)
Num.Obs. 214
R2 0.397

SE for Urbanization = 0.019. This tells you how much the estimate would vary across different samples. The smaller it is, the more precise your estimate.

Regression Example

Interpretation: The Intercept (a)

Show the code
library("modelsummary")
model<-lm(life_expectancy~urb_mean, data=merged_data2)

models<-list("DV: Life Expectancy" = model)

cm <- c('urb_mean'='Urbanization',
        "(Intercept)"="Intercept")

modelsummary(models, stars = TRUE, coef_map = cm, gof_map = c("nobs", "r.squared"))
DV: Life Expectancy
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Urbanization 0.225***
(0.019)
Intercept 49.698***
(1.082)
Num.Obs. 214
R2 0.397

Interpretation: Intercept = 49.698 When urbanization is 0%, the predicted life expectancy is 49.698 years. This is not always meaningful in practice (e.g., no country has exactly 0% urbanization), but it’s mathematically necessary.

Regression Example

Interpretation: R Squared

The \(R^2\) is 0.397

It tells us the variance explained in our outcome variable by our predictor variable(s).

The interpretation is: “Our model explains 39.7% of the variance in our outcome variable.”

R squared ranges from 0 to 1: \(R\in [0,1]\)

  • R sq = 0 means no variance is explained
  • R sq = 1 means that all variance is explained

Notation

We now have the following equation:

\[ \hat{Y} = bX + a \]

This can also be written as:

\[ \hat{y} = \hat{\alpha} + \hat{\beta}_1 x + \epsilon \]

\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + \epsilon \]

Where:

\(a = \alpha = \beta_0\)

\(b = \beta_1\)

\(\epsilon\) = error term

Conclusion

  • Correlation quantifies the strength and direction of a linear relationship between two variables.
  • Regression models this relationship to make predictions.
  • The slope (\(\beta_1\)) tells us the expected change in the outcome for a one-unit change in the predictor.
  • The intercept (\(\beta_0\)) gives the expected value when the predictor is zero.
  • \(R^2\) measures how well the model explains variation in the outcome.
  • Correlation and regression do not imply causation.