Statistical Analysis

Lecture 9: Bivariate Regression

Bogdan G. Popescu

John Cabot University

Introduction

From Correlation to Regression

  • Previously: Pearson correlation coefficient (\(r\))
    • Measures strength and direction of a linear relationship
  • Limitation: \(X\) and \(Y\) are interchangeable
    • \(r\) does not distinguish IV from DV
  • Today: regression lets us predict \(Y\) from \(X\)
    • Assigns roles: \(X\) is the predictor, \(Y\) is the outcome

Pearson Correlation (Recap)

\[ r = \frac{\frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{n}}{s_X \cdot s_Y} \]

  • Ranges from \(-1\) to \(+1\)
  • Does not imply causation
  • Treats \(X\) and \(Y\) symmetrically

Notation

Population vs. Sample Notation

Before building the regression equation, we distinguish population parameters from sample statistics:

Concept Population Sample
Mean \(\mu_X\), \(\mu_Y\) \(\bar{X}\), \(\bar{Y}\)
Std. Dev. \(\sigma_X\), \(\sigma_Y\) \(s_X\), \(s_Y\)
Size \(N\) \(n\)
  • We observe the sample but want to learn about the population

Sample Standard Deviation

\[s_Y = \sqrt{\frac{\sum_{i=1}^{n}(Y_i - \bar{Y})^2}{n}}\]

  • \(Y_i\): the \(i\)th value of variable \(Y\)
  • \(\bar{Y}\): sample mean of \(Y\)
  • \(n\): sample size

Regression Notation

The regression equation can be written two equivalent ways:

\[\hat{Y} = bX + a \quad \text{or} \quad \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + \epsilon\]

Symbol Equivalent Meaning
\(\hat{\beta}_0\) \(a\) Intercept
\(\hat{\beta}_1\) \(b\) Slope
\(\epsilon\) Error term

Estimates vs. True Values

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x + \epsilon \quad \text{(sample estimate)}\]

\[y = \beta_0 + \beta_1 x + \epsilon \quad \text{(true population)}\]

  • The hat (\(\hat{}\)) denotes an estimate from sample data
  • \(\hat{\beta}_1\) is our best guess of the true \(\beta_1\)
  • \(b\) and \(\hat{\beta}_1\) refer to the same quantity

The Regression Line

The Regression Equation

We make predictions with regressions:

\[\hat{Y}_i = bX_i + a\]

  • \(\hat{Y}_i\): predicted value of the dependent variable
  • \(X_i\): observed value of the independent variable
  • \(b\) (\(\hat{\beta}_1\)): slope — change in \(Y\) per unit of \(X\)
  • \(a\) (\(\hat{\beta}_0\)): intercept — predicted \(Y\) when \(X = 0\)

Similar to \(y = mx + b\) from algebra

Computing the Slope (\(b\))

\[b = r \cdot \frac{s_Y}{s_X}\]

  • \(r\): correlation between \(X\) and \(Y\)
  • \(s_Y\): standard deviation of \(Y\)
  • \(s_X\): standard deviation of \(X\)

Computing the Intercept (\(a\))

Once we know \(b\), we compute \(a\) using the sample means:

\[a = \bar{Y} - b\bar{X}\]

  • The regression line passes through \((\bar{X}, \bar{Y})\)

Caution on interpreting \(a\): The intercept is the predicted \(Y\) when \(X = 0\). If \(X = 0\) does not exist in the data, \(a\) is an out-of-sample extrapolation and may not be substantively meaningful.

Worked Example

Our Running Example

Question: Does urbanization predict life expectancy?

  • \(X\) = urbanization (% urban population)
  • \(Y\) = life expectancy (years)
  • \(n\) = 214 countries (2015 data)

Step 1: Compute Means

Sample means
Statistic Formula Value
Mean of X (urbanization) \(\bar{X} = \frac{\sum X_i}{n}\) 60.132
Mean of Y (life expectancy) \(\bar{Y} = \frac{\sum Y_i}{n}\) 72.078

Step 2: Compute Standard Deviations

Sample standard deviations
Statistic Value
SD of X 23.9784
SD of Y 7.8963

\[s_X = \sqrt{\frac{\sum(X_i - \bar{X})^2}{n}} = 23.9784\]

\[s_Y = \sqrt{\frac{\sum(Y_i - \bar{Y})^2}{n}} = 7.8963\]

Step 3: Compute the Correlation

\[r = \frac{\frac{\sum(X_i - \bar{X})(Y_i - \bar{Y})}{n}}{s_X \cdot s_Y} = 0.5968\]

Step 4: Compute \(b\) and \(a\)

\[b = r \cdot \frac{s_Y}{s_X} = 0.5968 \times \frac{7.8963}{23.9784} = 0.1965\]

\[a = \bar{Y} - b\bar{X} = 72.078 - 0.1965 \times 60.132 = 60.2599\]

Our regression equation:

\[\hat{Y}_i = 0.1965 \cdot X_i + 60.2599\]

Summary of Parameters

Regression parameters
Parameter Value
X-bar (mean urbanization) 60.1322
Y-bar (mean life expectancy) 72.0776
s_X 23.9784
s_Y 7.8963
r (correlation) 0.5968
b (slope) 0.1965
a (intercept) 60.2599

Building the Table Step by Step

Table: Original Data

Entity Life Expectancy Urbanization
Afghanistan 62.7 24.803
Albania 78.6 57.434
Algeria 75.6 70.848
American Samoa 72.5 87.238
Zimbabwe 59.6 32.385
X-bar 60.132

Table: Adding \(\bar{Y}\)

Entity Life Expectancy Urbanization
Afghanistan 62.7 24.803
Albania 78.6 57.434
Algeria 75.6 70.848
American Samoa 72.5 87.238
Zimbabwe 59.6 32.385
X-bar 60.132
Y-bar 72.078

Table: Adding \(b\)

\(b = r \cdot \frac{s_Y}{s_X} = 0.1965\)

Entity Life Expectancy Urbanization b
Afghanistan 62.7 24.803 0.1965
Albania 78.6 57.434 0.1965
Algeria 75.6 70.848 0.1965
American Samoa 72.5 87.238 0.1965
Zimbabwe 59.6 32.385 0.1965
X-bar 60.132
Y-bar 72.078

Table: Adding \(a\)

\(a = \bar{Y} - b\bar{X} = 60.2599\)

Entity Life Exp. Urbanization b a
Afghanistan 62.7 24.803 0.1965 60.2599
Albania 78.6 57.434 0.1965 60.2599
Algeria 75.6 70.848 0.1965 60.2599
American Samoa 72.5 87.238 0.1965 60.2599
Zimbabwe 59.6 32.385 0.1965 60.2599
X-bar 60.132
Y-bar 72.078

Table: Adding \(\hat{Y}\) (Operations)

\(\hat{Y}_i = b \cdot X_i + a\)

Entity Life Exp. Urbanization b a Y-hat
Afghanistan 62.7 24.803 0.1965 60.2599 0.197 * 24.803 + 60.26
Albania 78.6 57.434 0.1965 60.2599 0.197 * 57.434 + 60.26
Algeria 75.6 70.848 0.1965 60.2599 0.197 * 70.848 + 60.26
American Samoa 72.5 87.238 0.1965 60.2599 0.197 * 87.238 + 60.26
Zimbabwe 59.6 32.385 0.1965 60.2599 0.197 * 32.385 + 60.26
X-bar 60.132
Y-bar 72.078

Table: Adding \(\hat{Y}\) (Results)

Entity Life Exp. Urbanization b a Y-hat
Afghanistan 62.7 24.803 0.1965 60.2599 65.1344
Albania 78.6 57.434 0.1965 60.2599 71.5473
Algeria 75.6 70.848 0.1965 60.2599 74.1835
American Samoa 72.5 87.238 0.1965 60.2599 77.4046
Zimbabwe 59.6 32.385 0.1965 60.2599 66.6245
X-bar 60.132
Y-bar 72.078

Five Countries: Step by Step

Five Countries: \(\hat{Y}\) Operations

\(\hat{Y}_i = b \cdot X_i + a\)

Entity Life Exp. Urbanization b a Y-hat
China 77 55.5 0.1965 60.2599 0.197 * 55.5 + 60.26
Italy 82.5 69.565 0.1965 60.2599 0.197 * 69.565 + 60.26
Spain 82.6 79.602 0.1965 60.2599 0.197 * 79.602 + 60.26
United Kingdom 80.9 82.626 0.1965 60.2599 0.197 * 82.626 + 60.26
United States 78.9 81.671 0.1965 60.2599 0.197 * 81.671 + 60.26
X-bar 60.132
Y-bar 72.078

Five Countries: \(\hat{Y}\) Results

Entity Life Exp. Urbanization b a Y-hat
China 77 55.5 0.1965 60.2599 71.1672
Italy 82.5 69.565 0.1965 60.2599 73.9314
Spain 82.6 79.602 0.1965 60.2599 75.9039
United Kingdom 80.9 82.626 0.1965 60.2599 76.4982
United States 78.9 81.671 0.1965 60.2599 76.3105
X-bar 60.132
Y-bar 72.078

Five Countries: Residual Operations

\(\text{residual}_i = Y_i - \hat{Y}_i\)

Entity Life Exp. Urb. b a Y-hat Residual
China 77 55.5 0.1965 60.2599 71.1672 77 - 71.167
Italy 82.5 69.565 0.1965 60.2599 73.9314 82.5 - 73.931
Spain 82.6 79.602 0.1965 60.2599 75.9039 82.6 - 75.904
United Kingdom 80.9 82.626 0.1965 60.2599 76.4982 80.9 - 76.498
United States 78.9 81.671 0.1965 60.2599 76.3105 78.9 - 76.311
X-bar 60.132
Y-bar 72.078

Five Countries: Residual Results

Entity Life Exp. Urb. b a Y-hat Residual
China 77 55.5 0.1965 60.2599 71.1672 5.8328
Italy 82.5 69.565 0.1965 60.2599 73.9314 8.5686
Spain 82.6 79.602 0.1965 60.2599 75.9039 6.6961
United Kingdom 80.9 82.626 0.1965 60.2599 76.4982 4.4018
United States 78.9 81.671 0.1965 60.2599 76.3105 2.5895
X-bar 60.132
Y-bar 72.078

Making Predictions

Prediction Exercise 1

If a country has urbanization of 77%, what is the predicted life expectancy?

\[\hat{Y} = 0.1965 \times 77 + 60.2599 = 75.393\]

Prediction Exercise 2

If a country has urbanization of 10%, what is the predicted life expectancy?

\[\hat{Y} = 0.1965 \times 10 + 60.2599 = 62.225\]

Plotting the Regression Line

Scatterplot: Urbanization vs. Life Expectancy

Five Countries: Observed Values

Five Countries: Predicted Values

Residuals

What Is a Residual?

The residual is the difference between observed and predicted:

\[\text{residual}_i = Y_i - \hat{Y}_i\]

  • Negative residual: country scored lower than predicted
  • Positive residual: country scored higher than predicted

The regression line minimizes the overall size of residuals — it is the line of best fit

Visualizing Residuals

Residuals Table

Residuals for five example countries
Country Y (Observed) Ŷ (Predicted) Residual
China 77.0 71.167 5.833
Italy 82.5 73.931 8.569
Spain 82.6 75.904 6.696
United Kingdom 80.9 76.498 4.402
United States 78.9 76.311 2.589

Baseline Model & R²

The Baseline Model

The simplest “model” predicts \(\bar{Y}\) for everyone:

\[\hat{Y}_i^{\text{baseline}} = \bar{Y}\]

  • Baseline residual: \(Y_i - \bar{Y}\)
  • How much does our regression improve over this?

Comparing Residuals

Regression vs. baseline residuals
Country Y Ŷ Resid (Regression) Resid (Baseline)
China 77.0 71.167 5.833 4.922
Italy 82.5 73.931 8.569 10.422
Spain 82.6 75.904 6.696 10.522
United Kingdom 80.9 76.498 4.402 8.822
United States 78.9 76.311 2.589 6.822

Visualizing the Baseline

Sum of Squared Residuals

We square residuals and sum them:

Total Sum of Squares (SST) — baseline model:

\[SST = \sum_{i=1}^{n}(Y_i - \bar{Y})^2 = 1.334313\times 10^{4}\]

Sum of Squared Residuals (SSR) — regression model:

\[SSR = \sum_{i=1}^{n}(Y_i - \hat{Y}_i)^2 = 8590.86\]

R-Squared (\(R^2\))

\[R^2 = \frac{SST - SSR}{SST} = 1 - \frac{SSR}{SST}\]

\[R^2 = 1 - \frac{8590.86}{1.334313\times 10^{4}} = 0.3562\]

Interpretation: Urbanization explains 35.6% of the variance in life expectancy

Interpreting \(R^2\)

  • \(R^2\) ranges from 0 to 1
  • \(R^2 = 0\): model explains nothing
  • \(R^2 = 1\): model explains everything

“Our model explains X% of the variance in the outcome variable.”

Practice: If \(R^2 = 0.4532\), how do you interpret it?

“Our model explains 45% of the variance in the outcome.”

Running Regressions in R

The lm() Function

model <- lm(life_expectancy ~ urbanization, data = final)
summary(model)

Call:
lm(formula = life_expectancy ~ urbanization, data = final)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.861  -3.301   1.042   4.297  19.729 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  60.25992    1.17483   51.29   <2e-16 ***
urbanization  0.19653    0.01815   10.83   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.366 on 212 degrees of freedom
Multiple R-squared:  0.3562,    Adjusted R-squared:  0.3531 
F-statistic: 117.3 on 1 and 212 DF,  p-value: < 2.2e-16

Reading R Output: Coefficients

Term Estimate Meaning
(Intercept) 60.2599 \(a\) — predicted \(Y\) when \(X = 0\) (extrapolation!)
urbanization 0.1965 \(b\) — change in \(Y\) per unit \(X\)
  • Std. Error: precision of the estimate
  • t value: \(\text{Estimate} / \text{Std. Error}\)
  • Pr(>|t|): p-value for \(H_0: \beta = 0\)
  • ***: \(p < 0.001\) (highly significant)

Reading R Output: Model Fit

  • Multiple R-squared: 0.3562 — variance explained
  • F-statistic: overall model significance
  • Residual standard error: typical prediction error

Standard Errors

Standard Error of \(b\)

\[SE(b) = \sqrt{\frac{1}{n-2} \cdot \frac{\sum(Y_i - \hat{Y}_i)^2}{\sum(X_i - \bar{X})^2}}\]

  • Numerator: sum of squared residuals (\(SSR\))
  • Denominator: spread of \(X\) values
  • Smaller SE → more precise estimate

Computing SE(b)

\[SE(b) = \sqrt{\frac{1}{214-2} \cdot \frac{8590.86}{1.2304188\times 10^{5}}} = 0.018148\]

This matches the R output: 0.018148

OLS Assumptions

Three Assumptions for Unbiasedness

We want \(\hat{\beta}_1 = \beta_1\) (unbiased). Three assumptions:

  1. Linearity: \(Y\) is linearly related to \(X\)
  1. Random Sampling: observations are independently drawn
  1. Zero Conditional Mean: \(E(\epsilon | X) = 0\)

Zero Conditional Mean

\(E(\epsilon | X) = 0\) means: knowing \(X\) tells you nothing about the error

  • For any value of \(X\), the average error is zero
  • The unobserved factors affecting \(Y\) are uncorrelated with \(X\)

This is the assumption most frequently violated in observational data — and the most consequential

Why Zero Conditional Mean Fails

When an omitted variable affects both \(X\) and \(Y\):

Example: Urbanization \(\rightarrow\) Life Expectancy

  • Wealthier countries are more urbanized and have better healthcare
  • GDP affects both \(X\) (urbanization) and \(Y\) (life expectancy)
  • Our estimate of \(b\) absorbs the effect of GDP \(\rightarrow\) omitted variable bias

Other violations: reverse causality (\(Y\) also causes \(X\)) and selection bias (non-random sample)

What Omitted Variable Bias Means

If \(E(\epsilon | X) \neq 0\):

  • \(\hat{\beta}_1 \neq \beta_1\) — our slope estimate is biased
  • The regression still describes the data accurately
  • But it does not give us a causal effect of \(X\) on \(Y\)

Solutions: randomized experiments, instrumental variables, or controlling for confounders (future lectures)

Homoskedasticity

Beyond unbiasedness, for valid standard errors we also need:

  • Homoskedasticity: \(\text{Var}(\epsilon | X) = \sigma^2\) for all values of \(X\)
  • The spread of residuals is constant across \(X\)

If violated (heteroskedasticity):

  • \(\hat{\beta}_1\) is still unbiased
  • But standard errors, t-values, and p-values become unreliable

Diagnosing Heteroskedasticity

Plot residuals vs. \(X\) and look for patterns:

  • Homoskedastic: points form a roughly even band around zero
  • Heteroskedastic: spread fans out (or narrows) as \(X\) increases

Addressing Heteroskedasticity

If the residual plot shows fanning or clustering:

  • Use robust standard errors (Huber-White / HC standard errors)
  • These adjust SE calculations without changing \(\hat{\beta}_1\)
  • In R: lmtest::coeftest(model, vcov = sandwich::vcovHC)

Binary Explanatory Variables

Regression with a Dummy Variable

So far, \(X\) was continuous. What if \(X\) is binary (0 or 1)?

\[y = \beta_0 + \beta_1 x + \epsilon\]

  • When \(x = 0\): \(E(y | x=0) = \beta_0\)
  • When \(x = 1\): \(E(y | x=1) = \beta_0 + \beta_1\)

\[\beta_1 = E(y | x=1) - E(y | x=0)\]

\(\beta_1\) is the difference in group means

Example: EU Membership

Question: Do EU countries have higher life expectancy?

model_eu <- lm(life_expectancy ~ eu, data = final)
summary(model_eu)

Call:
lm(formula = life_expectancy ~ eu, data = final)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.898  -5.062   1.373   5.077  14.302 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  70.9979     0.5413 131.165  < 2e-16 ***
eu            8.5577     1.5239   5.616  6.1e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.402 on 212 degrees of freedom
Multiple R-squared:  0.1295,    Adjusted R-squared:  0.1254 
F-statistic: 31.54 on 1 and 212 DF,  p-value: 6.099e-08

Interpreting the EU Result

  • \(\hat{\beta}_0 = 70.998\): average life expectancy for non-EU countries
  • \(\hat{\beta}_1 = 8.558\): EU countries live 8.558 years longer on average

\[\beta_1 = E(\text{Life Exp.} | \text{EU}=1) - E(\text{Life Exp.} | \text{EU}=0)\]

Caution: this does not prove EU membership causes longer life

Counterfactuals & Policy Analysis

Binary Variables and Causality

Binary regressions are central to policy analysis:

  • Treatment group (\(x = 1\)) vs. control group (\(x = 0\))
  • \(\beta_1\) = treatment effect (causal if properly designed)

Examples of policy interventions:

  • Free iPads \(\rightarrow\) educational outcomes
  • Plastic bag ban \(\rightarrow\) pollution reduction
  • Job training \(\rightarrow\) income increase

The Average Treatment Effect

%%{init:{"theme":"base","themeVariables":{"fontSize":"22px","primaryColor":"#4a7c6f","primaryTextColor":"#1e293b","lineColor":"#334155"},"flowchart":{"useMaxWidth":true,"nodeSpacing":60,"rankSpacing":80}}}%%
flowchart LR
    A["Random<br/>Assignment"] --> B["Treatment<br/>(x = 1)"]
    A --> C["Control<br/>(x = 0)"]
    B --> D["Outcome<br/>Y₁"]
    C --> E["Outcome<br/>Y₀"]
    D --> F["ATE =<br/>Ȳ₁ − Ȳ₀"]
    E --> F

\[ATE = E(\bar{Y} | x=1) - E(\bar{Y} | x=0)\]

Example: Malaria Bed Nets in Malawi

The World Bank evaluated insecticide-treated bed nets:

  • Setting: 1,000 people in Malawi
  • Treatment: randomly assigned bed nets
  • Outcome: malaria risk

Malaria RCT Results

summary(model_rct)

Call:
lm(formula = malaria_post ~ net, data = rct_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-31.1804  -6.4519  -0.0199   6.5294  30.3102 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  46.5707     0.4262  109.27   <2e-16 ***
net         -10.3903     0.5899  -17.61   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.318 on 998 degrees of freedom
Multiple R-squared:  0.2371,    Adjusted R-squared:  0.2364 
F-statistic: 310.2 on 1 and 998 DF,  p-value: < 2.2e-16

Interpreting the Treatment Effect

\[ATE = \bar{Y}_{\text{treatment}} - \bar{Y}_{\text{control}} = 36.18 - 46.57 = -10.39\]

  • Bed nets reduce malaria risk by 10.39 units
  • The effect is statistically significant (\(p < 0.001\))
  • Random assignment ensures this is a causal estimate

Treatment vs. Control Distribution

Conclusion

Summary

  • Notation first: \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are sample estimates of population parameters
  • Regression equation: \(\hat{Y} = bX + a\) predicts the outcome from a predictor
  • Intercept: may be an out-of-sample extrapolation — interpret with care
  • Residuals: \(Y_i - \hat{Y}_i\) — the regression line minimizes these
  • \(R^2\): proportion of variance explained by the model
  • Zero conditional mean: the key OLS assumption — violated when confounders are omitted
  • Binary regressors: \(\beta_1\) is the difference in group means
  • Randomized experiments enable causal interpretation of \(\beta_1\)