Interpreting Binary and Multivariate Regression Models

Dummy Variables, Interaction Effects, and Holding Other Factors Constant

Bogdan G. Popescu

John Cabot University

Table of Contents

  1. Regression with Binary Independent Variables
  2. Interaction Effects
  3. Multivariate Regression

Regression on a Binary Independent Variable

So far, we only focused on the case where the independent variable is continuous (e.g., urbanization measured as a percent).

We can also apply a similar procedure when X is binary (i.e., a dummy variable where X takes two values: 0 and 1).

Our original equation:

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Can now be written in the following way for the two cases:

\[ E(y|x=0) = \beta_0 + \beta_1 \cdot 0 + \epsilon \]

\[ E(y|x=1) = \beta_0 + \beta_1 \cdot 1 + \epsilon \]

Regression on a Binary Independent Variable

\[ E(y|x=0) = \beta_0 + \beta_1 \cdot 0 + \epsilon \]

\[ E(y|x=1) = \beta_0 + \beta_1 \cdot 1 + \epsilon \]

Where:

  • \(E(y|x=0)\) is the expected value of \(y\) when \(x = 0\)
  • \(E(y|x=1)\) is the expected value of \(y\) when \(x = 1\)

Regression on a Binary Explanatory Variable

This means:

\[ E(y|x=0) = \beta_0 + \beta_1 \cdot 0 + \epsilon \]

\[ E(y|x=0) = \beta_0 + \epsilon \]

\[ E(y|x=1) = \beta_0 + \beta_1 \cdot 1 + \epsilon \]

\[ E(y|x=1) = \beta_0 + \beta_1 + \epsilon \]

So the only difference between the two equations is \(\beta_1\):

\[ E(y|x=0) = \beta_0 + \epsilon \\ E(y|x=1) = \beta_0 + \beta_1 + \epsilon \]

This means that we can write:

\[ \beta_1 = E(y|x=1) - E(y|x=0) \]

Regression on a Binary Explanatory Variable

We can interpret \(\beta_1\) as the difference in the average value of \(y\) between the subpopulations where \(x = 1\) and \(x = 0\).

\(\beta_1\) can be descriptive or represent the causal effect of an intervention or program (if assumptions hold).

For example, to see if EU countries have a higher life expectancy:

\[ \beta_1 = E(y|x=1) - E(y|x=0) \]

\[ \beta_1 = E(\text{life expectancy} \mid \text{EU}=1) - E(\text{life expectancy} \mid \text{EU}=0) \]

Show the code
library(dplyr)
library(ggplot2)
setwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture10/data/")

life_expectancy_df <- read.csv(file = './life-expectancy.csv')
urbanization_df <- read.csv(file = './share-of-population-urban.csv')

urbanization_df2<-urbanization_df%>%
  dplyr::group_by(Entity, Code)%>%
  dplyr::summarize(urb_mean=mean(Urban.population....of.total.population.))

life_expectancy_df2<-life_expectancy_df%>%
  dplyr::group_by(Entity, Code)%>%
  dplyr::summarize(life_expectancy=mean(Life.expectancy.at.birth..historical.))

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_expectancy_df<-subset(life_expectancy_df2, !(Code %in% weird_labels))

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_urbanization_df<-subset(urbanization_df2, !(Code %in% weird_labels))

clean_urbanization_df<-subset(clean_urbanization_df, select = -c(Entity))
merged_data<-left_join(clean_life_expectancy_df, clean_urbanization_df, by = c("Code"="Code"))
merged_data2<-na.omit(merged_data)


library("modelsummary")
##############################
#Step4: Labeling EU countries#
##############################
eu_countries <- c(
  "Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia", "Denmark",
  "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Ireland",
  "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands",
  "Poland", "Portugal", "Romania", "Slovakia", "Slovenia", "Spain", "Sweden"
)

model<-lm(life_expectancy~urb_mean, data=merged_data2)

merged_data2$eu<-0
merged_data2$eu[merged_data2$Entity %in% eu_countries] <- 1
merged_data2$eu[is.na(merged_data2$eu)]<-0

x2<-lm(life_expectancy~eu, data=merged_data2)

cm <- c("(Intercept)"="Intercept",
        'eu'='EU')

modelsummary(x2, stars = TRUE, coef_map = cm, gof_map = c("nobs", "r.squared"))
(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Intercept 60.426***
(0.614)
EU 6.688***
(1.728)
Num.Obs. 214
R2 0.066

Regression on a Binary Explanatory Variable

We can interpret \(\beta_1\) as the difference in the average value of \(y\) between the subpopulations where \(x = 1\) and \(x = 0\).

\(\beta_1\) can be descriptive or represent the causal effect of an intervention or program (if assumptions hold).

For example, to see if EU countries have a higher life expectancy:

\[ \beta_1 = E(y|x=1) - E(y|x=0) \]

\[ \beta_1 = E(\text{life expectancy} \mid \text{EU}=1) - E(\text{life expectancy} \mid \text{EU}=0) \]

Show the code
library(broom)
model_scaled<-lm(scale(life_expectancy)~scale(eu), data=merged_data2)
results<-tidy(model_scaled)

cm <- c('scale(eu)'='EU',
        "(Intercept)"="Intercept")

ggplot(results, aes(x = estimate, y = term)) +
      geom_point()+
      geom_errorbar(aes(xmin = estimate-1.96*std.error, 
                        xmax = estimate+1.96*std.error), 
                linewidth = 1, width=0)+
  scale_y_discrete(labels = cm) +  # this maps term names to nicer labels
  theme_bw()+
   theme(
    axis.text  = element_text(size = 14),
    axis.title = element_text(size = 14),
    plot.title = element_text(hjust = 0.5))

Regression on a Binary Explanatory Variable

We can interpret \(\beta_1\) as the difference in the average value of \(y\) between the subpopulations where \(x = 1\) and \(x = 0\).

\(\beta_1\) can be descriptive or represent the causal effect of an intervention or program (if assumptions hold).

For example, to see if EU countries have a higher life expectancy:

\[ \beta_1 = E(y|x=1) - E(y|x=0) \]

\[ \beta_1 = E(\text{life expectancy} \mid \text{EU}=1) - E(\text{life expectancy} \mid \text{EU}=0) \]

\(\beta_1\) = 67.114 − 60.426
\(\beta_1\) = 6.688

Group Means: Life Expectancy in EU vs. Non-EU Countries

Before running a regression, let’s look at the average life expectancy in each group:

This helps us see what the regression is about to estimate.

We’ll expect the regression slope \(\beta_1\) to be roughly:

\[ \text{Mean}_{EU} - \text{Mean}_{Non-EU} \]

Regression on a Binary Explanatory Variable

The interpretation of \(\beta_1\):

  • EU countries have 6.6 more years of life compared with non-EU countries in the sample.
  • Better said: life expectancy is 6.6 years higher in the EU, compared to non-EU countries.
  • This difference is significant at a 0.1% significance level (0.001*100=0.1).
  • EU countries live 6.688 + 60.426 = 67.114 years

Interaction Effects

What if we interacted urbanization with some of our other variables?

\[ \begin{align*} y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \\ \text{life expectancy} &= \beta_0 + \beta_1 \cdot \text{EU} + \beta_2 \cdot \text{urbanization} + \beta_3 \cdot (\text{EU} \times \text{urbanization}) + \epsilon \end{align*} \]

(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Intercept 48.981***
(1.044)
EU 33.095***
(6.611)
Urbanization 0.233***
(0.019)
EU * Urbanization -0.456***
(0.097)
Num.Obs. 214
R2 0.464

Interaction Effects

What if we interacted urbanization with some of our other variables?

(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Intercept 48.981***
(1.044)
EU 33.095***
(6.611)
Urbanization 0.233***
(0.019)
EU * Urbanization -0.456***
(0.097)
Num.Obs. 214
R2 0.464
  • The coefficient -0.456 not the intercept for EU countries, but rather but the difference in slope for EU countries compared to others.
  • The intercept for the EU countries: the intercept + EU= 48.981 + (-0.456) = 48.525

Interaction Effects

What if we interacted urbanization with some of our other variables?

(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Intercept 48.981***
(1.044)
EU 33.095***
(6.611)
Urbanization 0.233***
(0.019)
EU * Urbanization -0.456***
(0.097)
Num.Obs. 214
R2 0.464
  • Similarly, EU * urbanization = 48.525 is not the slope for urbanization for the EU, but rather the offset in slope for the EU.
  • Therefore, the slope for urbanization for EU countries is: urbanization + EU * urbanization = 0.233–0.456 = -0.223

Interaction Effects

What if we interacted urbanization with some of our other variables?

(1)
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Intercept 48.981***
(1.044)
EU 33.095***
(6.611)
Urbanization 0.233***
(0.019)
EU * Urbanization -0.456***
(0.097)
Num.Obs. 214
R2 0.464

Since the slope for urbanization for the rest of the world is 0.233, it means that on average, a one-unit increase in urbanization in any country outside the EU leads to an increase in life expectancy of 0.233 years.

In the EU, by contrast, a one-unit increase in urbanization is associated with a decrease in life expectancy of -0.223 years.

Interaction Effects

Show the code
# Step 1: Label EU countries
eu_countries <- c(
  "Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia", "Denmark",
  "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Ireland",
  "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands",
  "Poland", "Portugal", "Romania", "Slovakia", "Slovenia", "Spain", "Sweden"
)

merged_data2$eu <- ifelse(merged_data2$Entity %in% eu_countries, 1, 0)
merged_data2$type <- ifelse(merged_data2$Entity %in% eu_countries, "EU", "Everything Else")

# Step 2: Fit interaction model
merged_data2$eu_urbanization <- merged_data2$eu * merged_data2$urb_mean
model <- lm(life_expectancy ~ urb_mean + eu + eu_urbanization, data = merged_data2)

# Step 3: Extract coefficients
b0 <- coef(model)["(Intercept)"]
b1 <- coef(model)["urb_mean"]
b2 <- coef(model)["eu"]
b3 <- coef(model)["eu_urbanization"]

# Step 4: Compute lines
x_vals <- c(0, 100)
non_eu_line <- data.frame(
  x = x_vals,
  y = b0 + b1 * x_vals,
  group = "Everything Else"
)

eu_line <- data.frame(
  x = x_vals,
  y = (b0 + b2) + (b1 + b3) * x_vals,
  group = "EU"
)

regression_lines <- rbind(non_eu_line, eu_line)

# Step 5: Plot
cols <- c("Everything Else" = "black", "EU" = "blue")
shapes <- c("Everything Else" = 16, "EU" = 3)

ggplot(merged_data2) +
    geom_point(data = merged_data2, aes(x = urb_mean, y = life_expectancy, color = type, shape = type), size=2)+
  geom_line(data = regression_lines, aes(x = x, y = y, color = group, group = group), 
            linetype = "solid", size = 1, inherit.aes = FALSE)+
  scale_shape_manual(name = "", values = shapes) +
  scale_color_manual(name = "", values = cols) +

  scale_x_continuous(name = "Urbanization", breaks = seq(0, 100, 20), limits = c(0, 100)) +
  scale_y_continuous(name = "Life Expectancy", breaks = seq(0, 100, 20), limits = c(0, 100)) +
  theme_bw() +
theme(
  axis.text.x = element_text(size = 14),
  axis.title = element_text(size = 14),
  plot.title = element_text(hjust = 0.5),
  legend.position = "bottom",  # Or "right", "top"
  legend.title = element_blank(),
  legend.box.background = element_rect(fill = 'white'),
  legend.background = element_blank(),
  legend.text = element_text(size = 12)
)

Interaction Effects

Show the code
# Step 1: Label EU countries
eu_countries <- c(
  "Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia", "Denmark",
  "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Ireland",
  "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands",
  "Poland", "Portugal", "Romania", "Slovakia", "Slovenia", "Spain", "Sweden"
)

merged_data2$eu <- ifelse(merged_data2$Entity %in% eu_countries, 1, 0)
merged_data2$type <- ifelse(merged_data2$Entity %in% eu_countries, "EU", "Everything Else")

# Step 2: Fit interaction model
merged_data2$eu_urbanization <- merged_data2$eu * merged_data2$urb_mean
model <- lm(life_expectancy ~ urb_mean + eu + eu_urbanization, data = merged_data2)

# Step 3: Extract coefficients
b0 <- coef(model)["(Intercept)"]
b1 <- coef(model)["urb_mean"]
b2 <- coef(model)["eu"]
b3 <- coef(model)["eu_urbanization"]

# Step 4: Compute lines
x_vals <- c(0, 100)
non_eu_line <- data.frame(
  x = x_vals,
  y = b0 + b1 * x_vals,
  group = "Everything Else"
)

eu_line <- data.frame(
  x = x_vals,
  y = (b0 + b2) + (b1 + b3) * x_vals,
  group = "EU"
)

regression_lines <- rbind(non_eu_line, eu_line)

# Step 5: Plot
cols <- c("Everything Else" = "black", "EU" = "blue")
shapes <- c("Everything Else" = 16, "EU" = 3)

ggplot(merged_data2) +
    geom_point(data = merged_data2, aes(x = urb_mean, y = life_expectancy, color = type, shape = type), alpha = .2, size=2)+
  geom_line(data = regression_lines, aes(x = x, y = y, color = group, group = group), 
            linetype = "solid", size = 1, alpha = .2, inherit.aes = FALSE)+
  scale_shape_manual(name = "", values = shapes) +
  scale_color_manual(name = "", values = cols) +

  scale_x_continuous(name = "Urbanization", breaks = seq(0, 100, 20), limits = c(0, 100)) +
  scale_y_continuous(name = "Life Expectancy", breaks = seq(0, 100, 20), limits = c(0, 100)) +
  theme_bw() +
    # Slope annotations
  annotate("text", x = 10, y = b0 + b1 * 20 + 5, 
           label = paste0("Slope (b) = ", round(b1, 3)), 
           color = "black", size = 5) +
  annotate("text", x = 10, y = (b0 + b2) + (b1 + b3) * 20 - 5, 
           label = paste0("Slope (b) = ", round(b1 + b3, 3)), 
           color = "blue", size = 5) +
theme(
  axis.text.x = element_text(size = 14),
  axis.title = element_text(size = 14),
  plot.title = element_text(hjust = 0.5),
  legend.position = "bottom",  # Or "right", "top"
  legend.title = element_blank(),
  legend.box.background = element_rect(fill = 'white'),
  legend.background = element_blank(),
  legend.text = element_text(size = 12)
)

Interaction Effects

Show the code
# Step 1: Label EU countries
eu_countries <- c(
  "Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia", "Denmark",
  "Estonia", "Finland", "France", "Germany", "Greece", "Hungary", "Ireland",
  "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands",
  "Poland", "Portugal", "Romania", "Slovakia", "Slovenia", "Spain", "Sweden"
)

merged_data2$eu <- ifelse(merged_data2$Entity %in% eu_countries, 1, 0)
merged_data2$type <- ifelse(merged_data2$Entity %in% eu_countries, "EU", "Everything Else")

# Step 2: Fit interaction model
merged_data2$eu_urbanization <- merged_data2$eu * merged_data2$urb_mean
model <- lm(life_expectancy ~ urb_mean + eu + eu_urbanization, data = merged_data2)

# Step 3: Extract coefficients
b0 <- coef(model)["(Intercept)"]
b1 <- coef(model)["urb_mean"]
b2 <- coef(model)["eu"]
b3 <- coef(model)["eu_urbanization"]

# Step 4: Compute lines
x_vals <- c(0, 100)
non_eu_line <- data.frame(
  x = x_vals,
  y = b0 + b1 * x_vals,
  group = "Everything Else"
)

eu_line <- data.frame(
  x = x_vals,
  y = (b0 + b2) + (b1 + b3) * x_vals,
  group = "EU"
)

regression_lines <- rbind(non_eu_line, eu_line)

# Step 5: Plot
cols <- c("Everything Else" = "black", "EU" = "blue")
shapes <- c("Everything Else" = 16, "EU" = 3)

    
ggplot(merged_data2) +
  geom_point(data = merged_data2, aes(x = urb_mean, y = life_expectancy, color = type, shape = type), alpha = .2, size=2)+
  geom_line(data = regression_lines, aes(x = x, y = y, color = group, group = group), 
            linetype = "solid", size = 1, alpha = .2, inherit.aes = FALSE)+
  scale_shape_manual(name = "", values = shapes) +
  scale_color_manual(name = "", values = cols) +

  scale_x_continuous(name = "Urbanization", breaks = seq(0, 100, 20), limits = c(0, 100)) +
  scale_y_continuous(name = "Life Expectancy", breaks = seq(0, 100, 20), limits = c(0, 100)) +
  theme_bw() +
  # Intercept annotations
  annotate("point", x = 0, y = b0, color = "black", size = 3) +
  annotate("text", x = 2, y = b0 + 2, 
           label = paste0("Intercept (a) = ", round(b0, 2)), 
           hjust = 0, color = "black", size = 5) +

  annotate("point", x = 0, y = b0 + b2, color = "blue", size = 3) +
  annotate("text", x = 2, y = b0 + b2 + 2, 
           label = paste0("Intercept (a) = ", round(b0 + b2, 2)), 
           hjust = 0, color = "blue", size = 5) +

theme(
  axis.text.x = element_text(size = 14),
  axis.title = element_text(size = 14),
  plot.title = element_text(hjust = 0.5),
  legend.position = "bottom",  # Or "right", "top"
  legend.title = element_blank(),
  legend.box.background = element_rect(fill = 'white'),
  legend.background = element_blank(),
  legend.text = element_text(size = 12)
)

Interaction Effects

An interaction effect allows us to identify the change that happens when combining two explanatory variables:

  • Urbanization effect
  • EU effect
  • Additional urbanization effect in the EU

Multivariate Regression

Up until now, we only had one independent variable: urbanization

\[ \text{life expectancy} = \beta_0 + \beta_1 \text{urbanization} + \epsilon \]

But many other variables could be explaining life expectancy beyond urbanization

Thus, we need to ask ourselves?

  • What characteristics do countries with longer life expectancy have in common?
  • What characteristics do countries with shorter life expectancy have in common?
  • Normally, we would be examining the relationship between different types of variables and life expectancy:
    • income
    • urbanization
    • education

Multivariate Regression

Multivariate Regression

Specification

We can now run something like:

\[ \text{life expectancy} = \beta_0 + \beta_1 \text{urbanization} + \beta_2 \text{GDP} + \epsilon \]

Show the code
setwd("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/lecture10/data/")
gdp <- read.csv(file = './gdp-per-capita-maddison-2020.csv')
gdp2 <- gdp %>%
    group_by(Code) %>%
    summarize(gdp = mean(GDP.per.capita))

# Removing continents
gdp3 <- subset(gdp2, gdp2$Code != "")
merged_data3<-left_join(merged_data2, gdp3, by = c("Code"="Code"))
merged_data3<-na.omit(merged_data3)
#Taking the log
merged_data3$log_gdp<-log(merged_data3$gdp)

Regression Example

Interpretation

This is how we create a professionally-looking table.

Show the code
library("modelsummary")
model<-lm(life_expectancy~urb_mean+log_gdp, data=merged_data3)

models<-list("DV: Life Expectancy" = model)

cm <- c('urb_mean'='Urbanization',
        'log_gdp'='GDP',
        "(Intercept)"="Intercept")

modelsummary(models, stars = TRUE, coef_map = cm, gof_map = c("nobs", "r.squared"))
DV: Life Expectancy
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Urbanization 0.054+
(0.029)
GDP 5.947***
(0.699)
Intercept 7.060
(4.880)
Num.Obs. 164
R2 0.608

Regression Example

Interpretation

This is how we create a professionally-looking plot

Show the code
#Step1: To make readable graphs, we need to standardize our coefficients:
# Scaling variables means subtracting the mean of the original variable from the raw value and then divide it by the standard deviation of the original variable.
x5_b<-lm(scale(life_expectancy)~scale(urb_mean)+scale(log_gdp),data=merged_data3)
#Step2: Tyding your coeficients
results1_b <- tidy(x5_b)

cm <- c('scale(urb_mean)'='Urbanization',
        'scale(log_gdp)'='GDP',
        "(Intercept)"="Intercept")

# Ensure the levels of 'term' match the names of cm
results1_b$term <- factor(results1_b$term, levels = names(cm))

graph_results1_b <- ggplot(results1_b, 
       aes(x = estimate, y = term)) +
    geom_point(position = position_dodge(width = 0.4), size = 4) +
    geom_errorbar(aes(xmin = estimate - 1.96 * std.error, 
                      xmax = estimate + 1.96 * std.error), 
                  size = 1, width = 0,
                  position = position_dodge(width = 0.4)) +
    geom_vline(xintercept = 0, color = "black", linetype = "dashed") +
    scale_y_discrete(labels = cm) +
    theme_bw() +
    theme(axis.text.x = element_text(size = 16),
          axis.text.y = element_text(size = 16),
          legend.position = c(1, 0),
          legend.box.background = element_rect(fill = 'white'),
          legend.background = element_blank(),
          legend.text = element_text(size = 16))
graph_results1_b

Regression Example

Interpretation

We can now interpret the coefficients:

Show the code
library("modelsummary")
model<-lm(life_expectancy~urb_mean+log_gdp, data=merged_data3)

models<-list("DV: Life Expectancy" = model)

cm <- c('urb_mean'='Urbanization',
        'log_gdp'='GDP',
        "(Intercept)"="Intercept")

modelsummary(models, stars = TRUE, coef_map = cm, gof_map = c("nobs", "r.squared"))
DV: Life Expectancy
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Urbanization 0.054+
(0.029)
GDP 5.947***
(0.699)
Intercept 7.060
(4.880)
Num.Obs. 164
R2 0.608
  • Urbanization: Every unit increase in urbanization has a 0.054 increase in life expectancy, holding everything else constant
  • Log GDP: Every unit increase in log GDP has a 5.947 increase in life expectancy, holding everything else constant

Regression Example

Interpretation

We can now interpret the coefficients:

Show the code
library("modelsummary")
model<-lm(life_expectancy~urb_mean+log_gdp, data=merged_data3)

models<-list("DV: Life Expectancy" = model)

cm <- c('urb_mean'='Urbanization',
        'log_gdp'='GDP',
        "(Intercept)"="Intercept")

modelsummary(models, stars = TRUE, coef_map = cm, gof_map = c("nobs", "r.squared"))
DV: Life Expectancy
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Urbanization 0.054+
(0.029)
GDP 5.947***
(0.699)
Intercept 7.060
(4.880)
Num.Obs. 164
R2 0.608
  • Note that our independent variable was logged.
  • The right interpretation is: A 1% increase in GDP per capita has a 5.947/100 = 0.05947 increase in life expectancy.

Multivariate Regressions

R Sq. and Adj. R Sq.

  • The output provided by R for multiple regression is exactly the same as for bivariate regression
  • Sometimes we need to be conscious that we are talking about more than one predictor variable
  • In a multivariate regression, we care about the adjusted R squared
  • The reason is that the value of the \(R^2\) never decreases no matter the number of variables we add to our regression model.
  • Even if we are adding redundant variables to the model, the value of R-squared does not decrease.
  • In the case of the Adjusted R-squared, adding redundant variables to the model reduces the Adjusted R-squared: the Adjusted R-squared can thus be negative: This usually happens when the model fits the data worse than a model with no predictors — meaning it adds noise rather than explanatory power.

Conclusion: What You Should Remember

  • Binary regression estimates group differences:
    \[ \beta_1 = \text{mean}(y \mid x = 1) - \text{mean}(y \mid x = 0) \]

  • Always check group means and sample sizes first.

  • Interaction terms allow effects to vary across groups
    (e.g. slope/intercept differences for EU vs. non-EU).

  • Multivariate regression controls for other factors:
    coefficients show the effect of one variable holding others constant.

  • Use Adjusted \(R^2\) to assess model fit when adding predictors.