Differences in Differences

Using Regression to Understand Differences, Interactions, and Multiple Explanations

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Intro

How can we identify causal effects using observational data?

Some of the ways to conduct causal analysis with observational data, we need to run

differences in differences
regression discontinuity design

First Study

Minimum Wage and Number of Jobs

Does raising minimum wage increase the number of jobs? (Card and Kruger, 1993)

Background

New Jersey changed the minimum wage from $4.25 per hour to $5.05 in 1992.
The number of jobs per fast food restaurant the change was: 20.44
The number of jobs per fast food restaurant the change was: 21.03

\[ \text{New Jersey}_{\text{Before}} = 20.44\\ \text{New Jersey}_{\text{After}} = 21.03\\ \Delta = 0.59 \]

Why can’t we interpret \(\Delta = 0.59\) as causal?

First Study

Minimum Wage and Number of Jobs

Is \(\Delta = 0.59\) a causal effect?

It is not a causal effect
This is because we only look at the treatment group
It is impossible to know if the treatment happened because of the treatment (i.e. change in minimum wage) vs. other factors that happen simultaneously or because of the natural volatility of jobs.

First Study

Minimum Wage and Number of Jobs

To answer the question, Card and Kruger compare New Jersey to a neighboring state: Pennsylvania.

\[ \text{Pennsylvania}_{\text{After}} = 21.17\\ \text{New Jersey}_{\text{After}} = 21.03\\ \Delta = -0.14 \]

Is \(\Delta = -0.14\) a causal effect?

It is not a causal effect
This is because we only look at post-treatment outcomes.
It is impossible to know the effect of the treatment: New Jersey and Pennsylvania may be completely different.

Framework

Difference-in-Differences Table

Group	Before	After
Control	A - not treated	B - not treated
Treatment	C - not treated	D - treated

Framework

Difference-in-Differences Table

Group	Before	After	Δ (After - Before)
Control	A - not treated	B - not treated	B - A
Treatment	C - not treated	D - treated	D - C

Δ (After - Before) = within-unit change

Framework

Difference-in-Differences Table

Group	Before	After	Δ (After - Before)
Control	A - not treated	B - not treated	B - A
Treatment	C - not treated	D - treated	D - C
Δ (Treatment - Control)	C - A	D - B

Δ (After - Before) = within-unit change
Δ (Treatment - Control) = across-group change

Framework

Difference-in-Differences Table

Group	Before	After	Δ (After - Before)
Control	A - not treated	B - not treated	B - A
Treatment	C - not treated	D - treated	D - C
Δ (Treatment - Control)	C - A	D - B	(D - C) - (B - A)

Δ (After - Before) = within-unit change
Δ (Treatment - Control) = across-group change

Framework

Difference-in-Differences Table

Group	Before	After	Δ (After - Before)
Control	A - not treated	B - not treated	B - A
Treatment	C - not treated	D - treated	D - C
Δ (Treatment - Control)	C - A	D - B	(D - B) - (C - A)

Δ (After - Before) = within-unit change
Δ (Treatment - Control) = across-group change

Framework

Difference-in-Differences Table

Group	Before	After	Δ (After - Before)
Control	A 23.33	B 21.17	B − A −2.16
Treatment	C 20.44	D 21.03	D − C 0.59
Δ (Treatment − Control)	C − A −2.89	D − B −0.14	(0.59 − −2.16) or (−0.14 − −2.89)

Δ (After - Before) = within-unit change
Δ (Treatment - Control) = across-group change

Framework

Difference-in-Differences Table

Group	Before	After	Δ (After - Before)
Control	A 23.33	B 21.17	B − A −2.16
Treatment	C 20.44	D 21.03	D − C 0.59
Δ (Treatment − Control)	C − A −2.89	D − B −0.14	2.75 or 2.75

Δ (After - Before) = within-unit change
Δ (Treatment - Control) = across-group change

Kard and Krueger

Controversy

Conventional wisdom (in economics)
Raising the minimum wage reduces employment due to higher labor costs.

Card & Krueger’s finding:
After New Jersey raised the minimum wage, employment increased slightly at fast-food restaurants compared to Pennsylvania.

Methodological innovation:
They used a natural experiment with a difference-in-differences approach — unusual at the time for labor economics.

Differences in Differences

Differences in Differences

Differences in Differences

Differences in Differences

Differences in Differences

Differences in Differences

Differences-in-Differences

The way we can estimate the causal effect is by running the following model:

1. Model

\[ Y_{it} = \beta_0 + \color{blue}{\beta_1 \cdot \text{Group}_i} + \color{purple}{\beta_2 \cdot \text{Time}_t} + \color{red}{\beta_3 \cdot (\text{Group}_i \times \text{Time}_t)} + \epsilon_{it} \]

2. Code

mod <- lm(outcome ~ group + time + group * time, data = final_new)

Where:

-Group = 1 if this is the treatment group
-Time = 1 if this is the period after intervention
-β₀ – mean of the control group in the pre-treatment period
-β₁ – the increase in outcome across groups
-β₂ – the increase in outcome within groups
-β₃ – the Differences-in-Differences

Framework for Causal Effects

1. Model

\[ Y_{it} = \beta_0 + \color{blue}{\beta_1 \cdot \text{Group}_i} + \color{purple}{\beta_2 \cdot \text{Time}_t} + \color{red}{\beta_3 \cdot (\text{Group}_i \times \text{Time}_t)} + \epsilon_{it} \]

Group	Before	After	Δ (After − Before)
Control	β₀	β₀ + β₂	β₂
Treatment	β₀ + β₁	β₀ + β₁ + β₂ + β₃	β₂ + β₃
Δ (Treatment − Control)	β₁	β₁ + β₃	β₃

\(\Delta\) within units − \(\Delta\) across groups = Difference-in-differences = causal effect

Assumptions

Diff-in-Diff Assumptions

Parallel Trends Assumption

The treatment and the control group have the same trends prior to the intervention.

We assume that the treatment group would have changed like the control group in the absence of the treatment.

Timing
Sometimes, units receive treatment at different times, so this can distort our estimates.

Assumptions

Diff-in-Diff Assumptions

This is an example where the parallel trends assumption holds

Assumptions

Diff-in-Diff Assumptions

Another example where parallel trends hold is the following:

Assumptions

Diff-in-Diff Assumptions

An example where the parallel trend is violated, is the following:

Assumptions

Diff-in-Diff Assumptions

Another example where the parallel trend is violated, is the following:

Treatment Timing

Units can receive observations at different times which can distort our estimate:

Treatment Timing

Units can receive observations at different times which can distort our estimate:

Treatment Timing

Units can receive observations at different times which can distort our estimate:

Scenario

This is based on made-up data.

World Banks wants to reduce the risk of malaria in Malawia by providing insecticide-treated bed nets.

So they provided insecticide-treated bed nets only to city B from 2017 to 2020.

The World Bank selected 24 individuals (over 3 years) from city B and they want to investigate whether receiving such nets has any effect on people’s risk of malaria.

The insecticide data is downloadable.

Scenario

Show the code

diff_data <- read.csv(file ='./data/did_data_geo_malawi.csv')
glimpse(diff_data, n=10)

Rows: 8,000
Columns: 10
$ year         <int> 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2013, 201…
$ income       <dbl> 1284.551, 1285.832, 1285.997, 1285.157, 1286.313, 1287.24…
$ age          <int> 33, 34, 35, 36, 37, 38, 39, 40, 51, 52, 53, 54, 55, 56, 5…
$ sex          <chr> "Woman", "Woman", "Woman", "Woman", "Woman", "Woman", "Wo…
$ malaria_risk <dbl> 36.26529, 35.10382, 73.17664, 28.98390, 51.18721, 51.9071…
$ id           <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, …
$ lat          <dbl> -11.27476, -11.27476, -11.27476, -11.27476, -11.27476, -1…
$ lon          <dbl> 34.03006, 34.03006, 34.03006, 34.03006, 34.03006, 34.0300…
$ after        <int> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, …
$ city         <chr> "City A", "City A", "City A", "City A", "City A", "City A…

Scenario

Here are the individuals

Show the code

library(sf)
library(lemon)
mwi1 <- st_read("./data/malawi-latest-free/gadm41_MWI_1.shp", quiet=TRUE)
mwi1 <- st_read("./data/malawi-latest-free/gadm41_MWI_1.shp", quiet=TRUE)
mwi1_lab<-subset(mwi1, NAME_1=="Nkhata Bay")

Show the code

panel_2013<-subset(diff_data, year==2013)
min_lon_x<-min(panel_2013$lon)
max_lon_x<-max(panel_2013$lon)

min_lat_y<-min(panel_2013$lat)
max_lat_y<-max(panel_2013$lat)
error<-2

error<-0.05
map_2013<-ggplot() +
#  geom_raster(data=hdf,aes(X,Y,alpha=Hill)) +
  #scale_alpha(name = "Altitude", guide = "none")  + 
  #geom_sf(data = world, fill=NA, color = "black", linewidth = 0.4)+
#    geom_sf(data = select_comm, fill="green", color = "green", linewidth = 0.4)+

#  geom_sf(data = mwi1, fill=NA, color = "blue", linewidth = 0.2)+
#  geom_sf(data = mwi3, fill=NA, color = "red", linewidth = 0.4)+
  geom_sf(data = mwi1_lab, fill=NA, color = "blue", linewidth = 0.9)+
  geom_point(data = panel_2013, aes(x=lon, y=lat, color = city, size = malaria_risk), alpha=0.1)+
    scale_radius(limits = c(0, NA), range = c(0, 5))+
#  scale_size_continuous(range = c(1,4))+
  #  geom_sf(data = roads_crop1, fill=NA, color = "black", linewidth = 0.3)+
  theme_bw()+
  labs(x = "Longitude", y="Latitude")+
  ggtitle("The Three Districts Selected in Nkhata Bay, Malawi\n Location of 1000 Individuals\n For the Experiment, 2013")+
    theme(axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title=element_text(size=14),
        plot.title = element_text(hjust = 0.5),
        #Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
        #and c(1,1) corresponds to the "top right" position.
        legend.box.background = element_rect(fill='white'),
        legend.background = element_blank(),
        legend.text=element_text(size=12))+
      coord_sf(xlim = c(min_lon_x-3*error, max_lon_x+3*error), ylim = c(min_lat_y-error, max_lat_y+error), expand = FALSE)+
      ggspatial::annotation_scale(location = 'tr')

map_2013<-reposition_legend(map_2013, 'bottom left')

Summary statisics

Before we conduct any analysis, it is important to get a sense of our data

Show the code

library(modelsummary)
datasummary_skim(diff_data)

	Unique	Missing Pct.	Mean	SD	Min	Median	Max	Histogram
year	8	0	2016.5	2.3	2013.0	2016.5	2020.0
income	8000	0	1249.9	160.6	836.3	1246.8	1710.7
age	87	0	32.3	16.0	1.0	31.0	87.0
malaria_risk	8000	0	48.9	12.7	0.0	49.0	100.0
id	1000	0	500.5	288.7	1.0	500.5	1000.0
lat	1000	0	-11.2	0.1	-11.3	-11.2	-11.1
lon	1000	0	34.1	0.1	34.0	34.1	34.2
after	2	0	0.4	0.5	0.0	0.0	1.0
		N	%
sex	Man	3224	40.3
	Woman	4776	59.7
city	City A	7808	97.6
	City B	192	2.4

Estimating the Difference

So, we are interested in the causal effect of the program - \(\beta_3 (\text{Group}_i \times \text{Time}_t)\).

\[ Y_{it} = \beta_0 + \beta_1 \text{Group}_i + \beta_2 \text{Time}_t + \beta_3 (\text{Group}_i \times \text{Time}_t) + \epsilon_{it} \]

Or

\[ \color{green}{\text{Malaria Risk}_{it}} = \beta_0 + \color{blue}{\beta_1 \text{City B}_i} + \color{purple}{\beta_2 \text{Year}_t} + \color{red}{\beta_3 (\text{City B}_i \times \text{Time}_t)} + \epsilon_{it} \]

We can calculate \(\beta_3 (\text{Group}_i \times \text{Time}_t)\) by running:

model_did <- lm(malaria_risk ~ city + after + city * after, data = diff_data)
options(modelsummary_format_numeric_latex = "plain")
modelsummary(model_did, stars = TRUE,  gof_map = c("nobs", "r.squared"))

Estimating the Difference

\[ \color{green}{\text{Malaria Risk}_{it}} = \beta_0 + \color{blue}{\beta_1 \text{City B}_i} + \color{purple}{\beta_2 \text{Year}_t} + \color{red}{\beta_3 (\text{City B}_i \times \text{Time}_t)} + \epsilon_{it} \]

Show the code

model_did <- lm(malaria_risk ~ city + after + city * after, data = diff_data)
options(modelsummary_format_numeric_latex = "plain")
modelsummary(model_did, stars = TRUE,  gof_map = c("nobs", "r.squared"))

	(1)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
(Intercept)	50.629***
	(0.179)
cityCity B	3.071**
	(1.155)
after	-4.532***
	(0.292)
cityCity B × after	-7.623***
	(1.886)
Num.Obs.	8000
R2	0.034

4. Interpretation?
Being in City B is associated with a 3-point higher risk on average; being after 2017 is associated with a 4.5-point lower risk on average, and being in City B after 2017 causes risk to drop by −7.6.

Event-Study Plot

Show the code

library("lemon")
plot_data <- diff_data %>%
  group_by(year, city) %>%
  summarize(mean_risk = mean(malaria_risk),
            se_risk = sd(malaria_risk) / sqrt(n()),
            upper = mean_risk + (1.96 * se_risk),
            lower = mean_risk + (-1.96 * se_risk))

plot_data <- diff_data %>%
  group_by(year, city) %>%
  summarize(mean_risk = mean(malaria_risk),
            se_risk = sd(malaria_risk) / sqrt(n()),
            upper = mean_risk + (1.96 * se_risk),
            lower = mean_risk + (-1.96 * se_risk))

mean_risk<-ggplot(plot_data, aes(x = year, y = mean_risk, color = city)) +
  geom_vline(xintercept = 2017.5) +
  geom_errorbar(aes(ymin = lower, ymax = upper), 
                size = 1, width = 0,
                position=position_dodge(width=0.04))+
  geom_line() +
  geom_point(size = 2, position=position_dodge(width=0.04))+
  labs(x = "Year", y = "Malaria Risk")+
  scale_y_continuous(breaks = (seq(40, 57, by = 3)),
                    limits = c(40, 57))+
  scale_x_continuous(breaks = (seq(2013, 2020, by = 1)),
                     limits = c(2012, 2021))+
  theme_bw() +
  theme(legend.position.inside = c(1, 0),
        #Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
        #and c(1,1) corresponds to the "top right" position.
        legend.box.background = element_rect(fill='white'),
        legend.background = element_blank())
#Repositioning legend
mean_risk<-reposition_legend(mean_risk, 'bottom left')