| Group | Before | After | 
|---|---|---|
| Control | A - not treated | B - not treated | 
| Treatment | C - not treated | D - treated | 
Using Regression to Understand Differences, Interactions, and Multiple Explanations
How can we identify causal effects using observational data?
Some of the ways to conduct causal analysis with observational data, we need to run
Does raising minimum wage increase the number of jobs? (Card and Kruger, 1993)
Background
\[ \text{New Jersey}_{\text{Before}} = 20.44\\ \text{New Jersey}_{\text{After}} = 21.03\\ \Delta = 0.59 \]
Why can’t we interpret \(\Delta = 0.59\) as causal?
Is \(\Delta = 0.59\) a causal effect?
To answer the question, Card and Kruger compare New Jersey to a neighboring state: Pennsylvania.
\[ \text{Pennsylvania}_{\text{After}} = 21.17\\ \text{New Jersey}_{\text{After}} = 21.03\\ \Delta = -0.14 \]
Is \(\Delta = -0.14\) a causal effect?
| Group | Before | After | 
|---|---|---|
| Control | A - not treated | B - not treated | 
| Treatment | C - not treated | D - treated | 
| Group | Before | After | Δ (After - Before) | 
|---|---|---|---|
| Control | A - not treated | B - not treated | B - A | 
| Treatment | C - not treated | D - treated | D - C | 
Δ (After - Before) = within-unit change
| Group | Before | After | Δ (After - Before) | 
|---|---|---|---|
| Control | A - not treated | B - not treated | B - A | 
| Treatment | C - not treated | D - treated | D - C | 
| Δ (Treatment - Control) | C - A | D - B | 
Δ (After - Before) = within-unit change
Δ (Treatment - Control) = across-group change
| Group | Before | After | Δ (After - Before) | 
|---|---|---|---|
| Control | A - not treated | B - not treated | B - A | 
| Treatment | C - not treated | D - treated | D - C | 
| Δ (Treatment - Control) | C - A | D - B | (D - C) - (B - A) | 
Δ (After - Before) = within-unit change
Δ (Treatment - Control) = across-group change
| Group | Before | After | Δ (After - Before) | 
|---|---|---|---|
| Control | A - not treated | B - not treated | B - A | 
| Treatment | C - not treated | D - treated | D - C | 
| Δ (Treatment - Control) | C - A | D - B | (D - B) - (C - A) | 
Δ (After - Before) = within-unit change
Δ (Treatment - Control) = across-group change
| Group | Before | After | Δ (After - Before) | 
|---|---|---|---|
| Control | A  23.33  | 
B  21.17  | 
B − A  −2.16  | 
| Treatment | C  20.44  | 
D  21.03  | 
D − C  0.59  | 
| Δ (Treatment − Control) | C − A  −2.89  | 
D − B  −0.14  | 
(0.59 − −2.16) or  (−0.14 − −2.89)  | 
Δ (After - Before) = within-unit change
Δ (Treatment - Control) = across-group change
| Group | Before | After | Δ (After - Before) | 
|---|---|---|---|
| Control | A  23.33  | 
B  21.17  | 
B − A  −2.16  | 
| Treatment | C  20.44  | 
D  21.03  | 
D − C  0.59  | 
| Δ (Treatment − Control) | C − A  −2.89  | 
D − B  −0.14  | 
2.75 or  2.75  | 
Δ (After - Before) = within-unit change
Δ (Treatment - Control) = across-group change
Conventional wisdom (in economics)
Raising the minimum wage reduces employment due to higher labor costs.
Card & Krueger’s finding:
After New Jersey raised the minimum wage, employment increased slightly at fast-food restaurants compared to Pennsylvania.
Methodological innovation:
They used a natural experiment with a difference-in-differences approach — unusual at the time for labor economics.
The way we can estimate the causal effect is by running the following model:
1. Model
\[ Y_{it} = \beta_0 + \color{blue}{\beta_1 \cdot \text{Group}_i} + \color{purple}{\beta_2 \cdot \text{Time}_t} + \color{red}{\beta_3 \cdot (\text{Group}_i \times \text{Time}_t)} + \epsilon_{it} \]
Where:
-Group = 1 if this is the treatment group 
 -Time = 1 if this is the period after intervention 
 -β₀ – mean of the control group in the pre-treatment period 
 -β₁ – the increase in outcome across groups 
 -β₂ – the increase in outcome within groups 
 -β₃ – the Differences-in-Differences
1. Model
\[ Y_{it} = \beta_0 + \color{blue}{\beta_1 \cdot \text{Group}_i} + \color{purple}{\beta_2 \cdot \text{Time}_t} + \color{red}{\beta_3 \cdot (\text{Group}_i \times \text{Time}_t)} + \epsilon_{it} \]
| Group | Before | After | Δ (After − Before) | 
|---|---|---|---|
| Control | β₀ | β₀ + β₂ | β₂ | 
| Treatment | β₀ + β₁ | β₀ + β₁ + β₂ + β₃ | β₂ + β₃ | 
| Δ (Treatment − Control) | β₁ | β₁ + β₃ | β₃ | 
\(\Delta\) within units − \(\Delta\) across groups = Difference-in-differences = causal effect
Parallel Trends Assumption
The treatment and the control group have the same trends prior to the intervention.
We assume that the treatment group would have changed like the control group in the absence of the treatment.
Timing
Sometimes, units receive treatment at different times, so this can distort our estimates.
This is an example where the parallel trends assumption holds
Another example where parallel trends hold is the following:
An example where the parallel trend is violated, is the following:
Another example where the parallel trend is violated, is the following:
Units can receive observations at different times which can distort our estimate:
Units can receive observations at different times which can distort our estimate:
Units can receive observations at different times which can distort our estimate:
This is based on made-up data.
World Banks wants to reduce the risk of malaria in Malawia by providing insecticide-treated bed nets.
So they provided insecticide-treated bed nets only to city B from 2017 to 2020.
The World Bank selected 24 individuals (over 3 years) from city B and they want to investigate whether receiving such nets has any effect on people’s risk of malaria.
Rows: 8,000
Columns: 10
$ year         <int> 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2013, 201…
$ income       <dbl> 1284.551, 1285.832, 1285.997, 1285.157, 1286.313, 1287.24…
$ age          <int> 33, 34, 35, 36, 37, 38, 39, 40, 51, 52, 53, 54, 55, 56, 5…
$ sex          <chr> "Woman", "Woman", "Woman", "Woman", "Woman", "Woman", "Wo…
$ malaria_risk <dbl> 36.26529, 35.10382, 73.17664, 28.98390, 51.18721, 51.9071…
$ id           <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, …
$ lat          <dbl> -11.27476, -11.27476, -11.27476, -11.27476, -11.27476, -1…
$ lon          <dbl> 34.03006, 34.03006, 34.03006, 34.03006, 34.03006, 34.0300…
$ after        <int> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, …
$ city         <chr> "City A", "City A", "City A", "City A", "City A", "City A…
Here are the individuals
panel_2013<-subset(diff_data, year==2013)
min_lon_x<-min(panel_2013$lon)
max_lon_x<-max(panel_2013$lon)
min_lat_y<-min(panel_2013$lat)
max_lat_y<-max(panel_2013$lat)
error<-2
error<-0.05
map_2013<-ggplot() +
#  geom_raster(data=hdf,aes(X,Y,alpha=Hill)) +
  #scale_alpha(name = "Altitude", guide = "none")  + 
  #geom_sf(data = world, fill=NA, color = "black", linewidth = 0.4)+
#    geom_sf(data = select_comm, fill="green", color = "green", linewidth = 0.4)+
#  geom_sf(data = mwi1, fill=NA, color = "blue", linewidth = 0.2)+
#  geom_sf(data = mwi3, fill=NA, color = "red", linewidth = 0.4)+
  geom_sf(data = mwi1_lab, fill=NA, color = "blue", linewidth = 0.9)+
  geom_point(data = panel_2013, aes(x=lon, y=lat, color = city, size = malaria_risk), alpha=0.1)+
    scale_radius(limits = c(0, NA), range = c(0, 5))+
#  scale_size_continuous(range = c(1,4))+
  #  geom_sf(data = roads_crop1, fill=NA, color = "black", linewidth = 0.3)+
  theme_bw()+
  labs(x = "Longitude", y="Latitude")+
  ggtitle("The Three Districts Selected in Nkhata Bay, Malawi\n Location of 1000 Individuals\n For the Experiment, 2013")+
    theme(axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title=element_text(size=14),
        plot.title = element_text(hjust = 0.5),
        #Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
        #and c(1,1) corresponds to the "top right" position.
        legend.box.background = element_rect(fill='white'),
        legend.background = element_blank(),
        legend.text=element_text(size=12))+
      coord_sf(xlim = c(min_lon_x-3*error, max_lon_x+3*error), ylim = c(min_lat_y-error, max_lat_y+error), expand = FALSE)+
      ggspatial::annotation_scale(location = 'tr')
map_2013<-reposition_legend(map_2013, 'bottom left')Before we conduct any analysis, it is important to get a sense of our data
| Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
|---|---|---|---|---|---|---|---|---|
| year | 8 | 0 | 2016.5 | 2.3 | 2013.0 | 2016.5 | 2020.0 | |
| income | 8000 | 0 | 1249.9 | 160.6 | 836.3 | 1246.8 | 1710.7 | |
| age | 87 | 0 | 32.3 | 16.0 | 1.0 | 31.0 | 87.0 | |
| malaria_risk | 8000 | 0 | 48.9 | 12.7 | 0.0 | 49.0 | 100.0 | |
| id | 1000 | 0 | 500.5 | 288.7 | 1.0 | 500.5 | 1000.0 | |
| lat | 1000 | 0 | -11.2 | 0.1 | -11.3 | -11.2 | -11.1 | |
| lon | 1000 | 0 | 34.1 | 0.1 | 34.0 | 34.1 | 34.2 | |
| after | 2 | 0 | 0.4 | 0.5 | 0.0 | 0.0 | 1.0 | |
| N | % | |||||||
| sex | Man | 3224 | 40.3 | |||||
| Woman | 4776 | 59.7 | ||||||
| city | City A | 7808 | 97.6 | |||||
| City B | 192 | 2.4 | 
So, we are interested in the causal effect of the program - \(\beta_3 (\text{Group}_i \times \text{Time}_t)\).
\[ Y_{it} = \beta_0 + \beta_1 \text{Group}_i + \beta_2 \text{Time}_t + \beta_3 (\text{Group}_i \times \text{Time}_t) + \epsilon_{it} \]
Or
\[ \color{green}{\text{Malaria Risk}_{it}} = \beta_0 + \color{blue}{\beta_1 \text{City B}_i} + \color{purple}{\beta_2 \text{Year}_t} + \color{red}{\beta_3 (\text{City B}_i \times \text{Time}_t)} + \epsilon_{it} \]
We can calculate \(\beta_3 (\text{Group}_i \times \text{Time}_t)\) by running:
\[ \color{green}{\text{Malaria Risk}_{it}} = \beta_0 + \color{blue}{\beta_1 \text{City B}_i} + \color{purple}{\beta_2 \text{Year}_t} + \color{red}{\beta_3 (\text{City B}_i \times \text{Time}_t)} + \epsilon_{it} \]
| (1) | |
|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
| (Intercept) | 50.629*** | 
| (0.179) | |
| cityCity B | 3.071** | 
| (1.155) | |
| after | -4.532*** | 
| (0.292) | |
| cityCity B × after | -7.623*** | 
| (1.886) | |
| Num.Obs. | 8000 | 
| R2 | 0.034 | 
4. Interpretation?
Being in City B is associated with a 3-point higher risk on average; being after 2017 is associated with a 4.5-point lower risk on average, and being in City B after 2017 causes risk to drop by −7.6.
library("lemon")
plot_data <- diff_data %>%
  group_by(year, city) %>%
  summarize(mean_risk = mean(malaria_risk),
            se_risk = sd(malaria_risk) / sqrt(n()),
            upper = mean_risk + (1.96 * se_risk),
            lower = mean_risk + (-1.96 * se_risk))
plot_data <- diff_data %>%
  group_by(year, city) %>%
  summarize(mean_risk = mean(malaria_risk),
            se_risk = sd(malaria_risk) / sqrt(n()),
            upper = mean_risk + (1.96 * se_risk),
            lower = mean_risk + (-1.96 * se_risk))
mean_risk<-ggplot(plot_data, aes(x = year, y = mean_risk, color = city)) +
  geom_vline(xintercept = 2017.5) +
  geom_errorbar(aes(ymin = lower, ymax = upper), 
                size = 1, width = 0,
                position=position_dodge(width=0.04))+
  geom_line() +
  geom_point(size = 2, position=position_dodge(width=0.04))+
  labs(x = "Year", y = "Malaria Risk")+
  scale_y_continuous(breaks = (seq(40, 57, by = 3)),
                    limits = c(40, 57))+
  scale_x_continuous(breaks = (seq(2013, 2020, by = 1)),
                     limits = c(2012, 2021))+
  theme_bw() +
  theme(legend.position.inside = c(1, 0),
        #Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
        #and c(1,1) corresponds to the "top right" position.
        legend.box.background = element_rect(fill='white'),
        legend.background = element_blank())
#Repositioning legend
mean_risk<-reposition_legend(mean_risk, 'bottom left')Popescu (JCU): Differences in Differences