library("broom")
library("dplyr")
library("ggplot2")
library("ggrepel")
library("modelsummary")
library("ggdag")
library("purrr")
library("stringr")
library("ggpubr")
library("ggpattern")
# For Maps
library("sf")
library("stars")
library("sp")
library("rnaturalearth")
library("rnaturalearthdata")
library("ggspatial")
library("dichromat")
# For interactive maps
library("leaflet")
library("leaflegend")
library("htmltools")
# Removing previous files and setting the
# directory
rm(list = ls())
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week11/lab/")
1 Introduction
In this hypothetical situation (the data is made up), the World Bank is planning on launching a program designed to reduce the risk of malaria in Malawi by providing insecticide-treated bed nets. So they have funding to run a randomized controlled trial (RCT) in three districts in Malawi that are very close to the Malawi lake. The World Bank randomly selected 1000 individuals from these districts and they want to investigate whether receiving such nets has any effect on people’s risk of malaria.
In this document, we want to calculate \(E(\text{Post-Malaria Risk | Net})\). First, create a folder called lab11.
In a first instance, we want to draw a causal model which explains the effect of the insecticide-treated bed nets on participant risk of malaria, given the confounding caused by age, sex, and income. We think that the risk of malaria will be impacted by age, sex, income, prior risk of malaria, and whether people used the insecticide-treated net. See the models below (together with the code to produce them).
Step 1 Code
#Basic structure of a dag
<- dagify(post_malaria_risk ~ net + age + sex + income + pre_malaria_risk,
malaria_dag ~ income + pre_malaria_risk,
net exposure = "net",
outcome = "post_malaria_risk",
labels = c(post_malaria_risk = "Post Malaria Risk",
net = "Mosquito Net",
age = "Age",
sex = "Sex",
income = "Income",
pre_malaria_risk = "Pre Malaria Risk"),
coords = list(x = c(net = 2, post_malaria_risk=7, income = 5, age = 2,
sex = 4, pre_malaria_risk=6),
y = c(net = 3, post_malaria_risk=2, income = 1, age = 2,
sex = 4, pre_malaria_risk=4)))
#Cleaning the dags and turning it into a dataframe.
<-data.frame(tidy_dagitty(malaria_dag))
bigger_dag #Labeling variables
$type<-"Confounder"
bigger_dag$type[bigger_dag$name=="post_malaria_risk"]<-"Outcome"
bigger_dag$type[bigger_dag$name=="net"]<-"Intervention"
bigger_dag#Identifying min and max coordinate
<-min(bigger_dag$x)
min_lon_x<-max(bigger_dag$x)
max_lon_x<-min(bigger_dag$y)
min_lat_y<-max(bigger_dag$y) max_lat_y
Step 2 Code
#Defining colors
= c("Outcome"="green3",
col "Intervention"="red",
"Confounder"="grey60")
#Defining order
<-c("Outcome", "Intervention", "Confounder")
order_col
#Producing the graph
<-ggplot(data = bigger_dag, aes(x = x, y = y, xend = xend, yend = yend, color=type)) +
example_daggeom_dag_point() +
geom_dag_edges() +
coord_sf(xlim = c(min_lon_x-1, max_lon_x+1),
ylim = c(min_lat_y-1, max_lat_y+1))+
scale_colour_manual(values = col,
name = "Group",
breaks = order_col)+
geom_label_repel(data = subset(bigger_dag, !duplicated(bigger_dag$label)),
aes(label = label), fill = alpha(c("white"),0.8))+
theme_bw()+
labs(x = "", y="")
#Saving the ggplot
#NOTE: Create a folder called "graphs" in your lab folder.
ggsave(example_dag, file = "./graphs/example_dag1.jpg",
height = 20, width = 25,
units = "cm", dpi = 200)
example_dag
In the experiment, we want to manipulate access to the program (providing nets). Thus, we can calculate \(E(\text{malaria risk | net})\). Following the rules of causal diagrams, we can delete all the arrows going into the program node, because the nets are randomly assigned.
Step 3 Code
#Basic structure of a dag
<- dagify(post_malaria_risk ~ net + age + sex + income + pre_malaria_risk,
malaria_dag2 exposure = "net",
outcome = "post_malaria_risk",
labels = c(post_malaria_risk = "Post Malaria Risk",
net = "Mosquito Net",
age = "Age",
sex = "Sex",
income = "Income",
pre_malaria_risk = "Pre Malaria Risk"),
coords = list(x = c(net = 2, post_malaria_risk=7, income = 5, age = 2,
sex = 4, pre_malaria_risk=6),
y = c(net = 3, post_malaria_risk=2, income = 1, age = 2,
sex = 4, pre_malaria_risk=4)))
Step 4 Code
<-data.frame(tidy_dagitty(malaria_dag2))
bigger_dag $type<-NA
bigger_dag$type<-"Confounder"
bigger_dag$type[bigger_dag$name=="post_malaria_risk"]<-"Outcome"
bigger_dag$type[bigger_dag$name=="net"]<-"Intervention"
bigger_dag<-min(bigger_dag$x)
min_lon_x<-max(bigger_dag$x)
max_lon_x<-min(bigger_dag$y)
min_lat_y<-max(bigger_dag$y) max_lat_y
Step 5 Code
#Defining colors
= c("Outcome"="green3",
col "Intervention"="red",
"Confounder"="grey60")
#Defining order
<-c("Outcome", "Intervention", "Confounder")
order_col
#Producing the graph
<-ggplot(data = bigger_dag, aes(x = x, y = y, xend = xend, yend = yend, color=type)) +
example_dag2geom_dag_point() +
geom_dag_edges() +
# geom_text(data = bigger_dag, aes(x = x, y = y, label = label),
# hjust=0.4, vjust=0, color = "black")+
coord_sf(xlim = c(min_lon_x-1, max_lon_x+1),
ylim = c(min_lat_y-1, max_lat_y+1))+
scale_colour_manual(values = col,
name = "Group",
breaks = order_col)+
geom_label_repel(data = subset(bigger_dag, !duplicated(bigger_dag$label)),
aes(label = label), fill = alpha(c("white"),0.8))+
theme_bw()+
labs(x = "", y="")
#NOTE: Make sure that there is a folder called "graphs" in your lab folder.
#Saving the file
ggsave(example_dag2, file = "./graphs/example_dag2.jpg",
height = 20, width = 25,
units = "cm", dpi = 200)
#Displaying the file
example_dag2
2 Loading the Data
Within the lab11 folder, create another folder and call it data. Go to the following links and download the following:
Place all the data in the data folder.
3 Inspecting the Area
Let’s assume that the experiment has been conducted and the data has come in. Let us first have a first look at the area where the World Bank ran this RCT.
3.1 Loading the Shape Files
st_read
reads the shapefiles from your hardrive. read_stars
reads a tif file that shows the elevation for your area of interest. ne_countries
loads a shapefile with all the countries around the world.
#Country shapefile
<- st_read("./data/malawi-latest-free/gadm41_MWI_0.shp")
mwi #Region 1 shapefile
<- st_read("./data/malawi-latest-free/gadm41_MWI_1.shp")
mwi1 #Region 3 shapefile
<- st_read("./data/malawi-latest-free/gadm41_MWI_3.shp")
mwi3 #Selected communes shapefile
<- st_read("./data/malawi-latest-free/selected_communes3.shp")
select_comm #Raster with elevation
<- read_stars("./data/elev.tif")
elev #World Shapefile
<- ne_countries(scale = "medium", returnclass = "sf") world
3.2 Simplifying the shapefiles for better rendering
st_simplify
simplifies the polygons for quicker rendering.
#Simplifying polygons for quicker rendering
#Switched off spherical geometry (s2)
::sf_use_s2(FALSE)
sf#Note: you have to play around with dTolerance to figure out the appropriate value
#Too little - your computer will need more time to render the file
#Too much - the country polygons will be too simple
<-st_simplify(mwi, dTolerance = 0.005)
mwi<-st_simplify(mwi1, dTolerance = 0.005)
mwi1<-st_simplify(mwi3, dTolerance = 0.005)
mwi3<-st_simplify(select_comm, dTolerance = 0.0005)
select_comm<-st_simplify(world, dTolerance = 0.005) world
3.3 Calculating the appropriate zoom level for Malawi
We now calculate the zoom level for the entire country of Malawi.
#Calculating the appropriate zoom level
<- st_bbox(mwi)
mwi_extent
<-mwi_extent["xmin"]
min_lon_x<-mwi_extent["xmax"]
max_lon_x
<-mwi_extent["ymin"]
min_lat_y<-mwi_extent["ymax"] max_lat_y
We now create a new polygon to emphasize where Malawi is on the World Map.
#Creating a dataframe to emphasize Malawi and adding some error
= 5
error
= data.frame("lon" = c(min_lon_x - error, max_lon_x + error,
sp_df - error, max_lon_x + error),
min_lon_x "lat" = c(min_lat_y - error, min_lat_y - error,
+ error, max_lat_y + error)) max_lat_y
We now make a spatial polygon out of it so that we can highlight where Malawi is on the world map.
#Creating polygon to indicate zoom
<- sp_df %>%
poly st_as_sf(coords = c("lon", "lat"),
crs = st_crs(mwi)) %>%
st_bbox() %>%
st_as_sfc()
#Examining what we just created
ggplot() +
geom_sf(data = mwi, fill="red", color = "red", linewidth = 0.9)+
geom_sf(data = poly, fill=NA, color = "red", linewidth = 0.3)
3.4 Cropping the raster
We now crop our raster only to the area of interest. Again, if we have too much, our computer will have a hard time rendering the file.
#Step1: Reducing raster resolution: Higher dx/day values mean lower resolution
<-st_warp(elev, st_as_stars(st_bbox(elev), dx = 0.05, dy = 0.05))
elev2#Step2: Removing really high pixel values for better elevation contrast relevant for our area
> 2000] <- NA
elev2[elev2 #Step3: Cropping
<-st_crop(elev2, poly) dem_mwi
<-ggplot() +
map_wordgeom_sf(data = world, fill=NA, color = "black", linewidth = 0.3)+
geom_sf(data = mwi, fill="red", color = "red", linewidth = 0.7)+
geom_sf(data = poly, fill=NA, color = "red", linewidth = 0.9)+
theme_bw()+
labs(x = "Longitude", y="Latitude")+
ggtitle("Malawi on the World Map")+
theme(axis.text.x = element_text(size=14),
axis.text.y = element_text(size=14),
axis.title=element_text(size=14),
plot.title = element_text(hjust = 0.5),
legend.box.background = element_rect(fill='white'),
legend.background = element_blank(),
legend.text=element_text(size=12))
map_word
#This is how to save the map on your drive
#Make sure that within the lab folder, there is a "graphs" folder
ggsave(map_word, file="./graphs/map_world.jpg", height=15, width=30, units = "cm", dpi=300)
3.5 Mapping the region of interest - Malawi
#Selecting only the region of interest
<-subset(mwi1, NAME_1=="Nkhata Bay")
mwi1_lab#Defining colors for raster
<- colorRampPalette(c("grey90", "grey30"))
grey_palette # Generate a sequence of colors based on the number of breaks desired
<- 10 # Adjust this based on your data
num_breaks <- grey_palette(num_breaks) grey_colors
This is the region of Nkhata Bay where the World Bank ran the RCT. Within the region of Nkhata Bay, the World Bank chose three settlements close to the lake where people could be exposed to malaria.
We now want to draw a new map to emphasize where Nkhata Bay is.
<-0.2
error<-ggplot() +
map_malawigeom_stars(data=elev2, show.legend=FALSE) +
scale_fill_gradientn(colours = grey_colors)+
geom_sf(data = select_comm, fill="green", color = "green", linewidth = 0.4)+
geom_sf(data = mwi1, fill=NA, color = "blue", linewidth = 0.2)+
geom_sf(data = mwi1_lab, fill=NA, color = "blue", linewidth = 0.9)+
geom_sf_text(data=mwi1_lab, aes(label = NAME_1), size=3.5)+
theme_bw()+
labs(x = "Longitude", y="Latitude")+
ggtitle("Nkhata Bay in Malawi")+
theme(axis.text.x = element_text(size=14),
axis.text.y = element_text(size=14),
axis.title=element_text(size=14),
plot.title = element_text(hjust = 0.5),
legend.position = c(1, 0),
#Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
#and c(1,1) corresponds to the "top right" position.
legend.box.background = element_rect(fill='white'),
legend.background = element_blank(),
legend.text=element_text(size=12))+
coord_sf(xlim = c(min_lon_x-5*error, max_lon_x+5*error),
ylim = c(min_lat_y-error, max_lat_y+error))+
::annotation_scale(location = 'tr')
ggspatial#Saving the map
ggsave(map_malawi, file="./graphs/map_malawi.jpg", height=15, width=15, units = "cm", dpi=300)
map_malawi
3.6 Calculating the appropriate zoom level for the Selected Communes
We now calculate the zoom level for the region of interest - Nkhata Bay.
#Calculating the appropriate zoom level for the selected commune
<- st_bbox(select_comm)
mwi_extent
<-mwi_extent["xmin"]
min_lon_x<-mwi_extent["xmax"]
max_lon_x
<-mwi_extent["ymin"]
min_lat_y<-mwi_extent["ymax"] max_lat_y
#Re-reading the shapefiles.
#We now want more accurate polygons since we are zooming in.
#This is why we load them again
<- st_read("./data/malawi-latest-free/gadm41_MWI_0.shp")
mwi <- st_read("./data/malawi-latest-free/gadm41_MWI_1.shp")
mwi1 <- st_read("./data/malawi-latest-free/gadm41_MWI_3.shp")
mwi3 <- st_read("./data/malawi-latest-free/selected_communes3.shp")
select_comm <- read_stars("./data/elev.tif")
dem
#Step1: Reducing raster resolution: Higher dx/day values mean lower resolution
<-st_warp(dem, st_as_stars(st_bbox(elev), dx = 0.02, dy = 0.02))
elev2#Step2: Removing really high pixel values for good contrast in MWI
> 2000] <- NA
elev2[elev2
#Step3: Creating a dataframe to emphasize Malawi and adding some error
= 0.5
error = data.frame("lon" = c(min_lon_x - error, max_lon_x + error,
sp_df - error, max_lon_x + error),
min_lon_x "lat" = c(min_lat_y - error, min_lat_y - error,
+ error, max_lat_y + error))
max_lat_y
#Step4: Creating polygon to indicate zoom
<- sp_df %>%
poly st_as_sf(coords = c("lon", "lat"),
crs = st_crs(mwi)) %>%
st_bbox() %>%
st_as_sfc()
#Step5: Cropping
<-st_crop(elev2, poly)
dem_mwi
<- ne_countries(scale = "medium", returnclass = "sf")
world #Turning off spherical geometries
::sf_use_s2(FALSE) sf
#We also want to include roads in our map, to provide more context.
#Reading the OSM roads data
<-st_read("./data/malawi-latest-free/gis_osm_roads_free_1_crop2.shp")
roads#Step1: Dissolving polylines
<- st_combine(roads)
roads2 #Step2: Turning roads2 into an sf object
<-st_as_sf(roads2)
roads3#Step3: Only the roads of interest
<- sf::st_intersection(roads3, select_comm) roads_crop1
3.7 Mapping the Individuals
Let us first look at our data.
<- read.csv(file = './data/rct_data_geo_malawi.csv')
rct_data glimpse(rct_data, n=10)
Rows: 1,000
Columns: 10
$ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ lat <dbl> -11.27476, -11.15300, -11.13967, -11.25544, -11.1777…
$ lon <dbl> 34.03006, 34.14594, 34.14297, 34.14619, 34.20031, 34…
$ income <dbl> 409.4701, 520.8072, 581.3331, 324.0727, 532.1844, 53…
$ age <int> 10, 35, 7, 43, 45, 51, 24, 17, 38, 69, 10, 21, 32, 3…
$ sex <chr> "Man", "Woman", "Woman", "Woman", "Man", "Woman", "M…
$ health <dbl> 82.78928, 81.18602, 80.57399, 82.84023, 87.40737, 57…
$ net <int> 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1…
$ malaria_risk_pre <dbl> 35.74454, 36.65260, 22.87415, 35.41934, 27.49873, 42…
$ malaria_risk_post <dbl> 42.52199, 34.27589, 32.67552, 43.87256, 27.16945, 41…
<-0.05
error<-ggplot() +
map_malawi_exp#Plotting the elevation raster
geom_stars(data=elev2, show.legend=FALSE) +
#Making elevation grey
scale_fill_gradientn(colours = grey_colors)+
#Plotting the selected commune
geom_sf(data = select_comm, fill="green", color = "green", linewidth = 0.4)+
geom_sf(data = mwi1, fill=NA, color = "blue", linewidth = 0.2)+
geom_sf(data = mwi3, fill=NA, color = "red", linewidth = 0.4)+
geom_sf(data = mwi1_lab, fill=NA, color = "blue", linewidth = 0.9)+
#Plotting the roads
geom_sf(data = roads_crop1, fill=NA, color = "black", linewidth = 0.3)+
#Plotting the RCT data
geom_point(data = rct_data, aes(x = lon, y= lat), color = "red", size=0.3)+
theme_bw()+
labs(x = "Longitude", y="Latitude")+
ggtitle("The Three Districts Selected in Nkhata Bay, Malawi\n Location of 1000 Individuals\n For the Experiment")+
theme(axis.text.x = element_text(size=14),
axis.text.y = element_text(size=14),
axis.title=element_text(size=14),
plot.title = element_text(hjust = 0.5),
legend.position = c(1, 0),
#Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
#and c(1,1) corresponds to the "top right" position.
legend.box.background = element_rect(fill='white'),
legend.background = element_blank(),
legend.text=element_text(size=12))+
coord_sf(xlim = c(min_lon_x-3*error, max_lon_x+3*error),
ylim = c(min_lat_y-error, max_lat_y+error))+
#Adding a scale to the map
::annotation_scale(location = 'tr')
ggspatial
#Saving the picture in folder one your machine called "graphs"
ggsave(map_malawi_exp, file="./graphs/map_malawi_exp.jpg", height=15, width=15, units = "cm", dpi=300)
map_malawi_exp
3.8 Mapping the Individuals in an Interactive Way
Here is how we can make an interactive map.
$treat<-NA
rct_data$treat[rct_data$net==1]<-"Treatment"
rct_data$treat[rct_data$net==0]<-"Control"
rct_data
<-
pal colorFactor(palette = c("#8ED5D8", "#F2B6B2"),
levels = c("Control", "Treatment"))
leaflet(rct_data) %>% addTiles() %>%
addPolygons(data = select_comm,
group = "Nkhata Bay",
# stroke parameters
weight = 3, color = "blue",
# fill parameters
fillColor = "blue", fillOpacity = 0.0)%>%
addCircleMarkers(lng = ~lon, lat = ~lat,
label = ~htmlEscape(paste("Indiv.: ", id)), radius = 4, color = ~pal(treat))%>%
addLegendFactor(
pal=pal,
values = rct_data$treat,
position = 'topright',
shape = 'circle',
width = 10,
height = 10,
opacity = 0.5)%>%
addScaleBar(
position = c("bottomright"))
4 Data Analysis
We will now do the data analysis. Until now, we looked at various ways of mapping our data.
4.1 Summary statisics
Before we conduct any analysis, it is important to get a sense of our data, the minimum and maximum, the mean, and whether our variables are normally distributed or not. Let us now create a summary statistics table:
datasummary_skim(rct_data)
Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
---|---|---|---|---|---|---|---|---|
id | 1000 | 0 | 500.5 | 288.8 | 1.0 | 500.5 | 1000.0 | |
lat | 1000 | 0 | -11.2 | 0.1 | -11.3 | -11.2 | -11.1 | |
lon | 1000 | 0 | 34.1 | 0.1 | 34.0 | 34.1 | 34.2 | |
income | 1000 | 0 | 498.0 | 74.8 | 245.3 | 497.0 | 739.7 | |
age | 76 | 0 | 28.9 | 16.1 | 1.0 | 27.0 | 89.0 | |
health | 1000 | 0 | 69.1 | 12.9 | 30.9 | 69.0 | 100.0 | |
net | 2 | 0 | 0.5 | 0.5 | 0.0 | 1.0 | 1.0 | |
malaria_risk_pre | 1000 | 0 | 38.0 | 9.8 | 5.0 | 37.9 | 70.0 | |
malaria_risk_post | 1000 | 0 | 40.4 | 10.4 | 5.0 | 40.5 | 70.0 | |
N | % | |||||||
sex | Man | 396 | 39.6 | |||||
Woman | 604 | 60.4 | ||||||
treat | Control | 478 | 47.8 | |||||
Treatment | 522 | 52.2 |
Income and malaria seem to be normally distributed. The net variable is binary, as expected. Age and health seem to be skewed to the left.
4.2 Checking balance
Before calculating the effect of the program, we should first check how well balanced assignment was. As you can tell from the results below, assignment to the program was pretty much split 50/50.
<-rct_data %>%
people_randomizedcount(net) %>%
mutate(prop = n / sum(n))
people_randomized
We also want to check to see how well balanced the treatment and control groups were in participants’ pre-treatment characteristics:
#Creating a 0-1 variable representing women
$sex_num<-as.numeric(as.factor(as.character(rct_data$sex)))
rct_data$sex_num[rct_data$sex_num==2]<-0
rct_data
#Calculating the average variably by treated vs. non-treated
<-rct_data %>%
people_randomizedgroup_by(net) %>%
summarize(prop_male = mean(sex_num),
avg_age = mean(age),
avg_malaria_risk_pre = mean(malaria_risk_pre),
avg_health = mean(health),
avg_income = mean(income))
people_randomized
These variables appear fairly well balanced. To check that there aren’t any statistically significant differences between the groups, we should also conduct some t-tests.
We have learned to do t.tests in the following way:
#Method1
<-subset(rct_data, net==1)
rct_data1<-subset(rct_data, net==0)
rct_data0t.test(rct_data1$sex_num, rct_data0$sex_num, alternative=c("two.sided"))
Welch Two Sample t-test
data: rct_data1$sex_num and rct_data0$sex_num
t = 0.16658, df = 990.67, p-value = 0.8677
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.05564916 0.06597315
sample estimates:
mean of x mean of y
0.3984674 0.3933054
Another way to get the same result is the following:
#Method2
t.test(sex_num ~ net, data = rct_data, alternative=c("two.sided"))
Welch Two Sample t-test
data: sex_num by net
t = -0.16658, df = 990.67, p-value = 0.8677
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-0.06597315 0.05564916
sample estimates:
mean in group 0 mean in group 1
0.3933054 0.3984674
Thus, we can run t-test for all the following groups:
t.test(sex_num ~ net, data = rct_data, alternative=c("two.sided"))
Welch Two Sample t-test
data: sex_num by net
t = -0.16658, df = 990.67, p-value = 0.8677
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-0.06597315 0.05564916
sample estimates:
mean in group 0 mean in group 1
0.3933054 0.3984674
t.test(age ~ net, data = rct_data, alternative=c("two.sided"))
Welch Two Sample t-test
data: age by net
t = -1.275, df = 997.37, p-value = 0.2026
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-3.2870789 0.6979223
sample estimates:
mean in group 0 mean in group 1
28.25523 29.54981
t.test(malaria_risk_pre ~ net, data = rct_data, alternative=c("two.sided"))
Welch Two Sample t-test
data: malaria_risk_pre by net
t = 0.015818, df = 997.7, p-value = 0.9874
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-1.208182 1.227819
sample estimates:
mean in group 0 mean in group 1
37.96975 37.95993
t.test(health ~ net, data = rct_data, alternative=c("two.sided"))
Welch Two Sample t-test
data: health by net
t = -0.40132, df = 996.97, p-value = 0.6883
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-1.931451 1.275577
sample estimates:
mean in group 0 mean in group 1
68.90312 69.23106
t.test(income ~ net, data = rct_data, alternative=c("two.sided"))
Welch Two Sample t-test
data: income by net
t = 0.19875, df = 991.26, p-value = 0.8425
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-8.353269 10.236022
sample estimates:
mean in group 0 mean in group 1
498.4966 497.5552
We can also simply ask to be shown the p-values.
t.test(sex_num ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.8677374
t.test(age ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.2026112
t.test(malaria_risk_pre ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.9873827
t.test(health ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.6882691
t.test(income ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.8424983
Let us also plot these differences.
4.3 Plotting Differences
The following graphs plot the point ranges for the variables of interest.
#The value of 1.96 is based on the fact that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean
#Remember that the SE is the SD (var)/Sqrt(Sample size)
#The value of 1.96 is based on the fact that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean
#Remember that the SE is the SD (var)/Sqrt(Sample size)
<- ggplot(rct_data, aes(x = treat, y = sex_num, color = treat)) +
plot_diff_sex stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
guides(color = "none") +
labs(x = NULL, y = "Prop. male")
<- ggplot(rct_data, aes(x = treat, y = age, color = treat)) +
plot_diff_age stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
guides(color = "none") +
labs(x = NULL, y = "Age")
<- ggplot(rct_data, aes(x = treat, y = malaria_risk_pre, color = treat)) +
plot_diff_malaria_risk_pre stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
guides(color = "none") +
labs(x = NULL, y = "Mal. Risk Bef.")
<- ggplot(rct_data, aes(x = treat, y = health, color = treat)) +
plot_diff_health stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
guides(color = "none") +
labs(x = NULL, y = "Health")
<- ggplot(rct_data, aes(x = treat, y = income, color = treat)) +
plot_diff_income stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
guides(color = "none") +
labs(x = NULL, y = "Income")
ggarrange(plot_diff_sex, plot_diff_age,
plot_diff_malaria_risk_pre,ncol=2, nrow=3, common.legend = TRUE) plot_diff_health, plot_diff_income,
Let us now also include the distributions of these variables by group (treatment and control group).
For sex, given its binary nature, we simply need to do a barplot by group (men and women in the treatment vs. control group).
<-ggplot(rct_data, aes(pattern = sex, x = treat)) +
plot_diff_sex_dgeom_bar(data = rct_data,
position = "dodge",
aes(fill=treat),
alpha = 0.5) +
geom_bar_pattern(position = "dodge",
aes(pattern=sex),
fill = NA,
linewidth=0.5,
color="black")+
scale_pattern_manual(name = NULL,
values = c(
Man = "stripe",
Woman = "circle"))+
labs(fill = NULL)+
xlab(NULL)
plot_diff_sex_d
For the other variables, given their continuous nature, we need to plot a histogram or a density function or both. A histogram shows the frequency distribution of our variable (e.g. how many times we encounter the age 32, 40, 56, etc.). A density graph shows the probability density function of the same variable. It shows a continuous line, where the height of the line at a given point represents the probability density of observing a value within a particular range.
We can first plot a histogram with age:
<-ggplot(rct_data, aes(x = age, fill = treat, group=treat)) +
plot_diff_age_histguides(color = "none") +
geom_histogram(binwidth = 2,
color = "white",
alpha = 0.5)+
labs(fill = NULL)+
xlab(NULL)
plot_diff_age_hist
We can then plot its density function.
<-ggplot(rct_data, aes(x = age, fill = treat, group=treat)) +
plot_diff_age_denguides(color = "none") +
geom_density(aes(y = after_stat(density), color=treat),
fill=NA)+
labs(fill = NULL)+
xlab(NULL)
plot_diff_age_den
Or we can plot them together. Note the mutiplicative factor of 2000 in geom_density
in order to have comparable scales on the two y-axes.
<-ggplot(rct_data, aes(x = age, fill = treat, group=treat)) +
plot_diff_age_dguides(color = "none") +
geom_histogram(binwidth = 2,
color = "white",
alpha = 0.5)+
#Adding density and multiplying by a manually defined value
#for visibility
geom_density(aes(y = after_stat(density)* 2000, color=treat),
#alpha = 0.5,
fill=NA)+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/2000,
name = "density"))+
labs(fill = NULL)+
xlab(NULL)
plot_diff_age_d
For the other variables, I am simply plotting them together.
<-ggplot(rct_data, aes(x = malaria_risk_pre, fill = treat, group=treat)) +
plot_diff_malaria_risk_pre_dguides(color = "none") +
geom_histogram(binwidth = 2,
color = "white",
alpha = 0.5)+
#Adding density and multiplying by a manually defined value
#for visibility
geom_density(aes(y = after_stat(density)* 2000, color=treat),
#alpha = 0.5,
fill=NA)+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/2000,
name = "density"))+
labs(fill = NULL)+
xlab(NULL)
plot_diff_malaria_risk_pre_d
<-ggplot(rct_data, aes(x = health, fill = treat, group=treat)) +
plot_diff_health_dguides(color = "none") +
geom_histogram(binwidth = 2,
color = "white",
alpha = 0.5)+
#Adding density and multiplying by a manually defined value
#for visibility
geom_density(aes(y = after_stat(density)* 2000, color=treat),
#alpha = 0.5,
fill=NA)+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/2000,
name = "density"))+
labs(fill = NULL)+
xlab(NULL)
plot_diff_health_d
<-ggplot(rct_data, aes(x = income, fill = treat, group=treat)) +
plot_diff_income_dguides(color = "none") +
geom_histogram(binwidth = 10,
color = "white",
alpha = 0.5)+
#Adding density and multiplying by a manually defined value
#for visibility
geom_density(aes(y = after_stat(density)* 10000, color=treat),
#alpha = 0.5,
fill=NA)+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/10000,
name = "density"))+
labs(fill = NULL)+
xlab(NULL)
plot_diff_income_d
We can now plot the differences and the distributions side by side.
<-ggarrange(plot_diff_sex, plot_diff_sex_d, ncol=2, common.legend = TRUE)
figure1annotate_figure(figure1, top = text_grob("Sex"))
<-ggarrange(plot_diff_age, plot_diff_age_d, ncol=2, common.legend = TRUE)
figure2annotate_figure(figure2, top = text_grob("Age"))
<-ggarrange(plot_diff_malaria_risk_pre, plot_diff_malaria_risk_pre_d, ncol=2, common.legend = TRUE)
figure3annotate_figure(figure3, top = text_grob("Malaria Risk Pre-Intervention"))
<-ggarrange(plot_diff_health, plot_diff_health_d, ncol=2, common.legend = TRUE)
figure4annotate_figure(figure4, top = text_grob("Health Score"))
<-ggarrange(plot_diff_income, plot_diff_income_d, ncol=2, common.legend = TRUE)
figure5annotate_figure(figure5, top = text_grob("Income"))
4.4 Estimating the Difference
So, we are interested in the causal effect of the program - \(E(\text{Post-Malaria Risk | Net})\). We can find this causal effect by calculating the average treatment effect or ATE:
\[ ATE = E(\overline{\text{Post-Malaria Risk}} | \text{Net = 1}) - E(\overline{\text{Post-Malaria Risk}} | \text{Net = 0}) \]
This is simply the average outcome for people in the program minus the average outcome for people not in the program. We can calculate the group averages:
<-rct_data %>%
people_randomizedgroup_by(net) %>%
summarize(avg_malaria_risk_post = mean(malaria_risk_post))
That’s 35.58 - 45.66, or -10.08, which means that the program caused an decrease in the risk of malaria of -10.08, on average. We can also simply run a regression with post-program malaria risk as the outcome variable and the program indicator variable as the explanatory variable. The result is the same.
<- lm(malaria_risk_post ~ net, data = rct_data)
model_rct1 summary(model_rct1)
Call:
lm(formula = malaria_risk_post ~ net, data = rct_data)
Residuals:
Min 1Q Median 3Q Max
-30.5794 -6.4719 0.0798 6.1909 28.9747
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.6621 0.4146 110.14 <2e-16 ***
net -10.0827 0.5738 -17.57 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.064 on 998 degrees of freedom
Multiple R-squared: 0.2363, Adjusted R-squared: 0.2355
F-statistic: 308.7 on 1 and 998 DF, p-value: < 2.2e-16
Now, we can add more variables to the model. It is however important to comapare a simple model and a more complicated model like below.
<- lm(malaria_risk_post ~ net, data = rct_data)
model_rct1 <- lm(malaria_risk_post ~ net + income + age + sex_num, data = rct_data)
model_rct2
<-list("Risk of Malaria" = model_rct1,
models"Risk of Malaria" = model_rct2)
<- c('net'='Received Mosquito Net (Treatment)',
cm "income" = "income",
"age" = "age",
"sex_num" = "sex",
"(Intercept)"="Intercept")
modelsummary(models, stars = TRUE, coef_map = cm, gof_map = c("nobs", "r.squared"))
Risk of Malaria | Risk of Malaria | |
---|---|---|
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | ||
Received Mosquito Net (Treatment) | -10.083*** | -10.235*** |
(0.574) | (0.527) | |
income | -0.045*** | |
(0.004) | ||
age | 0.083*** | |
(0.016) | ||
sex | 0.518 | |
(0.539) | ||
Intercept | 45.662*** | 65.608*** |
(0.415) | (1.866) | |
Num.Obs. | 1000 | 1000 |
R2 | 0.236 | 0.359 |
Let us now also plot this difference.
ggplot(rct_data, aes(x = malaria_risk_post, fill = treat, group=treat)) +
guides(color = "none") +
geom_histogram(binwidth = 2,
color = "white",
alpha = 0.5)+
#Adding density and multiplying by a manually defined value
#for visibility
geom_density(aes(y = after_stat(density)* 1500, color=treat),
#alpha = 0.5,
fill=NA)+
#Adding a secondary axis
scale_y_continuous(name = "count",
sec.axis =
sec_axis(~.x/1500,
name = "density"))+
labs(fill = NULL)+
xlab(NULL)
ggplot(rct_data, aes(x = treat, y = malaria_risk_post, color = treat)) +
stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
guides(color = "none") +
labs(x = NULL, y = "Post Malaria Risk")
5 Conclusion
We can now conclude that indeed the provision of insecticide-treated nets reduces the risk of malaria by -10.08 units. Thus, if the we want to reduce such a health risk to the wider region, it might be effective to provide such nets to all vulnerable populations.
Let us recap the steps in the analysis of RCTs
- Step 1: Check balance
- Step 2: Calculate the difference
- Step 3: Show the difference