Lab 11: Statistics

Analysis of RCT (Randomized Control Trial) Data

In Lab 11, we’ll explore the impact of insecticide-treated bed nets on malaria risk using randomized controlled trial (RCT) data from Malawi. We’ll start by creating a causal model to understand how factors like age, sex, and income might influence malaria risk. As we work through this, we’ll use statistical tools to calculate the expected post-malaria risk and visualize the relationships within the data, helping us to grasp the underlying concepts that drive the analysis.
Causal Models
Randomized Controlled Trials
Risk Analysis
Author

Bogdan G. Popescu

1 Introduction

In this hypothetical situation (the data is made up), the World Bank is planning on launching a program designed to reduce the risk of malaria in Malawi by providing insecticide-treated bed nets. So they have funding to run a randomized controlled trial (RCT) in three districts in Malawi that are very close to the Malawi lake. The World Bank randomly selected 1000 individuals from these districts and they want to investigate whether receiving such nets has any effect on people’s risk of malaria.

In this document, we want to calculate \(E(\text{Post-Malaria Risk | Net})\). First, create a folder called lab11.

library("broom")
library("dplyr")
library("ggplot2")
library("ggrepel")
library("modelsummary")
library("ggdag")
library("purrr")
library("stringr")
library("ggpubr")
library("ggpattern")


# For Maps
library("sf")
library("stars")
library("sp")
library("rnaturalearth")
library("rnaturalearthdata")
library("ggspatial")
library("dichromat")

# For interactive maps
library("leaflet")
library("leaflegend")
library("htmltools")

# Removing previous files and setting the
# directory
rm(list = ls())
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week11/lab/")

In a first instance, we want to draw a causal model which explains the effect of the insecticide-treated bed nets on participant risk of malaria, given the confounding caused by age, sex, and income. We think that the risk of malaria will be impacted by age, sex, income, prior risk of malaria, and whether people used the insecticide-treated net. See the models below (together with the code to produce them).

Step 1 Code
#Basic structure of a dag
malaria_dag <- dagify(post_malaria_risk ~ net + age + sex + income + pre_malaria_risk,
                     net ~ income + pre_malaria_risk,
                     exposure = "net",
                     outcome = "post_malaria_risk",
                     labels = c(post_malaria_risk = "Post Malaria Risk",
                                net = "Mosquito Net",
                                age = "Age",
                                sex = "Sex",
                                income = "Income",
                                pre_malaria_risk = "Pre Malaria Risk"),
                     coords = list(x = c(net = 2, post_malaria_risk=7, income = 5, age = 2,
                                         sex = 4, pre_malaria_risk=6),
                                   y = c(net = 3, post_malaria_risk=2, income = 1, age = 2,
                                         sex = 4, pre_malaria_risk=4)))

#Cleaning the dags and turning it into a dataframe.
bigger_dag <-data.frame(tidy_dagitty(malaria_dag))
#Labeling variables
bigger_dag$type<-"Confounder"
bigger_dag$type[bigger_dag$name=="post_malaria_risk"]<-"Outcome"
bigger_dag$type[bigger_dag$name=="net"]<-"Intervention"
#Identifying min and max coordinate
min_lon_x<-min(bigger_dag$x)
max_lon_x<-max(bigger_dag$x)
min_lat_y<-min(bigger_dag$y)
max_lat_y<-max(bigger_dag$y)
Step 2 Code
#Defining colors
col = c("Outcome"="green3",
        "Intervention"="red",
        "Confounder"="grey60")

#Defining order
order_col<-c("Outcome", "Intervention", "Confounder")

#Producing the graph
example_dag<-ggplot(data = bigger_dag, aes(x = x, y = y, xend = xend, yend = yend, color=type)) +
  geom_dag_point() +
  geom_dag_edges() +
  coord_sf(xlim = c(min_lon_x-1, max_lon_x+1), 
           ylim = c(min_lat_y-1, max_lat_y+1))+
  scale_colour_manual(values = col,
                      name = "Group",
                      breaks = order_col)+
  geom_label_repel(data = subset(bigger_dag, !duplicated(bigger_dag$label)), 
                   aes(label = label), fill = alpha(c("white"),0.8))+

  theme_bw()+
  labs(x = "", y="")

#Saving the ggplot
#NOTE: Create a folder called "graphs" in your lab folder.
ggsave(example_dag, file = "./graphs/example_dag1.jpg", 
       height = 20, width = 25, 
       units = "cm", dpi = 200)

example_dag

In the experiment, we want to manipulate access to the program (providing nets). Thus, we can calculate \(E(\text{malaria risk | net})\). Following the rules of causal diagrams, we can delete all the arrows going into the program node, because the nets are randomly assigned.

Step 3 Code
#Basic structure of a dag
malaria_dag2 <- dagify(post_malaria_risk ~ net + age + sex + income + pre_malaria_risk,
                     exposure = "net",
                     outcome = "post_malaria_risk",
                     labels = c(post_malaria_risk = "Post Malaria Risk",
                                net = "Mosquito Net",
                                age = "Age",
                                sex = "Sex",
                                income = "Income",
                                pre_malaria_risk = "Pre Malaria Risk"),
                     coords = list(x = c(net = 2, post_malaria_risk=7, income = 5, age = 2,
                                         sex = 4, pre_malaria_risk=6),
                                   y = c(net = 3, post_malaria_risk=2, income = 1, age = 2,
                                         sex = 4, pre_malaria_risk=4)))
Step 4 Code
bigger_dag <-data.frame(tidy_dagitty(malaria_dag2))
bigger_dag$type<-NA
bigger_dag$type<-"Confounder"
bigger_dag$type[bigger_dag$name=="post_malaria_risk"]<-"Outcome"
bigger_dag$type[bigger_dag$name=="net"]<-"Intervention"
min_lon_x<-min(bigger_dag$x)
max_lon_x<-max(bigger_dag$x)
min_lat_y<-min(bigger_dag$y)
max_lat_y<-max(bigger_dag$y)
Step 5 Code
#Defining colors
col = c("Outcome"="green3",
        "Intervention"="red",
        "Confounder"="grey60")

#Defining order
order_col<-c("Outcome", "Intervention", "Confounder")


#Producing the graph
example_dag2<-ggplot(data = bigger_dag, aes(x = x, y = y, xend = xend, yend = yend, color=type)) +
  geom_dag_point() +
  geom_dag_edges() +
# geom_text(data = bigger_dag, aes(x = x, y = y, label = label), 
#            hjust=0.4, vjust=0, color = "black")+
  coord_sf(xlim = c(min_lon_x-1, max_lon_x+1), 
           ylim = c(min_lat_y-1, max_lat_y+1))+
  scale_colour_manual(values = col,
                      name = "Group",
                      breaks = order_col)+
  geom_label_repel(data = subset(bigger_dag, !duplicated(bigger_dag$label)), 
                   aes(label = label), fill = alpha(c("white"),0.8))+
  theme_bw()+
  labs(x = "", y="")

#NOTE: Make sure that there is a folder called "graphs" in your lab folder.
#Saving the file
ggsave(example_dag2, file = "./graphs/example_dag2.jpg", 
       height = 20, width = 25, 
       units = "cm", dpi = 200)
#Displaying the file
example_dag2

2 Loading the Data

Within the lab11 folder, create another folder and call it data. Go to the following links and download the following:

Place all the data in the data folder.

3 Inspecting the Area

Let’s assume that the experiment has been conducted and the data has come in. Let us first have a first look at the area where the World Bank ran this RCT.

3.1 Loading the Shape Files

st_read reads the shapefiles from your hardrive. read_stars reads a tif file that shows the elevation for your area of interest. ne_countries loads a shapefile with all the countries around the world.

#Country shapefile
mwi <- st_read("./data/malawi-latest-free/gadm41_MWI_0.shp")
#Region 1 shapefile
mwi1 <- st_read("./data/malawi-latest-free/gadm41_MWI_1.shp")
#Region 3 shapefile
mwi3 <- st_read("./data/malawi-latest-free/gadm41_MWI_3.shp")
#Selected communes shapefile
select_comm <- st_read("./data/malawi-latest-free/selected_communes3.shp")
#Raster with elevation
elev <- read_stars("./data/elev.tif")
#World Shapefile
world <- ne_countries(scale = "medium", returnclass = "sf")

3.2 Simplifying the shapefiles for better rendering

st_simplify simplifies the polygons for quicker rendering.

#Simplifying polygons for quicker rendering
#Switched off spherical geometry (s2)
sf::sf_use_s2(FALSE)
#Note: you have to play around with dTolerance to figure out the appropriate value
#Too little - your computer will need more time to render the file
#Too much - the country polygons will be too simple
mwi<-st_simplify(mwi,  dTolerance = 0.005)
mwi1<-st_simplify(mwi1,  dTolerance = 0.005)
mwi3<-st_simplify(mwi3,  dTolerance = 0.005)
select_comm<-st_simplify(select_comm,  dTolerance = 0.0005)
world<-st_simplify(world,  dTolerance = 0.005)

3.3 Calculating the appropriate zoom level for Malawi

We now calculate the zoom level for the entire country of Malawi.

#Calculating the appropriate zoom level
mwi_extent <- st_bbox(mwi)

min_lon_x<-mwi_extent["xmin"]
max_lon_x<-mwi_extent["xmax"]

min_lat_y<-mwi_extent["ymin"]
max_lat_y<-mwi_extent["ymax"]

We now create a new polygon to emphasize where Malawi is on the World Map.

#Creating a dataframe to emphasize Malawi and adding some error
error = 5

sp_df = data.frame("lon" = c(min_lon_x - error, max_lon_x + error, 
                             min_lon_x - error, max_lon_x + error),
                   "lat" = c(min_lat_y - error, min_lat_y - error, 
                             max_lat_y + error, max_lat_y + error))

We now make a spatial polygon out of it so that we can highlight where Malawi is on the world map.

#Creating polygon to indicate zoom
poly <- sp_df %>% 
  st_as_sf(coords = c("lon", "lat"), 
           crs = st_crs(mwi)) %>% 
  st_bbox() %>% 
  st_as_sfc()

#Examining what we just created
ggplot() +
  geom_sf(data = mwi, fill="red", color = "red", linewidth = 0.9)+
  geom_sf(data = poly, fill=NA, color = "red", linewidth = 0.3)

3.4 Cropping the raster

We now crop our raster only to the area of interest. Again, if we have too much, our computer will have a hard time rendering the file.

#Step1: Reducing raster resolution: Higher dx/day values mean lower resolution
elev2<-st_warp(elev, st_as_stars(st_bbox(elev), dx = 0.05, dy = 0.05))
#Step2: Removing really high pixel values for better elevation contrast relevant for our area
elev2[elev2 > 2000] <- NA
#Step3: Cropping
dem_mwi<-st_crop(elev2, poly)
map_word<-ggplot() +
  geom_sf(data = world, fill=NA, color = "black", linewidth = 0.3)+
  geom_sf(data = mwi, fill="red", color = "red", linewidth = 0.7)+
  geom_sf(data = poly, fill=NA, color = "red", linewidth = 0.9)+
  theme_bw()+
  labs(x = "Longitude", y="Latitude")+
  ggtitle("Malawi on the World Map")+
  theme(axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title=element_text(size=14),
        plot.title = element_text(hjust = 0.5),
        legend.box.background = element_rect(fill='white'),
        legend.background = element_blank(),
        legend.text=element_text(size=12))
map_word

#This is how to save the map on your drive
#Make sure that within the lab folder, there is a "graphs" folder
ggsave(map_word, file="./graphs/map_world.jpg", height=15, width=30, units = "cm", dpi=300)

3.5 Mapping the region of interest - Malawi

#Selecting only the region of interest
mwi1_lab<-subset(mwi1, NAME_1=="Nkhata Bay")
#Defining colors for raster
grey_palette <- colorRampPalette(c("grey90", "grey30"))
# Generate a sequence of colors based on the number of breaks desired
num_breaks <- 10  # Adjust this based on your data
grey_colors <- grey_palette(num_breaks)

This is the region of Nkhata Bay where the World Bank ran the RCT. Within the region of Nkhata Bay, the World Bank chose three settlements close to the lake where people could be exposed to malaria.

We now want to draw a new map to emphasize where Nkhata Bay is.

error<-0.2
map_malawi<-ggplot() +
  geom_stars(data=elev2, show.legend=FALSE) +
  scale_fill_gradientn(colours = grey_colors)+
  geom_sf(data = select_comm, fill="green", color = "green", linewidth = 0.4)+
  geom_sf(data = mwi1, fill=NA, color = "blue", linewidth = 0.2)+
  geom_sf(data = mwi1_lab, fill=NA, color = "blue", linewidth = 0.9)+
  geom_sf_text(data=mwi1_lab, aes(label = NAME_1), size=3.5)+
  theme_bw()+
  labs(x = "Longitude", y="Latitude")+
  ggtitle("Nkhata Bay in Malawi")+
  theme(axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title=element_text(size=14),
        plot.title = element_text(hjust = 0.5),
        legend.position = c(1, 0),
        #Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
        #and c(1,1) corresponds to the "top right" position.
        legend.box.background = element_rect(fill='white'),
        legend.background = element_blank(),
        legend.text=element_text(size=12))+
    coord_sf(xlim = c(min_lon_x-5*error, max_lon_x+5*error), 
             ylim = c(min_lat_y-error, max_lat_y+error))+
    ggspatial::annotation_scale(location = 'tr')
#Saving the map
ggsave(map_malawi, file="./graphs/map_malawi.jpg", height=15, width=15, units = "cm", dpi=300)
map_malawi

3.6 Calculating the appropriate zoom level for the Selected Communes

We now calculate the zoom level for the region of interest - Nkhata Bay.

#Calculating the appropriate zoom level for the selected commune
mwi_extent <- st_bbox(select_comm)

min_lon_x<-mwi_extent["xmin"]
max_lon_x<-mwi_extent["xmax"]

min_lat_y<-mwi_extent["ymin"]
max_lat_y<-mwi_extent["ymax"]
#Re-reading the shapefiles.
#We now want more accurate polygons since we are zooming in.
#This is why we load them again

mwi <- st_read("./data/malawi-latest-free/gadm41_MWI_0.shp")
mwi1 <- st_read("./data/malawi-latest-free/gadm41_MWI_1.shp")
mwi3 <- st_read("./data/malawi-latest-free/gadm41_MWI_3.shp")
select_comm <- st_read("./data/malawi-latest-free/selected_communes3.shp")
dem <- read_stars("./data/elev.tif")

#Step1: Reducing raster resolution: Higher dx/day values mean lower resolution
elev2<-st_warp(dem, st_as_stars(st_bbox(elev), dx = 0.02, dy = 0.02))
#Step2: Removing really high pixel values for good contrast in MWI
elev2[elev2 > 2000] <- NA

#Step3: Creating a dataframe to emphasize Malawi and adding some error
error = 0.5
sp_df = data.frame("lon" = c(min_lon_x - error, max_lon_x + error, 
                             min_lon_x - error, max_lon_x + error),
                   "lat" = c(min_lat_y - error, min_lat_y - error, 
                             max_lat_y + error, max_lat_y + error))

#Step4: Creating polygon to indicate zoom
poly <- sp_df %>% 
  st_as_sf(coords = c("lon", "lat"), 
           crs = st_crs(mwi)) %>% 
  st_bbox() %>% 
  st_as_sfc()

#Step5: Cropping
dem_mwi<-st_crop(elev2, poly)

world <- ne_countries(scale = "medium", returnclass = "sf")
#Turning off spherical geometries
sf::sf_use_s2(FALSE)
#We also want to include roads in our map, to provide more context.
#Reading the OSM roads data
roads<-st_read("./data/malawi-latest-free/gis_osm_roads_free_1_crop2.shp")
#Step1: Dissolving polylines
roads2 <- st_combine(roads)
#Step2: Turning roads2 into an sf object
roads3<-st_as_sf(roads2)
#Step3: Only the roads of interest
roads_crop1 <- sf::st_intersection(roads3, select_comm)

3.7 Mapping the Individuals

Let us first look at our data.

rct_data <- read.csv(file = './data/rct_data_geo_malawi.csv')
glimpse(rct_data, n=10)
Rows: 1,000
Columns: 10
$ id                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ lat               <dbl> -11.27476, -11.15300, -11.13967, -11.25544, -11.1777…
$ lon               <dbl> 34.03006, 34.14594, 34.14297, 34.14619, 34.20031, 34…
$ income            <dbl> 409.4701, 520.8072, 581.3331, 324.0727, 532.1844, 53…
$ age               <int> 10, 35, 7, 43, 45, 51, 24, 17, 38, 69, 10, 21, 32, 3…
$ sex               <chr> "Man", "Woman", "Woman", "Woman", "Man", "Woman", "M…
$ health            <dbl> 82.78928, 81.18602, 80.57399, 82.84023, 87.40737, 57…
$ net               <int> 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1…
$ malaria_risk_pre  <dbl> 35.74454, 36.65260, 22.87415, 35.41934, 27.49873, 42…
$ malaria_risk_post <dbl> 42.52199, 34.27589, 32.67552, 43.87256, 27.16945, 41…
error<-0.05
map_malawi_exp<-ggplot() +
  #Plotting the elevation raster
  geom_stars(data=elev2, show.legend=FALSE) +
  #Making elevation grey
  scale_fill_gradientn(colours = grey_colors)+
  #Plotting the selected commune
  geom_sf(data = select_comm, fill="green", color = "green", linewidth = 0.4)+
  geom_sf(data = mwi1, fill=NA, color = "blue", linewidth = 0.2)+
  geom_sf(data = mwi3, fill=NA, color = "red", linewidth = 0.4)+
  geom_sf(data = mwi1_lab, fill=NA, color = "blue", linewidth = 0.9)+
  #Plotting the roads
  geom_sf(data = roads_crop1, fill=NA, color = "black", linewidth = 0.3)+
  #Plotting the RCT data
  geom_point(data = rct_data, aes(x = lon, y= lat), color = "red", size=0.3)+
  theme_bw()+
  labs(x = "Longitude", y="Latitude")+
  ggtitle("The Three Districts Selected in Nkhata Bay, Malawi\n Location of 1000 Individuals\n For the Experiment")+
  theme(axis.text.x = element_text(size=14),
        axis.text.y = element_text(size=14),
        axis.title=element_text(size=14),
        plot.title = element_text(hjust = 0.5),
        legend.position = c(1, 0),
        #Legend.position values should be between 0 and 1. c(0,0) corresponds to the "bottom left"
        #and c(1,1) corresponds to the "top right" position.
        legend.box.background = element_rect(fill='white'),
        legend.background = element_blank(),
        legend.text=element_text(size=12))+
        coord_sf(xlim = c(min_lon_x-3*error, max_lon_x+3*error), 
                 ylim = c(min_lat_y-error, max_lat_y+error))+
  #Adding a scale to the map
  ggspatial::annotation_scale(location = 'tr')

#Saving the picture in folder one your machine called "graphs"
ggsave(map_malawi_exp, file="./graphs/map_malawi_exp.jpg", height=15, width=15, units = "cm", dpi=300)

map_malawi_exp

3.8 Mapping the Individuals in an Interactive Way

Here is how we can make an interactive map.

rct_data$treat<-NA
rct_data$treat[rct_data$net==1]<-"Treatment"
rct_data$treat[rct_data$net==0]<-"Control"


pal <- 
   colorFactor(palette = c("#8ED5D8", "#F2B6B2"), 
               levels = c("Control", "Treatment"))


leaflet(rct_data) %>% addTiles() %>%
addPolygons(data = select_comm,
                group = "Nkhata Bay",
    # stroke parameters
    weight = 3, color = "blue",
    # fill parameters
    fillColor = "blue", fillOpacity = 0.0)%>%
    
  addCircleMarkers(lng = ~lon, lat = ~lat, 
             label = ~htmlEscape(paste("Indiv.: ", id)), radius = 4, color = ~pal(treat))%>%
  addLegendFactor(
    pal=pal,
    values = rct_data$treat,
    position = 'topright',
    shape = 'circle',
    width = 10,
    height = 10,
    opacity = 0.5)%>%
  addScaleBar(
  position = c("bottomright"))

4 Data Analysis

We will now do the data analysis. Until now, we looked at various ways of mapping our data.

4.1 Summary statisics

Before we conduct any analysis, it is important to get a sense of our data, the minimum and maximum, the mean, and whether our variables are normally distributed or not. Let us now create a summary statistics table:

datasummary_skim(rct_data)
Unique Missing Pct. Mean SD Min Median Max Histogram
id 1000 0 500.5 288.8 1.0 500.5 1000.0
lat 1000 0 -11.2 0.1 -11.3 -11.2 -11.1
lon 1000 0 34.1 0.1 34.0 34.1 34.2
income 1000 0 498.0 74.8 245.3 497.0 739.7
age 76 0 28.9 16.1 1.0 27.0 89.0
health 1000 0 69.1 12.9 30.9 69.0 100.0
net 2 0 0.5 0.5 0.0 1.0 1.0
malaria_risk_pre 1000 0 38.0 9.8 5.0 37.9 70.0
malaria_risk_post 1000 0 40.4 10.4 5.0 40.5 70.0
N %
sex Man 396 39.6
Woman 604 60.4
treat Control 478 47.8
Treatment 522 52.2

Income and malaria seem to be normally distributed. The net variable is binary, as expected. Age and health seem to be skewed to the left.

4.2 Checking balance

Before calculating the effect of the program, we should first check how well balanced assignment was. As you can tell from the results below, assignment to the program was pretty much split 50/50.

people_randomized<-rct_data %>%
  count(net) %>%
  mutate(prop = n / sum(n))
people_randomized

We also want to check to see how well balanced the treatment and control groups were in participants’ pre-treatment characteristics:

#Creating a 0-1 variable representing women
rct_data$sex_num<-as.numeric(as.factor(as.character(rct_data$sex)))
rct_data$sex_num[rct_data$sex_num==2]<-0

#Calculating the average variably by treated vs. non-treated
people_randomized<-rct_data %>%
  group_by(net) %>%
  summarize(prop_male = mean(sex_num),
    avg_age = mean(age),
    avg_malaria_risk_pre = mean(malaria_risk_pre),
    avg_health = mean(health),
    avg_income = mean(income))

people_randomized

These variables appear fairly well balanced. To check that there aren’t any statistically significant differences between the groups, we should also conduct some t-tests.

We have learned to do t.tests in the following way:

#Method1
rct_data1<-subset(rct_data, net==1)
rct_data0<-subset(rct_data, net==0)
t.test(rct_data1$sex_num, rct_data0$sex_num, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  rct_data1$sex_num and rct_data0$sex_num
t = 0.16658, df = 990.67, p-value = 0.8677
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.05564916  0.06597315
sample estimates:
mean of x mean of y 
0.3984674 0.3933054 

Another way to get the same result is the following:

#Method2
t.test(sex_num ~ net, data = rct_data, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  sex_num by net
t = -0.16658, df = 990.67, p-value = 0.8677
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.06597315  0.05564916
sample estimates:
mean in group 0 mean in group 1 
      0.3933054       0.3984674 

Thus, we can run t-test for all the following groups:

t.test(sex_num ~ net, data = rct_data, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  sex_num by net
t = -0.16658, df = 990.67, p-value = 0.8677
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.06597315  0.05564916
sample estimates:
mean in group 0 mean in group 1 
      0.3933054       0.3984674 
t.test(age ~ net, data = rct_data, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  age by net
t = -1.275, df = 997.37, p-value = 0.2026
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -3.2870789  0.6979223
sample estimates:
mean in group 0 mean in group 1 
       28.25523        29.54981 
t.test(malaria_risk_pre ~ net, data = rct_data, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  malaria_risk_pre by net
t = 0.015818, df = 997.7, p-value = 0.9874
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -1.208182  1.227819
sample estimates:
mean in group 0 mean in group 1 
       37.96975        37.95993 
t.test(health ~ net, data = rct_data, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  health by net
t = -0.40132, df = 996.97, p-value = 0.6883
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -1.931451  1.275577
sample estimates:
mean in group 0 mean in group 1 
       68.90312        69.23106 
t.test(income ~ net, data = rct_data, alternative=c("two.sided"))

    Welch Two Sample t-test

data:  income by net
t = 0.19875, df = 991.26, p-value = 0.8425
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -8.353269 10.236022
sample estimates:
mean in group 0 mean in group 1 
       498.4966        497.5552 

We can also simply ask to be shown the p-values.

t.test(sex_num ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.8677374
t.test(age ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.2026112
t.test(malaria_risk_pre ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.9873827
t.test(health ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.6882691
t.test(income ~ net, data = rct_data, alternative=c("two.sided"))["p.value"]
$p.value
[1] 0.8424983

Let us also plot these differences.

4.3 Plotting Differences

The following graphs plot the point ranges for the variables of interest.

#The value of 1.96 is based on the fact that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean
#Remember that the SE is the SD (var)/Sqrt(Sample size)

#The value of 1.96 is based on the fact that 95% of the area of a normal distribution is within 1.96 standard deviations of the mean
#Remember that the SE is the SD (var)/Sqrt(Sample size)

plot_diff_sex <- ggplot(rct_data, aes(x = treat, y = sex_num, color = treat)) +
  stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
  guides(color = "none") +
  labs(x = NULL, y = "Prop. male")

plot_diff_age <- ggplot(rct_data, aes(x = treat, y = age, color = treat)) +
  stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
  guides(color = "none") +
  labs(x = NULL, y = "Age")

plot_diff_malaria_risk_pre <- ggplot(rct_data, aes(x = treat, y = malaria_risk_pre, color = treat)) +
  stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
  guides(color = "none") +
  labs(x = NULL, y = "Mal. Risk Bef.")


plot_diff_health <- ggplot(rct_data, aes(x = treat, y = health, color = treat)) +
  stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
  guides(color = "none") +
  labs(x = NULL, y = "Health")

plot_diff_income <- ggplot(rct_data, aes(x = treat, y = income, color = treat)) +
  stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
  guides(color = "none") +
  labs(x = NULL, y = "Income")

ggarrange(plot_diff_sex, plot_diff_age,
             plot_diff_malaria_risk_pre,
             plot_diff_health, plot_diff_income, ncol=2, nrow=3, common.legend = TRUE)

Let us now also include the distributions of these variables by group (treatment and control group).

For sex, given its binary nature, we simply need to do a barplot by group (men and women in the treatment vs. control group).

plot_diff_sex_d<-ggplot(rct_data, aes(pattern = sex, x = treat)) +
  geom_bar(data = rct_data,
          position = "dodge",
          aes(fill=treat),
          alpha = 0.5) +
  geom_bar_pattern(position = "dodge",
                   aes(pattern=sex),
                   fill = NA,
                   linewidth=0.5,
                  color="black")+
  scale_pattern_manual(name = NULL,
                       values = c(
                         Man = "stripe",
                         Woman = "circle"))+
  labs(fill = NULL)+
  xlab(NULL)
plot_diff_sex_d

For the other variables, given their continuous nature, we need to plot a histogram or a density function or both. A histogram shows the frequency distribution of our variable (e.g. how many times we encounter the age 32, 40, 56, etc.). A density graph shows the probability density function of the same variable. It shows a continuous line, where the height of the line at a given point represents the probability density of observing a value within a particular range.

We can first plot a histogram with age:

plot_diff_age_hist<-ggplot(rct_data, aes(x = age, fill = treat, group=treat)) +
  guides(color = "none") +
  geom_histogram(binwidth = 2, 
                 color = "white",
                 alpha = 0.5)+
  labs(fill = NULL)+
  xlab(NULL)
plot_diff_age_hist

We can then plot its density function.

plot_diff_age_den<-ggplot(rct_data, aes(x = age, fill = treat, group=treat)) +
  guides(color = "none") +
  geom_density(aes(y = after_stat(density), color=treat), 
               fill=NA)+
  labs(fill = NULL)+
  xlab(NULL)
plot_diff_age_den

Or we can plot them together. Note the mutiplicative factor of 2000 in geom_density in order to have comparable scales on the two y-axes.

plot_diff_age_d<-ggplot(rct_data, aes(x = age, fill = treat, group=treat)) +
  guides(color = "none") +
  geom_histogram(binwidth = 2, 
                 color = "white",
                 alpha = 0.5)+
  #Adding density and multiplying by a manually defined value
  #for visibility
  geom_density(aes(y = after_stat(density)* 2000, color=treat), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/2000, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab(NULL)
plot_diff_age_d

For the other variables, I am simply plotting them together.

plot_diff_malaria_risk_pre_d<-ggplot(rct_data, aes(x = malaria_risk_pre, fill = treat, group=treat)) +
  guides(color = "none") +
  geom_histogram(binwidth = 2, 
                 color = "white",
                 alpha = 0.5)+
  #Adding density and multiplying by a manually defined value
  #for visibility
  geom_density(aes(y = after_stat(density)* 2000, color=treat), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/2000, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab(NULL)
plot_diff_malaria_risk_pre_d

plot_diff_health_d<-ggplot(rct_data, aes(x = health, fill = treat, group=treat)) +
  guides(color = "none") +
  geom_histogram(binwidth = 2, 
                 color = "white",
                 alpha = 0.5)+
  #Adding density and multiplying by a manually defined value
  #for visibility
  geom_density(aes(y = after_stat(density)* 2000, color=treat), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/2000, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab(NULL)
plot_diff_health_d

plot_diff_income_d<-ggplot(rct_data, aes(x = income, fill = treat, group=treat)) +
  guides(color = "none") +
  geom_histogram(binwidth = 10, 
                 color = "white",
                 alpha = 0.5)+
  #Adding density and multiplying by a manually defined value
  #for visibility
  geom_density(aes(y = after_stat(density)* 10000, color=treat), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/10000, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab(NULL)
plot_diff_income_d

We can now plot the differences and the distributions side by side.

figure1<-ggarrange(plot_diff_sex, plot_diff_sex_d, ncol=2, common.legend = TRUE)
annotate_figure(figure1, top = text_grob("Sex"))

figure2<-ggarrange(plot_diff_age, plot_diff_age_d, ncol=2, common.legend = TRUE)
annotate_figure(figure2, top = text_grob("Age"))

figure3<-ggarrange(plot_diff_malaria_risk_pre, plot_diff_malaria_risk_pre_d, ncol=2, common.legend = TRUE)
annotate_figure(figure3, top = text_grob("Malaria Risk Pre-Intervention"))

figure4<-ggarrange(plot_diff_health, plot_diff_health_d, ncol=2, common.legend = TRUE)
annotate_figure(figure4, top = text_grob("Health Score"))

figure5<-ggarrange(plot_diff_income, plot_diff_income_d, ncol=2, common.legend = TRUE)
annotate_figure(figure5, top = text_grob("Income"))

4.4 Estimating the Difference

So, we are interested in the causal effect of the program - \(E(\text{Post-Malaria Risk | Net})\). We can find this causal effect by calculating the average treatment effect or ATE:

\[ ATE = E(\overline{\text{Post-Malaria Risk}} | \text{Net = 1}) - E(\overline{\text{Post-Malaria Risk}} | \text{Net = 0}) \]

This is simply the average outcome for people in the program minus the average outcome for people not in the program. We can calculate the group averages:

people_randomized<-rct_data %>%
  group_by(net) %>%
  summarize(avg_malaria_risk_post = mean(malaria_risk_post))

That’s 35.58 - 45.66, or -10.08, which means that the program caused an decrease in the risk of malaria of -10.08, on average. We can also simply run a regression with post-program malaria risk as the outcome variable and the program indicator variable as the explanatory variable. The result is the same.

model_rct1 <- lm(malaria_risk_post ~ net, data = rct_data)
summary(model_rct1)

Call:
lm(formula = malaria_risk_post ~ net, data = rct_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-30.5794  -6.4719   0.0798   6.1909  28.9747 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  45.6621     0.4146  110.14   <2e-16 ***
net         -10.0827     0.5738  -17.57   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.064 on 998 degrees of freedom
Multiple R-squared:  0.2363,    Adjusted R-squared:  0.2355 
F-statistic: 308.7 on 1 and 998 DF,  p-value: < 2.2e-16

Now, we can add more variables to the model. It is however important to comapare a simple model and a more complicated model like below.

model_rct1 <- lm(malaria_risk_post ~ net, data = rct_data)
model_rct2 <- lm(malaria_risk_post ~ net + income + age + sex_num, data = rct_data)


models<-list("Risk of Malaria" = model_rct1,
             "Risk of Malaria" = model_rct2)

cm <- c('net'='Received Mosquito Net (Treatment)',
        "income" = "income",
        "age" = "age",
        "sex_num" = "sex",
        "(Intercept)"="Intercept")

modelsummary(models, stars = TRUE, coef_map = cm, gof_map = c("nobs", "r.squared"))
Risk of Malaria Risk of Malaria
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001
Received Mosquito Net (Treatment) -10.083*** -10.235***
(0.574) (0.527)
income -0.045***
(0.004)
age 0.083***
(0.016)
sex 0.518
(0.539)
Intercept 45.662*** 65.608***
(0.415) (1.866)
Num.Obs. 1000 1000
R2 0.236 0.359

Let us now also plot this difference.

ggplot(rct_data, aes(x = malaria_risk_post, fill = treat, group=treat)) +
  guides(color = "none") +
  geom_histogram(binwidth = 2, 
                 color = "white",
                 alpha = 0.5)+
  #Adding density and multiplying by a manually defined value
  #for visibility
  geom_density(aes(y = after_stat(density)* 1500, color=treat), 
               #alpha = 0.5,
               fill=NA)+
  #Adding a secondary axis
  scale_y_continuous(name = "count",
                     sec.axis = 
                       sec_axis(~.x/1500, 
                                name = "density"))+
  labs(fill = NULL)+
  xlab(NULL)

ggplot(rct_data, aes(x = treat, y = malaria_risk_post, color = treat)) +
  stat_summary(geom = "pointrange", fun.data = "mean_se", fun.args = list(mult = 1.96)) +
  guides(color = "none") +
  labs(x = NULL, y = "Post Malaria Risk")

5 Conclusion

We can now conclude that indeed the provision of insecticide-treated nets reduces the risk of malaria by -10.08 units. Thus, if the we want to reduce such a health risk to the wider region, it might be effective to provide such nets to all vulnerable populations.

Let us recap the steps in the analysis of RCTs

  • Step 1: Check balance
  • Step 2: Calculate the difference
  • Step 3: Show the difference