Statistical Discrimination - use of statistical information or group-based averages to make decisions or judgments about individuals.
It can serve as a decision-making tool, particularly when individual-level information is limited or expensive to obtain.
This approach assumes that certain statistical patterns or averages observed within a group can provide insights into the likely characteristics of an individual belonging to that group.
Statistical discrimination can perpetuate biases and inequalities
Hot Spot Analysis and Statistical Discrimination
Hot spot analysis involves identifying geographic areas or regions where certain characteristics or behaviors are concentrated.
Hot spot analysis aims to uncover spatial patterns and concentrations of specific variables or attributes within a given area.
By employing statistical methods, researchers can identify areas where certain phenomena are more prevalent than in surrounding regions
This approach is valuable for understanding spatial patterns and making informed decisions based on the distribution of specific factors.
Applications of Hot Spot Analysis
Criminal Justice - Identifying hot spots of criminal activity to allocate resources effectively
Public Health Hot spot analysis in public health helps identify regions with elevated risks of disease outbreaks or health issues.
Education Hot spot analysis can be applied to educational data to identify areas with lower academic performance, high dropout rates, or other education-related challenges.
How does it work
Hot spot analysis in statistical discrimination relies on geospatial data, statistical techniques, and mapping tools to visualize and interpret patterns within a given area.
It enables decision-makers to tailor interventions and policies to the specific needs of certain geographic regions, addressing disparities and optimizing resource allocation
It a similar way to statistical discrimination, it’s essential to consider ethical implications and potential biases in data collection and analysis to avoid reinforcing stereotypes or discriminating against certain communities.
Intro
In 2023, the world experienced a significant increases in the frequency of wildfires
Wildfires pose challenges to ecosystem, communities and firefighting efforts
Objectives
This exercise aims to map all wildfires that happened in 2023 on a month-by-month basis
It also tries to perform statistical analyses to indentify statistical clusters.
Data
The data is based off of The Terra and Aqua combined MCD64A1 Version 6.1 Burned Area data product
This is a monthly, global gridded 500m product containing per-pixel burned-area and quality information
Maps for the Whole World: January 2023
Maps for the Whole World: July 2023
Maps for the Whole World: August 2023
Maps for the Whole World: Sep. 2023
Maps for Europe: June 2023
Maps for Europe: July 2023
Maps for Europe: August 2023
Maps for Europe: September 2023
Analyzing Spatial Clusters
To analyze cluster, I use Getis-Ord Gi statistics.
The Getis-Ord Gi statistics produces z-scores, and p-values, showing where features with either high or low values cluster spatially.
\(w_{ij}\) is the spatial weight between locations i and j
The Global G Test: Data in July
Getis-Ord global G statistic
data: tes_subset$burn
weights: tes_w_binary
standard deviate = -2.1995, p-value = 0.9861
alternative hypothesis: greater
sample estimates:
Global G statistic Expectation Variance
9.031613e-04 9.054901e-04 1.121059e-12
The output shows a standard deviate of -2.199,
the observed clustering of wildfires is -2.199 standard deviations away from what would be expected under the null hypothesis of no clustering.
This value is associated with a p-value of 0.9861, so observed clustering is not statistically significant at the 0.05 level.
The Global G Test: Data in August
Getis-Ord global G statistic
data: tes_subset$burn
weights: tes_w_binary
standard deviate = 4.1566, p-value = 1.615e-05
alternative hypothesis: greater
sample estimates:
Global G statistic Expectation Variance
8.942119e-04 8.907281e-04 7.024831e-13
The output shows a standard deviate of 4.1566,
The observed clustering of wildfires is 4.1566 standard deviations away from what would be expected under the null hypothesis of no clustering.
This value is associated with a p-value of 1.615e-05, so observed clustering is statistically significant at the 0.05 level.
The Hotspots in August
Conclusion
Spatial statistics can help identify areas where wildfires are clustered (hotspots)
This information is crucial for policymakers to prioritize resources and interventions.
The spatial distribution of wildfires using Gi* statistics can help inform policy makers where to allocate resources more effectively
Technical Appendix
The full code to replicate this study is in the appendix of the memo.
Part 1: Reading the Data
library(sf)library(stars)library(dplyr)library(sfdep)library(tidyr)library(ggplot2)#Step1: Turning spherical geomtry offsf::sf_use_s2(FALSE)#Step2: Reading world bordersborders <-st_read(dsn="/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/week9/data/boundaries/all-country-boundaries.shp", quiet =TRUE)#Step3: Simplifying bordersborders2<-st_simplify(borders, dTolerance =0.1)#Step4: Reading rasters with wildfiresleft_hem<-read_stars("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/example/data/july/july-0000000000-0000000000.tif")right_hem<-read_stars("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/example/data/july/july-0000000000-0000023296.tif")#Step5: Mosaicking the wildfire rastersjuly<-st_mosaic(left_hem, right_hem)#Step4: Reading rasters with wildfiresleft_hem<-read_stars("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/example/data/august/august-0000000000-0000000000.tif")right_hem<-read_stars("/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/big_data/example/data/august/august-0000000000-0000023296.tif")#Step5: Mosaicking the wildfire rastersaugust<-st_mosaic(left_hem, right_hem)#Step6: Reducing resolution. Higher dx/day values values mean lower resolutionburn2_july<-st_warp(july, st_as_stars(st_bbox(borders), dx =0.1, dy =0.1))#Step7: Turning the rasters into pointsburn3_july<-st_as_sf(burn2_july, as_points =FALSE, merge =FALSE)names(burn3_july)[1]<-"burn"#Step6: Reducing resolution. Higher dx/day values values mean lower resolutionburn2_august<-st_warp(august, st_as_stars(st_bbox(borders), dx =0.1, dy =0.1))#Step7: Turning the rasters into pointsburn3_august<-st_as_sf(burn2_august, as_points =FALSE, merge =FALSE)names(burn3_august)[1]<-"burn"
Part 2: Doing the neighbor analysis: Data July
library(spdep)#Step1: Create a neighbor list based on queen contiguitylist_nb <- spdep::poly2nb(burn3_july, queen =TRUE)#Step2: Checking for empty neighbor sets (polygons without neighbors).empty_nb <-which(card(list_nb) ==0)#Step3: Remove polygons with empty neighbor sets from the datates_subset <- burn3_july[-empty_nb, ]#Step4: Identify neighbors with queen contiguity (edge/vertex touching)tes_nb <-poly2nb(tes_subset, queen =TRUE)#Step5: Binary weighting assigns a weight of 1 to all neighboring features # and a weight of 0 to all other featurestes_w_binary <-nb2listw(tes_nb, style="B")#Step6: Calculate spatial lag for burntes_lag <-lag.listw(tes_w_binary, tes_subset$burn)#Step7: Test for global G statistic of burned areasglobalG.test(tes_subset$burn, tes_w_binary)
Getis-Ord global G statistic
data: tes_subset$burn
weights: tes_w_binary
standard deviate = -2.1995, p-value = 0.9861
alternative hypothesis: greater
sample estimates:
Global G statistic Expectation Variance
9.031613e-04 9.054901e-04 1.121059e-12
Part 2: Doing the neighbor analysis: Data August
library(spdep)#Step1: Create a neighbor list based on queen contiguitylist_nb <- spdep::poly2nb(burn3_august, queen =TRUE)#Step2: Checking for empty neighbor sets (polygons without neighbors).empty_nb <-which(card(list_nb) ==0)#Step3: Remove polygons with empty neighbor sets from the datates_subset <- burn3_august[-empty_nb, ]#Step4: Identify neighbors with queen contiguity (edge/vertex touching)tes_nb <-poly2nb(tes_subset, queen =TRUE)#Step5: Binary weighting assigns a weight of 1 to all neighboring features # and a weight of 0 to all other featurestes_w_binary <-nb2listw(tes_nb, style="B")#Step6: Calculate spatial lag of TreEqtytes_lag <-lag.listw(tes_w_binary, tes_subset$burn)#Step7: Test for global G statistic of burned areasglobalG.test(tes_subset$burn, tes_w_binary)
Getis-Ord global G statistic
data: tes_subset$burn
weights: tes_w_binary
standard deviate = 4.1566, p-value = 1.615e-05
alternative hypothesis: greater
sample estimates:
Global G statistic Expectation Variance
8.942119e-04 8.907281e-04 7.024831e-13
Calculating the Local GI
We can calculate the Local GI to get a sense of the local hotspots in wildfires
# Calculate the Gi using local_g_permtes_hot_spots <- tes_nbs %>%mutate(Gi =local_g_perm(burn, nb, wt, nsim =999)# nsim = number of Monte Carlo simulations (999 is default) ) |># The new 'Gi' column itself contains a dataframe # We can't work with that, so we need to 'unnest' itunnest(Gi)
Let’s do a cursory visualization of Gi values
Calculating the Local GI
We can zoom in even further
Final Hotspot in August
tes_hot_spots2<-subset(tes_hot_spots, select =c(gi, p_folded_sim))tes_hot_spots3<-tes_hot_spots2%>%mutate(# Add a new column called "classification"classification =case_when(# Classify based on the following criteria: gi >0& p_folded_sim <=0.01~"Very hot", gi >0& p_folded_sim <=0.05~"Hot", gi >0& p_folded_sim <=0.1~"Somewhat hot", gi <0& p_folded_sim <=0.01~"Very cold", gi <0& p_folded_sim <=0.05~"Cold", gi <0& p_folded_sim <=0.1~"Somewhat cold",TRUE~"Insignificant" ),# Convert 'classification' into a factor for easier plottingclassification =factor( classification,levels =c("Very hot", "Hot", "Somewhat hot","Insignificant","Somewhat cold", "Cold", "Very cold") ) )ggplot() +geom_sf(data=tes_hot_spots3, color=NA, aes(fill = classification)) +scale_fill_brewer(type ="div", palette =5) +geom_sf(data=borders, fill=NA)+coord_sf(xlim =c(min_lon, max_lon), ylim =c(-15, -5), expand =FALSE)