Statistical Analysis

Lab 7: T-Tests, Choropleth Maps, and Correlations

Bogdan G. Popescu

John Cabot University

Intro

Setup

Open a new Quarto document and use this preamble:

---
title: "Lab 7"
author: "Your Name"
date: today
format:
  html:
    toc: true
    number-sections: true
    colorlinks: true
    smooth-scroll: true
    embed-resources: true
---

Render and save under “week7” in your “stats” folder.

Background

Loading Libraries

rm(list = ls())
library(ggplot2)
library(gridExtra)
library(dplyr)
library(purrr)
library(sf)
library(rnaturalearth)
library(rnaturalearthdata)
library(ggrepel)

From Z-Tests to T-Tests

Z-tests require known population SD (\(\sigma\))
In practice, \(\sigma\) is often unknown
T-tests use the sample SD (\(s\)) instead
More commonly used in real research
We compare Latin America to Europe using t-tests

Loading the Data

Loading the Datasets

life_expectancy_df <- read.csv(
  file = './data/life-expectancy.csv'
)
urbanization_df <- read.csv(
  file = './data/share-of-population-urban.csv'
)

Download links if needed:

Cleaning the Data

Data Cleaning Pipeline

Focus on post-1900, average by country, remove non-countries:

life_expectancy_df <- subset(
  life_expectancy_df, Year > 1900
)

life_expectancy_df2 <- life_expectancy_df %>%
  dplyr::group_by(Entity, Code) %>%
  dplyr::summarize(
    life_exp_mean = mean(
      Life.expectancy.at.birth..historical.
    )
  )

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_expectancy_df <- subset(
  life_expectancy_df2, !(Code %in% weird_labels)
)

Inspecting with `head`

head(clean_life_expectancy_df, n = 5)

# A tibble: 5 × 3
# Groups:   Entity [5]
  Entity         Code  life_exp_mean
  <chr>          <chr>         <dbl>
1 Afghanistan    AFG            45.4
2 Albania        ALB            68.3
3 Algeria        DZA            57.5
4 American Samoa ASM            68.6
5 Andorra        AND            77.0

Inspecting with `glimpse`

A more powerful way to inspect data:

glimpse(clean_life_expectancy_df)

Rows: 235
Columns: 3
Groups: Entity [235]
$ Entity        <chr> "Afghanistan", "Albania", "Algeria", "American Samoa", "…
$ Code          <chr> "AFG", "ALB", "DZA", "ASM", "AND", "AGO", "AIA", "ATG", …
$ life_exp_mean <dbl> 45.38333, 68.28611, 57.53013, 68.63750, 77.04861, 45.084…

glimpse tells us: 235 observations, 3 variables, variable types, and first values.

Mapping the Data

Data vs Spatial Objects

Our cleaned data is a tibble:

class(clean_life_expectancy_df)

[1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Tibble: the central data structure for tidyverse (dplyr, ggplot2, tidyr, readr).

Loading Geographic Data

The rnaturalearth package provides country boundaries:

world <- ne_countries(scale = "medium", returnclass = "sf")

class(world)

[1] "sf"         "data.frame"

This is an sf object — a data frame with rows of features, columns of attributes, and a geometry column for spatial aspects.

Inspecting the Spatial Data

glimpse(world)

Rows: 242
Columns: 169
$ featurecla <chr> "Admin-0 country", "Admin-0 country", "Admin-0 country", "A…
$ scalerank  <int> 1, 1, 1, 3, 5, 6, 1, 1, 1, 3, 5, 3, 3, 3, 3, 1, 5, 3, 3, 3,…
$ labelrank  <int> 3, 3, 3, 2, 3, 6, 4, 3, 4, 6, 6, 6, 6, 6, 4, 5, 2, 4, 5, 6,…
$ sovereignt <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "Vat…
$ sov_a3     <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_dif   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ level      <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ type       <chr> "Sovereign country", "Sovereign country", "Sovereign countr…
$ tlc        <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1",…
$ admin      <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "Vat…
$ adm0_a3    <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ geou_dif   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ geounit    <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "Vat…
$ gu_a3      <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ su_dif     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ subunit    <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "Vat…
$ su_a3      <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ brk_diff   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
$ name       <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "Vat…
$ name_long  <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "Vat…
$ brk_a3     <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ brk_name   <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "Vat…
$ brk_group  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ abbrev     <chr> "Zimb.", "Zambia", "Yem.", "Viet.", "Ven.", "Vat.", "Van.",…
$ postal     <chr> "ZW", "ZM", "YE", "VN", "VE", "V", "VU", "UZ", "UY", "FSM",…
$ formal_en  <chr> "Republic of Zimbabwe", "Republic of Zambia", "Republic of …
$ formal_fr  <chr> NA, NA, NA, NA, "República Bolivariana de Venezuela", NA, N…
$ name_ciawf <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "Hol…
$ note_adm0  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "U.S.A.", "U.S.…
$ note_brk   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ name_sort  <chr> "Zimbabwe", "Zambia", "Yemen, Rep.", "Vietnam", "Venezuela,…
$ name_alt   <chr> NA, NA, NA, NA, NA, "Holy See", NA, NA, NA, NA, NA, NA, NA,…
$ mapcolor7  <int> 1, 5, 5, 5, 1, 1, 6, 2, 1, 5, 2, 4, 4, 4, 4, 4, 4, 6, 6, 6,…
$ mapcolor8  <int> 5, 8, 3, 6, 3, 3, 3, 3, 2, 2, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6,…
$ mapcolor9  <int> 3, 5, 3, 5, 1, 4, 7, 5, 2, 4, 5, 1, 1, 1, 1, 1, 1, 6, 6, 6,…
$ mapcolor13 <int> 9, 13, 11, 4, 4, 2, 3, 4, 10, 13, 3, 1, 1, 1, 1, 1, 1, 3, 3…
$ pop_est    <dbl> 14645468, 17861030, 29161922, 96462106, 28515829, 825, 2998…
$ pop_rank   <int> 14, 14, 15, 16, 15, 2, 10, 15, 12, 9, 8, 8, 9, 9, 8, 12, 17…
$ pop_year   <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019,…
$ gdp_md     <int> 21440, 23309, 22581, 261921, 482359, -99, 934, 57921, 56045…
$ gdp_year   <int> 2019, 2019, 2019, 2019, 2014, 2019, 2019, 2019, 2019, 2018,…
$ economy    <chr> "5. Emerging region: G20", "7. Least developed region", "7.…
$ income_grp <chr> "5. Low income", "4. Lower middle income", "4. Lower middle…
$ fips_10    <chr> "ZI", "ZA", "YM", "VM", "VE", "VT", "NH", "UZ", "UY", "FM",…
$ iso_a2     <chr> "ZW", "ZM", "YE", "VN", "VE", "VA", "VU", "UZ", "UY", "FM",…
$ iso_a2_eh  <chr> "ZW", "ZM", "YE", "VN", "VE", "VA", "VU", "UZ", "UY", "FM",…
$ iso_a3     <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ iso_a3_eh  <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ iso_n3     <chr> "716", "894", "887", "704", "862", "336", "548", "860", "85…
$ iso_n3_eh  <chr> "716", "894", "887", "704", "862", "336", "548", "860", "85…
$ un_a3      <chr> "716", "894", "887", "704", "862", "336", "548", "860", "85…
$ wb_a2      <chr> "ZW", "ZM", "RY", "VN", "VE", "-99", "VU", "UZ", "UY", "FM"…
$ wb_a3      <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "-99", "VUT", "UZB", "UR…
$ woe_id     <int> 23425004, 23425003, 23425002, 23424984, 23424982, 23424986,…
$ woe_id_eh  <int> 23425004, 23425003, 23425002, 23424984, 23424982, 23424986,…
$ woe_note   <chr> "Exact WOE match as country", "Exact WOE match as country",…
$ adm0_iso   <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_diff  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ adm0_tlc   <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_us <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_fr <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_ru <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_es <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_cn <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_tw <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_in <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_np <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_pk <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_de <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_gb <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_br <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_il <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_ps <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_sa <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_eg <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_ma <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_pt <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_ar <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_jp <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_ko <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_vn <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_tr <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_id <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_pl <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_gr <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_it <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_nl <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_se <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_bd <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_ua <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ adm0_a3_un <int> -99, -99, -99, -99, -99, -99, -99, -99, -99, -99, -99, -99,…
$ adm0_a3_wb <int> -99, -99, -99, -99, -99, -99, -99, -99, -99, -99, -99, -99,…
$ continent  <chr> "Africa", "Africa", "Asia", "Asia", "South America", "Europ…
$ region_un  <chr> "Africa", "Africa", "Asia", "Asia", "Americas", "Europe", "…
$ subregion  <chr> "Eastern Africa", "Eastern Africa", "Western Asia", "South-…
$ region_wb  <chr> "Sub-Saharan Africa", "Sub-Saharan Africa", "Middle East & …
$ name_len   <int> 8, 6, 5, 7, 9, 7, 7, 10, 7, 10, 12, 14, 15, 4, 14, 11, 24, …
$ long_len   <int> 8, 6, 5, 7, 9, 7, 7, 10, 7, 30, 16, 24, 28, 4, 14, 11, 13, …
$ abbrev_len <int> 5, 6, 4, 5, 4, 4, 4, 4, 4, 6, 6, 6, 11, 4, 9, 4, 6, 10, 6, …
$ tiny       <int> -99, -99, -99, 2, -99, 4, 2, 5, -99, -99, 2, 3, 3, 2, 3, -9…
$ homepart   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -99, -99, -99, -99, -99, 1…
$ min_zoom   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ min_label  <dbl> 2.5, 3.0, 3.0, 2.0, 2.5, 5.0, 4.0, 3.0, 3.0, 5.0, 5.0, 5.0,…
$ max_label  <dbl> 8.0, 8.0, 8.0, 7.0, 7.5, 10.0, 9.0, 8.0, 8.0, 10.0, 10.0, 1…
$ label_x    <dbl> 29.92544, 26.39530, 45.87438, 105.38729, -64.59938, 12.4534…
$ label_y    <dbl> -18.911640, -14.660804, 15.328226, 21.715416, 7.182476, 41.…
$ ne_id      <dbl> 1159321441, 1159321439, 1159321425, 1159321417, 1159321411,…
$ wikidataid <chr> "Q954", "Q953", "Q805", "Q881", "Q717", "Q237", "Q686", "Q2…
$ name_ar    <chr> "زيمبابوي", "زامبيا", "اليمن", "فيتنام", "فنزويلا", "الفاتي…
$ name_bn    <chr> "জিম্বাবুয়ে", "জাম্বিয়া", "ইয়েমেন", "ভিয়েতনাম", "ভেনেজুয়েলা", "ভ্যাটিকান সিটি", "ভানুয়াতু",…
$ name_de    <chr> "Simbabwe", "Sambia", "Jemen", "Vietnam", "Venezuela", "Vat…
$ name_en    <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "Vat…
$ name_es    <chr> "Zimbabue", "Zambia", "Yemen", "Vietnam", "Venezuela", "Ciu…
$ name_fa    <chr> "زیمبابوه", "زامبیا", "یمن", "ویتنام", "ونزوئلا", "واتیکان"…
$ name_fr    <chr> "Zimbabwe", "Zambie", "Yémen", "Viêt Nam", "Venezuela", "Ci…
$ name_el    <chr> "Ζιμπάμπουε", "Ζάμπια", "Υεμένη", "Βιετνάμ", "Βενεζουέλα", …
$ name_he    <chr> "זימבבואה", "זמביה", "תימן", "וייטנאם", "ונצואלה", "קריית ה…
$ name_hi    <chr> "ज़िम्बाब्वे", "ज़ाम्बिया", "यमन", "वियतनाम", "वेनेज़ुएला", "वैटिकन नगर", "वानूआटू", …
$ name_hu    <chr> "Zimbabwe", "Zambia", "Jemen", "Vietnám", "Venezuela", "Vat…
$ name_id    <chr> "Zimbabwe", "Zambia", "Yaman", "Vietnam", "Venezuela", "Vat…
$ name_it    <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "Cit…
$ name_ja    <chr> "ジンバブエ", "ザンビア", "イエメン", "ベトナム", "ベネズエラ", "バチカン", "バヌアツ", "…
$ name_ko    <chr> "짐바브웨", "잠비아", "예멘", "베트남", "베네수엘라", "바티칸 시국", "바누아투", "우즈베…
$ name_nl    <chr> "Zimbabwe", "Zambia", "Jemen", "Vietnam", "Venezuela", "Vat…
$ name_pl    <chr> "Zimbabwe", "Zambia", "Jemen", "Wietnam", "Wenezuela", "Wat…
$ name_pt    <chr> "Zimbábue", "Zâmbia", "Iémen", "Vietname", "Venezuela", "Va…
$ name_ru    <chr> "Зимбабве", "Замбия", "Йемен", "Вьетнам", "Венесуэла", "Ват…
$ name_sv    <chr> "Zimbabwe", "Zambia", "Jemen", "Vietnam", "Venezuela", "Vat…
$ name_tr    <chr> "Zimbabve", "Zambiya", "Yemen", "Vietnam", "Venezuela", "Va…
$ name_uk    <chr> "Зімбабве", "Замбія", "Ємен", "В'єтнам", "Венесуела", "Вати…
$ name_ur    <chr> "زمبابوے", "زیمبیا", "یمن", "ویتنام", "وینیزویلا", "ویٹیکن …
$ name_vi    <chr> "Zimbabwe", "Zambia", "Yemen", "Việt Nam", "Venezuela", "Th…
$ name_zh    <chr> "津巴布韦", "赞比亚", "也门", "越南", "委内瑞拉", "梵蒂冈", "瓦努阿图", "乌兹别克斯坦",…
$ name_zht   <chr> "辛巴威", "尚比亞", "葉門", "越南", "委內瑞拉", "梵蒂岡", "萬那杜", "烏茲別克", "烏拉…
$ fclass_iso <chr> "Admin-0 country", "Admin-0 country", "Admin-0 country", "A…
$ tlc_diff   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_tlc <chr> "Admin-0 country", "Admin-0 country", "Admin-0 country", "A…
$ fclass_us  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_fr  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_ru  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_es  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_cn  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_tw  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_in  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_np  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_pk  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_de  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_gb  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_br  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_il  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_ps  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_sa  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_eg  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_ma  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_pt  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_ar  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_jp  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_ko  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_vn  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_tr  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_id  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_pl  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_gr  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_it  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_nl  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_se  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_bd  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fclass_ua  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ geometry   <MULTIPOLYGON [°]> MULTIPOLYGON (((31.28789 -2..., MULTIPOLYGON (…

170 variables is a lot. For our purposes, the key columns are adm0_a3 (ISO country code for merging), continent, and geometry (the polygon shapes).

A Basic World Map

Show code

ggplot() +
  geom_sf(data = world) +
  coord_sf(xlim = c(-180, 180), ylim = c(-60, 84), expand = FALSE) +
  theme_meridian()

Subsetting the Spatial Data

Keep only the variables we need for merging:

world2 <- subset(world, select = c(
  admin, adm0_a3, sovereignt, continent
))
glimpse(world2)

Rows: 242
Columns: 5
$ admin      <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "Vat…
$ adm0_a3    <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", "UR…
$ sovereignt <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "Vat…
$ continent  <chr> "Africa", "Africa", "Asia", "Asia", "South America", "Europ…
$ geometry   <MULTIPOLYGON [°]> MULTIPOLYGON (((31.28789 -2..., MULTIPOLYGON (…

Checking Country Codes

We need to compare codes between the two datasets:

ctries_sp <- world2$adm0_a3
glimpse(ctries_sp)

 chr [1:242] "ZWE" "ZMB" "YEM" "VNM" "VEN" "VAT" "VUT" "UZB" "URY" "FSM" ...

ctries_df <- clean_life_expectancy_df$Code
glimpse(ctries_df)

 chr [1:235] "AFG" "ALB" "DZA" "ASM" "AND" "AGO" "AIA" "ATG" "ARG" "ARM" ...

Finding Mismatches with `setdiff`

Before merging, always check: do the keys actually match? setdiff(X, Y) returns elements in X not in Y:

countries_dif <- setdiff(ctries_sp, ctries_df)
world3 <- subset(world2, adm0_a3 %in% countries_dif)
cat(length(countries_dif), "countries in the map have no life expectancy data:\n")

18 countries in the map have no life expectancy data:

head(world3$admin, 10)

 [1] "South Georgia and the Islands"       "British Indian Ocean Territory"     
 [3] "Pitcairn Islands"                    "South Sudan"                        
 [5] "Somaliland"                          "Western Sahara"                     
 [7] "Kosovo"                              "Palestine"                          
 [9] "Saint Barthelemy"                    "French Southern and Antarctic Lands"

These will appear as NA (grey) on the choropleth. This is a general lesson: merging always loses observations — check what you’re losing before interpreting the result.

Merging Data to the Map

merged <- left_join(
  world2, clean_life_expectancy_df,
  by = c("adm0_a3" = "Code")
)
glimpse(merged)

Rows: 242
Columns: 7
$ admin         <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "…
$ adm0_a3       <chr> "ZWE", "ZMB", "YEM", "VNM", "VEN", "VAT", "VUT", "UZB", …
$ sovereignt    <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "…
$ continent     <chr> "Africa", "Africa", "Asia", "Asia", "South America", "Eu…
$ Entity        <chr> "Zimbabwe", "Zambia", "Yemen", "Vietnam", "Venezuela", "…
$ life_exp_mean <dbl> 54.24722, 51.56389, 51.64861, 65.73194, 65.33421, 75.331…
$ geometry      <MULTIPOLYGON [°]> MULTIPOLYGON (((31.28789 -2..., MULTIPOLYGO…

Choropleth Map: Life Expectancy

Show code

ggplot() +
  geom_sf(data = merged, aes(fill = life_exp_mean)) +
  coord_sf(xlim = c(-180, 180), ylim = c(-60, 84), expand = FALSE) +
  labs(
    title = "Avg. Life Expectancy 1901-2021",
    fill = "Life Expectancy",
    caption = "Source: Our World in Data"
  ) +
  theme_meridian()

Improved Choropleth

Show code

ggplot() +
  geom_sf(data = merged, aes(fill = life_exp_mean)) +
  labs(
    title = "Avg. Life Expectancy 1901-2021",
    x = "Longitude", y = "Latitude",
    fill = "Life Expectancy",
    caption = "Source: Our World in Data"
  ) +
  coord_sf(xlim = c(-180, 180), ylim = c(-60, 84), expand = FALSE) +
  theme_meridian()

Performing a T-Test

Continent Labels

We can use the continent variable from the spatial data:

unique(merged$continent)

[1] "Africa"                  "Asia"                   
[3] "South America"           "Europe"                 
[5] "Oceania"                 "North America"          
[7] "Seven seas (open ocean)" "Antarctica"

Why South America, Not Latin America?

Our data has continent, not linguistic/cultural regions. “South America” excludes Mexico, Central America, and the Caribbean — all part of Latin America.

Latin America = linguistic concept (Spanish/Portuguese-speaking Americas)
South America = geographic concept (the continent)
Our spatial data only gives us the geographic version
This is a limitation: our results describe South America, not all of Latin America

Keep this in mind when interpreting results — we are comparing continents, not cultural regions.

Subsetting by Continent

sample_latam <- subset(merged, continent == "South America")
sample_eu <- subset(merged, continent == "Europe")

South America: 13 countries
Europe: 50 countries

Map: South America

Show code

ggplot() +
  geom_sf(data = sample_latam, aes(fill = life_exp_mean)) +
  labs(
    title = "Avg. Life Expectancy: South America",
    x = "Longitude", y = "Latitude",
    fill = "Life Exp.",
    caption = "Source: Our World in Data"
  ) +
  theme_meridian()

Map: Europe

Show code

ggplot() +
  geom_sf(data = sample_eu, aes(fill = life_exp_mean)) +
  coord_sf(xlim = c(-10, 40), ylim = c(35, 70), expand = FALSE) +
  labs(
    title = "Avg. Life Expectancy: Europe",
    x = "Longitude", y = "Latitude",
    fill = "Life Exp.",
    caption = "Source: Our World in Data"
  ) +
  theme_meridian()

Before We Test: Checking Assumptions

The t-test assumes:

Independence — the countries in each sample are independent of each other (plausible: different sovereign states)
Normality — the data within each group are approximately normally distributed
- South America has only 13 countries — with \(n < 30\), the normality assumption matters more
- In practice, the t-test is fairly robust to non-normality, but we should be cautious with such a small sample
Equal variances? — R’s t.test() uses Welch’s t-test by default, which does not assume equal variances

Bottom line: the test is defensible here, but the small South American sample (\(n = 13\)) means our power to detect differences is limited.

T-Test: Is Latam Greater?

test_greater <- t.test(
  sample_latam$life_exp_mean,
  sample_eu$life_exp_mean,
  alternative = "greater"
)
test_greater


    Welch Two Sample t-test

data:  sample_latam$life_exp_mean and sample_eu$life_exp_mean
t = -5.5367, df = 17.108, p-value = 1
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -9.228658       Inf
sample estimates:
mean of x mean of y 
 63.25522  70.27813

Interpretation: Greater

\(H_1\): Latam life expectancy \(>\) EU life expectancy
\(H_0\): Latam life expectancy \(\leq\) EU life expectancy

A p-value of 1 means the data are completely inconsistent with this direction. We do not reject \(H_0\).

T-Test: Is Latam Less?

The map suggests EU has higher life expectancy. Let’s test that:

test_less <- t.test(
  sample_latam$life_exp_mean,
  sample_eu$life_exp_mean,
  alternative = "less"
)
test_less


    Welch Two Sample t-test

data:  sample_latam$life_exp_mean and sample_eu$life_exp_mean
t = -5.5367, df = 17.108, p-value = 1.771e-05
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
      -Inf -4.817149
sample estimates:
mean of x mean of y 
 63.25522  70.27813

Interpretation: Less

\(H_1\): Latam life expectancy \(<\) EU life expectancy
\(H_0\): Latam life expectancy \(\geq\) EU life expectancy

We reject \(H_0\) because \(p < 0.05\). Latam life expectancy is significantly lower than Europe’s.

T-Test: Two-Sided

Are they simply different (either direction)?

test_twosided <- t.test(
  sample_latam$life_exp_mean,
  sample_eu$life_exp_mean,
  alternative = "two.sided"
)
test_twosided


    Welch Two Sample t-test

data:  sample_latam$life_exp_mean and sample_eu$life_exp_mean
t = -5.5367, df = 17.108, p-value = 3.541e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -9.697752 -4.348055
sample estimates:
mean of x mean of y 
 63.25522  70.27813

Interpretation: Two-Sided

\(H_1\): Latam mean \(\neq\) EU mean
\(H_0\): Latam mean \(=\) EU mean

\(p < 0.05\) → reject \(H_0\). The means are significantly different.

Boxplot: Latam vs EU

Show code

box_dat <- rbind(sample_latam, sample_eu)

ggplot(box_dat, aes(
  x = continent, y = life_exp_mean, color = continent
)) +
  geom_boxplot(linewidth = 0.8) +
  coord_cartesian(xlim = c(0, 3), ylim = c(40, 80)) +
  scale_color_manual(values = c(
    "Europe" = sage, "South America" = gold
  )) +
  labs(
    title = "Life Expectancy: South America vs Europe",
    x = "", y = "Life Expectancy",
    caption = "Source: Our World in Data"
  ) +
  theme_meridian() +
  theme(legend.position = "none")

Interpreting a Boxplot

Show code

merged_latam_bp <- subset(box_dat, continent == "South America")
hwy_50 <- quantile(merged_latam_bp$life_exp_mean, 0.5)
hwy_25 <- quantile(merged_latam_bp$life_exp_mean, 0.25)
hwy_75 <- quantile(merged_latam_bp$life_exp_mean, 0.75)
sd_bp <- sd(merged_latam_bp$life_exp_mean, na.rm = TRUE) / 4
hwy_iqr <- hwy_75 - hwy_25

merged_eu_bp <- subset(box_dat, continent == "Europe")
hwy2_25 <- quantile(merged_eu_bp$life_exp_mean, 0.25, na.rm = TRUE)
hwy2_75 <- quantile(merged_eu_bp$life_exp_mean, 0.75, na.rm = TRUE)
hwy2_min <- boxplot.stats(merged_eu_bp$life_exp_mean)$stats[1]
hwy2_max <- boxplot.stats(merged_eu_bp$life_exp_mean)$stats[5]

outliermax <- max(merged_latam_bp$life_exp_mean, na.rm = TRUE)
outliermin <- min(merged_latam_bp$life_exp_mean, na.rm = TRUE)

ggplot(box_dat, aes(
  x = continent, y = life_exp_mean, color = continent
)) +
  geom_boxplot() +
  coord_cartesian(xlim = c(0, 3), ylim = c(40, 80)) +
  scale_color_manual(values = c(
    "Europe" = sage, "South America" = gold
  )) +
  annotate("text", y = hwy_50 - sd_bp, x = 2,
           label = "Median", color = terracotta, fontface = 2) +
  annotate("text", y = hwy_25 - 1, x = 3,
           label = "25th percentile", color = terracotta, fontface = 3) +
  annotate("segment", y = hwy_25, yend = hwy_25,
           x = 2.6, xend = 2.4, color = terracotta,
           arrow = arrow(length = unit(0.3, "lines"))) +
  annotate("text", y = hwy_75 + 1, x = 3,
           label = "75th percentile", color = terracotta, fontface = 3) +
  annotate("segment", y = hwy_75, yend = hwy_75,
           x = 2.6, xend = 2.4, color = terracotta,
           arrow = arrow(length = unit(0.3, "lines"))) +
  annotate("text", y = hwy_25 + 2 * hwy_iqr, x = -0.1,
           label = "Interquartile\n range (IQR)", color = sage) +
  annotate("segment", y = c(hwy2_25, hwy2_75),
           yend = c(hwy2_25, hwy2_75), color = sage,
           x = 0.2, xend = 0.6,
           arrow = arrow(length = unit(0.3, "lines"))) +
  annotate("segment", y = hwy2_25, yend = hwy2_75,
           x = 0.2, xend = 0.2, color = sage) +
  annotate("text", y = hwy2_max, x = 0.8,
           label = "Maximum\n75% + (1.5 \u00d7 IQR)",
           hjust = 1, lineheight = 1, color = sage) +
  annotate("segment", y = hwy2_max, yend = hwy2_max,
           x = 1, xend = 0.8, color = sage,
           arrow = arrow(length = unit(0.3, "lines"))) +
  annotate("text", y = hwy2_min, x = 0.8,
           label = "Minimum\n25% - (1.5 \u00d7 IQR)",
           hjust = 1, lineheight = 1, color = sage) +
  annotate("segment", y = hwy2_min, yend = hwy2_min,
           x = 1, xend = 0.8, color = sage,
           arrow = arrow(length = unit(0.3, "lines"))) +
  annotate("text", y = outliermax, x = 2.3,
           label = "Outlier", color = terracotta, fontface = 2) +
  annotate("text", y = outliermin, x = 2.3,
           label = "Outlier", color = terracotta, fontface = 2) +
  labs(
    title = "Anatomy of a Boxplot",
    x = "", y = "Life Expectancy",
    caption = "Source: Our World in Data"
  ) +
  theme_meridian() +
  theme(legend.position = "none")

Boxplot with Jitter

Adding individual observations with geom_jitter:

Show code

ggplot(box_dat, aes(
  x = continent, y = life_exp_mean, color = continent
)) +
  geom_boxplot(linewidth = 0.8) +
  geom_jitter(width = 0.15, alpha = 0.6) +
  scale_color_manual(values = c(
    "Europe" = sage, "South America" = gold
  )) +
  labs(
    title = "Life Expectancy with Individual Countries",
    x = "", y = "Life Expectancy",
    caption = "Source: Our World in Data"
  ) +
  theme_meridian() +
  theme(legend.position = "none")

Mapping Europe Over Time

Setting Map Boundaries

Use extreme countries to define coordinates:

norway <- subset(box_dat, Entity == "Norway")
max_lat_y <- st_bbox(norway)["ymax"]

greece <- subset(box_dat, Entity == "Greece")
min_lat_y <- st_bbox(greece)["ymin"]

ukraine <- subset(box_dat, Entity == "Ukraine")
max_lon_x <- st_bbox(ukraine)["xmax"]

portugal <- subset(box_dat, Entity == "Portugal")
min_lon_x <- st_bbox(portugal)["xmin"]

Boundary	Country	Value
North (ymax)	Norway	80.5
South (ymin)	Greece	34.9
East (xmax)	Ukraine	40.1
West (xmin)	Portugal	-31.3

Europe: Average Life Expectancy

Show code

ggplot() +
  geom_sf(data = box_dat, aes(fill = life_exp_mean)) +
  coord_sf(
    xlim = c(min_lon_x, max_lon_x),
    ylim = c(min_lat_y, max_lat_y),
    expand = FALSE
  ) +
  labs(
    title = "Avg. Life Expectancy 1901-2021",
    x = "Longitude", y = "Latitude",
    fill = "Life Exp.",
    caption = "Source: Our World in Data"
  ) +
  theme_meridian()

Visualizing the Boundary Points

Show code

euro_extreme <- data.frame(
  x_lon = c(min_lon_x, max_lon_x, min_lon_x, max_lon_x),
  y_lat = c(min_lat_y, max_lat_y, max_lat_y, min_lat_y),
  name = c("South-West", "North-East", "North-West", "South-East")
)

ggplot() +
  geom_sf(data = box_dat, aes(fill = life_exp_mean)) +
  geom_point(
    data = euro_extreme, aes(x = x_lon, y = y_lat),
    fill = "blue", color = "darkred", size = 3
  ) +
  geom_label_repel(
    data = euro_extreme,
    aes(x = x_lon, y = y_lat, label = name),
    fill = alpha("white", 0.8)
  ) +
  coord_sf(
    xlim = c(min_lon_x - 3, max_lon_x + 3),
    ylim = c(min_lat_y - 3, max_lat_y + 3),
    expand = FALSE
  ) +
  labs(
    title = "Avg. Life Expectancy with Boundary Points",
    x = "Longitude", y = "Latitude",
    fill = "Life Exp.",
    caption = "Source: Our World in Data"
  ) +
  theme_meridian()

Merging by Year: 1950

names(life_expectancy_df)[4] <- "life_exp_mean"

life_expectancy_1950 <- subset(life_expectancy_df, Year == 1950)
world_1950 <- left_join(
  world2, life_expectancy_1950,
  by = c("adm0_a3" = "Code")
)

Map: Europe 1950

Show code

figure2 <- ggplot() +
  geom_sf(data = world_1950, aes(fill = life_exp_mean)) +
  coord_sf(
    xlim = c(min_lon_x, max_lon_x),
    ylim = c(min_lat_y, max_lat_y),
    expand = FALSE
  ) +
  labs(title = "Life Expectancy in 1950",
       x = "Longitude", y = "Latitude", fill = "Life Exp.") +
  theme_meridian()

figure2

Merging by Year: 2020

life_expectancy_2020 <- subset(life_expectancy_df, Year == 2020)
world_2020 <- left_join(
  world2, life_expectancy_2020,
  by = c("adm0_a3" = "Code")
)

Map: Europe 2020

Show code

figure3 <- ggplot() +
  geom_sf(data = world_2020, aes(fill = life_exp_mean)) +
  coord_sf(
    xlim = c(min_lon_x, max_lon_x),
    ylim = c(min_lat_y, max_lat_y),
    expand = FALSE
  ) +
  labs(title = "Life Expectancy in 2020",
       x = "Longitude", y = "Latitude", fill = "Life Exp.") +
  theme_meridian()

figure3

Side by Side (Without Fixed Scale)

Show code

grid.arrange(figure2, figure3, ncol = 2)

Hard to compare — the color scales differ between maps!

Fixing the Color Scale

We need the same min/max for both maps:

vmax <- max(life_expectancy_df$life_exp_mean, na.rm = TRUE)
vmin <- min(life_expectancy_df$life_exp_mean, na.rm = TRUE)
cat("Min:", round(vmin, 1), " Max:", round(vmax, 1))

Min: 12  Max: 86.5

1950 vs 2020: Common Scale

Show code

figure2 <- ggplot() +
  geom_sf(data = world_1950, aes(fill = life_exp_mean)) +
  coord_sf(
    xlim = c(min_lon_x, max_lon_x),
    ylim = c(min_lat_y, max_lat_y),
    expand = FALSE
  ) +
  scale_fill_gradient(
    limits = c(vmin, vmax), name = "Life Expectancy"
  ) +
  labs(title = "1950", x = "Longitude", y = "Latitude") +
  theme_meridian()

figure3 <- ggplot() +
  geom_sf(data = world_2020, aes(fill = life_exp_mean)) +
  coord_sf(
    xlim = c(min_lon_x, max_lon_x),
    ylim = c(min_lat_y, max_lat_y),
    expand = FALSE
  ) +
  scale_fill_gradient(
    limits = c(vmin, vmax), name = "Life Expectancy"
  ) +
  labs(title = "2020", x = "Longitude", y = "Latitude") +
  theme_meridian()

ggarrange(figure2, figure3, ncol = 2, common.legend = TRUE)

Computing the Change: 1950 to 2020

life_expectancy_1950_2020 <- subset(
  life_expectancy_df,
  Year %in% c(1950, 2020) & Code != ""
)

life_expectancy_1950_2020b <- life_expectancy_1950_2020 %>%
  dplyr::arrange(Code, Year) %>%
  dplyr::group_by(Code) %>%
  dplyr::filter(n() == 2) %>%
  dplyr::summarise(
    differ = last(life_exp_mean) - first(life_exp_mean)
  )

life_expectancy_1950_2020c <- subset(
  life_expectancy_1950_2020b, !is.na(differ)
)

glimpse(life_expectancy_1950_2020c)

Rows: 237
Columns: 2
$ Code   <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "AND", "ARE", "ARG", "ARM", …
$ differ <dbl> 18.5, 34.9, 26.0, 21.6, 32.3, 14.4, 37.8, 14.7, 12.9, 11.4, 21.…

Map: Change in Life Expectancy

Show code

world_change <- left_join(
  world2, life_expectancy_1950_2020c,
  by = c("adm0_a3" = "Code")
)

ggplot() +
  geom_sf(data = world_change, aes(fill = differ)) +
  coord_sf(
    xlim = c(min_lon_x, max_lon_x),
    ylim = c(min_lat_y, max_lat_y),
    expand = FALSE
  ) +
  labs(
    title = "Life Expectancy Change: 2020 - 1950",
    x = "Longitude", y = "Latitude",
    fill = "Change",
    caption = "Source: Our World in Data"
  ) +
  theme_meridian()

Highest and Lowest Change

world_change_eu <- subset(world_change, continent == "Europe")

highest <- subset(
  world_change_eu,
  differ == max(world_change_eu$differ, na.rm = TRUE)
)
cat("Highest change:", highest$admin, "\n")

Highest change: Albania

lowest <- subset(
  world_change_eu,
  differ == min(world_change_eu$differ, na.rm = TRUE)
)
cat("Lowest change:", lowest$admin)

Lowest change: Latvia

Correlation

From Maps to Relationships

We have mapped life expectancy across space and time. A natural question: is life expectancy related to urbanization?

Both variables are available in our data — we can merge them and find out.

Preparing the Data

names(urbanization_df)[4] <- "urb_mean"

urbanization_df2 <- subset(
  urbanization_df, select = c(Code, urb_mean, Year)
)

merged_df <- left_join(
  life_expectancy_df, urbanization_df2,
  by = c("Code" = "Code", "Year" = "Year")
)

merged_df2 <- na.omit(merged_df)

Computing the Correlation

cor(merged_df2$life_exp_mean, merged_df2$urb_mean)

[1] 0.4579985

A strong positive correlation (\(r \approx 0.80\)): higher urbanization is associated with higher life expectancy. But what does this actually look like?

Shared Axis Limits

Scatterplot: Earliest Year (1960)

Show code

merged_earliest <- subset(merged_df2, Year == earliest_year)

cor_earliest <- round(
  cor(merged_earliest$life_exp_mean, merged_earliest$urb_mean), 2
)

ggplot(merged_earliest, aes(x = urb_mean, y = life_exp_mean)) +
  geom_point(alpha = 0.6, color = sage, size = 2) +
  geom_smooth(method = "lm", color = terracotta, se = FALSE) +
  coord_cartesian(xlim = x_lim, ylim = y_lim) +
  labs(
    title = paste0("Life Expectancy vs Urbanization (", earliest_year, ")"),
    subtitle = paste0("r = ", cor_earliest,
                      " · Each point = one country"),
    x = "Urban Population (%)",
    y = "Life Expectancy at Birth",
    caption = "Source: Our World in Data"
  ) +
  theme_meridian()

Scatterplot: Latest Year (2020)

Show code

merged_latest <- subset(merged_df2, Year == latest_year)

cor_latest <- round(
  cor(merged_latest$life_exp_mean, merged_latest$urb_mean), 2
)

ggplot(merged_latest, aes(x = urb_mean, y = life_exp_mean)) +
  geom_point(alpha = 0.6, color = sage, size = 2) +
  geom_smooth(method = "lm", color = terracotta, se = FALSE) +
  coord_cartesian(xlim = x_lim, ylim = y_lim) +
  labs(
    title = paste0("Life Expectancy vs Urbanization (", latest_year, ")"),
    subtitle = paste0("r = ", cor_latest,
                      " · Each point = one country"),
    x = "Urban Population (%)",
    y = "Life Expectancy at Birth",
    caption = "Source: Our World in Data"
  ) +
  theme_meridian()

Correlation ≠ Causation

The correlation is real, but think about what’s in the data:

Each point is a country-year — we’re pooling observations across both countries and time
Over time, both urbanization and life expectancy have increased worldwide
Much of this correlation may be driven by the shared time trend, not a direct causal link
This is a textbook example of a confound: a third variable (time/development) drives both

A correlation of \(r = 0.80\) tells us there is a strong association — it does not tell us that urbanization causes longer lives.

Takeaways

T-tests let us compare group means when \(\sigma\) is unknown — but check your assumptions (especially with small \(n\))
Choropleths reveal spatial patterns that summary statistics hide
Correlation measures association — but always ask: could a third variable explain this?
The real skill is not running t.test() or cor() — it’s knowing what the output means and what it doesn’t

Statistical Analysis

Intro

Setup

Background

Loading Libraries

From Z-Tests to T-Tests

Loading the Data

Loading the Datasets

Cleaning the Data

Data Cleaning Pipeline

Inspecting with head

Inspecting with glimpse

Mapping the Data

Data vs Spatial Objects

Loading Geographic Data

Inspecting the Spatial Data

A Basic World Map

Subsetting the Spatial Data

Checking Country Codes

Finding Mismatches with setdiff

Merging Data to the Map

Choropleth Map: Life Expectancy

Improved Choropleth

Performing a T-Test

Continent Labels

Why South America, Not Latin America?

Subsetting by Continent

Map: South America

Map: Europe

Before We Test: Checking Assumptions

T-Test: Is Latam Greater?

Interpretation: Greater

T-Test: Is Latam Less?

Interpretation: Less

T-Test: Two-Sided

Interpretation: Two-Sided

Boxplot: Latam vs EU

Interpreting a Boxplot

Boxplot with Jitter

Mapping Europe Over Time

Setting Map Boundaries

Europe: Average Life Expectancy

Visualizing the Boundary Points

Merging by Year: 1950

Map: Europe 1950

Merging by Year: 2020

Map: Europe 2020

Side by Side (Without Fixed Scale)

Fixing the Color Scale

1950 vs 2020: Common Scale

Computing the Change: 1950 to 2020

Map: Change in Life Expectancy

Highest and Lowest Change

Correlation

From Maps to Relationships

Preparing the Data

Computing the Correlation

Shared Axis Limits

Scatterplot: Earliest Year (1960)

Scatterplot: Latest Year (2020)

Correlation ≠ Causation

Takeaways

Inspecting with `head`

Inspecting with `glimpse`

Finding Mismatches with `setdiff`