Lab 6: Statistics

Z-Tests, Z-Scores, and the Standard Normal Distribution

Author

Bogdan G. Popescu

#Removing previous datasets in memory
rm(list = ls())
#Loading the relevant libraries
library(ggplot2)
library(gridExtra)
library(dplyr)
library(BSDA)

1 Intro

In this lab, we learn about Z-tests, standardization, and z-scores.

2 Z-Tests

We learned from the lecture that Z-tests evaluate if the averages of two datasets are different from each other when standard deviation or variance is known. The sample size typically should be larger than 30. To illustrate how Z-tests work, let us go back to the sample of Latin American countries and compare them to the world.

2.1 Loading the Data

If you don’t have the Life Expectancy dataset, anymore, you can download it from the following link:

Let us re-examine the distribution of our life expectancy dataset by making a histogram. We need to first load the data:

#Setting path
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week6/lab/")
#Step1: Loading the data
life_expectancy_df <- read.csv(file = './data/life-expectancy.csv')

2.2 Cleaning the Data

In the next few lines we average life expectancy over country (the oiriginal dataset is a panel - with countries and years). This means that we are getting rid of the time component.

#Step1: Calculating the mean
life_expectancy_df2<-life_expectancy_df%>%
  dplyr::group_by(Entity, Code)%>%
  dplyr::summarize(life_exp_mean=mean(Life.expectancy.at.birth..historical.))

#Step2: Cleaning the Data
weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_expectancy_df<-subset(life_expectancy_df2, !(Code %in% weird_labels))

2.3 Inspecting the data

Let us inspect the first 10 entries.

head(clean_life_expectancy_df, n=10)

2.4 Sorting the dataframe in the order of life_expectancy

Let us now order our dataset based on life expectancy.

clean_life_expectancy_sorted_df<-clean_life_expectancy_df[order(-clean_life_expectancy_df$life_exp_mean),]
head(clean_life_expectancy_sorted_df, n=10)

2.5 Make a barplot for the first 10 countries

Let us now make a barplot where life expectancy is ordered from highest to lowest. The first step entails selecting the top 10 countries.

life_exp_top10<-head(clean_life_expectancy_sorted_df, n=10)
life_exp_top10

Let us know create the barplot for the first 10 entries:

figure_1<-ggplot(life_exp_top10, aes(x = reorder(Entity, -life_exp_mean), 
                                     y = life_exp_mean)) + 
  geom_bar(stat="identity")+
  coord_cartesian(ylim = c(60,80))+
  geom_text(data=life_exp_top10,
  aes(label = round(life_exp_mean,2), 
        y = life_exp_mean, 
        vjust = 2), 
  colour = "white", size=2)+
  xlab("Country") + ylab("Life Expectancy")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

figure_1