Lab 5: Statistics

Normal Distribution, Histograms, R Functions

In this lab, we explore the concepts of normal distributions and their related functions in R, such as rnorm, pnorm, qnorm, and dnorm. We generate samples of different sizes from a standard normal distribution and visualize them with histograms to understand how sample size affects distribution shapes.
Normal Distribution
Cumulative Distribution Function
Probability Density Function
Author

Bogdan G. Popescu

1 Intro

Open a new Quarto document. Select all text by pressing Ctrl + A (Windows) or Cmd + A (Mac), then press Delete. Next, copy and paste the following preamble.

---
title: "Lab 5"
author: "Your Name"
date: "September 13, 2024"
format:
  html:
    toc: true
    number-sections: true
    colorlinks: true
    smooth-scroll: true
    embed-resources: true
---

Render the document and save it under “week5” in your “stats” folder.

2 Background

#Removing previous datasets in memory
rm(list = ls())
#Loading the relevant libraries
#install.packages("gghighlight")
library(ggplot2)
library(gridExtra)
library(gghighlight)

Before we get into coding, let us go back to a short explanation of how frequencies relate to probability mass functions (pmf) and cumulative distribution functions (cdf).

There are four functions in R related to distributions: rnorm, pnorm, qnorm, dnorm.

3 Generating Random heights in a Normal Distribution

The rnorm function generates a vector of normally distributed random numbers. We can specify the mean \(\mu\) and the standard deviation \(\sigma\). Let’s simulate women’s heights in centimeters and demonstrate how heights follow a normal distribution.

Let us see how it works by creating three samples with 20, 100, and 1000 observations to demonstrate the impact of sample size on the normal distribution.

3.1 Creating a sample of 20 observations: heights

#Set seed allows us to make sure everyone gets the same results
set.seed(123)
#Step1: Creating 20 obs. with mean 165 and sd of 10
generated_heights_20obs<-rnorm(20, mean = 165, sd = 10)
#Step2: Turning the created data into a data frame
generated_heights_20obs<-data.frame(generated_heights_20obs)
#Step3: Inspecting the first 10 obs.
head(generated_heights_20obs, n=10)
#Step4: Renaming variable inside the datafarme
names(generated_heights_20obs)<-"height"
#Step5: Plotting the data
figure_1<-ggplot(data = generated_heights_20obs, aes(x=height))+
  geom_histogram()+
  theme_bw()+
  ggtitle("20 Obs.")
figure_1

3.2 Creating a sample of 100 observations: heights

set.seed(123)
#Step1: Creating 100 obs. with mean 165 and sd of 10
generated_heights_100obs<-rnorm(100, mean=165, sd=10)
#Step2: Turning the created data into a data frame
generated_heights_100obs<-data.frame(generated_heights_100obs)
#Step3: Inspecting the first 10 obs.
head(generated_heights_100obs, n=10)
#Step4: Renaming variable inside the datafarme
names(generated_heights_100obs)<-"height"
#Step5: Plotting the data
figure_2<-ggplot(data = generated_heights_100obs, aes(x=height))+
  geom_histogram()+
  theme_bw()+
  ggtitle("100 Obs.")
figure_2

3.3 Creating a sample of 1000 observations: heights

set.seed(123)
#Step1: Creating 1000 obs. with mean 165 and sd of 10
generated_heights_1000obs<-rnorm(1000, mean=165, sd=10)
#Step2: Turning the created data into a data frame
generated_heights_1000obs<-data.frame(generated_heights_1000obs)
#Step3: Inspecting the first 10 obs.
head(generated_heights_1000obs, n=10)
#Step4: Renaming variable inside the datafarme
names(generated_heights_1000obs)<-"height"
#Step5: Plotting the data
figure_3<-ggplot(data = generated_heights_1000obs, aes(x=height))+
  geom_histogram()+
  theme_bw()+
  ggtitle("1000 Obs.")
figure_3

3.4 Comparing the distribution of the three different samples

new_fig<-grid.arrange(figure_1, figure_2, figure_3, ncol=3)

  • When Small samples (n = 20) → More variability, less smooth
  • Large samples (n = 1000) → Converging to the theoretical normal distribution
  • Law of Large Numbers: As n increases, the sample distribution approximates the population distribution.

4 Generating the cumulative distribution function (CDF) for a normal distribution of heights

The pnorm function generates a vector for a cumulative distribution function of the normal distribution. We can specify the mean \(\mu\) and the standard deviation \(\sigma\). The pnorm will return the probability that the randomly generated variable X takes on a a value less than or equal to x:

\(F_X(x) = P (X\leq x)\)

4.1 Heights of Women from 135 to 195cm

# 99% of a normal distribution falls between mean - 3*sd and mean + 3*sd
list_heights <- seq(135, 195, by = 0.1)  # Heights from 135 to 195 cm
list_heights_df<-data.frame(list_heights)
dummy_df<-pnorm(list_heights, mean = 165, sd = 10)
dummy_df<-data.frame(dummy_df)
names(dummy_df)<-"probability"
dummy_df2<-cbind(dummy_df, list_heights)
names(dummy_df2)[names(dummy_df2)=="list_heights"]<-"heights"
figure_4<-ggplot()+
  geom_step(data=dummy_df2, mapping=aes(x=heights, y=probability)) +
  geom_bar(stat="identity")+
  theme_bw()+
  ggtitle("Sequence: 135 to 195")
figure_4

4.2 Highlighting probabilities

This will return the probability that the randomly generated variable X takes on a height less than or equal to 150cm.

figure_5<-ggplot(data=dummy_df2, aes(x=heights, y=probability)) + 
      geom_area(fill = "green") + 
      gghighlight(list_heights <= 150)+
      theme_bw()
figure_5

This will return the probability that the randomly generated variable X takes on a height less than or equal to 170cm.

figure_6<-ggplot(data=dummy_df2, aes(x=heights, y=probability)) + 
       geom_area(fill = "green") + 
       gghighlight(list_heights <= 170)+
       theme_bw()
figure_6

4.3 Changing size of the randomly generated variable - heights of women: increments of 1

list_heights <- seq(135, 195, by = 0.1)
dummy_df<-pnorm(list_heights, mean = 165, sd = 10)
dummy_df<-data.frame(dummy_df)
names(dummy_df)<-"probability"
dummy_df2<-cbind(dummy_df, list_heights)
names(dummy_df2)[names(dummy_df2)=="list_heights"]<-"heights"
figure_7<-ggplot()+
  geom_step(data=dummy_df2, mapping=aes(x=heights, y=probability)) +
  theme_bw()+
  ggtitle("Heights of Women:\n135 - 195cm by 0.1")
figure_7

4.4 Changing size of the randomly generated variable - heights of women: increments of 0.1

list_heights <- seq(135, 195, by = 1)
dummy_df<-pnorm(list_heights, mean = 165, sd = 10)
dummy_df<-data.frame(dummy_df)
names(dummy_df)<-"probability"
dummy_df2<-cbind(dummy_df, list_heights)
names(dummy_df2)[names(dummy_df2)=="list_heights"]<-"heights"
figure_8<-ggplot()+
  geom_step(data=dummy_df2, mapping=aes(x=heights, y=probability)) +
  theme_bw()+
  ggtitle("Heights of Women:\n135 - 195cm by 1")
figure_8

4.5 Changing size of the randomly generated variable - heights of women: increments of 0.01

list_heights <- seq(135, 195, by = 10)
dummy_df<-pnorm(list_heights, mean = 165, sd = 10)
dummy_df<-data.frame(dummy_df)
names(dummy_df)<-"probability"
dummy_df2<-cbind(dummy_df, list_heights)
names(dummy_df2)[names(dummy_df2)=="list_heights"]<-"heights"
figure_9<-ggplot()+
  geom_step(data=dummy_df2, mapping=aes(x=heights, y=probability)) +
  theme_bw()+
  ggtitle("Heights of Women:\n135 - 195cm by 10")
figure_9

4.6 Visualizing the results

grid.arrange(figure_7, figure_8, figure_9, ncol=3)

Smaller increments generate more data points for the CDF. This creates a smoother, more continuous-looking CDF. With by = 10, the CDF is “jumpy” because there are fewer points.

5 Generating the inverse of cumulative distribution function (CDF) for a normal distribution of heights

The qnorm function generates a vector for the inverse cumulative distribution function of the normal distribution with a mean \(\mu\) and a standard deviation \(\sigma\). It returns the height x such that pnorm(x, \(\mu\), \(\sigma\)) = p

\(P(X\leq x) = F_X(x)\)

Let’s use the same data. First let us examine what we created previously.

head(dummy_df2, n = 10)

Because qnorm works in the opposite direction, we will use the probabilities (that the randomly generated variable X takes on a value less than or equal to x) to recreate the heights in our sequence

probability2<-dummy_df2$probability
dummy_df_q<-qnorm(probability2, mean = 165, sd = 10)
dummy_df_q<-data.frame(dummy_df_q)
names(dummy_df_q)<-"q"
dummy_df_q2<-cbind(dummy_df_q, probability2)

Let us now inspect the output

head(dummy_df_q2, n=10)

This looks very familiar. Let us now compare it to the previous dataset

head(dummy_df2, n=10)
head(dummy_df_q2, n=10)

6 Generating the probability density function (PDF) for a normal distribution of heights

The dnorm function generates a vector for a probability density function of the normal distribution. We can specify the mean \(\mu\) and the standard deviation \(\sigma\).

dnorm(x, \(\mu\), \(\sigma\)) - probability density function of the normal distribution with mean \(\mu\) and standard deviation \(\sigma\)

list_heights<-seq(135, 195, by =0.1)
dnorm_df<-dnorm(list_heights, mean = 165, sd = 10)
dnorm_df<-data.frame(dnorm_df)
names(dnorm_df)<-"density"
dnorm_df2<-cbind(dnorm_df, list_heights)
names(dnorm_df2)[names(dnorm_df2)=="list_heights"]<-"heights"
figure_10<-ggplot()+
  geom_step(data=dnorm_df2, mapping=aes(y=density, x=heights)) +
  theme_bw()+
  ggtitle("Heights from 135 to 195:\n Increments of 0.01")
figure_10

Let us see how changing some of these parameters may affect the appearance of our bell curve

6.1 Sequence of Heights from 135 to 195: Increments of 1

list_heights<-seq(135, 195, by =1)
dnorm_df_3<-dnorm(list_heights, mean = 165, sd = 10)
dnorm_df_3<-data.frame(dnorm_df_3)
names(dnorm_df_3)<-"density"
dnorm_df_3<-cbind(dnorm_df_3, list_heights)
names(dnorm_df_3)[names(dnorm_df_3)=="list_heights"]<-"heights"
figure_11<-ggplot()+
  geom_step(data=dnorm_df_3, mapping=aes(y=density, x=heights)) +
  theme_bw()+
  ggtitle("Heights from 135 to 195:\nIncrements of 1")
figure_11

6.2 Sequence of Heights from 135 to 195: Increments of 10

list_heights<-seq(135, 195, by =10)
dnorm_df_4<-dnorm(list_heights, mean = 165, sd = 10)
dnorm_df_4<-data.frame(dnorm_df_4)
names(dnorm_df_4)<-"density"
dnorm_df_4<-cbind(dnorm_df_4, list_heights)
names(dnorm_df_4)[names(dnorm_df_4)=="list_heights"]<-"heights"
figure_12<-ggplot()+
  geom_step(data=dnorm_df_4, mapping=aes(y=density, x=heights)) +
  geom_bar(stat="identity")+
  theme_bw()+
  ggtitle("Heights from 135 to 195:\n Increments of 10")
figure_12

Comparing them

grid.arrange(figure_8, figure_9, ncol=2)

Here too, smaller increments generate more data points for the CDF. This creates a smoother, more continuous-looking CDF. With by = 10, the CDF is “jumpy” because there are fewer points.

Let’s do some problems now.

7 Normal vs. Non-Normal Distributions

As you might suspect, life expectancy in our dataset is not normally distributed.

Show the code
#Setting path
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week5/lab/data/")
#Step1: Loading the data
life_expectancy_df <- read.csv(file = './life-expectancy.csv')
library(dplyr)

#Step2: Calculating the mean
life_expectancy_df2<-life_expectancy_df%>%
  dplyr::group_by(Code)%>%
  dplyr::summarize(life_exp_mean=mean(Life.expectancy.at.birth..historical.))

#Step3: Getting rid of countries
weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_expectancy_df<-subset(life_expectancy_df2, !(Code %in% weird_labels))

#Step4: Calculating the mean and SDin the dataset
mean_empirical<-mean(clean_life_expectancy_df$life_exp_mean, na.rm = T)
mean_empirical_val<-as.character(round(mean_empirical, 0))
sd_empirical<-sd(clean_life_expectancy_df$life_exp_mean, na.rm = T)
sd_empirical_val<-as.character(round(sd(clean_life_expectancy_df$life_exp_mean, na.rm = T), 2))

#Step5: Creating annotation text
lines<- paste("Mean (\u03BC)", ": ", mean_empirical_val, "\n", "SD(\u03C3)", ": ", sd_empirical_val)

#Step6: Making the histogram for the dataset
y_coord<-18
fig_hist1<-ggplot(data = clean_life_expectancy_df, aes(x=life_exp_mean))+
  geom_histogram()+
  #Multiplying by 1.5. Otherwise, density will not be visible
  geom_density(aes(y=..count..*1.5))+
  theme_bw()+
  geom_vline(xintercept=mean_empirical, linetype='dashed', col = 'red')+
  ggtitle("Life Expectacy as is\n in the Dataset")+
  annotate(geom="text", 
           x=mean_empirical-12, 
           y=y_coord, 
           label=lines,
           color="red")+
  ylab("count")

#Step7: Generating 235 observations following a normal distribution with same mean and standard deviation
set.seed(123)
generated_data<-rnorm(235, mean=mean_empirical, sd=sd_empirical)
generated_data<-data.frame(generated_data)
names(generated_data)<-"life_exp_mean"

#Step8: Making the histogram for the generated dataset
fig_hist2<-ggplot(data = generated_data, aes(x=life_exp_mean))+
  geom_histogram()+
  #Multiplying by 1.5. Otherwise, density will not be visible
  geom_density(aes(y=..count..*1.5))+
  theme_bw()+
  geom_vline(xintercept=mean_empirical, linetype='dashed', col = 'red')+
  ggtitle("Life Expectacy generated\n on a Normal Distribution")+
  annotate(geom="text", 
           x=mean_empirical-12, 
           y=y_coord+3, 
           label=lines,
           color="red")+
  ylab("count")

#Step9: Comparing the two histograms
grid.arrange(fig_hist1, fig_hist2, ncol=2)

How do we do we obtain these graphs? Let us work on the first graph.

7.1 Working with the empirical (actual) data

Let us go back to the life expectancy dataset

Copy and paste the life expectancy dataset from Lab 4 or download it again. You can download the life expectancy dataset below:

7.2 Load the dataset

#Setting path
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/stats/week5/lab/data/")
life_exp_mean <- read.csv(file = './life-expectancy.csv')

7.3 Calculating the mean

life_exp_mean2<-life_exp_mean%>%
  dplyr::group_by(Code, Entity)%>%
  dplyr::summarize(life_exp_mean=mean(Life.expectancy.at.birth..historical.))

7.4 Clearning the Data

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_exp_mean_df<-subset(life_exp_mean2, !(Code %in% weird_labels))

7.5 Calculating the mean and SD in the dataset

mean_empirical<-mean(clean_life_exp_mean_df$life_exp_mean, na.rm = T)
mean_empirical_val<-as.character(round(mean_empirical, 0))
sd_empirical<-sd(clean_life_exp_mean_df$life_exp_mean, na.rm = T)
sd_empirical_val<-as.character(round(sd(clean_life_exp_mean_df$life_exp_mean, na.rm = T), 2))

7.6 Creating annotation text

lines<- paste("Mean (\u03BC)", ": ", mean_empirical_val, "\n", "SD(\u03C3)", ": ", sd_empirical_val)

7.7 Making the histogram for the dataset

y_coord<-17
fig_hist1<-ggplot(data = clean_life_exp_mean_df, aes(x=life_exp_mean))+
  geom_histogram()+
  #Multiplying by 1.5. Otherwise, density will not be visible
  geom_density(aes(y=..count..*1.5))+
  theme_bw()+
  geom_vline(xintercept=mean_empirical, linetype='dashed', col = 'red')+
  ggtitle("Life Expectacy as is\n in the Dataset")+
  annotate(geom="text", 
           x=mean_empirical-10, 
           y=y_coord, 
           label=lines,
           color="red")
fig_hist1

8 Working with the made-up (generated) data

8.1 Generating 235 observations following a normal distribution with same mean and standard deviation

#Step7: Generating 235 observations following a normal distribution with same mean and standard deviation
set.seed(123)
generated_data<-rnorm(235, mean=mean_empirical, sd=sd_empirical)
generated_data<-data.frame(generated_data)
names(generated_data)<-"life_exp_mean"

8.2 Making the histogram for the generated dataset

fig_hist2<-ggplot(data = generated_data, aes(x=life_exp_mean))+
  geom_histogram()+
  #Multiplying by 1.5. Otherwise, density will not be visible
  geom_density(aes(y=..count..*1.5))+
  theme_bw()+
  geom_vline(xintercept=mean_empirical, linetype='dashed', col = 'red')+
  ggtitle("Life Expectacy generated\n on a Normal Distribution")+
  annotate(geom="text", 
           x=mean_empirical-12, 
           y=y_coord+6, 
           label=lines,
           color="red")+
  ylab("count")

9 Comparing the two histograms

grid.arrange(fig_hist1, fig_hist2, ncol=2)

9.1 Generating more observations following a normal distribution with same mean and standard deviation

generated_data_1000obs<-rnorm(1000, mean=mean_empirical, sd=sd_empirical)
generated_data_1000obs<-data.frame(generated_data_1000obs)
names(generated_data_1000obs)<-"life_exp_mean"

9.2 Making the histogram for the generated dataset (1000 obs.)

fig_hist3<-ggplot(data = generated_data_1000obs, aes(x=life_exp_mean))+
  geom_histogram()+
  #Multiplying by 1.5. Otherwise, density will not be visible
  geom_density(aes(y=..count..*1.5))+
  theme_bw()+
  geom_vline(xintercept=mean_empirical, linetype='dashed', col = 'red')+
  ggtitle("Life Expectacy generated\n on a Normal Distribution\n 1000 Obs.")+
  annotate(geom="text", 
           x=mean_empirical-10, 
           y=y_coord, 
           label=lines,
           color="red")
fig_hist3

So, we can see that by having more observations, our dataset approaches something that looks like a normal distribution. Note that our mean \(\mu\) and standard deviation \(\sigma\) are the same as in the original data.

10 Problem

Let’s assume that the data in our original empirical dataset is normally distributed with the mean of 61.93 years and a standard deviation of 8.69 years.

10.1 Life expectancy of less than or equal 50 years

What is the probability that a randomly selected country will have a life expectancy of less than or equal 50 years?

ctry_50<-subset(life_exp_mean2, life_exp_mean<=50)
ctry_50
nrow(ctry_50)/nrow(life_exp_mean2)
[1] 0.1171875

The probability that a randomly selected country will have a life expectancy of less than or equal to 50 years is 0.12.

Let us also visualize this probability. We will proceed by calculating the frequency table.

#Step1: Rounding the heights
new_data2<-life_exp_mean2
new_data2$life_exp_mean_rounded<-round(new_data2$life_exp_mean, 0)
head(new_data2, n=5)
#Step2: Creating a frequency table
freq_table<-table(new_data2$life_exp_mean_rounded)
freq_table

38 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 
 1  2  3  4  4  5  5  3  4  5 12  6  5  6  5  6  7  8 11 10  7  8 14 13  5 15 
68 69 70 71 72 73 74 75 77 
10 15 14 14  8  6  3  8  4 
#Step3: Turning the table into a dataframe
freq_table<-data.frame(freq_table)
#Step4: Providing more intuitive names
names(freq_table)[1]<-"life_exp_mean_rounded"
names(freq_table)[2]<-"count"
names(freq_table)
[1] "life_exp_mean_rounded" "count"                
#Step5: Turning factor variables into numeric
freq_table$life_exp_mean_rounded<-as.numeric(as.character(freq_table$life_exp_mean_rounded))
str(freq_table)
'data.frame':   35 obs. of  2 variables:
 $ life_exp_mean_rounded: num  38 43 44 45 46 47 48 49 50 51 ...
 $ count                : int  1 2 3 4 4 5 5 3 4 5 ...
#Step6: Calculaing the PMFs
freq_table$sum_frequency<-sum(freq_table$count)
freq_table$pmf<-freq_table$count/freq_table$sum_frequency
head(freq_table, n=5)

The next step is to create the CDF. A CDF tells us probability that the randomly generated variable X takes on a height less than or equal to x. We will be using the function that we created previously.

#Step7: Calculating the CDF
new_data2$cdf <- ecdf(new_data2$life_exp_mean_rounded)(new_data2$life_exp_mean_rounded)

Plotting the CDF

#Step8: Plotting the CDF
figure_13<-ggplot()+
  geom_step(data=new_data2, mapping=aes(x=life_exp_mean_rounded, y=cdf)) +
  geom_bar(stat="identity")+
  theme_bw()
figure_13

Merging the PMFs to the dataset

#Step9: Selecting only the relevant variables
freq_table_x<-subset(freq_table, select = c(life_exp_mean_rounded, pmf))
head(freq_table_x, n=10)
#Step10: Performing the left join
new_data3<-left_join(new_data2, freq_table_x, by=c("life_exp_mean_rounded" = "life_exp_mean_rounded"))
#Probability Density
figure_14<-ggplot(data=new_data3, mapping=aes(x=life_exp_mean_rounded, y=pmf)) +
  geom_area(fill = "green") +
  gghighlight(new_data2$life_exp_mean_rounded <= 50)+
  theme_bw()+
  ggtitle("Probability\nDensity")
figure_14

#Cumulative Probability
figure_15<-ggplot()+
  geom_step(data=new_data2, mapping=aes(x=life_exp_mean_rounded, y=cdf)) +
  geom_area(fill = "green") +
  geom_bar(stat="identity")+
  theme_bw()+
  ggtitle("Cumulative\nProbability")

#Cumulative Probability < 50
figure_16<-ggplot(data=new_data2, aes(x=life_exp_mean_rounded, y=cdf)) + 
  geom_area(fill = "green") + 
  gghighlight(new_data2$life_exp_mean_rounded <= 50)+
  theme_bw()+
  ggtitle("Cumulative\nProbability")


grid.arrange(figure_14, figure_15, figure_16, ncol=3)

If our distribution was normal, it would look different:

It would look something like the following:

If we assume a normal distribution, we can simply get this answer this way:

res<-pnorm(50, mean = mean_empirical, sd = sd_empirical)
res
[1] 0.08475828

The probability of getting a country with a life expectancy of less than 50 is 0.12, if we assume a normal distribution.

10.2 Life expectancy of more than 70 years

What is the probability that a randomly selected country will have a life expectancy of more than 70 years?

So remember that the CDF is the probability that the randomly generated variable X takes on a height LESS than or equal to x. We will first subset the data to less than 70 and take the complement.

prob_more_70<-subset(new_data2, life_exp_mean_rounded<70)
1-max(prob_more_70$cdf)
[1] 0.2226562

Therefore, probability of getting a country with a life expectancy of greater than 70 is 0.22.

If we assume a normal distribution, we can simply get a similar answer this way:

res<-pnorm(70, mean = mean_empirical, sd = sd_empirical)
1-res
[1] 0.1765817

Here is what it looks like

figure_20<-ggplot()+
  geom_step(data=new_data2, mapping=aes(x=life_exp_mean_rounded, y=cdf)) +
  geom_area(fill = "green") +
  geom_bar(stat="identity")+
  theme_bw()


figure_21<-ggplot(data=new_data2, aes(x=life_exp_mean_rounded, y=cdf)) + 
  geom_area(fill = "green") + 
  gghighlight(new_data2$life_exp_mean_rounded > 70)+
  theme_bw()

grid.arrange(figure_20, figure_21, ncol=2)

10.3 Percentile

What is the 90th percentile of life expectancy?

We can obtain this by - ordering the ages and calculate the percentile - subset based on CDF and get the minimum

list_ages<-new_data2$life_exp_mean_rounded
list_ages<-list_ages[order(list_ages)]
quantile(new_data2$life_exp_mean_rounded, c(0.90))
90% 
 72 
pct_90<-subset(new_data2, cdf>0.9)
#pct_90
min(pct_90$life_exp_mean_rounded)
[1] 72

The 90th percentile of life expectancy is 72.

If we assume a normal distribution, we can also use qnorm. This generates a vector for the inverse cumulative distribution function of the normal distribution with a mean \(\mu\) and a standard deviation \(\sigma\). It is not exactly the same, but it is close.

res<-qnorm(0.9, mean = mean_empirical, sd = sd_empirical)
res
[1] 73.06729