Statistical Analysis

Lab 5: Normal Distribution & R Functions

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Intro

Overview

Today, we’ll learn how to:

Use rnorm to generate random samples from a normal distribution
Use pnorm to compute cumulative probabilities (CDF)
Use qnorm to find quantiles (inverse CDF)
Use dnorm to compute probability densities (PDF)
Compare normal vs. non-normal distributions using real data

Creating a New Quarto Document

Open a new Quarto document. Select all text (Ctrl+A / Cmd+A), delete, then paste:

---
title: "Lab 5"
author: "Your Name"
date: today
format:
  html:
    toc: true
    number-sections: true
    colorlinks: true
    smooth-scroll: true
    embed-resources: true
---

Render the document and save it under “week5” in your “stats” folder.

Background

Loading Libraries

rm(list = ls())
library(ggplot2)
library(gridExtra)
library(gghighlight)
library(dplyr)

There are four functions in R related to normal distributions: rnorm, pnorm, qnorm, dnorm.

`rnorm`: Generating Random Samples

`rnorm`

What Does It Do?

rnorm(n, mean, sd) generates n random numbers from a normal distribution with mean \(\mu\) and standard deviation \(\sigma\).

We will simulate women’s heights (cm) using three sample sizes.

Sample of 20 Observations

Generating the Data

set.seed(123)
generated_heights_20obs <- rnorm(20, mean = 165, sd = 10)
generated_heights_20obs <- data.frame(height = generated_heights_20obs)
head(generated_heights_20obs, n = 10)

Sample of 20 Observations

Plotting the Histogram

figure_1 <- ggplot(data = generated_heights_20obs, aes(x = height)) +
  geom_histogram(bins = 10) +
  theme_bw() +
  ggtitle("20 Obs.")
figure_1

Sample of 100 Observations

set.seed(123)
generated_heights_100obs <- rnorm(100, mean = 165, sd = 10)
generated_heights_100obs <- data.frame(height = generated_heights_100obs)

figure_2 <- ggplot(data = generated_heights_100obs, aes(x = height)) +
  geom_histogram() +
  theme_bw() +
  ggtitle("100 Obs.")
figure_2

Sample of 1000 Observations

set.seed(123)
generated_heights_1000obs <- rnorm(1000, mean = 165, sd = 10)
generated_heights_1000obs <- data.frame(height = generated_heights_1000obs)

figure_3 <- ggplot(data = generated_heights_1000obs, aes(x = height)) +
  geom_histogram() +
  theme_bw() +
  ggtitle("1000 Obs.")
figure_3

Comparing the Three Samples

grid.arrange(figure_1, figure_2, figure_3, ncol = 3)

Small samples (n = 20): More variability, less smooth
Large samples (n = 1000): Converges to theoretical normal distribution
Law of Large Numbers: As n increases, the sample distribution approximates the population distribution

`pnorm`: Cumulative Distribution Function

`pnorm`

What Does It Do?

pnorm(x, mean, sd) returns the probability that a normally distributed variable \(X\) takes a value \(\leq x\):

\[F_X(x) = P(X \leq x)\]

We will use heights of women from 135 to 195 cm.

Heights of Women: CDF

Setting Up the Data

list_heights <- seq(135, 195, by = 0.1)
dummy_df <- pnorm(list_heights, mean = 165, sd = 10)
dummy_df2 <- data.frame(heights = list_heights, probability = dummy_df)

figure_4 <- ggplot(data = dummy_df2, aes(x = heights, y = probability)) +
  geom_step() +
  theme_bw() +
  ggtitle("CDF: Heights 135 to 195 cm")
figure_4

Highlighting Probabilities

P(Height \(\leq\) 150)

figure_5 <- ggplot(data = dummy_df2, aes(x = heights, y = probability)) +
  geom_area(fill = "green") +
  gghighlight(heights <= 150) +
  theme_bw() +
  ggtitle("P(Height <= 150)")
figure_5

Highlighting Probabilities

P(Height \(\leq\) 170)

figure_6 <- ggplot(data = dummy_df2, aes(x = heights, y = probability)) +
  geom_area(fill = "green") +
  gghighlight(heights <= 170) +
  theme_bw() +
  ggtitle("P(Height <= 170)")
figure_6

Effect of Increment Size

How Granularity Affects the CDF Shape

Show Code

df_01 <- data.frame(heights = seq(135, 195, by = 0.1),
                    probability = pnorm(seq(135, 195, by = 0.1), 165, 10))
df_1 <- data.frame(heights = seq(135, 195, by = 1),
                   probability = pnorm(seq(135, 195, by = 1), 165, 10))
df_10 <- data.frame(heights = seq(135, 195, by = 10),
                    probability = pnorm(seq(135, 195, by = 10), 165, 10))
fig_inc1 <- ggplot(df_01, aes(x = heights, y = probability)) +
  geom_step() + theme_bw() + ggtitle("by = 0.1")
fig_inc2 <- ggplot(df_1, aes(x = heights, y = probability)) +
  geom_step() + theme_bw() + ggtitle("by = 1")
fig_inc3 <- ggplot(df_10, aes(x = heights, y = probability)) +
  geom_step() + theme_bw() + ggtitle("by = 10")
grid.arrange(fig_inc1, fig_inc2, fig_inc3, ncol = 3)

Smaller increments produce more data points, creating a smoother curve. With by = 10, the CDF is “jumpy.”

`qnorm`: Inverse CDF

`qnorm`

What Does It Do?

qnorm(p, mean, sd) returns the value \(x\) such that \(P(X \leq x) = p\).

It is the inverse of pnorm: given a probability, it returns the corresponding value.

Why is this useful? Whenever you need to find a cutoff value — “What height is at the 90th percentile?”, “What score do you need to be in the top 5%?” — you use qnorm.

`qnorm` in Action

Recovering Heights from Probabilities

Let us examine the CDF data we created previously:

head(df_1, n = 10)

`qnorm` in Action

Applying the Inverse

probability2 <- df_1$probability
dummy_df_q <- qnorm(probability2, mean = 165, sd = 10)
dummy_df_q2 <- data.frame(q = dummy_df_q, probability = probability2)
head(dummy_df_q2, n = 10)

The q column recovers the original heights — qnorm reverses what pnorm did.

`dnorm`: Probability Density Function

`dnorm`

What Does It Do?

dnorm(x, mean, sd) returns the height of the bell curve at value \(x\).

This is the probability density — not a probability itself, but it tells you how concentrated observations are around a given value. Higher density = more likely to observe values nearby.

PDF for Heights

Show Code

list_heights <- seq(135, 195, by = 0.1)
dnorm_df2 <- data.frame(
  heights = list_heights,
  density = dnorm(list_heights, mean = 165, sd = 10)
)
ggplot(data = dnorm_df2, aes(x = heights, y = density)) +
  geom_line() +
  theme_bw() +
  ggtitle("PDF: Women's Heights (mean = 165, sd = 10)")

Overlaying the PDF on a Histogram

Connecting `rnorm` and `dnorm`

Show Code

ggplot(data = generated_heights_1000obs, aes(x = height)) +
  geom_histogram(aes(y = after_stat(density)), bins = 30, fill = "grey80", color = "white") +
  geom_line(data = dnorm_df2, aes(x = heights, y = density), color = "red", linewidth = 1) +
  theme_bw() +
  ggtitle("Histogram (1000 obs.) with Theoretical PDF Overlay")

The red curve is the theoretical normal distribution. The histogram of our rnorm sample approximates it.

Normal vs. Non-Normal Distributions

Loading Life Expectancy Data

Working with Empirical Data

life_expectancy_df <- read.csv("./data/life-expectancy.csv")

life_expectancy_df2 <- life_expectancy_df %>%
  dplyr::group_by(Code, Entity) %>%
  dplyr::summarize(life_exp_mean = mean(Life.expectancy.at.birth..historical.))

Cleaning the Data

weird_labels <- c("OWID_KOS", "OWID_WRL", "")
clean_life_exp_mean_df <- subset(life_expectancy_df2, !(Code %in% weird_labels))

mean_empirical <- mean(clean_life_exp_mean_df$life_exp_mean, na.rm = TRUE)
sd_empirical <- sd(clean_life_exp_mean_df$life_exp_mean, na.rm = TRUE)

Mean: 61.93 years | SD: 8.69 years

Histogram of Empirical Data

Show Code

lines <- paste("Mean:", round(mean_empirical, 0),
               "\nSD:", round(sd_empirical, 2))

fig_hist1 <- ggplot(data = clean_life_exp_mean_df, aes(x = life_exp_mean)) +
  geom_histogram() +
  geom_density(aes(y = after_stat(count) * 1.5)) +
  theme_bw() +
  geom_vline(xintercept = mean_empirical, linetype = "dashed", col = "red") +
  ggtitle("Life Expectancy\n(Empirical Data)") +
  annotate("text", x = mean_empirical - 10, y = 17, label = lines, color = "red")
fig_hist1

Generating Normal Data

Same Mean & SD, 235 Observations

set.seed(123)
generated_data <- rnorm(235, mean = mean_empirical, sd = sd_empirical)
generated_data <- data.frame(life_exp_mean = generated_data)

fig_hist2 <- ggplot(data = generated_data, aes(x = life_exp_mean)) +
  geom_histogram() +
  geom_density(aes(y = after_stat(count) * 1.5)) +
  theme_bw() +
  geom_vline(xintercept = mean_empirical, linetype = "dashed", col = "red") +
  ggtitle("Life Expectancy\n(Generated Normal)") +
  annotate("text", x = mean_empirical - 12, y = 23, label = lines, color = "red") +
  ylab("count")

Comparing Empirical vs. Generated

grid.arrange(fig_hist1, fig_hist2, ncol = 2)

The empirical data is not normally distributed — it is left-skewed.

Generating More Observations

1000 Observations

set.seed(123)
generated_data_1000obs <- rnorm(1000, mean = mean_empirical, sd = sd_empirical)
generated_data_1000obs <- data.frame(life_exp_mean = generated_data_1000obs)

Show Code

fig_hist3 <- ggplot(data = generated_data_1000obs, aes(x = life_exp_mean)) +
  geom_histogram() +
  geom_density(aes(y = after_stat(count) * 1.5)) +
  theme_bw() +
  geom_vline(xintercept = mean_empirical, linetype = "dashed", col = "red") +
  ggtitle("Generated Normal\n(1000 Obs.)") +
  annotate("text", x = mean_empirical - 10, y = 17, label = lines, color = "red")
fig_hist3

With more observations, the generated data looks increasingly like a bell curve.

Problems

Problem Setup

Let’s assume that life expectancy is normally distributed with:

Mean (\(\mu\)): 61.93 years
Standard deviation (\(\sigma\)): 8.69 years

Problem 1: Using `pnorm`

P(Life Expectancy \(\leq\) 50)

What is the probability that a randomly selected country has life expectancy \(\leq\) 50?

ctry_50 <- subset(life_expectancy_df2, life_exp_mean <= 50)
nrow(ctry_50) / nrow(life_expectancy_df2)

[1] 0.1171875

Problem 1 (cont.)

Using `pnorm`

If we assume a normal distribution, we can compute this directly:

res <- pnorm(50, mean = mean_empirical, sd = sd_empirical)
res

[1] 0.08475828

Problem 1 (cont.)

Visualizing with CDF

Show Code

new_data2 <- life_expectancy_df2
new_data2$life_exp_mean_rounded <- round(new_data2$life_exp_mean, 0)
freq_table <- table(new_data2$life_exp_mean_rounded)
freq_table <- data.frame(freq_table)
names(freq_table) <- c("life_exp_mean_rounded", "count")
freq_table$life_exp_mean_rounded <- as.numeric(as.character(freq_table$life_exp_mean_rounded))
freq_table$sum_frequency <- sum(freq_table$count)
freq_table$pmf <- freq_table$count / freq_table$sum_frequency
new_data2$cdf <- ecdf(new_data2$life_exp_mean_rounded)(new_data2$life_exp_mean_rounded)

# Deduplicated datasets for clean plots
cdf_unique <- new_data2 %>%
  distinct(life_exp_mean_rounded, .keep_all = TRUE) %>%
  arrange(life_exp_mean_rounded)
pmf_unique <- freq_table %>%
  select(life_exp_mean_rounded, pmf) %>%
  arrange(life_exp_mean_rounded)

# PMF with highlighted region
fig_pmf <- ggplot(data = pmf_unique, aes(x = life_exp_mean_rounded, y = pmf)) +
  geom_area(fill = "green") +
  gghighlight(life_exp_mean_rounded <= 50) +
  theme_bw() +
  ggtitle("PMF: P(X <= 50)")

# CDF with highlighted region using step-function rectangles
ecdf_fn <- ecdf(new_data2$life_exp_mean_rounded)
cdf_plot <- data.frame(x = sort(unique(new_data2$life_exp_mean_rounded)))
cdf_plot$cdf <- ecdf_fn(cdf_plot$x)
cdf_sub <- subset(cdf_plot, x <= 50)
cdf_rects <- data.frame(
  xmin = cdf_sub$x,
  xmax = c(cdf_sub$x[-1], 50),
  ymin = 0,
  ymax = cdf_sub$cdf
)
fig_cdf <- ggplot() +
  geom_step(data = cdf_plot, aes(x = x, y = cdf)) +
  geom_rect(data = cdf_rects,
            aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
            fill = "green", alpha = 0.5) +
  theme_bw() +
  ggtitle("CDF: P(X <= 50)")

grid.arrange(fig_pmf, fig_cdf, ncol = 2)

Problem 2: Using `pnorm` (complement)

P(Life Expectancy > 70)

What is the probability that a randomly selected country has life expectancy > 70?

prob_more_70 <- subset(new_data2, life_exp_mean_rounded < 70)
1 - max(prob_more_70$cdf)

[1] 0.2226562

Problem 2 (cont.)

Using `pnorm`

If we assume a normal distribution:

res <- pnorm(70, mean = mean_empirical, sd = sd_empirical)
1 - res

[1] 0.1765817

Problem 2 (cont.)

Visualizing P(X > 70)

Show Code

fig_cdf_full <- ggplot(data = cdf_plot, aes(x = x, y = cdf)) +
  geom_step() +
  theme_bw() +
  ggtitle("Full CDF")

cdf_sub_70 <- subset(cdf_plot, x > 70)
cdf_rects_70 <- data.frame(
  xmin = cdf_sub_70$x,
  xmax = c(cdf_sub_70$x[-1], max(cdf_sub_70$x)),
  ymin = 0,
  ymax = cdf_sub_70$cdf
)
fig_cdf_70 <- ggplot() +
  geom_step(data = cdf_plot, aes(x = x, y = cdf)) +
  geom_rect(data = cdf_rects_70,
            aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
            fill = "green", alpha = 0.5) +
  theme_bw() +
  ggtitle("CDF: P(X > 70)")

grid.arrange(fig_cdf_full, fig_cdf_70, ncol = 2)

Problem 3: Using `dnorm`

Where Is Life Expectancy Most Concentrated?

Which value has a higher density — a life expectancy of 55 or 65?

d_55 <- dnorm(55, mean = mean_empirical, sd = sd_empirical)
d_65 <- dnorm(65, mean = mean_empirical, sd = sd_empirical)
data.frame(life_exp = c(55, 65), density = c(d_55, d_65))

. . .

Problem 4: Using `qnorm`

90th Percentile of Life Expectancy

What life expectancy marks the top 10% of countries?

quantile(new_data2$life_exp_mean_rounded, c(0.90))

90% 
 72

If we assume a normal distribution, we can use qnorm:

qnorm(0.9, mean = mean_empirical, sd = sd_empirical)

[1] 73.06729

Conclusion

The normal distribution can be fully described by its mean and standard deviation
pnorm and qnorm are inverses: probability \(\leftrightarrow\) value
dnorm tells you where observations are most concentrated — higher density near the mean, lower in the tails
Real-world data (life expectancy) is often skewed, but the normal approximation still gives useful answers

Statistical Analysis

Intro

Overview

Creating a New Quarto Document

Background

Loading Libraries

rnorm: Generating Random Samples

rnorm

What Does It Do?

Sample of 20 Observations

Generating the Data

Sample of 20 Observations

Plotting the Histogram

Sample of 100 Observations

Sample of 1000 Observations

Comparing the Three Samples

pnorm: Cumulative Distribution Function

pnorm

What Does It Do?

Heights of Women: CDF

Setting Up the Data

Highlighting Probabilities

P(Height \(\leq\) 150)

Highlighting Probabilities

P(Height \(\leq\) 170)

Effect of Increment Size

How Granularity Affects the CDF Shape

qnorm: Inverse CDF

qnorm

What Does It Do?

qnorm in Action

Recovering Heights from Probabilities

qnorm in Action

Applying the Inverse

dnorm: Probability Density Function

dnorm

What Does It Do?

PDF for Heights

Overlaying the PDF on a Histogram

Connecting rnorm and dnorm

Normal vs. Non-Normal Distributions

Loading Life Expectancy Data

Working with Empirical Data

Cleaning the Data

Histogram of Empirical Data

Generating Normal Data

Same Mean & SD, 235 Observations

Comparing Empirical vs. Generated

Generating More Observations

1000 Observations

Problems

Problem Setup

Problem 1: Using pnorm

P(Life Expectancy \(\leq\) 50)

Problem 1 (cont.)

Using pnorm

Problem 1 (cont.)

Visualizing with CDF

Problem 2: Using pnorm (complement)

P(Life Expectancy > 70)

Problem 2 (cont.)

Using pnorm

Problem 2 (cont.)

Visualizing P(X > 70)

Problem 3: Using dnorm

Where Is Life Expectancy Most Concentrated?

Problem 4: Using qnorm

90th Percentile of Life Expectancy

Conclusion

`rnorm`: Generating Random Samples

`rnorm`

`pnorm`: Cumulative Distribution Function

`pnorm`

`qnorm`: Inverse CDF

`qnorm`

`qnorm` in Action

`qnorm` in Action

`dnorm`: Probability Density Function

`dnorm`

Connecting `rnorm` and `dnorm`

Problem 1: Using `pnorm`

Using `pnorm`

Problem 2: Using `pnorm` (complement)

Using `pnorm`

Problem 3: Using `dnorm`

Problem 4: Using `qnorm`