Statistical Analysis

Lecture 8: Revision for Midterm

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Learning Outcomes & Skills

Learning Outcomes (Reminder)

Use statistical terminology accurately and interpret descriptive statistics, hypothesis tests, and regressions
Organize, clean, and merge datasets using R, and summarize data with numerical and graphical methods
Conduct hypothesis tests (z-tests, t-tests) and estimate bivariate and multivariate regression models
Distinguish correlation from causation using causal diagrams (DAGs) and apply causal inference methods (RCTs, matching, differences-in-differences)
Create effective data visualizations using ggplot2 and produce reproducible reports in Quarto

Skills Acquired

Ability to interpret and analyze data using descriptive statistics and probability distributions
Experience with hypothesis testing and correlation analysis
Good understanding of data visualization techniques
Ability to generate insights from data and present findings effectively

Jobs

Types of jobs where these skills are valuable:

Data Analyst / Statistician
Policy Analyst
Market Research Analyst
Program Evaluator / Impact Analyst
Academic Researcher

Week 1: Variables & Causality

Independent & Dependent Variables

An independent variable (\(X\)) influences or causes variation in another variable
A dependent variable (\(Y\)) depends upon or is caused by the independent variable
Example: Education \(\rightarrow\) Income

Causal Relationships

A causal relationship requires three elements:

\(X\) and \(Y\) covary
The change in \(X\) precedes the change in \(Y\)
The covariation is not spurious

Populations & Samples

Population: complete enumeration of some set of interest
Sampling: the selection of a subset of individuals from within a population
A representative sample reflects the population’s characteristics

Potential Questions from Week 1

What is a hypothesis? Provide two examples
What is a sample?
What does it take for a statistical relationship to be causal?
What is an independent/dependent variable?

Week 2: Measurement & Variable Types

Reliability & Validity

Reliability — the extent to which a measurement procedure yields the same result in repeated trials.

Validity — the degree of correspondence between the measure and the concept it is supposed to measure.

Variable Types

Categorical

Binary: (e.g. 0 = unemployed, 1 = employed)
Nominal: Order does not matter (e.g. 0 = Green; 1 = Red; 3 = Blue)
Ordinal: Order is meaningful (0 = Poor; 1 = Fair; 2 = Good)

Numerical

Discrete: countable values (e.g. no. of individuals in a household)
Continuous: any value in a range (e.g. height, weight, wages)

Interval vs. Ratio

Interval: meaningful distances but no true zero (e.g. Celsius, SAT scores)
- Zero does not represent the absence of the characteristic being measured
Ratio: meaningful distances with a true zero (e.g. weight, years of education)
- A weight of 4 grams is twice a weight of 2 grams
Key difference: ratio variables have a true zero point; interval variables do not

Week 3: Central Tendency

Measures of Central Tendency

Mode: most frequently occurring value
Median: middle value when data is ordered
Mean: arithmetic average — \(\bar{X} = \frac{\sum X_i}{n}\)

Measures of Dispersion

Range: maximum \(-\) minimum
Variance: \(s^2 = \frac{\sum(X_i - \bar{X})^2}{n-1}\)
Standard Deviation: \(s = \sqrt{s^2}\)

Sigma Notation

\(\displaystyle\sum_{i=1}^{n} X_i\) means sum all values from \(i=1\) to \(n\)

Example: \(\displaystyle\sum_{n=2}^{5} n^2 = 2^2 + 3^2 + 4^2 + 5^2 = 4 + 9 + 16 + 25 = 54\)

Potential Questions from Week 3

1. You have the following set: \([3, 4, 7]\). Calculate: Mode, Median, Mean, Standard Deviation, Variance

2. Calculate: \(\displaystyle\sum_{n=2}^{5} n^2\)

Week 4: Probabilities

Probability Rules

Addition rule (mutually exclusive): \(P(A \text{ or } B) = P(A) + P(B)\)
Multiplication rule (independent events): \(P(A \text{ and } B) = P(A) \times P(B)\)

Discrete vs. Continuous Variables

Discrete: finite, countable values (e.g. no. of people in a household)
Continuous: any value in a range (e.g. height, weight, wages)

Central Limit Theorem (CLT)

Sample means from repeated samples form an approximately normal distribution, regardless of the population’s shape
Larger samples \(\rightarrow\) narrower distribution of sample means

Potential Questions from Week 4

What is a discrete variable?
What is a continuous variable?
Provide examples of discrete and continuous variables
Problems with probabilities (addition and multiplication rules)

Week 5: Distributions

Probability Mass Functions (PMF)

Shows the probability of each discrete outcome: \(P(X = x)\)
All probabilities must sum to 1

Normal Distribution

Symmetric, bell-shaped; characterized by \(\mu\) (mean) and \(\sigma\) (standard deviation)
68-95-99.7 rule: 68.3% within \(\mu \pm 1\sigma\), 95.4% within \(\mu \pm 2\sigma\), 99.7% within \(\mu \pm 3\sigma\)
Standard normal: \(\mu = 0\), \(\sigma = 1\)

Cumulative Distribution Functions (CDF)

Gives \(P(X \leq x)\) — the probability of falling at or below a threshold
Always increases from 0 to 1

Potential Questions from Week 5

Problems with probability mass functions
Questions about standard normal distributions: \(\mu\) and \(\sigma\)

Week 6: Hypothesis Testing

Sample Statistics vs. Population Parameters

\(\bar{X}\) is a sample statistic (estimate)
\(\mu\) is a population parameter (true value)
We use sample statistics to make inferences about population parameters

Z-Tests

One-tailed: tests a specific direction (\(H_1\): \(\mu > k\) or \(\mu < k\))
Two-tailed: tests whether a value differs in either direction (\(H_1\): \(\mu \neq k\))
Identify \(H_0\) and \(H_1\); state whether you reject or don’t reject \(H_0\)

P-Values and Decision Rule

P-value: probability of observing results as extreme as ours, assuming \(H_0\) is true
If \(p < 0.05\): reject \(H_0\)
If \(p \geq 0.05\): do not reject \(H_0\)

Potential Questions from Week 6

Specific lines of code in which you have to state the \(H_0\) and the \(H_1\) and in which you have to decide whether you reject or you don’t reject the \(H_0\)
Theoretical questions on when to use a two-tailed or a one-tailed test

Week 7: T-tests & Confidence Intervals

Z-test vs. T-test

Z-test: population \(\sigma\) is known and \(n > 30\)
T-test: population \(\sigma\) is unknown; uses sample \(s\) instead
T-distribution has heavier tails for small samples; converges to normal as \(n\) grows

Confidence Intervals

A plausible range for the true population parameter:

\[CI = \bar{X} \pm t \times \frac{s}{\sqrt{n}}\]

95% CI: if we repeated sampling, 95% of intervals would contain the true mean

Type I & Type II Errors

Type I error (false positive): rejecting \(H_0\) when it is actually true
Type II error (false negative): retaining \(H_0\) when it is actually false

Potential Questions from Week 7

Specific lines of code in which you have to state the \(H_0\) and the \(H_1\) and in which you have to stipulate whether you reject or you don’t reject the \(H_0\)
Theoretical questions on when to use a z-test or a t-test

Exam

The Exam — Grading

The Exam in relation to Grading:

Five problem sets: 35% of the final grade (7% each)
Mid-term: 30%
Final exam: 30%

Problem Sets (Reminder)

Average of:

Initial Submission
Second Submission (if necessary)

Peer-grading: evaluating your colleague’s participation to the group. (5%)

The Exam — Format

This is a short exam: one hour and 15 mins

You will have twenty short questions. You have to answer all the questions.

You can get partial credit if you attempt to answer the question.

No calculators or books are allowed.

Example Questions

Example Questions (1–7)

What are causal relationships?
What is a population?
1. What is reliability in an experimental setting?
Provide an example of a continuous variable
Calculate the mode for the following set: (4, 5, 6, 10, 6)
1. Calculate the median for the following set: (7, 3, 1, 2, 5, 6, 8)
2. Calculate the median for the following set: (7, 3, 1, 2, 5, 6)
1. Calculate the mean for the following set: (1, 2, 3, 4, 5)
2. Calculate the mean for the following set: (3, 3, 4, 4, 5, 5)
Calculate the range of the following set: (6, 7, 8, 9, 10)

Example Questions (8–15)

Calculate the following: \(\displaystyle\sum_{i=2}^{3} X_i\)
Calculate the following: \(\displaystyle\sum_{n=1}^{3} n^3\)
Calculate the variance for the following set: (2, 3, 4, 5)
What is the probability of rolling a die and obtaining an outcome of 4, 5 or 6?
What is the probability that if you throw two dice, both dice will be both 3?
Suppose you have 6 people in the class with the following grades: Grade 60 occurs 3 times, Grade 80 occurs 1 time, Grade 90 occurs 2 times. What is \(P(X=80)\)?
What is a standard normal distribution?
When do we use a two-tailed z-test?

Example Questions (16–17)

16–17. See the following output:

Two-sample z-Test
data: final_latam$urbanization and final_eu$urbanization
z = -1.9176, p-value = 0.05516
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -16.4191204   0.1792105
sample estimates:
  mean of x   mean of y
  59.01841     67.13836

What is the \(H_0\)? What is the \(H_1\)?
Do you reject the \(H_0\)? Why?

Example Questions (18–20)

What is a Type II error?
When is a researcher supposed to use a z-test vs. a t-test?
Calculate the upper and lower confidence intervals if you are given the following parameters: \(\bar{X} = 100\), \(s_x = 10\), \(n = 4\)

Example Questions with Answers

Answers: Questions 1–3

1. What are causal relationships?

Causal relationships are relationships that entail three elements:

The independent (\(X\)) and dependent variables (\(Y\)) covary
The change in \(X\) precedes the change in \(Y\)
The covariation between \(X\) and \(Y\) is not a coincidence or is not spurious

2. What is a population?

Population — complete enumeration of some set of interest.

2a. What is reliability in an experimental setting?

Reliability — the extent to which an experiment or measurement procedure yields the same result on repeated trials.

3. Provide an example of a continuous variable

Height, weight, wage

Answers: Questions 4–6a

4. Calculate the mode for the following set: (4, 5, 6, 10, 6)

Mode is 6

5a. Calculate the median for the following set: (7, 3, 1, 2, 5, 6, 8)

We first order the set: 1, 2, 3, 5, 6, 7, 8. We have an odd number: Middle no. is 5

5b. Calculate the median for the following set: (7, 3, 1, 2, 5, 6)

We first order the set: 1, 2, 3, 5, 6, 7. We have an even number: Median is \((3+5)/2 = \textbf{4}\)

6a. Calculate the mean for the following set: (1, 2, 3, 4, 5)

\((1+2+3+4+5)/5 = \textbf{3}\)

Answers: Questions 6b–10

6b. Calculate the mean for the following set: (3, 3, 4, 4, 5, 5)

\((3+3+4+4+5+5)/6 = \textbf{4}\)

7. Calculate the range of the following set: (6, 7, 8, 9, 10)

Range: \(10 - 6 = \textbf{4}\)

8. Calculate the following: \(\sum_{i=2}^{3} X_i = 2 + 3 = \textbf{5}\)

9. Calculate the following: \(\sum_{n=1}^{3} n^3 = 1^3 + 2^3 + 3^3 = 1 + 8 + 27 = \textbf{36}\)

10. Calculate the variance for the following set: (2, 3, 4, 5)

Mean \(= (2+3+4+5)/4 = 3.5\)

\(s_x^2 = \frac{(2-3.5)^2 + (3-3.5)^2 + (4-3.5)^2 + (5-3.5)^2}{4-1} = \frac{2.25 + 0.25 + 0.25 + 2.25}{3} = \textbf{5/3}\)

Answers: Questions 11–14

11. What is the probability of rolling a die and obtaining an outcome of either a 4, 5, or 6?

\(1/6 + 1/6 + 1/6 = 3/6 = \textbf{1/2}\)

12. What is the probability that if you throw two dice, both dice will be both 3?

\(1/6 \times 1/6 = \textbf{1/36}\)

13. Suppose you have 6 people in the class with the following grades: Grade 60 occurs 3 times, Grade 80 occurs 1 time, Grade 90 occurs 2 times. What is \(P(X=80)\)?

\(P(X=80) = \textbf{1/6}\)

14. What is a standard normal distribution?

It is a normal distribution with mean \(\mu = 0\) and standard deviation \(\sigma = 1\)

Answers: Questions 15–17

15. When do we use a two-tailed z-test?

When we want to see if something is equal to a value, we use a two-tailed test.

16. What is the \(H_0\)? What is the \(H_1\)?

The \(H_0\) is that the true difference in means is equal to 0. The \(H_1\) is that the true difference in means is not equal to 0.

17. Do you reject the \(H_0\)? Why?

No, because the p-value (0.05516) is larger than 0.05.

Answers: Questions 18–20

18. What is a Type II error?

When researchers retain a null hypothesis that is actually untrue.

19. When is a researcher supposed to use a z-test vs. a t-test?

A z-test is used when the population variance and SD are known and when the sample size is larger than 30.

20. Calculate the upper and lower confidence intervals:

Given: \(\bar{X} = 100\), \(s_x = 10\), \(n = 4\)

We apply the formula for the calculation of confidence intervals:

\[CI = \bar{X} \pm t \times \left(\frac{s_x}{\sqrt{n}}\right) = 100 \pm 2 \times \left(\frac{10}{\sqrt{4}}\right) = 100 \pm 2 \times 5 = 100 \pm 10\]

Lower bound: 90, Upper bound: 110