Lecture 8: Revision for Midterm
Types of jobs where these skills are valuable:
An independent variable (\(X\)) influences or causes variation in another variable
A dependent variable (\(Y\)) depends upon or is caused by the independent variable
Example: Education \(\rightarrow\) Income
A causal relationship requires three elements:
Population: complete enumeration of some set of interest
Sampling: the selection of a subset of individuals from within a population
A representative sample reflects the population’s characteristics
Reliability — the extent to which a measurement procedure yields the same result in repeated trials.
Validity — the degree of correspondence between the measure and the concept it is supposed to measure.
Categorical
Numerical
Mode: most frequently occurring value
Median: middle value when data is ordered
Mean: arithmetic average — \(\bar{X} = \frac{\sum X_i}{n}\)
Range: maximum \(-\) minimum
Variance: \(s^2 = \frac{\sum(X_i - \bar{X})^2}{n-1}\)
Standard Deviation: \(s = \sqrt{s^2}\)
\(\displaystyle\sum_{i=1}^{n} X_i\) means sum all values from \(i=1\) to \(n\)
Example: \(\displaystyle\sum_{n=2}^{5} n^2 = 2^2 + 3^2 + 4^2 + 5^2 = 4 + 9 + 16 + 25 = 54\)
1. You have the following set: \([3, 4, 7]\). Calculate: Mode, Median, Mean, Standard Deviation, Variance
2. Calculate: \(\displaystyle\sum_{n=2}^{5} n^2\)
Addition rule (mutually exclusive): \(P(A \text{ or } B) = P(A) + P(B)\)
Multiplication rule (independent events): \(P(A \text{ and } B) = P(A) \times P(B)\)
Discrete: finite, countable values (e.g. no. of people in a household)
Continuous: any value in a range (e.g. height, weight, wages)
Sample means from repeated samples form an approximately normal distribution, regardless of the population’s shape
Larger samples \(\rightarrow\) narrower distribution of sample means
Shows the probability of each discrete outcome: \(P(X = x)\)
All probabilities must sum to 1
Symmetric, bell-shaped; characterized by \(\mu\) (mean) and \(\sigma\) (standard deviation)
68-95-99.7 rule: 68.3% within \(\mu \pm 1\sigma\), 95.4% within \(\mu \pm 2\sigma\), 99.7% within \(\mu \pm 3\sigma\)
Standard normal: \(\mu = 0\), \(\sigma = 1\)
Gives \(P(X \leq x)\) — the probability of falling at or below a threshold
Always increases from 0 to 1
\(\bar{X}\) is a sample statistic (estimate)
\(\mu\) is a population parameter (true value)
We use sample statistics to make inferences about population parameters
One-tailed: tests a specific direction (\(H_1\): \(\mu > k\) or \(\mu < k\))
Two-tailed: tests whether a value differs in either direction (\(H_1\): \(\mu \neq k\))
Identify \(H_0\) and \(H_1\); state whether you reject or don’t reject \(H_0\)
P-value: probability of observing results as extreme as ours, assuming \(H_0\) is true
If \(p < 0.05\): reject \(H_0\)
If \(p \geq 0.05\): do not reject \(H_0\)
Z-test: population \(\sigma\) is known and \(n > 30\)
T-test: population \(\sigma\) is unknown; uses sample \(s\) instead
T-distribution has heavier tails for small samples; converges to normal as \(n\) grows
\[CI = \bar{X} \pm t \times \frac{s}{\sqrt{n}}\]
Type I error (false positive): rejecting \(H_0\) when it is actually true
Type II error (false negative): retaining \(H_0\) when it is actually false
The Exam in relation to Grading:
Problem Sets (Reminder)
Average of:
Peer-grading: evaluating your colleague’s participation to the group. (5%)
This is a short exam: one hour and 15 mins
You will have twenty short questions. You have to answer all the questions.
You can get partial credit if you attempt to answer the question.
No calculators or books are allowed.
16–17. See the following output:
Two-sample z-Test
data: final_latam$urbanization and final_eu$urbanization
z = -1.9176, p-value = 0.05516
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-16.4191204 0.1792105
sample estimates:
mean of x mean of y
59.01841 67.13836
1. What are causal relationships?
Causal relationships are relationships that entail three elements:
2. What is a population?
Population — complete enumeration of some set of interest.
2a. What is reliability in an experimental setting?
Reliability — the extent to which an experiment or measurement procedure yields the same result on repeated trials.
3. Provide an example of a continuous variable
Height, weight, wage
4. Calculate the mode for the following set: (4, 5, 6, 10, 6)
Mode is 6
5a. Calculate the median for the following set: (7, 3, 1, 2, 5, 6, 8)
We first order the set: 1, 2, 3, 5, 6, 7, 8. We have an odd number: Middle no. is 5
5b. Calculate the median for the following set: (7, 3, 1, 2, 5, 6)
We first order the set: 1, 2, 3, 5, 6, 7. We have an even number: Median is \((3+5)/2 = \textbf{4}\)
6a. Calculate the mean for the following set: (1, 2, 3, 4, 5)
\((1+2+3+4+5)/5 = \textbf{3}\)
6b. Calculate the mean for the following set: (3, 3, 4, 4, 5, 5)
\((3+3+4+4+5+5)/6 = \textbf{4}\)
7. Calculate the range of the following set: (6, 7, 8, 9, 10)
Range: \(10 - 6 = \textbf{4}\)
8. Calculate the following: \(\sum_{i=2}^{3} X_i = 2 + 3 = \textbf{5}\)
9. Calculate the following: \(\sum_{n=1}^{3} n^3 = 1^3 + 2^3 + 3^3 = 1 + 8 + 27 = \textbf{36}\)
10. Calculate the variance for the following set: (2, 3, 4, 5)
Mean \(= (2+3+4+5)/4 = 3.5\)
\(s_x^2 = \frac{(2-3.5)^2 + (3-3.5)^2 + (4-3.5)^2 + (5-3.5)^2}{4-1} = \frac{2.25 + 0.25 + 0.25 + 2.25}{3} = \textbf{5/3}\)
11. What is the probability of rolling a die and obtaining an outcome of either a 4, 5, or 6?
\(1/6 + 1/6 + 1/6 = 3/6 = \textbf{1/2}\)
12. What is the probability that if you throw two dice, both dice will be both 3?
\(1/6 \times 1/6 = \textbf{1/36}\)
13. Suppose you have 6 people in the class with the following grades: Grade 60 occurs 3 times, Grade 80 occurs 1 time, Grade 90 occurs 2 times. What is \(P(X=80)\)?
\(P(X=80) = \textbf{1/6}\)
14. What is a standard normal distribution?
It is a normal distribution with mean \(\mu = 0\) and standard deviation \(\sigma = 1\)
15. When do we use a two-tailed z-test?
When we want to see if something is equal to a value, we use a two-tailed test.
16. What is the \(H_0\)? What is the \(H_1\)?
The \(H_0\) is that the true difference in means is equal to 0. The \(H_1\) is that the true difference in means is not equal to 0.
17. Do you reject the \(H_0\)? Why?
No, because the p-value (0.05516) is larger than 0.05.
18. What is a Type II error?
When researchers retain a null hypothesis that is actually untrue.
19. When is a researcher supposed to use a z-test vs. a t-test?
A z-test is used when the population variance and SD are known and when the sample size is larger than 30.
20. Calculate the upper and lower confidence intervals:
Given: \(\bar{X} = 100\), \(s_x = 10\), \(n = 4\)
We apply the formula for the calculation of confidence intervals:
\[CI = \bar{X} \pm t \times \left(\frac{s_x}{\sqrt{n}}\right) = 100 \pm 2 \times \left(\frac{10}{\sqrt{4}}\right) = 100 \pm 2 \times 5 = 100 \pm 10\]
Lower bound: 90, Upper bound: 110
Popescu (JCU) Statistical Analysis Lecture 8: Revision for Midterm