Intro to Statistics
Variables, Samples, Population, Data
Bogdan G. Popescu
John Cabot University
Variables
How are two or more variables related?
- An independent variable is thought to influence or cause variation in another variable
- A dependent variable depends upon or is caused by variation in the independent variable
Examples:
Independent Variable → Dependent Variable
Associations
Two variables are associated if knowing the value of one of them will help to predict the value of the other.
Example:
Life Expectancy and Urbanization
Life Expectancy and Urbanization
Life Expectancy
- Indicator of countries’ overall health, physical well-being
- Of interest to many, including health researchers, economists, sociologists, anthropologists…
We will examine UN data on average life expectancy for 214 countries.
We want to know if urbanization has a positive or negative relationship on life expectancy.
Life Expectancy and Urbanization
The Data
![]()
Life Expectancy vs Urbanization
- Countries are subjects, cases, units, or elements in the data set
- Two columns for life expectancy and urbanization
- These are variables or characteristics varying among units
Life Expectancy and Urbanization
- What explains variation in life expectancy?
- What characteristics do countries with longer life expectancy have in common?
- What characteristics do countries with shorter life expectancy have in common?
- Normally, we would be examining the relationship between different types of variables and life expectancy:
- income
- urbanization
- education
Correlates of Life Expectancy
![]()
Correlates of Life Expectancy
A scatterplot of life expectancy by urbanization
![]()
Scatterplot 1
Associations vs. Causal Relationships
- If two variables are associated, knowing the value of one helps predict the value of the other
- In this example, we would predict:
- A middle-income country would have a longer life expectancy than a low-income country
- A country with more urbanization would have longer life expectancy than one with less
Causal Relationships
A causal relationship entails three elements:
- The independent (X) and dependent variables (Y) covary
- The change in X precedes the change in Y
- The covariation between X and Y is not coincidental or spurious
Causal relationships can be stipulated in hypotheses.
Hypotheses
Relationships between variables can be stated in hypotheses.
A hypothesis is an explicit statement about the relationship between phenomena that formalizes the researcher’s informed guess.
Characteristics of Good Hypotheses
- Empirical statements that formulate educated guesses
- Logical reason to think data can confirm hypotheses
- Indicate direction of the relationship
- Terms must match testing methods
- Data should be feasible to obtain
- Must specify unit of analysis (individuals, orgs, states, etc.)
Examples of Hypotheses
People tend to adopt political viewpoints similar to their parents.
Democracies are more likely to engage in trade with one another.
Authoritarian regimes are more likely to violate human rights.
Countries where property rights are protected tend to have higher levels of development.
Concepts
Definitions of concepts should be:
- clear
- accurate
- precise
- informative
Concepts should strike a balance between the specific and the abstract.
Populations vs. Samples
Population – complete enumeration of some set of interest
To learn about the population, a sample is often studied
Sampling is the process of selecting a subset from the population
Sampling is used to estimate characteristics of the full population
Aim: Ensure sample is representative
Requirement: Know your population
Dominant approach: probability sampling
Populations vs. Samples
Representative sample – If repeated, the sample’s features would match those of the population on average
Probability sampling reduces sample selection bias and ensures representativeness
Data and Variables – Basics
Categorical
- Binary: e.g., 0 = unemployed, 1 = employed
- Nominal: Order does not matter (e.g., 0 = Green, 1 = Red, 3 = Blue)
- Ordinal: Order is meaningful (e.g., 0 = Poor, 1 = Fair, 2 = Good)
Data and Variables – Basics
Numerical
- Discrete: e.g., number of individuals in a household
- Continuous: e.g., height, weight, wages
Cross-Sectional Data
- Cross-sectional datasets have one observation per unit
- Data for one variable (attribute) measured in N countries is written as:
\[
\{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N}
\]
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
![]()