Intro to Statistics
Variables, Samples, Population, Data
Bogdan G. Popescu
John Cabot University
Variables
How are two or more variables related?
- An independent variable is thought to influence or cause variation in another variable
- A dependent variable depends upon or is caused by variation in the independent variable
Examples:
Independent Variable → Dependent Variable
Associations
Two variables are associated if knowing the value of one of them will help to predict the value of the other.
Example:
Life Expectancy and Urbanization
Life Expectancy and Urbanization
Life Expectancy
- Indicator of countries’ overall health, physical well-being
- Of interest to many, including health researchers, economists, sociologists, anthropologists…
We will examine UN data on average life expectancy for 214 countries.
We want to know if urbanization has a positive or negative relationship on life expectancy.
Life Expectancy and Urbanization
The Data
![]()
Life Expectancy vs Urbanization
- Countries are subjects, cases, units, or elements in the data set
- Two columns for life expectancy and urbanization
- These are variables or characteristics varying among units
Life Expectancy and Urbanization
- What explains variation in life expectancy?
- What characteristics do countries with longer life expectancy have in common?
- What characteristics do countries with shorter life expectancy have in common?
- Normally, we would be examining the relationship between different types of variables and life expectancy:
- income
- urbanization
- education
Correlates of Life Expectancy
![]()
Correlates of Life Expectancy
A scatterplot of life expectancy by urbanization
![]()
Scatterplot 1
Associations vs. Causal Relationships
- If two variables are associated, knowing the value of one helps predict the value of the other
- In this example, we would predict:
- A middle-income country would have a longer life expectancy than a low-income country
- A country with more urbanization would have longer life expectancy than one with less
Causal Relationships
A causal relationship entails three elements:
- The independent (X) and dependent variables (Y) covary
- The change in X precedes the change in Y
- The covariation between X and Y is not coincidental or spurious
Causal relationships can be stipulated in hypotheses.
Hypotheses
Relationships between variables can be stated in hypotheses.
A hypothesis is an explicit statement about the relationship between phenomena that formalizes the researcher’s informed guess.
Characteristics of Good Hypotheses
- Empirical statements that formulate educated guesses
- Logical reason to think data can confirm hypotheses
- Indicate direction of the relationship
- Terms must match testing methods
- Data should be feasible to obtain
- Must specify unit of analysis (individuals, orgs, states, etc.)
Examples of Hypotheses
- People have political viewpoints similar to their parents.
- Democracies are more likely to engage in trade with one another.
- Authoritarian regimes are more likely to violate human rights.
- Countries where property rights are protected tend to have higher levels of development.
Concepts
Definitions of concepts should be:
- clear
- accurate
- precise
- informative
Concepts should strike a balance between the specific and the abstract.
Populations vs. Samples
Population – complete enumeration of some set of interest
To learn about the population, a sample is often studied
Sampling is the process of selecting a subset from the population
Sampling is used to estimate characteristics of the full population
Aim: Ensure sample is representative
Requirement: Know your population
Dominant approach: probability sampling
Populations vs. Samples
Representative sample – If repeated, the sample’s features would match those of the population on average
Probability sampling reduces sample selection bias and ensures representativeness
Data and Variables – Basics
Categorical
- Binary: e.g., 0 = unemployed, 1 = employed
- Nominal: Order does not matter (e.g., 0 = Green, 1 = Red, 3 = Blue)
- Ordinal: Order is meaningful (e.g., 0 = Poor, 1 = Fair, 2 = Good)
Data and Variables – Basics
Numerical
- Discrete: e.g., number of individuals in a household
- Continuous: e.g., height, weight, wages
Cross-Sectional Data
- Cross-sectional datasets have one observation per unit
- Data for one variable (attribute) measured in N countries is written as:
\[
\{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N}
\]
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
- Cross-sectional datasets have one observation per unit
- Example values for one variable (e.g., life expectancy):
\[
\{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N}
\]
\[
\{45.38333, 68.28611, 57.53013, \dots, 77.04861\} = \{X_i\}_{i=1,\dots,N}
\]
Cross-Sectional Data
- If we measure two attributes, we can represent them as a point in 2D space
- A single data point is a vector in two dimensions
Example:
- Life expectancy = 59.75
- Level of urbanization = 66.4
Then the data point is:
\[
X = [66.4, 59.75]
\]
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Evaluation of Empirical Propositions
Social scientists use statistical analyses to verify theories driven by carefully thought-out hypotheses.
Hypotheses are falsifiable claims about the world.
Hypotheses connect dependent variables to independent variables.
- Dependent variables: outcomes or things we want to explain
- Independent variables: factors that help explain the dependent variable
Example
Hypothesis:
An increase in X (independent variable) leads to an increase in Y (dependent variable).
Democratization Hypothesis:
More economic development is associated with higher levels of democracy.
To test this, we collect data on X and Y.
Units of analysis are the entities where our theory applies (e.g., countries, individuals, firms).
Datasets
When we collect the data, we input it into a spreadsheet, a tabular format.
This becomes a dataset.
A Dataset
A Dataset
A Dataset
In this example, there appears to be a positive relationship between X and Y.
- Not all high-X observations have high Y
- Not all low-X observations have low Y
To evaluate the relationship, we fit a line that best approximates the pattern in the data.
A Dataset
Cross-Sectional Data
Each country’s data is a point in a scatter plot.
If we measure three variables (e.g., life expectancy, urbanization, education),
we get a 3D point cloud:
Time-Series Data
- A time series of length T is written as:
\[
\{X_1, X_2, X_3, \dots, X_T\} = \{X_t\}_{t=1,\dots,T}
\]
- A time series is a sequence of data points indexed in time order
- It has a natural temporal ordering
- Time is the second attribute
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data vs. Cross-Section
Time-Series Data vs. Cross-Section
Time-Series Data
This is depicted as a 2D scatter.
Time is one variable, and the value of interest is another.
So, each point in the time series is a pair: (time, value).
Time-Series Data
Time-Series and Cross-Section Data
The following is a cross-section of time-series data:
Time-Series and Cross-Section Data
Balanced Panel = Every unit is observed in every time period. No missing time points for any unit.
![]()
Time-Series and Cross-Section Data
Unbalanced Panel = Some units are missing observations for some time periods.
Time-Series and Cross-Section Data
Balanced Panel = Every unit is observed in every time period. No missing time points for any unit.
![]()
Time-Series and Cross-Section Data
Unbalanced Panel = Some units are missing observations for some time periods.
![]()