Intro to Statistics

Variables, Samples, Population, Data

Bogdan G. Popescu

John Cabot University

Variables

How are two or more variables related?

  • An independent variable is thought to influence or cause variation in another variable
  • A dependent variable depends upon or is caused by variation in the independent variable

Examples:
Independent Variable → Dependent Variable

Education → Income

Associations

Two variables are associated if knowing the value of one of them will help to predict the value of the other.

Example:
Life Expectancy and Urbanization

Life Expectancy and Urbanization

Life Expectancy

  • Indicator of countries’ overall health, physical well-being
  • Of interest to many, including health researchers, economists, sociologists, anthropologists…

We will examine UN data on average life expectancy for 214 countries.

We want to know if urbanization has a positive or negative relationship on life expectancy.

Life Expectancy and Urbanization

The Data

Life Expectancy vs Urbanization

  • A row for each country
  • Countries are subjects, cases, units, or elements in the data set
  • Two columns for life expectancy and urbanization
  • These are variables or characteristics varying among units

Life Expectancy and Urbanization

  • What explains variation in life expectancy?
  • What characteristics do countries with longer life expectancy have in common?
  • What characteristics do countries with shorter life expectancy have in common?
  • Normally, we would be examining the relationship between different types of variables and life expectancy:
    • income
    • urbanization
    • education

Correlates of Life Expectancy

Correlates of Life Expectancy

A scatterplot of life expectancy by urbanization

Scatterplot 1

Associations vs. Causal Relationships

  • If two variables are associated, knowing the value of one helps predict the value of the other
  • In this example, we would predict:
    • A middle-income country would have a longer life expectancy than a low-income country
    • A country with more urbanization would have longer life expectancy than one with less

Causal Relationships

A causal relationship entails three elements:

  • The independent (X) and dependent variables (Y) covary
  • The change in X precedes the change in Y
  • The covariation between X and Y is not coincidental or spurious

Causal relationships can be stipulated in hypotheses.

Hypotheses

Relationships between variables can be stated in hypotheses.

A hypothesis is an explicit statement about the relationship between phenomena that formalizes the researcher’s informed guess.

Characteristics of Good Hypotheses

  • Empirical statements that formulate educated guesses
  • Logical reason to think data can confirm hypotheses
  • Indicate direction of the relationship
  • Terms must match testing methods
  • Data should be feasible to obtain
  • Must specify unit of analysis (individuals, orgs, states, etc.)

Examples of Hypotheses

People tend to adopt political viewpoints similar to their parents.
Democracies are more likely to engage in trade with one another.
Authoritarian regimes are more likely to violate human rights.
Countries where property rights are protected tend to have higher levels of development.

Concepts

Definitions of concepts should be:

  • clear
  • accurate
  • precise
  • informative

Concepts should strike a balance between the specific and the abstract.

Populations vs. Samples

Population – complete enumeration of some set of interest

To learn about the population, a sample is often studied

Sampling is the process of selecting a subset from the population

Sampling is used to estimate characteristics of the full population

Aim: Ensure sample is representative

Requirement: Know your population

Dominant approach: probability sampling

Populations vs. Samples

Representative sample – If repeated, the sample’s features would match those of the population on average

Probability sampling reduces sample selection bias and ensures representativeness

Data and Variables – Basics

Categorical

  • Binary: e.g., 0 = unemployed, 1 = employed
  • Nominal: Order does not matter (e.g., 0 = Green, 1 = Red, 3 = Blue)
  • Ordinal: Order is meaningful (e.g., 0 = Poor, 1 = Fair, 2 = Good)

Data and Variables – Basics

Numerical

  • Discrete: e.g., number of individuals in a household
  • Continuous: e.g., height, weight, wages

Cross-Sectional Data

  • Cross-sectional datasets have one observation per unit
  • Data for one variable (attribute) measured in N countries is written as:

\[ \{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N} \]

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data