Intro to Statistics

Variables, Samples, Population, Data

Bogdan G. Popescu

John Cabot University

Variables

How are two or more variables related?

  • An independent variable is thought to influence or cause variation in another variable
  • A dependent variable depends upon or is caused by variation in the independent variable

Examples:
Independent Variable → Dependent Variable

Education → Income

Associations

Two variables are associated if knowing the value of one of them will help to predict the value of the other.

Example:
Life Expectancy and Urbanization

Life Expectancy and Urbanization

Life Expectancy

  • Indicator of countries’ overall health, physical well-being
  • Of interest to many, including health researchers, economists, sociologists, anthropologists…

We will examine UN data on average life expectancy for 214 countries.

We want to know if urbanization has a positive or negative relationship on life expectancy.

Life Expectancy and Urbanization

The Data

Life Expectancy vs Urbanization

  • A row for each country
  • Countries are subjects, cases, units, or elements in the data set
  • Two columns for life expectancy and urbanization
  • These are variables or characteristics varying among units

Life Expectancy and Urbanization

  • What explains variation in life expectancy?
  • What characteristics do countries with longer life expectancy have in common?
  • What characteristics do countries with shorter life expectancy have in common?
  • Normally, we would be examining the relationship between different types of variables and life expectancy:
    • income
    • urbanization
    • education

Correlates of Life Expectancy

Correlates of Life Expectancy

A scatterplot of life expectancy by urbanization

Scatterplot 1

Associations vs. Causal Relationships

  • If two variables are associated, knowing the value of one helps predict the value of the other
  • In this example, we would predict:
    • A middle-income country would have a longer life expectancy than a low-income country
    • A country with more urbanization would have longer life expectancy than one with less

Causal Relationships

A causal relationship entails three elements:

  • The independent (X) and dependent variables (Y) covary
  • The change in X precedes the change in Y
  • The covariation between X and Y is not coincidental or spurious

Causal relationships can be stipulated in hypotheses.

Hypotheses

Relationships between variables can be stated in hypotheses.

A hypothesis is an explicit statement about the relationship between phenomena that formalizes the researcher’s informed guess.

Characteristics of Good Hypotheses

  • Empirical statements that formulate educated guesses
  • Logical reason to think data can confirm hypotheses
  • Indicate direction of the relationship
  • Terms must match testing methods
  • Data should be feasible to obtain
  • Must specify unit of analysis (individuals, orgs, states, etc.)

Examples of Hypotheses

  • People have political viewpoints similar to their parents.
  • Democracies are more likely to engage in trade with one another.
  • Authoritarian regimes are more likely to violate human rights.
  • Countries where property rights are protected tend to have higher levels of development.

Concepts

Definitions of concepts should be:

  • clear
  • accurate
  • precise
  • informative

Concepts should strike a balance between the specific and the abstract.

Populations vs. Samples

Population – complete enumeration of some set of interest

To learn about the population, a sample is often studied

Sampling is the process of selecting a subset from the population

Sampling is used to estimate characteristics of the full population

Aim: Ensure sample is representative

Requirement: Know your population

Dominant approach: probability sampling

Populations vs. Samples

Representative sample – If repeated, the sample’s features would match those of the population on average

Probability sampling reduces sample selection bias and ensures representativeness

Data and Variables – Basics

Categorical

  • Binary: e.g., 0 = unemployed, 1 = employed
  • Nominal: Order does not matter (e.g., 0 = Green, 1 = Red, 3 = Blue)
  • Ordinal: Order is meaningful (e.g., 0 = Poor, 1 = Fair, 2 = Good)

Data and Variables – Basics

Numerical

  • Discrete: e.g., number of individuals in a household
  • Continuous: e.g., height, weight, wages

Cross-Sectional Data

  • Cross-sectional datasets have one observation per unit
  • Data for one variable (attribute) measured in N countries is written as:

\[ \{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N} \]

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

  • Cross-sectional datasets have one observation per unit
  • Example values for one variable (e.g., life expectancy):

\[ \{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N} \]

\[ \{45.38333, 68.28611, 57.53013, \dots, 77.04861\} = \{X_i\}_{i=1,\dots,N} \]

Cross-Sectional Data

  • If we measure two attributes, we can represent them as a point in 2D space
  • A single data point is a vector in two dimensions

Example:
- Life expectancy = 59.75
- Level of urbanization = 66.4

Then the data point is:

\[ X = [66.4, 59.75] \]

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Cross-Sectional Data

Evaluation of Empirical Propositions

Social scientists use statistical analyses to verify theories driven by carefully thought-out hypotheses.

Hypotheses are falsifiable claims about the world.

Hypotheses connect dependent variables to independent variables.

- Dependent variables: outcomes or things we want to explain
- Independent variables: factors that help explain the dependent variable

Example

Hypothesis:
An increase in X (independent variable) leads to an increase in Y (dependent variable).

Democratization Hypothesis:
More economic development is associated with higher levels of democracy.

To test this, we collect data on X and Y.

Units of analysis are the entities where our theory applies (e.g., countries, individuals, firms).

Datasets

When we collect the data, we input it into a spreadsheet, a tabular format.

This becomes a dataset.

A Dataset

A Dataset

A Dataset

In this example, there appears to be a positive relationship between X and Y.

- Not all high-X observations have high Y
- Not all low-X observations have low Y

To evaluate the relationship, we fit a line that best approximates the pattern in the data.

A Dataset

Cross-Sectional Data

Each country’s data is a point in a scatter plot.

If we measure three variables (e.g., life expectancy, urbanization, education),
we get a 3D point cloud:

Time-Series Data

  • A time series of length T is written as:

\[ \{X_1, X_2, X_3, \dots, X_T\} = \{X_t\}_{t=1,\dots,T} \]

  • A time series is a sequence of data points indexed in time order
  • It has a natural temporal ordering
  • Time is the second attribute

Time-Series Data

Time-Series Data

Time-Series Data

Time-Series Data

Time-Series Data

Time-Series Data

Time-Series Data

Time-Series Data vs. Cross-Section

Time-Series Data vs. Cross-Section

Time-Series Data

This is depicted as a 2D scatter.

Time is one variable, and the value of interest is another.

So, each point in the time series is a pair: (time, value).

Time-Series Data

Time-Series and Cross-Section Data

The following is a cross-section of time-series data:

Time-Series and Cross-Section Data

Balanced Panel = Every unit is observed in every time period. No missing time points for any unit.

Time-Series and Cross-Section Data

Unbalanced Panel = Some units are missing observations for some time periods. 

Time-Series and Cross-Section Data

Balanced Panel = Every unit is observed in every time period. No missing time points for any unit.

Time-Series and Cross-Section Data

Unbalanced Panel = Some units are missing observations for some time periods.

Conclusion

  • Measurement quality depends on accuracy and precision
  • Reliability: can we replicate results?
  • Validity: does the measure reflect the concept?
  • Variables can be categorical or numerical
  • Data can be cross-sectional, time-series, or both (panel data)
  • Panel data can be balanced or unbalanced