Intro to Statistics

Variables, Samples, Population, Data

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Variables

How are two or more variables related?

An independent variable is thought to influence or cause variation in another variable

A dependent variable depends upon or is caused by variation in the independent variable

Examples:
Independent Variable → Dependent Variable

Education → Income

Associations

Two variables are associated if knowing the value of one of them will help to predict the value of the other.

Example:
Life Expectancy and Urbanization

Life Expectancy and Urbanization

Life Expectancy

Indicator of countries’ overall health, physical well-being

Of interest to many, including health researchers, economists, sociologists, anthropologists…

We will examine UN data on average life expectancy for 214 countries.

We want to know if urbanization has a positive or negative relationship on life expectancy.

Life Expectancy and Urbanization

The Data

Life Expectancy vs Urbanization

A row for each country

Countries are subjects, cases, units, or elements in the data set

Two columns for life expectancy and urbanization

These are variables or characteristics varying among units

Life Expectancy and Urbanization

What explains variation in life expectancy?

What characteristics do countries with longer life expectancy have in common?

What characteristics do countries with shorter life expectancy have in common?

Normally, we would be examining the relationship between different types of variables and life expectancy:
- income
- urbanization
- education

Correlates of Life Expectancy

A scatterplot of life expectancy by urbanization

Scatterplot 1

Associations vs. Causal Relationships

If two variables are associated, knowing the value of one helps predict the value of the other

In this example, we would predict:
- A middle-income country would have a longer life expectancy than a low-income country
- A country with more urbanization would have longer life expectancy than one with less

Causal Relationships

A causal relationship entails three elements:

The independent (X) and dependent variables (Y) covary

The change in X precedes the change in Y

The covariation between X and Y is not coincidental or spurious

Causal relationships can be stipulated in hypotheses.

Hypotheses

Relationships between variables can be stated in hypotheses.

A hypothesis is an explicit statement about the relationship between phenomena that formalizes the researcher’s informed guess.

Characteristics of Good Hypotheses

Empirical statements that formulate educated guesses

Logical reason to think data can confirm hypotheses

Indicate direction of the relationship

Terms must match testing methods

Data should be feasible to obtain

Must specify unit of analysis (individuals, orgs, states, etc.)

Examples of Hypotheses

People have political viewpoints similar to their parents.
Democracies are more likely to engage in trade with one another.
Authoritarian regimes are more likely to violate human rights.
Countries where property rights are protected tend to have higher levels of development.

Concepts

Definitions of concepts should be:

clear
accurate
precise
informative

Concepts should strike a balance between the specific and the abstract.

Populations vs. Samples

Population – complete enumeration of some set of interest

To learn about the population, a sample is often studied

Sampling is the process of selecting a subset from the population

Sampling is used to estimate characteristics of the full population

Aim: Ensure sample is representative

Requirement: Know your population

Dominant approach: probability sampling

Populations vs. Samples

Representative sample – If repeated, the sample’s features would match those of the population on average

Probability sampling reduces sample selection bias and ensures representativeness

Data and Variables – Basics

Categorical

Binary: e.g., 0 = unemployed, 1 = employed
Nominal: Order does not matter (e.g., 0 = Green, 1 = Red, 3 = Blue)
Ordinal: Order is meaningful (e.g., 0 = Poor, 1 = Fair, 2 = Good)

Data and Variables – Basics

Numerical

Discrete: e.g., number of individuals in a household
Continuous: e.g., height, weight, wages

Cross-Sectional Data

Cross-sectional datasets have one observation per unit
Data for one variable (attribute) measured in N countries is written as:

\[ \{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N} \]

Cross-Sectional Data

Cross-sectional datasets have one observation per unit
Example values for one variable (e.g., life expectancy):

\[ \{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N} \]

\[ \{45.38333, 68.28611, 57.53013, \dots, 77.04861\} = \{X_i\}_{i=1,\dots,N} \]

Cross-Sectional Data

If we measure two attributes, we can represent them as a point in 2D space
A single data point is a vector in two dimensions

Example:
- Life expectancy = 59.75
- Level of urbanization = 66.4

Then the data point is:

\[ X = [66.4, 59.75] \]

Cross-Sectional Data

Evaluation of Empirical Propositions

Social scientists use statistical analyses to verify theories driven by carefully thought-out hypotheses.

Hypotheses are falsifiable claims about the world.

Hypotheses connect dependent variables to independent variables.

- Dependent variables: outcomes or things we want to explain
- Independent variables: factors that help explain the dependent variable

Example

Hypothesis:
An increase in X (independent variable) leads to an increase in Y (dependent variable).

Democratization Hypothesis:
More economic development is associated with higher levels of democracy.

To test this, we collect data on X and Y.

Units of analysis are the entities where our theory applies (e.g., countries, individuals, firms).

Datasets

When we collect the data, we input it into a spreadsheet, a tabular format.

This becomes a dataset.

A Dataset

In this example, there appears to be a positive relationship between X and Y.

- Not all high-X observations have high Y
- Not all low-X observations have low Y

To evaluate the relationship, we fit a line that best approximates the pattern in the data.

A Dataset

Cross-Sectional Data

Each country’s data is a point in a scatter plot.

If we measure three variables (e.g., life expectancy, urbanization, education),
we get a 3D point cloud:

Time-Series Data

A time series of length T is written as:

\[ \{X_1, X_2, X_3, \dots, X_T\} = \{X_t\}_{t=1,\dots,T} \]

A time series is a sequence of data points indexed in time order
It has a natural temporal ordering
Time is the second attribute

Time-Series Data

Time-Series Data vs. Cross-Section

Time-Series Data

This is depicted as a 2D scatter.

Time is one variable, and the value of interest is another.

So, each point in the time series is a pair: (time, value).

Time-Series Data

Time-Series and Cross-Section Data

The following is a cross-section of time-series data:

Time-Series and Cross-Section Data

Balanced Panel = Every unit is observed in every time period. No missing time points for any unit.

Time-Series and Cross-Section Data

Unbalanced Panel = Some units are missing observations for some time periods.

Time-Series and Cross-Section Data

Balanced Panel = Every unit is observed in every time period. No missing time points for any unit.

Time-Series and Cross-Section Data

Unbalanced Panel = Some units are missing observations for some time periods.

Conclusion

Measurement quality depends on accuracy and precision
Reliability: can we replicate results?
Validity: does the measure reflect the concept?
Variables can be categorical or numerical
Data can be cross-sectional, time-series, or both (panel data)
Panel data can be balanced or unbalanced