Statistical Analysis

Lecture 2: Levels of Data

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Why Variables Matter

Political science research is built on data

Testing theories requires measuring concepts
- Democracy, inequality, conflict

Measurements become variables

Variable type determines statistical methods

From Concepts to Data

Start with a concept (e.g., democracy)

Translate into measurable attributes
- Electoral competition, press freedom

Collect data: scores, categories, numbers

Result: variables for systematic analysis

Data and Variables

Stevens’ Measurement Levels

Four levels of measurement (Stevens, 1946):

Nominal: unordered categories
- e.g., party affiliation, eye color

Ordinal: ordered, but distances unknown
- e.g., Poor < Fair < Good

Interval: equal distances, no true zero
- e.g., temperature in Celsius, SAT scores

Ratio: equal distances with a true zero
- e.g., weight in grams, years of education

Discrete vs. Continuous

A separate distinction based on possible values:

Discrete: countable, finite values
- e.g., number of household members

Continuous: any value in a range
- e.g., height, weight, wages

Both discrete and continuous variables can appear at different Stevens’ levels

Variable Taxonomy

%%{init:{"flowchart":{"useMaxWidth":true,"nodeSpacing":40,"rankSpacing":40},"themeVariables":{"fontSize":"20px"},"width":1150,"height":650}}%%
flowchart TB
  A[Variables] --> B[By Measurement Level<br/>Stevens 1946]
  A --> C[By Value Type]
  B --> D[Nominal<br/>e.g. color]
  B --> E[Ordinal<br/>e.g. rank]
  B --> F[Interval<br/>e.g. SAT]
  B --> G[Ratio<br/>e.g. weight]
  C --> H[Discrete<br/>e.g. count]
  C --> I[Continuous<br/>e.g. height]

Cross-Sectional Data

Cross-Sectional Data

One observation per unit at a single point in time

Data for variable \(X\) measured in \(N\) countries:

\[\{X_1, X_2, X_3, \ldots, X_N\} = \{X_i\}_{i=1,\ldots,N}\]

Cross-Sectional Data: Indexing

Cross-Sectional Data: Indexing

Cross-Sectional Data: Indexing

Cross-Sectional Data: Indexing

Cross-Sectional Data: Indexing

Cross-Sectional Data: Notation

\[\{X_1, X_2, X_3, \ldots, X_N\} = \{X_i\}_{i=1,\ldots,N}\]

\[\{45.38, 68.29, 57.53, \ldots, 77.05\} = \{X_i\}_{i=1,\ldots,N}\]

Two Variables in 2D

Measuring two attributes per unit
- Each observation becomes a point in 2D

Example: urbanization and life expectancy
- Urbanization = 66.4, Life expectancy = 59.75

Single datum: \(\mathbf{X} = [66.4,\ 59.75]\)

Scatterplot: One Country

Figure 1

Scatterplot: Adding Countries

Figure 2

Scatterplot: All Countries

Figure 3

Evaluating Empirical Propositions

Scientists use statistics to verify theories

Theories produce hypotheses
- Falsifiable claims about the world

Hypotheses link variables:
- Dependent variable (Y): the outcome to explain
- Independent variable (X): the explanatory factor

Example: Democratization Hypothesis

Hypothesis: More economic development \(\rightarrow\) higher democracy

X (independent): economic development
Y (dependent): level of democracy

To test: collect data on X and Y

Units of analysis: countries, individuals, firms

Datasets

Collected data stored in a spreadsheet

Tabular format = a dataset

Rows = observations (units)
Columns = variables (attributes)

A Dataset

A Dataset: Variables Highlighted

Interpreting the Relationship

Positive relationship between X and Y appears

Not all high-X observations have high Y

Not all low-X observations have low Y

We fit a line of best fit to evaluate

A Dataset: Trend Line

Three Dimensions

Three variables \(\rightarrow\) 3D scatter cloud

e.g., life expectancy, urbanization, education

Time-Series Data

Time-Series Data

A series of data points indexed in time order

Time series of length \(T\):

\[\{X_1, X_2, X_3, \ldots, X_T\} = \{X_t\}_{t=1,\ldots,T}\]

Natural temporal ordering
Time index as an implicit second attribute

Time-Series: Indexing

Time-Series: Indexing

Time-Series: Indexing

Time-Series: Indexing

Time-Series: Indexing

Time-Series: Indexing

Time-Series: Indexing

Time-Series vs. Cross-Section

Time-Series vs. Cross-Section

Time-Series as 2D Scatter

Time is a measured variable

Each data point has two variables:
- Variable of interest + date

Depicted as a 2D scatter plot

USA: Life Expectancy Over Time

Figure 4

Panel Data: USA and Italy

Figure 5

Balanced Panel: Three Countries

Figure 6

Unbalanced Panel: Four Countries

Figure 7

Balanced vs. Unbalanced Panels

Balanced Panel

All units observed in all periods
Complete rectangular dataset
e.g., USA + Italy + Afghanistan

Unbalanced Panel

Some units have missing periods
Gaps in the data
e.g., Iran missing 1960–1975

Conclusion

What We Learned

Variables represent concepts in measurable form

Stevens’ levels: nominal, ordinal, interval, ratio

Value types: discrete vs. continuous

Data structures: cross-sectional, time-series, panel

Panels can be balanced or unbalanced

Variable type and data structure determine statistical tools