Statistical Analysis

Lecture 2: Levels of Data

Bogdan G. Popescu

John Cabot University

Why Variables Matter

  • Political science research is built on data
  • Testing theories requires measuring concepts
    • Democracy, inequality, conflict
  • Measurements become variables
  • Variable type determines statistical methods

From Concepts to Data

  • Start with a concept (e.g., democracy)
  • Translate into measurable attributes
    • Electoral competition, press freedom
  • Collect data: scores, categories, numbers
  • Result: variables for systematic analysis

Data and Variables

Stevens’ Measurement Levels

Four levels of measurement (Stevens, 1946):

  • Nominal: unordered categories
    • e.g., party affiliation, eye color
  • Ordinal: ordered, but distances unknown
    • e.g., Poor < Fair < Good
  • Interval: equal distances, no true zero
    • e.g., temperature in Celsius, SAT scores
  • Ratio: equal distances with a true zero
    • e.g., weight in grams, years of education

Discrete vs. Continuous

A separate distinction based on possible values:

  • Discrete: countable, finite values
    • e.g., number of household members
  • Continuous: any value in a range
    • e.g., height, weight, wages

Both discrete and continuous variables can appear at different Stevens’ levels

Variable Taxonomy

%%{init:{"flowchart":{"useMaxWidth":true,"nodeSpacing":40,"rankSpacing":40},"themeVariables":{"fontSize":"20px"},"width":1150,"height":650}}%%
flowchart TB
  A[Variables] --> B[By Measurement Level<br/>Stevens 1946]
  A --> C[By Value Type]
  B --> D[Nominal<br/>e.g. color]
  B --> E[Ordinal<br/>e.g. rank]
  B --> F[Interval<br/>e.g. SAT]
  B --> G[Ratio<br/>e.g. weight]
  C --> H[Discrete<br/>e.g. count]
  C --> I[Continuous<br/>e.g. height]

Cross-Sectional Data

Cross-Sectional Data

  • One observation per unit at a single point in time
  • Data for variable \(X\) measured in \(N\) countries:

\[\{X_1, X_2, X_3, \ldots, X_N\} = \{X_i\}_{i=1,\ldots,N}\]

Cross-Sectional Data: Indexing

Cross-Sectional Data: Indexing

Cross-Sectional Data: Indexing

Cross-Sectional Data: Indexing

Cross-Sectional Data: Indexing

Cross-Sectional Data: Notation

\[\{X_1, X_2, X_3, \ldots, X_N\} = \{X_i\}_{i=1,\ldots,N}\]

\[\{45.38, 68.29, 57.53, \ldots, 77.05\} = \{X_i\}_{i=1,\ldots,N}\]

Two Variables in 2D

  • Measuring two attributes per unit
    • Each observation becomes a point in 2D
  • Example: urbanization and life expectancy
    • Urbanization = 66.4, Life expectancy = 59.75
  • Single datum: \(\mathbf{X} = [66.4,\ 59.75]\)

Scatterplot: One Country

Figure 1

Scatterplot: Adding Countries

Figure 2

Scatterplot: All Countries

Figure 3

Evaluating Empirical Propositions

  • Scientists use statistics to verify theories
  • Theories produce hypotheses
    • Falsifiable claims about the world
  • Hypotheses link variables:
    • Dependent variable (Y): the outcome to explain
    • Independent variable (X): the explanatory factor

Example: Democratization Hypothesis

  • Hypothesis: More economic development \(\rightarrow\) higher democracy
  • X (independent): economic development
  • Y (dependent): level of democracy
  • To test: collect data on X and Y
  • Units of analysis: countries, individuals, firms

Datasets

  • Collected data stored in a spreadsheet
  • Tabular format = a dataset
  • Rows = observations (units)
  • Columns = variables (attributes)

A Dataset

A Dataset: Variables Highlighted

Interpreting the Relationship

  • Positive relationship between X and Y appears
  • Not all high-X observations have high Y
  • Not all low-X observations have low Y
  • We fit a line of best fit to evaluate

A Dataset: Trend Line

Three Dimensions

  • Three variables \(\rightarrow\) 3D scatter cloud
  • e.g., life expectancy, urbanization, education

Time-Series Data

Time-Series Data

  • A series of data points indexed in time order
  • Time series of length \(T\):

\[\{X_1, X_2, X_3, \ldots, X_T\} = \{X_t\}_{t=1,\ldots,T}\]

  • Natural temporal ordering
  • Time index as an implicit second attribute

Time-Series: Indexing

Time-Series: Indexing

Time-Series: Indexing

Time-Series: Indexing

Time-Series: Indexing

Time-Series: Indexing

Time-Series: Indexing

Time-Series vs. Cross-Section

Time-Series vs. Cross-Section

Time-Series as 2D Scatter

  • Time is a measured variable
  • Each data point has two variables:
    • Variable of interest + date
  • Depicted as a 2D scatter plot

USA: Life Expectancy Over Time

Figure 4

Panel Data: USA and Italy

Figure 5

Balanced Panel: Three Countries

Figure 6

Unbalanced Panel: Four Countries

Figure 7

Balanced vs. Unbalanced Panels

Balanced Panel

  • All units observed in all periods
  • Complete rectangular dataset
  • e.g., USA + Italy + Afghanistan

Unbalanced Panel

  • Some units have missing periods
  • Gaps in the data
  • e.g., Iran missing 1960–1975

Conclusion

What We Learned

  • Variables represent concepts in measurable form
  • Stevens’ levels: nominal, ordinal, interval, ratio
  • Value types: discrete vs. continuous
  • Data structures: cross-sectional, time-series, panel
  • Panels can be balanced or unbalanced
  • Variable type and data structure determine statistical tools