Intro to Statistics
  Variables, Samples, Population, Data
Bogdan G. Popescu 
 
 
        
            John Cabot University
          
     
 
Variables
How are two or more variables related?
- An independent variable is thought to influence or cause variation in another variable
 
 
- A dependent variable depends upon or is caused by variation in the independent variable
 
 
Examples:
Independent Variable → Dependent Variable
 
Associations
Two variables are associated if knowing the value of one of them will help to predict the value of the other.
Example:
Life Expectancy and Urbanization
 
Life Expectancy and Urbanization
Life Expectancy
- Indicator of countries’ overall health, physical well-being
 
 
- Of interest to many, including health researchers, economists, sociologists, anthropologists…
 
 
We will examine UN data on average life expectancy for 214 countries.
 
We want to know if urbanization has a positive or negative relationship on life expectancy.
 
Life Expectancy and Urbanization
The Data
![]()
Life Expectancy vs Urbanization
- Countries are subjects, cases, units, or elements in the data set
 
 
- Two columns for life expectancy and urbanization
 
 
- These are variables or characteristics varying among units
 
 
Life Expectancy and Urbanization
- What explains variation in life expectancy?
 
 
- What characteristics do countries with longer life expectancy have in common?
 
 
- What characteristics do countries with shorter life expectancy have in common?
 
 
- Normally, we would be examining the relationship between different types of variables and life expectancy:
- income
 
- urbanization
 
- education
 
 
 
Correlates of Life Expectancy
![]()
Correlates of Life Expectancy
A scatterplot of life expectancy by urbanization
![]()
Scatterplot 1
Associations vs. Causal Relationships
- If two variables are associated, knowing the value of one helps predict the value of the other
 
 
- In this example, we would predict:
- A middle-income country would have a longer life expectancy than a low-income country
 
- A country with more urbanization would have longer life expectancy than one with less
 
 
 
Causal Relationships
A causal relationship entails three elements:
- The independent (X) and dependent variables (Y) covary
 
 
- The change in X precedes the change in Y
 
 
- The covariation between X and Y is not coincidental or spurious
 
 
Causal relationships can be stipulated in hypotheses.
 
Hypotheses
Relationships between variables can be stated in hypotheses.
 
A hypothesis is an explicit statement about the relationship between phenomena that formalizes the researcher’s informed guess.
 
Characteristics of Good Hypotheses
- Empirical statements that formulate educated guesses
 
 
- Logical reason to think data can confirm hypotheses
 
 
- Indicate direction of the relationship
 
 
- Terms must match testing methods
 
 
- Data should be feasible to obtain
 
 
- Must specify unit of analysis (individuals, orgs, states, etc.)
 
 
Examples of Hypotheses
- People have political viewpoints similar to their parents.
 
- Democracies are more likely to engage in trade with one another.
 
- Authoritarian regimes are more likely to violate human rights.
 
- Countries where property rights are protected tend to have higher levels of development.
 
Concepts
Definitions of concepts should be:
- clear
 
- accurate
 
- precise
 
- informative
 
Concepts should strike a balance between the specific and the abstract.
 
Populations vs. Samples
Population – complete enumeration of some set of interest
To learn about the population, a sample is often studied
 
Sampling is the process of selecting a subset from the population
 
Sampling is used to estimate characteristics of the full population
 
Aim: Ensure sample is representative
 
Requirement: Know your population
 
Dominant approach: probability sampling
 
Populations vs. Samples
Representative sample – If repeated, the sample’s features would match those of the population on average
Probability sampling reduces sample selection bias and ensures representativeness
 
Data and Variables – Basics
Categorical
- Binary: e.g., 0 = unemployed, 1 = employed
 
- Nominal: Order does not matter (e.g., 0 = Green, 1 = Red, 3 = Blue)
 
- Ordinal: Order is meaningful (e.g., 0 = Poor, 1 = Fair, 2 = Good)
 
Data and Variables – Basics
Numerical
- Discrete: e.g., number of individuals in a household
 
- Continuous: e.g., height, weight, wages
 
Cross-Sectional Data
- Cross-sectional datasets have one observation per unit
 
- Data for one variable (attribute) measured in N countries is written as:
 
\[
\{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N}
\]
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
- Cross-sectional datasets have one observation per unit
 
- Example values for one variable (e.g., life expectancy):
 
\[
\{X_1, X_2, X_3, \dots, X_N\} = \{X_i\}_{i=1,\dots,N}
\]
\[
\{45.38333, 68.28611, 57.53013, \dots, 77.04861\} = \{X_i\}_{i=1,\dots,N}
\]
Cross-Sectional Data
- If we measure two attributes, we can represent them as a point in 2D space
 
- A single data point is a vector in two dimensions
 
Example:
- Life expectancy = 59.75
- Level of urbanization = 66.4
Then the data point is:
\[
X = [66.4, 59.75]
\]
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Cross-Sectional Data
Evaluation of Empirical Propositions
Social scientists use statistical analyses to verify theories driven by carefully thought-out hypotheses.
 Hypotheses are falsifiable claims about the world.
 Hypotheses connect dependent variables to independent variables.
 - Dependent variables: outcomes or things we want to explain
- Independent variables: factors that help explain the dependent variable
Example
Hypothesis:
An increase in X (independent variable) leads to an increase in Y (dependent variable).
 Democratization Hypothesis:
More economic development is associated with higher levels of democracy.
 To test this, we collect data on X and Y.
 Units of analysis are the entities where our theory applies (e.g., countries, individuals, firms).
Datasets
When we collect the data, we input it into a spreadsheet, a tabular format.
 This becomes a dataset.
A Dataset
A Dataset
A Dataset
In this example, there appears to be a positive relationship between X and Y.
 - Not all high-X observations have high Y
- Not all low-X observations have low Y
 To evaluate the relationship, we fit a line that best approximates the pattern in the data.
A Dataset
Cross-Sectional Data
Each country’s data is a point in a scatter plot.
 If we measure three variables (e.g., life expectancy, urbanization, education),
we get a 3D point cloud:
Time-Series Data
- A time series of length T is written as:
 
\[
\{X_1, X_2, X_3, \dots, X_T\} = \{X_t\}_{t=1,\dots,T}
\]
- A time series is a sequence of data points indexed in time order
 
- It has a natural temporal ordering
 
- Time is the second attribute
 
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data
Time-Series Data vs. Cross-Section
Time-Series Data vs. Cross-Section
Time-Series Data
This is depicted as a 2D scatter.
 Time is one variable, and the value of interest is another.
 So, each point in the time series is a pair: (time, value).
Time-Series Data
Time-Series and Cross-Section Data
The following is a cross-section of time-series data:
Time-Series and Cross-Section Data
Balanced Panel = Every unit is observed in every time period. No missing time points for any unit.
![]()
Time-Series and Cross-Section Data
Unbalanced Panel = Some units are missing observations for some time periods. 
Time-Series and Cross-Section Data
Balanced Panel = Every unit is observed in every time period. No missing time points for any unit.
![]()
Time-Series and Cross-Section Data
Unbalanced Panel = Some units are missing observations for some time periods.
![]()