Previously, we covered quite some ground:
Before we use ggplot, our data has to be tidy
This means:
People are better at seeing height differences than angle and area differences
People are better at seeing height differences than angle and area differences.
This how to obtain the same plots.
People are better at seeing height differences than angle and area differences
This how to obtain the same plots.
For example, let’s say we want to compare life expectancy in Latin America with EU
Python
# Setting path
path = "/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/text_analysis/week6/lecture6a/data/"
# Step 1: Loading the data
life_expectancy_df = pd.read_csv(f'{path}life-expectancy.csv')
urbanization_df = pd.read_csv(f'{path}share-of-population-urban.csv')
# Step 2: Removing countries with no country code
weird_labels = ["OWID_KOS", "OWID_WRL", ""]
clean_life_expectancy_df = life_expectancy_df[~life_expectancy_df['Code'].isin(weird_labels)]
# Step 3: Changing variable name
clean_life_expectancy_df = clean_life_expectancy_df.rename(columns={"Life expectancy at birth (historical)": "life_exp_yearly"})
# Step 4: Keeping only relevant vars
clean_life_expectancy_df2 = clean_life_expectancy_df[['Entity', 'Code', 'Year', 'life_exp_yearly']]
# Step 5: Removing countries with no country code
clean_urbanization_df = urbanization_df[~urbanization_df['Code'].isin(weird_labels)]
# Step 6: Changing variable name
clean_urbanization_df = clean_urbanization_df.rename(columns={"Urban population (% of total population)": "urb_yearly"})
# Step 7: Keeping only relevant vars
clean_urbanization_df2 = clean_urbanization_df[['Code', 'Year', 'urb_yearly']]
# Step 8: Performing a merge
merged_data_temp = pd.merge(clean_life_expectancy_df2, clean_urbanization_df2, on=['Code', 'Year'], how='left')
# Step 9: Removing NAs
merged_data_temp = merged_data_temp.dropna()
# Step 10: Defining continents
eu_countries = [
"Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia", "Denmark", "Estonia",
"Finland", "France", "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia",
"Lithuania", "Luxembourg", "Malta", "Netherlands", "Poland", "Portugal", "Romania",
"Slovakia", "Slovenia", "Spain", "Sweden"]
latam_countries = [
"Belize", "Costa Rica", "El Salvador", "Guatemala", "Honduras", "Mexico", "Nicaragua", "Panama",
"Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador", "Guyana", "Paraguay", "Peru",
"Suriname", "Uruguay", "Venezuela", "Cuba", "Dominican Republic", "Haiti"]
# Step 11: Labeling continents
merged_data_temp['continent'] = 'Everything Else' # Default value
merged_data_temp.loc[merged_data_temp['Entity'].isin(eu_countries), 'continent'] = 'EU'
merged_data_temp.loc[merged_data_temp['Entity'].isin(latam_countries), 'continent'] = 'Latin America'
For example, let’s say we want to compare life expectancy in Latin America with EU
Python
continent Year life_exp_mean urb_mean
0 EU 1990.000000 74.259927 67.138362
1 Everything Else 1989.976588 62.945803 47.720601
2 Latin America 1990.000000 66.128083 59.018407
For example, let’s say we want to compare life expectancy in Latin America, EU, and the Rest of the World
A more compelling way is: boxplots with points