To get the file path we simply go to the relevant folder
# This opens a file dialog to select your filefile_path <-file.choose()file_path
Opening the File
Once we have the path, we can now read the files:
# Defining Pathspath_data <-"/Users/bgpopescu/Library/CloudStorage/Dropbox/john_cabot/teaching/research_workshop/essential_stats/week3/lab/data/"# Use file.path() to construct full pathlife_expectancy_df <-read.csv(file.path(path_data, "life-expectancy.csv"))urbanization_df <-read.csv(file.path(path_data, "share-of-population-urban.csv"))
Please change the path to your computer’s path.
What We Want to Do
Explanation
These are our datasets
Doing it
Calculating Average by Country Code
This is what this looks like in code for life_expectancy_df2:
#Step3: Turning the table into a dataframefreq_table<-data.frame(freq_table)#Step4:Identifying how to extract the variable namesnames(freq_table)
[1] "Var1" "Freq"
Creating a Bar Plot from a Frequency Table
Step2: Creating a frequency table
#Step5: Providing more intuitive namesnames(freq_table)[1]<-"life_exp_mean_rounded"names(freq_table)[2]<-"frequency"#Step6: Turning factor variables into numericfreq_table$life_exp_mean_rounded<-as.numeric(as.character(freq_table$life_exp_mean_rounded))str(freq_table)
'data.frame': 35 obs. of 2 variables:
$ life_exp_mean_rounded: num 38 43 44 45 46 47 48 49 50 51 ...
$ frequency : int 1 2 3 4 4 5 5 2 2 4 ...
names(freq_table)
[1] "life_exp_mean_rounded" "frequency"
Creating a Bar Plot from a Frequency Table
Step2: Creating a frequency table
Inspecting the new dataframe
freq_table
Creating a Bar Plot from a Frequency Table
Step3: Creating the Barplot
library(ggplot2)#Step7: Creating the barplotfig5<-ggplot(data = freq_table, aes(x=life_exp_mean_rounded, y=frequency))+geom_bar(stat="identity")+theme_bw()fig5
What do you think cause the dip in life expectancy?
Exploring Data
What is the year with the lowest life expectancy for the US?
What is the year with the lowest life expectancy for the US?
#Creating a new df with the USdf_us_after1900<-subset(df_us_uk_after1900, Entity=="United States")df_us_after1900$Year[df_us_after1900$life_exp_yearly==min(df_us_after1900$life_exp_yearly)]
[1] 1918
Exploring Data
What is the year with the second lowest life expectancy for the US?
#Arranging and creating a new dataframedf <- df_us_after1900 %>%arrange(life_exp_yearly)#Selecting the second lowestsecond_highest_life_expectancy <- df$life_exp_yearly[2]#Selecting the year with the second lowestdf$Year[df$life_exp_yearly==second_highest_life_expectancy]
[1] 1901
Exploring Data
What is the year with the second lowest life expectancy for the UK?
#Creating a new df with the UKdf_uk_after1900<-subset(df_us_uk_after1900, Entity=="United Kingdom")df_uk_after1900$Year[df_uk_after1900$life_exp_yearly==min(df_uk_after1900$life_exp_yearly)]
[1] 1901
Conclusion
What Did We Learn Today?
How to import and clean real-world data
The difference between histograms, bar plots, and line plots
How to use ggplot2 for data visualization
How to explore patterns and trends in life expectancy and urbanization