The R environment always points to a certain directory on our computer, which is known as the working directory.
We can get the current working directory with getwd
Type that in a new chunk. What is your current working directory?
The R environment always points to a certain directory on our computer, which is known as the working directory.
We can get the current working directory with getwd
Type that in a new chunk. What is your current working directory?
Create a new folder called “week3”
Then set the directory using setwd
To know the directory on a MacOS, you go to the week3 folder and click on “Get Info”
Select what is in the “Where” section
Select what is in the “Where” section and copy it using “Cmd + C”
In this case, the working directory will be:
[1] "/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data"
We now simply need to add week3 to this path
Our final path will be:
[1] "/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week3"
To do the same on Windows, you go to the week3 folder and right-click on “Properties”
To do the same on Windows, you go to the week3 folder and right-click on “Properties”
To do the same on Windows, you go to the week3 folder and right-click on “Properties”
To do the same on Windows, you go to the week3 folder and right-click on “Properties”
In this case, the working directory will be:
We first need to change all the backslashes to forward slashes:
[1] "/Mac/Dropbox-1/john_cabot/teaching/big_data"
We now simply need to add week3 to this path
Our final path will be:
[1] "/Mac/Dropbox-1/john_cabot/teaching/big_data/week3"
In your case, this will look something like:
Please download the data for this week and place it in the relevant week folder.
The easiest datasets to work with are csv files
R is also capabale of reading a variety of other dataset formats including: .dta, .sas, .xlsx, xls, txt, etc.
The type of library used to read these files will have implications for how quickly your computer can read the data
Let us look at some examples
read_csvread.csvread_csvlibrary("readr")
#This is to set the directory
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week3/")
# Reading data
data_1851_obs10000<-read_csv("./data/data_examples/data_1851_obs10000.csv")
#Recording how long it takes
#Step1:Recoding your system's time
start_time <- Sys.time()
#Step2:Loding the data
data_1851_obs10000<-read_csv("./data/data_examples/data_1851_obs10000.csv")
#Step3: Recording when it finishes
end_time <- Sys.time()
#Step4: Calculating the difference
time_taken_a <- end_time - start_time
#Step5: Printing the difference
time_taken_aTime difference of 0.1327941 secs
read.csv# Reading data
#Step1:Recoding your system's time
start_time <- Sys.time()
#Step2:Loding the data
data_1851_obs10000_b<-read.csv("./data/data_examples/data_1851_obs10000.csv")
#Step3: Recording when it finishes
end_time <- Sys.time()
#Step4: Calculating the difference
time_taken_b <- end_time - start_time
#Step5: Printing the difference
time_taken_bTime difference of 0.190902 secs
read.csvread_csvread.csv or read_csv makes a difference#Reading CSV files
csv_file<-read.csv("./data/data_examples/data_1851_obs10000.csv")
#Reading a Stata file
#Library for reading Stata files
library("haven")
#Library for reading SPSS files
library("foreign")
stata_file<-read_dta("./data/data_examples/data_1851_obs10000.dta")
spss_file<-read.spss("./data/data_examples/data_1851_obs10000.sav", to.data.frame=TRUE)
#Library for reading Excel file
library("readxl")
excel_file<-read_excel("./data/data_examples/data_1851_obs10000.xlsx")You can easily write CSV files to your hard drive using the readr library
Tidy datasets are all alike
Tidy datasets are all alike
Tidy datasets are all alike
Messy data can be messy in their own way.
Messy data can be messy in their own way.
Messy data can be messy in their own way.
Messy data can be messy in their own way.
Messy data can be messy in their own way.
You can become friends with tidy data by following one of the following strategies:
1999 and 2000yearcasesThis is what that looks like in code
Original
Fix
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
This is what that looks like in code
Original
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
Fix
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
We sometimes may have columns that contain data which should separated in multiple columns
This where we use separate()
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
This is what that looks like in code
Original
Fix
# A tibble: 6 × 4
  country      year cases  population
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
Uniting is the inverse of separating
It combines multiple columns into a single column
# A tibble: 6 × 4
  country     century year  rate             
  <chr>         <dbl> <chr> <chr>            
1 Afghanistan      19 99    745/19987071     
2 Afghanistan      20 00    2666/20595360    
3 Brazil           19 99    37737/172006362  
4 Brazil           20 00    80488/174504898  
5 China            19 99    212258/1272915272
6 China            20 00    213766/1280428583
This is what that looks like in code
Original
Here is an example:
The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains NA
The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
To deal with them, we can use values_drop_na = TRUE
library("tibble")
library("tidyr")
stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(
    cols = c(`2015`, `2016`), 
    names_to = "year", 
    values_to = "return", 
    values_drop_na = TRUE
  )# A tibble: 6 × 3
    qtr year  return
  <dbl> <chr>  <dbl>
1     1 2015    1.88
2     2 2015    0.59
3     2 2016    0.92
4     3 2015    0.35
5     3 2016    0.17
6     4 2016    2.66
Notice the use of values_drop_na = TRUE
Notice that the 4th quarter of 2015 is missing now
We can also use complete()
# A tibble: 8 × 3
   year   qtr return
  <dbl> <dbl>  <dbl>
1  2015     1   1.88
2  2015     2   0.59
3  2015     3   0.35
4  2015     4  NA   
5  2016     1  NA   
6  2016     2   0.92
7  2016     3   0.17
8  2016     4   2.66
complete() takes a set of columns, and finds all unique combinations
It then ensures the original dataset contains all those values, filling in explicit NAs where necessary.
Sometimes the NA is not random
Missing values could indicate that the previous value should be carried forward:
Example:
We can deal with this problem by using fill()
Original
Popescu (JCU): Lecture 5