The R environment always points to a certain directory on our computer, which is known as the working directory.
We can get the current working directory with getwd
Type that in a new chunk. What is your current working directory?
The R environment always points to a certain directory on our computer, which is known as the working directory.
We can get the current working directory with getwd
Type that in a new chunk. What is your current working directory?
Create a new folder called “week3”
Then set the directory using setwd
To know the directory on a MacOS, you go to the week3 folder and click on “Get Info”
Select what is in the “Where” section
Select what is in the “Where” section and copy it using “Cmd + C”
In this case, the working directory will be:
[1] "/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data"
We now simply need to add week3
to this path
Our final path will be:
[1] "/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week3"
To do the same on Windows, you go to the week3 folder and right-click on “Properties”
To do the same on Windows, you go to the week3 folder and right-click on “Properties”
To do the same on Windows, you go to the week3 folder and right-click on “Properties”
To do the same on Windows, you go to the week3 folder and right-click on “Properties”
In this case, the working directory will be:
We first need to change all the backslashes to forward slashes:
[1] "/Mac/Dropbox-1/john_cabot/teaching/big_data"
We now simply need to add week3
to this path
Our final path will be:
[1] "/Mac/Dropbox-1/john_cabot/teaching/big_data/week3"
In your case, this will look something like:
Please download the data for this week and place it in the relevant week folder.
The easiest datasets to work with are csv files
R is also capabale of reading a variety of other dataset formats including: .dta, .sas, .xlsx, xls, txt, etc.
The type of library used to read these files will have implications for how quickly your computer can read the data
Let us look at some examples
read_csv
read.csv
read_csv
library("readr")
#This is to set the directory
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week3/")
# Reading data
data_1851_obs10000<-read_csv("./data/data_examples/data_1851_obs10000.csv")
#Recording how long it takes
#Step1:Recoding your system's time
start_time <- Sys.time()
#Step2:Loding the data
data_1851_obs10000<-read_csv("./data/data_examples/data_1851_obs10000.csv")
#Step3: Recording when it finishes
end_time <- Sys.time()
#Step4: Calculating the difference
time_taken_a <- end_time - start_time
#Step5: Printing the difference
time_taken_a
Time difference of 0.1327941 secs
read.csv
# Reading data
#Step1:Recoding your system's time
start_time <- Sys.time()
#Step2:Loding the data
data_1851_obs10000_b<-read.csv("./data/data_examples/data_1851_obs10000.csv")
#Step3: Recording when it finishes
end_time <- Sys.time()
#Step4: Calculating the difference
time_taken_b <- end_time - start_time
#Step5: Printing the difference
time_taken_b
Time difference of 0.190902 secs
read.csv
read_csv
read.csv
or read_csv
makes a difference#Reading CSV files
csv_file<-read.csv("./data/data_examples/data_1851_obs10000.csv")
#Reading a Stata file
#Library for reading Stata files
library("haven")
#Library for reading SPSS files
library("foreign")
stata_file<-read_dta("./data/data_examples/data_1851_obs10000.dta")
spss_file<-read.spss("./data/data_examples/data_1851_obs10000.sav", to.data.frame=TRUE)
#Library for reading Excel file
library("readxl")
excel_file<-read_excel("./data/data_examples/data_1851_obs10000.xlsx")
You can easily write CSV files to your hard drive using the readr
library
Tidy datasets are all alike
Tidy datasets are all alike
Tidy datasets are all alike
Messy data can be messy in their own way.
Messy data can be messy in their own way.
Messy data can be messy in their own way.
Messy data can be messy in their own way.
Messy data can be messy in their own way.
You can become friends with tidy data by following one of the following strategies:
1999
and 2000
year
cases
This is what that looks like in code
Original
Fix
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
This is what that looks like in code
Original
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Fix
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
We sometimes may have columns that contain data which should separated in multiple columns
This where we use separate()
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
This is what that looks like in code
Original
Fix
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Uniting is the inverse of separating
It combines multiple columns into a single column
# A tibble: 6 × 4
country century year rate
<chr> <dbl> <chr> <chr>
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
This is what that looks like in code
Original
Here is an example:
The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains NA
The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
To deal with them, we can use values_drop_na = TRUE
library("tibble")
library("tidyr")
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)
# A tibble: 6 × 3
qtr year return
<dbl> <chr> <dbl>
1 1 2015 1.88
2 2 2015 0.59
3 2 2016 0.92
4 3 2015 0.35
5 3 2016 0.17
6 4 2016 2.66
Notice the use of values_drop_na = TRUE
Notice that the 4th quarter of 2015 is missing now
We can also use complete()
# A tibble: 8 × 3
year qtr return
<dbl> <dbl> <dbl>
1 2015 1 1.88
2 2015 2 0.59
3 2015 3 0.35
4 2015 4 NA
5 2016 1 NA
6 2016 2 0.92
7 2016 3 0.17
8 2016 4 2.66
complete()
takes a set of columns, and finds all unique combinations
It then ensures the original dataset contains all those values, filling in explicit NAs
where necessary.
Sometimes the NA
is not random
Missing values could indicate that the previous value should be carried forward:
Example:
We can deal with this problem by using fill()
Original
Popescu (JCU): Lecture 5