L5: Reading Data and Working Directories

Bogdan G. Popescu

John Cabot University

The Working Directory

The R environment always points to a certain directory on our computer, which is known as the working directory.

We can get the current working directory with getwd

Type that in a new chunk. What is your current working directory?

getwd()

The Working Directory

The R environment always points to a certain directory on our computer, which is known as the working directory.

We can get the current working directory with getwd

Type that in a new chunk. What is your current working directory?

getwd()
[1] "/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week3"

The Working Directory on MacOS

Create a new folder called “week3”

Then set the directory using setwd

To know the directory on a MacOS, you go to the week3 folder and click on “Get Info”

The Working Directory on MacOS

The Working Directory on MacOS

The Working Directory on MacOS

Select what is in the “Where” section

The Working Directory on MacOS

Select what is in the “Where” section and copy it using “Cmd + C”

The Working Directory on MacOS

In this case, the working directory will be:

[1] "/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data"

We now simply need to add week3 to this path

Our final path will be:

[1] "/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week3"

To set this directory, we type in:

setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week3")

The Working Directory on Windows

To do the same on Windows, you go to the week3 folder and right-click on “Properties”

The Working Directory on Windows

To do the same on Windows, you go to the week3 folder and right-click on “Properties”

The Working Directory on Windows

To do the same on Windows, you go to the week3 folder and right-click on “Properties”

The Working Directory on Windows

To do the same on Windows, you go to the week3 folder and right-click on “Properties”

The Working Directory on Windows

In this case, the working directory will be:

cat("\\\\Mac\\Dropbox-1\\john_cabot\\teaching\\big_data")
\\Mac\Dropbox-1\john_cabot\teaching\big_data

We first need to change all the backslashes to forward slashes:

[1] "/Mac/Dropbox-1/john_cabot/teaching/big_data"

We now simply need to add week3 to this path

Our final path will be:

[1] "/Mac/Dropbox-1/john_cabot/teaching/big_data/week3"

To set this directory, we type in:

setwd("/Mac/Dropbox-1/john_cabot/teaching/big_data/week3")

The Working Directory on Windows

In your case, this will look something like:

setwd("C:/Dropbox/john_cabot/teaching/big_data/week3")

The Data for this Week

Please download the data for this week and place it in the relevant week folder.

Importing Data into R

The easiest datasets to work with are csv files

R is also capabale of reading a variety of other dataset formats including: .dta, .sas, .xlsx, xls, txt, etc.

The type of library used to read these files will have implications for how quickly your computer can read the data

Let us look at some examples

  • read_csv
  • read.csv

Importing Data into R

  • read_csv
library("readr")
#This is to set the directory
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week3/")
# Reading data
data_1851_obs10000<-read_csv("./data/data_examples/data_1851_obs10000.csv")

#Recording how long it takes

#Step1:Recoding your system's time
start_time <- Sys.time()
#Step2:Loding the data
data_1851_obs10000<-read_csv("./data/data_examples/data_1851_obs10000.csv")
#Step3: Recording when it finishes
end_time <- Sys.time()
#Step4: Calculating the difference
time_taken_a <- end_time - start_time
#Step5: Printing the difference
time_taken_a
Time difference of 0.1327941 secs

Importing Data into R

  • read.csv
# Reading data
#Step1:Recoding your system's time
start_time <- Sys.time()
#Step2:Loding the data
data_1851_obs10000_b<-read.csv("./data/data_examples/data_1851_obs10000.csv")
#Step3: Recording when it finishes
end_time <- Sys.time()
#Step4: Calculating the difference
time_taken_b <- end_time - start_time
#Step5: Printing the difference
time_taken_b
Time difference of 0.190902 secs

Importing Data into R

  • read.csv
  • read_csv
  • read.csv or read_csv makes a difference
  • the difference is: time_taken_b - time_taken_a = 0.0581
  • This is hugely important for large datasets

Alernative File Formats

  • csv
  • xlsx and xls
  • spss
  • Stata

Examples of Reading Different Files

#Reading CSV files
csv_file<-read.csv("./data/data_examples/data_1851_obs10000.csv")
#Reading a Stata file
#Library for reading Stata files
library("haven")
#Library for reading SPSS files
library("foreign")
stata_file<-read_dta("./data/data_examples/data_1851_obs10000.dta")
spss_file<-read.spss("./data/data_examples/data_1851_obs10000.sav", to.data.frame=TRUE)
#Library for reading Excel file
library("readxl")
excel_file<-read_excel("./data/data_examples/data_1851_obs10000.xlsx")

Writing CSV files

You can easily write CSV files to your hard drive using the readr library

#load the tidyverse readr package
library(readr)
#writing data as csv
write_csv(excel_file, "./data/data_examples/data_1851_obs10000_written.csv")

Tidy Data

  • Tidy Data = “standard way of mapping the meaning of a dataset to its structure.” (Hadley Wickham)
  • Within tidy data:
    • each variable forms a column
    • each observation forms a row
    • each cell is a single measurement
  • Tidy data means that all datasets are alike

Tidy Data

Tidy Data

Tidy Data

Tidy Data

Tidy datasets are all alike

Tidy Data

Tidy datasets are all alike

Tidy Data

Tidy datasets are all alike

Untidy Data

Messy data can be messy in their own way.

Untidy Data

Messy data can be messy in their own way.

Untidy Data

Messy data can be messy in their own way.

Untidy Data

Messy data can be messy in their own way.

Untidy Data

Messy data can be messy in their own way.

Tidy Data

Tidy Data

You can become friends with tidy data by following one of the following strategies:

  • Pivoting
    • Longer
    • Wider
  • Separating
  • Uniting

Untidy Data 1: Pivoting Longer

  • A common problem is that column names are not names of variables, but values of a variable
table4a <- read_excel("./data/data_examples/table4a.xlsx")
table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
  • We have columns whose names are values, not variables: 1999 and 2000
  • We need a new column that reflects these two variable. Let’s call it year

Untidy Data 1: Pivoting Longer

  • We also need another column that reflects the numbers. Let’s call it cases

Untidy Data 1: Pivoting Longer

Untidy Data 1: Pivoting Longer

Untidy Data 1: Pivoting Longer

This is what that looks like in code

Original

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

Fix

library("dplyr")
library("tidyr")
table4a %>% 
  pivot_longer(c(`1999`, `2000`), 
  names_to = "year",
  values_to = "cases")
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Untidy Data 2: Pivoting Wider

  • This is where one observation is scattered across multiple rows.
  • Within the table below, an observation is a country in a year, but each observation is spread across two rows.
table4b <- read_excel("./data/data_examples/table4b.xlsx")
table4b
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Untidy Data 2: Pivoting Wider

Untidy Data 2: Pivoting Wider

Untidy Data 2: Pivoting Wider

This is what that looks like in code

Original

table4b
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Fix

table4b %>% 
   pivot_wider(names_from = type, 
   values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Untidy Data 3: Separating

  • We sometimes may have columns that contain data which should separated in multiple columns

  • This where we use separate()

table3 <- read_excel("./data/data_examples/table3.xlsx")
table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Untidy Data 3: Separating

Untidy Data 3: Separating

Untidy Data 3: Separating

This is what that looks like in code

Original

table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Fix

table3 %>% 
  separate(rate, into = c("cases", "population"))
# A tibble: 6 × 4
  country      year cases  population
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Untidy Data 4: Uniting

  • Uniting is the inverse of separating

  • It combines multiple columns into a single column

table5 <- read_excel("./data/data_examples/table5.xlsx")
table5
# A tibble: 6 × 4
  country     century year  rate             
  <chr>         <dbl> <chr> <chr>            
1 Afghanistan      19 99    745/19987071     
2 Afghanistan      20 00    2666/20595360    
3 Brazil           19 99    37737/172006362  
4 Brazil           20 00    80488/174504898  
5 China            19 99    212258/1272915272
6 China            20 00    213766/1280428583

Untidy Data 4: Uniting

Untidy Data 4: Uniting

Untidy Data 4: Uniting

This is what that looks like in code

Original

table5
# A tibble: 6 × 4
  country     century year  rate             
  <chr>         <dbl> <chr> <chr>            
1 Afghanistan      19 99    745/19987071     
2 Afghanistan      20 00    2666/20595360    
3 Brazil           19 99    37737/172006362  
4 Brazil           20 00    80488/174504898  
5 China            19 99    212258/1272915272
6 China            20 00    213766/1280428583

Fix

table5 %>% 
  unite(new, century, year, sep = "")
# A tibble: 6 × 3
  country     new   rate             
  <chr>       <chr> <chr>            
1 Afghanistan 1999  745/19987071     
2 Afghanistan 2000  2666/20595360    
3 Brazil      1999  37737/172006362  
4 Brazil      2000  80488/174504898  
5 China       1999  212258/1272915272
6 China       2000  213766/1280428583

Missing Values 1

Here is an example:

library("tibble")
library("tidyr")
stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
  • The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains NA

  • The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.

Missing Values 1

To deal with them, we can use values_drop_na = TRUE

library("tibble")
library("tidyr")

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(
    cols = c(`2015`, `2016`), 
    names_to = "year", 
    values_to = "return", 
    values_drop_na = TRUE
  )
# A tibble: 6 × 3
    qtr year  return
  <dbl> <chr>  <dbl>
1     1 2015    1.88
2     2 2015    0.59
3     2 2016    0.92
4     3 2015    0.35
5     3 2016    0.17
6     4 2016    2.66
  • Notice the use of values_drop_na = TRUE

  • Notice that the 4th quarter of 2015 is missing now

Missing Values 2

We can also use complete()

stocks %>% 
  complete(year, qtr)
# A tibble: 8 × 3
   year   qtr return
  <dbl> <dbl>  <dbl>
1  2015     1   1.88
2  2015     2   0.59
3  2015     3   0.35
4  2015     4  NA   
5  2016     1  NA   
6  2016     2   0.92
7  2016     3   0.17
8  2016     4   2.66
  • complete() takes a set of columns, and finds all unique combinations

  • It then ensures the original dataset contains all those values, filling in explicit NAs where necessary.

Missing Values 3

  • Sometimes the NA is not random

  • Missing values could indicate that the previous value should be carried forward:

Example:

library("dplyr")
treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)

Missing Values 3

We can deal with this problem by using fill()

Original

library("dplyr")
treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)

Fix

treatment %>% 
  fill(person)
# A tibble: 4 × 3
  person           treatment response
  <chr>                <dbl>    <dbl>
1 Derrick Whitmore         1        7
2 Derrick Whitmore         2       10
3 Derrick Whitmore         3        9
4 Katherine Burke          1        4