L5: Reading Data and Working Directories

Bogdan G. Popescu

John Cabot University

The Working Directory

The R environment always points to a certain directory on our computer, which is known as the working directory.

We can get the current working directory with getwd

Type that in a new chunk. What is your current working directory?

getwd()

The Working Directory

The R environment always points to a certain directory on our computer, which is known as the working directory.

We can get the current working directory with getwd

Type that in a new chunk. What is your current working directory?

getwd()

[1] "/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week3"

The Working Directory on MacOS

Create a new folder called “week3”

Then set the directory using setwd

To know the directory on a MacOS, you go to the week3 folder and click on “Get Info”

The Working Directory on MacOS

Select what is in the “Where” section

The Working Directory on MacOS

Select what is in the “Where” section and copy it using “Cmd + C”

The Working Directory on MacOS

In this case, the working directory will be:

[1] "/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data"

We now simply need to add week3 to this path

Our final path will be:

[1] "/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week3"

To set this directory, we type in:

setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week3")

The Working Directory on Windows

To do the same on Windows, you go to the week3 folder and right-click on “Properties”

The Working Directory on Windows

To do the same on Windows, you go to the week3 folder and right-click on “Properties”

The Working Directory on Windows

To do the same on Windows, you go to the week3 folder and right-click on “Properties”

The Working Directory on Windows

To do the same on Windows, you go to the week3 folder and right-click on “Properties”

The Working Directory on Windows

In this case, the working directory will be:

cat("\\\\Mac\\Dropbox-1\\john_cabot\\teaching\\big_data")

\\Mac\Dropbox-1\john_cabot\teaching\big_data

We first need to change all the backslashes to forward slashes:

[1] "/Mac/Dropbox-1/john_cabot/teaching/big_data"

We now simply need to add week3 to this path

Our final path will be:

[1] "/Mac/Dropbox-1/john_cabot/teaching/big_data/week3"

To set this directory, we type in:

setwd("/Mac/Dropbox-1/john_cabot/teaching/big_data/week3")

The Working Directory on Windows

In your case, this will look something like:

setwd("C:/Dropbox/john_cabot/teaching/big_data/week3")

The Data for this Week

Please download the data for this week and place it in the relevant week folder.

Importing Data into R

The easiest datasets to work with are csv files

R is also capabale of reading a variety of other dataset formats including: .dta, .sas, .xlsx, xls, txt, etc.

The type of library used to read these files will have implications for how quickly your computer can read the data

Let us look at some examples

read_csv
read.csv

Importing Data into R

read_csv

library("readr")
#This is to set the directory
setwd("/Users/bgpopescu/Dropbox/john_cabot/teaching/big_data/week3/")
# Reading data
data_1851_obs10000<-read_csv("./data/data_examples/data_1851_obs10000.csv")

#Recording how long it takes

#Step1:Recoding your system's time
start_time <- Sys.time()
#Step2:Loding the data
data_1851_obs10000<-read_csv("./data/data_examples/data_1851_obs10000.csv")
#Step3: Recording when it finishes
end_time <- Sys.time()
#Step4: Calculating the difference
time_taken_a <- end_time - start_time
#Step5: Printing the difference
time_taken_a

Time difference of 0.1327941 secs

Importing Data into R

read.csv

# Reading data
#Step1:Recoding your system's time
start_time <- Sys.time()
#Step2:Loding the data
data_1851_obs10000_b<-read.csv("./data/data_examples/data_1851_obs10000.csv")
#Step3: Recording when it finishes
end_time <- Sys.time()
#Step4: Calculating the difference
time_taken_b <- end_time - start_time
#Step5: Printing the difference
time_taken_b

Time difference of 0.190902 secs

Importing Data into R

read.csv
read_csv

read.csv or read_csv makes a difference
the difference is: time_taken_b - time_taken_a = 0.0581

This is hugely important for large datasets

Alernative File Formats

csv
xlsx and xls
spss
Stata

Examples of Reading Different Files

#Reading CSV files
csv_file<-read.csv("./data/data_examples/data_1851_obs10000.csv")
#Reading a Stata file
#Library for reading Stata files
library("haven")
#Library for reading SPSS files
library("foreign")
stata_file<-read_dta("./data/data_examples/data_1851_obs10000.dta")
spss_file<-read.spss("./data/data_examples/data_1851_obs10000.sav", to.data.frame=TRUE)
#Library for reading Excel file
library("readxl")
excel_file<-read_excel("./data/data_examples/data_1851_obs10000.xlsx")

Writing CSV files

You can easily write CSV files to your hard drive using the readr library

#load the tidyverse readr package
library(readr)
#writing data as csv
write_csv(excel_file, "./data/data_examples/data_1851_obs10000_written.csv")

Tidy Data

Tidy Data = “standard way of mapping the meaning of a dataset to its structure.” (Hadley Wickham)

Within tidy data:
- each variable forms a column
- each observation forms a row
- each cell is a single measurement

Tidy data means that all datasets are alike

Tidy Data

Tidy datasets are all alike

Tidy Data

Tidy datasets are all alike

Tidy Data

Tidy datasets are all alike

Untidy Data