In the real world, people typically have to work with many tables
Multiple tables of data that need to be combined together, are called relational data
We will first load two libraries
There are five databases (tibbles) inside nycflights13
:
airlines
airports
planes
weather
flights
airlines
tibbleCreate new objects for each tibble/data frame and print the first 5 entries
airports
tibble# A tibble: 5 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New…
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/Chi…
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chi…
4 06N Randall Airport 41.4 -74.4 523 -5 A America/New…
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/New…
weather
tibble# A tibble: 5 × 15
origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
# ℹ 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
# time_hour <dttm>
planes
tibble# A tibble: 5 × 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
2 N102UW 1998 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
3 N103US 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
4 N104UW 1999 Fixed wing multi … AIRBUS INDU… A320… 2 182 NA Turbo…
5 N10575 2002 Fixed wing multi … EMBRAER EMB-… 2 55 NA Turbo…
flights
tibble# A tibble: 5 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
This is how the five tibbles could be connected to one another
flights
connects to planes
via a single variable, tailnum
flights
connects to planes
via a single variable, tailnum
flights
connects to airlines
through the carrier
variableflights
connects to airports
through: the origin
and dest
variablesflights
connects to airports
through: the origin
and dest
variablesflights
connects to weather
via origin
(the location), and year
, month
, day
and hour
(the time)The variables used to connect the tibbles are keys which are used to uniquely identify the observations
Sometimes one variable uniquely identifies the observation - tailnum
for planes
Other times, more than one variable is necessary - year
, month
, day
, hour
, and origin
.
It is always good practice to make sure that the keys identified are in fact unique
This can be done this way for planes
:
# A tibble: 0 × 9
# ℹ 9 variables: tailnum <chr>, year <int>, type <chr>, manufacturer <chr>,
# model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
tailnum
uniquely identifies planesx
(left dataframe).y
(right dataframe)x
and y
The most common type of join that you will need in practive is a left join
This allows you to preserve the original observation even when there is not match
It is important to consider what happens when we have duplicate keys
Case 1: One Table has duplicated keys
It is important to consider what happens when we have duplicate keys
Case 2: Both Tables have duplicated keys
When you join duplicated keys, you get all possible combinations, the Cartesian product
flights
connects to planes
via a single variable, tailnum
Step1: Selecting only relevant variables
Step2: Performing the Left Join
# A tibble: 336,776 × 14
day hour origin dest tailnum carrier year type manufacturer model
<int> <dbl> <chr> <chr> <chr> <chr> <int> <chr> <chr> <chr>
1 1 5 EWR IAH N14228 UA 1999 Fixed wing… BOEING 737-…
2 1 5 LGA IAH N24211 UA 1998 Fixed wing… BOEING 737-…
3 1 5 JFK MIA N619AA AA 1990 Fixed wing… BOEING 757-…
4 1 5 JFK BQN N804JB B6 2012 Fixed wing… AIRBUS A320…
5 1 6 LGA ATL N668DN DL 1991 Fixed wing… BOEING 757-…
6 1 5 EWR ORD N39463 UA 2012 Fixed wing… BOEING 737-…
7 1 6 EWR FLL N516JB B6 2000 Fixed wing… AIRBUS INDU… A320…
8 1 6 LGA IAD N829AS EV 1998 Fixed wing… CANADAIR CL-6…
9 1 6 JFK MCO N593JB B6 2004 Fixed wing… AIRBUS A320…
10 1 6 LGA ORD N3ALAA AA NA <NA> <NA> <NA>
# ℹ 336,766 more rows
# ℹ 4 more variables: engines <int>, seats <int>, speed <int>, engine <chr>
Step2: Performing the Left Join
Because we have the same variable - tailnum
in both dataframes, we can also simplify by writing
# A tibble: 336,776 × 14
day hour origin dest tailnum carrier year type manufacturer model
<int> <dbl> <chr> <chr> <chr> <chr> <int> <chr> <chr> <chr>
1 1 5 EWR IAH N14228 UA 1999 Fixed wing… BOEING 737-…
2 1 5 LGA IAH N24211 UA 1998 Fixed wing… BOEING 737-…
3 1 5 JFK MIA N619AA AA 1990 Fixed wing… BOEING 757-…
4 1 5 JFK BQN N804JB B6 2012 Fixed wing… AIRBUS A320…
5 1 6 LGA ATL N668DN DL 1991 Fixed wing… BOEING 757-…
6 1 5 EWR ORD N39463 UA 2012 Fixed wing… BOEING 737-…
7 1 6 EWR FLL N516JB B6 2000 Fixed wing… AIRBUS INDU… A320…
8 1 6 LGA IAD N829AS EV 1998 Fixed wing… CANADAIR CL-6…
9 1 6 JFK MCO N593JB B6 2004 Fixed wing… AIRBUS A320…
10 1 6 LGA ORD N3ALAA AA NA <NA> <NA> <NA>
# ℹ 336,766 more rows
# ℹ 4 more variables: engines <int>, seats <int>, speed <int>, engine <chr>
In the previous example, we looked at join by tailum
But we can also perform join by multiple variables
# A tibble: 336,776 × 29
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
# humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
# precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>
Filtering joins match observations in the same way as mutating joins
They affect observations, not variables
semi_join(x, y)
keeps all observations in x that have a match in yanti_join(x, y)
drops all observations in x that have a match in yOnly the existence of a match is important
It doesn’t matter which observation is matched
It is the inverse of the semi-join
The anti-join keeps the rows that don’t have a match
They are useful for diagnosing join mismatches
When working spatial data, it is often necessary to work with time dates and time series
In other words, spatial data may contain temporal information
Some common used classes to repsent time in R include:
Times:
Date
- To represent data at day-by-day levelPOSIXct
and POSIXlt
to represent data at a second-by-second levelTime Series:
ts
zoo
(with the zoo
package)xts
(with the xts
package)The simplest data structure to represent time is Date
For example, we can turn an object from character
to Date
by using as.Date
function
The character
values are in the standard format known as ISO 8601
(YYYY-MM-DD)
The as.Date
fuction works without additional arguments
When the character
values are in a non-standard format, we need to specify the format definition
The following table lists the most commonly used symbols for specifying date in R.
Symbol | Example | Meaning |
---|---|---|
%d |
“15” | Day |
%m |
“08” | Month, Numeric |
%b |
“Aug” | Month, 3-letter |
%B |
“August” | Month, full |
%y |
14 | Year, 2-digit |
%Y |
2014 | Year, 4-digit |
For example, we need to specify the format for certain date types.
To fix the error, we need to type:
For example, we need to specify the format for certain date types.
To fix the error, we need to type:
The following will not work
This is because the first two are Date
objects while the last two are character
objects
Date
objects act like numeric objects
We can thus perform:
Another example:
If we look at time_diff1
and time_diff2
, they are different classes
We can turn them to numeric classes and to specific units.
We can also create a sequence of dates using seq
.
For example, the following expression creates a sequence of dates.
This is how we can subset our date vector.
The diff
function can be used to create a vector of differences between consecutive elements
For example:
or
Note that the length of diff(x)
is one element less that x
We don’t have the difference for the first (or last) element, depending how you look at it
Create the following dataframe:
library("tibble")
names <- c("Alex", "Joan", "Helen", "Mark", "Lydia")
birthdates <- c("1986-02-21", "1990-07-21", "1995-05-01", "1989-03-10", "1991-12-12")
df <- tibble(names = names, birthdates = birthdates)
df
# A tibble: 5 × 2
names birthdates
<chr> <chr>
1 Alex 1986-02-21
2 Joan 1990-07-21
3 Helen 1995-05-01
4 Mark 1989-03-10
5 Lydia 1991-12-12
Instructions
calculate_age
that accepts one argument: birthdate
as.Date()
# Define the function to calculate age
calculate_age <- function(birthdates) {
# Convert birthdates to Date objects
birthdates <- as.Date(birthdates)
# Get current date
#current_date <- Sys.Date()
current_date <- as.Date("2024-01-31")
# Calculate age
ages <- as.numeric(difftime(current_date, birthdates, units = "days") / 365.25)
return(ages)
}
Instructions
The output should look like below
# A tibble: 5 × 4
names birthdates age sentence
<chr> <chr> <dbl> <chr>
1 Alex 1986-02-21 37.9 "Alex was born on 1986-02-21. Alex is 37 years and 11 …
2 Joan 1990-07-21 33.5 "Joan was born on 1990-07-21. Joan is 33 years and 6 m…
3 Helen 1995-05-01 28.8 "Helen was born on 1995-05-01. Helen is 28 years and 9…
4 Mark 1989-03-10 34.9 "Mark was born on 1989-03-10. Mark is 34 years and 11 …
5 Lydia 1991-12-12 32.1 "Lydia was born on 1991-12-12. Lydia is 32 years and 2…
Instructions
# Define the function to calculate age and create the sentence
calculate_age <- function(names, birthdates) {
# Convert birthdates to Date objects
birthdates <- as.Date(birthdates)
# Specify the desired current date
current_date <- as.Date("2024-01-31") # Make sure to use the correct date format
# Calculate age
ages <- as.numeric(difftime(current_date, birthdates, units = "days") / 365.25)
# Adjust for leap years
months<-round((ages-floor(ages))*12)
# Create sentences
sentences <- paste0(names, " was born on ", birthdates, ". ", names, " is ", floor(ages), " years and ", months," months old.", sep = " ")
return(sentences)
}
The output should look like below
# A tibble: 5 × 4
names birthdates age sentence
<chr> <chr> <dbl> <chr>
1 Alex 1986-02-21 37.9 "Alex was born on 1986-02-21. Alex is 37 years and 11 …
2 Joan 1990-07-21 33.5 "Joan was born on 1990-07-21. Joan is 33 years and 6 m…
3 Helen 1995-05-01 28.8 "Helen was born on 1995-05-01. Helen is 28 years and 9…
4 Mark 1989-03-10 34.9 "Mark was born on 1989-03-10. Mark is 34 years and 11 …
5 Lydia 1991-12-12 32.1 "Lydia was born on 1991-12-12. Lydia is 32 years and 2…
Suppose you have two datasets:
Instructions
merge_sales_info
that takes sales_data and product_info
as input arguments.left_join()
function to merge sales_data
and product_info
based on the common column product_id.
Your function use should look like below:
transaction_id product_id quantity price product_name category
1 1 101 2 10 Product A Category 1
2 2 102 1 20 Product B Category 2
3 3 103 3 15 Product C Category 1
4 4 101 2 10 Product A Category 1
5 5 104 1 25 Product D Category 3
Instructions
merge_sales_info
that takes sales_data and product_info
as input arguments.left_join()
function to merge sales_data
and product_info
based on the common column product_id.
Your function use should look like below:
transaction_id product_id quantity price product_name category
1 1 101 2 10 Product A Category 1
2 2 102 1 20 Product B Category 2
3 3 103 3 15 Product C Category 1
4 4 101 2 10 Product A Category 1
5 5 104 1 25 Product D Category 3
Suppose you have two datasets:
Instructions:
employee_performance_analysis
that takes the two datasets as inputs and performs the following tasks:employee_data
with performance_data
based on the employee_id.
performance_grade
based on the following criteria:rating
is greater than or equal to 4, assign “High”.rating
is between 3 and 4, assign “Medium”.rating
is less than 3, assign “Low”.performance_grade
column.This is what your output should look like:
# Test the function
employee_performance <- employee_performance_analysis(employee_data, performance_data)
print(employee_performance)
employee_id name department rating bonus performance_grade
1 101 Alice HR 4.5 1000 High
2 102 Bob Finance 3.2 800 Medium
3 103 Charlie Marketing 2.9 500 Low
4 104 David IT 4.8 1200 High
5 105 Eve Operations 3.7 900 Medium
Popescu (JCU): Lecture 6