L6: Relational Data, Dates and Joins

Bogdan G. Popescu

John Cabot University

Relational Data

  • In the real world, people typically have to work with many tables

  • Multiple tables of data that need to be combined together, are called relational data

Examples: NYC Flights

We will first load two libraries

library(tidyverse)
library(nycflights13)

There are five databases (tibbles) inside nycflights13:

  • airlines
  • airports
  • planes
  • weather
  • flights

The airlines tibble

Create new objects for each tibble/data frame and print the first 5 entries

airlines<-airlines
head(airlines, n=5)
# A tibble: 5 × 2
  carrier name                  
  <chr>   <chr>                 
1 9E      Endeavor Air Inc.     
2 AA      American Airlines Inc.
3 AS      Alaska Airlines Inc.  
4 B6      JetBlue Airways       
5 DL      Delta Air Lines Inc.  

The airports tibble

airports<-airports
head(airports, n=5)
# A tibble: 5 × 8
  faa   name                            lat   lon   alt    tz dst   tzone       
  <chr> <chr>                         <dbl> <dbl> <dbl> <dbl> <chr> <chr>       
1 04G   Lansdowne Airport              41.1 -80.6  1044    -5 A     America/New…
2 06A   Moton Field Municipal Airport  32.5 -85.7   264    -6 A     America/Chi…
3 06C   Schaumburg Regional            42.0 -88.1   801    -6 A     America/Chi…
4 06N   Randall Airport                41.4 -74.4   523    -5 A     America/New…
5 09J   Jekyll Island Airport          31.1 -81.4    11    -5 A     America/New…

The weather tibble

weather<-weather
head(weather, n=5)
# A tibble: 5 × 15
  origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
  <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA
2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA
3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA
4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA
5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA
# ℹ 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
#   time_hour <dttm>

The planes tibble

planes<-planes
head(planes, n=5)
# A tibble: 5 × 9
  tailnum  year type               manufacturer model engines seats speed engine
  <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
2 N102UW   1998 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
3 N103US   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
4 N104UW   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
5 N10575   2002 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…

The flights tibble

flights<-flights
head(flights, n=5)
# A tibble: 5 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

The tibbles together

This is how the five tibbles could be connected to one another

The tibbles together

The tibbles together

The tibbles together

The tibbles together

The tibbles together

The tibbles together

The tibbles together

  • flights connects to planes via a single variable, tailnum

The tibbles together

  • flights connects to planes via a single variable, tailnum

The tibbles together

  • flights connects to airlines through the carrier variable

The tibbles together

  • flights connects to airports through: the origin and dest variables

The tibbles together

  • flights connects to airports through: the origin and dest variables

The tibbles together

  • flights connects to weather via origin (the location), and year, month, day and hour (the time)

Keys

  • The variables used to connect the tibbles are keys which are used to uniquely identify the observations

  • Sometimes one variable uniquely identifies the observation - tailnum for planes

  • Other times, more than one variable is necessary - year, month, day, hour, and origin.

Keys

  • It is always good practice to make sure that the keys identified are in fact unique

  • This can be done this way for planes:

planes2<-subset(planes, duplicated(tailnum))
planes2
# A tibble: 0 × 9
# ℹ 9 variables: tailnum <chr>, year <int>, type <chr>, manufacturer <chr>,
#   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
  • This means that tailnum uniquely identifies planes

Types of Joins

  • A left join keeps all the observations in x (left dataframe).
  • A right join keeps all the observations in y (right dataframe)
  • A full join keeps all the observations in both x and y

Left Join

Right Join

Full Join

Advice

  • The most common type of join that you will need in practive is a left join

  • This allows you to preserve the original observation even when there is not match

Caveat: Duplicate Keys

  • It is important to consider what happens when we have duplicate keys

  • Case 1: One Table has duplicated keys

Caveat: Duplicate Keys

  • It is important to consider what happens when we have duplicate keys

  • Case 2: Both Tables have duplicated keys

  • When you join duplicated keys, you get all possible combinations, the Cartesian product

Example of a Left Join

Left Join

  • flights connects to planes via a single variable, tailnum

Left Join

Step1: Selecting only relevant variables

library(tidyverse)
library(nycflights13)
flights2 <- subset(flights, select = c(day, hour, origin, dest, tailnum, carrier))

Left Join

Step2: Performing the Left Join

fights_planes<-left_join(flights2, planes, by = c("tailnum" = "tailnum"))
fights_planes
# A tibble: 336,776 × 14
     day  hour origin dest  tailnum carrier  year type        manufacturer model
   <int> <dbl> <chr>  <chr> <chr>   <chr>   <int> <chr>       <chr>        <chr>
 1     1     5 EWR    IAH   N14228  UA       1999 Fixed wing… BOEING       737-…
 2     1     5 LGA    IAH   N24211  UA       1998 Fixed wing… BOEING       737-…
 3     1     5 JFK    MIA   N619AA  AA       1990 Fixed wing… BOEING       757-…
 4     1     5 JFK    BQN   N804JB  B6       2012 Fixed wing… AIRBUS       A320…
 5     1     6 LGA    ATL   N668DN  DL       1991 Fixed wing… BOEING       757-…
 6     1     5 EWR    ORD   N39463  UA       2012 Fixed wing… BOEING       737-…
 7     1     6 EWR    FLL   N516JB  B6       2000 Fixed wing… AIRBUS INDU… A320…
 8     1     6 LGA    IAD   N829AS  EV       1998 Fixed wing… CANADAIR     CL-6…
 9     1     6 JFK    MCO   N593JB  B6       2004 Fixed wing… AIRBUS       A320…
10     1     6 LGA    ORD   N3ALAA  AA         NA <NA>        <NA>         <NA> 
# ℹ 336,766 more rows
# ℹ 4 more variables: engines <int>, seats <int>, speed <int>, engine <chr>

Left Join

Step2: Performing the Left Join

Because we have the same variable - tailnum in both dataframes, we can also simplify by writing

fights_planes<-left_join(flights2, planes, by = c("tailnum"))
fights_planes
# A tibble: 336,776 × 14
     day  hour origin dest  tailnum carrier  year type        manufacturer model
   <int> <dbl> <chr>  <chr> <chr>   <chr>   <int> <chr>       <chr>        <chr>
 1     1     5 EWR    IAH   N14228  UA       1999 Fixed wing… BOEING       737-…
 2     1     5 LGA    IAH   N24211  UA       1998 Fixed wing… BOEING       737-…
 3     1     5 JFK    MIA   N619AA  AA       1990 Fixed wing… BOEING       757-…
 4     1     5 JFK    BQN   N804JB  B6       2012 Fixed wing… AIRBUS       A320…
 5     1     6 LGA    ATL   N668DN  DL       1991 Fixed wing… BOEING       757-…
 6     1     5 EWR    ORD   N39463  UA       2012 Fixed wing… BOEING       737-…
 7     1     6 EWR    FLL   N516JB  B6       2000 Fixed wing… AIRBUS INDU… A320…
 8     1     6 LGA    IAD   N829AS  EV       1998 Fixed wing… CANADAIR     CL-6…
 9     1     6 JFK    MCO   N593JB  B6       2004 Fixed wing… AIRBUS       A320…
10     1     6 LGA    ORD   N3ALAA  AA         NA <NA>        <NA>         <NA> 
# ℹ 336,766 more rows
# ℹ 4 more variables: engines <int>, seats <int>, speed <int>, engine <chr>

Left Join by Multiple Keys

  • In the previous example, we looked at join by tailum

  • But we can also perform join by multiple variables

flights3 <-left_join(flights, weather, by = c("year", "month", "day", "hour", "origin"))
flights3
# A tibble: 336,776 × 29
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
#   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
#   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>

Filtering Joins

  • Filtering joins match observations in the same way as mutating joins

  • They affect observations, not variables

    • semi_join(x, y) keeps all observations in x that have a match in y
    • anti_join(x, y) drops all observations in x that have a match in y

The Semi-Join

  • Only the existence of a match is important

  • It doesn’t matter which observation is matched

The Anti-Join

  • It is the inverse of the semi-join

  • The anti-join keeps the rows that don’t have a match

  • They are useful for diagnosing join mismatches

Dates and Times

When working spatial data, it is often necessary to work with time dates and time series

In other words, spatial data may contain temporal information

Dates and Times

Some common used classes to repsent time in R include:

Times:

  • Date - To represent data at day-by-day level
  • POSIXct and POSIXlt to represent data at a second-by-second level

Time Series:

  • ts
  • zoo (with the zoo package)
  • xts (with the xts package)

Working with Date objects

The simplest data structure to represent time is Date

For example, we can turn an object from character to Date by using as.Date function

x = "2014-10-20"
x
[1] "2014-10-20"
class(x)
[1] "character"
x2 = as.Date(x)
x2
[1] "2014-10-20"
class(x2)
[1] "Date"

Working with Date objects

The character values are in the standard format known as ISO 8601 (YYYY-MM-DD)

The as.Date fuction works without additional arguments

When the character values are in a non-standard format, we need to specify the format definition

Working with Date objects

The following table lists the most commonly used symbols for specifying date in R.

Symbol Example Meaning
%d “15” Day
%m “08” Month, Numeric
%b “Aug” Month, 3-letter
%B “August” Month, full
%y 14 Year, 2-digit
%Y 2014 Year, 4-digit

Working with Date objects: Examples

For example, we need to specify the format for certain date types.

as.Date("07/Aug/12")
## Error in charToDate(x) : 
##  character string is not in a standard unambiguous format

To fix the error, we need to type:

as.Date("07/Aug/12", format = "%d/%b/%y")

Working with Date objects: Examples

For example, we need to specify the format for certain date types.

as.Date("2012-August-07")
## Error in charToDate(x) : 
##  character string is not in a standard unambiguous format

To fix the error, we need to type:

as.Date("2012-August-07", format = "%Y-%B-%d")
[1] "2012-08-07"

Obtaining “Parts” of a Date

format(as.Date("1955-11-30"), format = "%m/%Y")
[1] "11/1955"
format(as.Date("1955-11-30"), format = "%d")
[1] "30"

The following will not work

format("1955-11-30", format = "%m/%Y")
[1] "1955-11-30"
format("1955-11-30", format = "%d")
[1] "1955-11-30"

This is because the first two are Date objects while the last two are character objects

x1<-"1955-11-30"
class(x1)
[1] "character"
x2<-as.Date("1955-11-30")
class(x2)
[1] "Date"

Arithmetic operations with dates

Date objects act like numeric objects

We can thus perform:

  • conditional operations
  • addition and subtraction of/between dates
  • creating date sequences

Arithmetic operations with dates: subtraction

x2<-as.Date("1955-11-30")
x3<-as.Date("1955-11-01")
time_diff1<-x2-x3
time_diff1
Time difference of 29 days

Another example:

x2<-as.Date("1955-11-30")
x4<-as.Date("2023-12-20")
time_diff2<-x4-x2
time_diff2
Time difference of 24857 days

Arithmetic operations with dates: subtraction

If we look at time_diff1 and time_diff2, they are different classes

class(time_diff1)
[1] "difftime"
class(time_diff2)
[1] "difftime"

We can turn them to numeric classes and to specific units.

as.numeric(time_diff1, unit = "days")
[1] 29
as.numeric(time_diff2, unit = "weeks")
[1] 3551

Arithmetic operations with dates: sequences

We can also create a sequence of dates using seq.

For example, the following expression creates a sequence of dates.

x<-seq(from = as.Date("2018-10-14"), to = as.Date("2019-01-11"), by = 7)
x
 [1] "2018-10-14" "2018-10-21" "2018-10-28" "2018-11-04" "2018-11-11"
 [6] "2018-11-18" "2018-11-25" "2018-12-02" "2018-12-09" "2018-12-16"
[11] "2018-12-23" "2018-12-30" "2019-01-06"

Time Series: Subsestting

This is how we can subset our date vector.

x<-seq(from = as.Date("2018-10-14"), to = as.Date("2019-01-11"), by = 7)
x[x<"2019-01-06"]
 [1] "2018-10-14" "2018-10-21" "2018-10-28" "2018-11-04" "2018-11-11"
 [6] "2018-11-18" "2018-11-25" "2018-12-02" "2018-12-09" "2018-12-16"
[11] "2018-12-23" "2018-12-30"

Time Series: Consecutive Differences

The diff function can be used to create a vector of differences between consecutive elements

For example:

x = c(15, 8, 23, 24)
diff(x)
[1] -7 15  1

or

x<-seq(from = as.Date("2018-10-14"), to = as.Date("2019-01-11"), by = 7)
diff(x)
Time differences in days
 [1] 7 7 7 7 7 7 7 7 7 7 7 7

Time Series: Consecutive Differences

Note that the length of diff(x) is one element less that x

We don’t have the difference for the first (or last) element, depending how you look at it

length(x)
[1] 13
length(diff(x))
[1] 12

Exercise using Dates 1

Create the following dataframe:

library("tibble")
names <- c("Alex", "Joan", "Helen", "Mark", "Lydia")
birthdates <- c("1986-02-21", "1990-07-21", "1995-05-01", "1989-03-10", "1991-12-12")
df <- tibble(names = names, birthdates = birthdates)
df
# A tibble: 5 × 2
  names birthdates
  <chr> <chr>     
1 Alex  1986-02-21
2 Joan  1990-07-21
3 Helen 1995-05-01
4 Mark  1989-03-10
5 Lydia 1991-12-12

Exercise using Dates 1

Instructions

  1. Define a function called calculate_age that accepts one argument: birthdate
  2. Convert the birthdate string into a Date object using as.Date()
  3. Calculate people’s age
# Add a new column 'age' to the dataframe using the function
df$age <- calculate_age(df$birthdates)
# Print the updated dataframe
print(df)
# A tibble: 5 × 3
  names birthdates   age
  <chr> <chr>      <dbl>
1 Alex  1986-02-21  37.9
2 Joan  1990-07-21  33.5
3 Helen 1995-05-01  28.8
4 Mark  1989-03-10  34.9
5 Lydia 1991-12-12  32.1

Exercise using Dates 1 ANSWER

# Define the function to calculate age
calculate_age <- function(birthdates) {
    # Convert birthdates to Date objects
    birthdates <- as.Date(birthdates)
    # Get current date
    #current_date <- Sys.Date()
    current_date <- as.Date("2024-01-31")
    # Calculate age
    ages <- as.numeric(difftime(current_date, birthdates, units = "days") / 365.25)
    return(ages)
}

Exercise using Dates 1 ANSWER

# Add a new column 'age' to the dataframe using the function
df$age <- calculate_age(df$birthdates)
# Print the updated dataframe
print(df)
# A tibble: 5 × 3
  names birthdates   age
  <chr> <chr>      <dbl>
1 Alex  1986-02-21  37.9
2 Joan  1990-07-21  33.5
3 Helen 1995-05-01  28.8
4 Mark  1989-03-10  34.9
5 Lydia 1991-12-12  32.1

Exercise using Dates 2

Instructions

  1. Now, add a string column in which you have a sentence. For example: “Alex was born 21-02-1986. Alex is 37 years and 11 months old.”

The output should look like below

# Add a new column 'sentence' to the dataframe using the function
df$sentence <- calculate_age(df$names, df$birthdates)
print(df)
# A tibble: 5 × 4
  names birthdates   age sentence                                               
  <chr> <chr>      <dbl> <chr>                                                  
1 Alex  1986-02-21  37.9 "Alex was born on 1986-02-21. Alex is 37 years and 11 …
2 Joan  1990-07-21  33.5 "Joan was born on 1990-07-21. Joan is 33 years and 6 m…
3 Helen 1995-05-01  28.8 "Helen was born on 1995-05-01. Helen is 28 years and 9…
4 Mark  1989-03-10  34.9 "Mark was born on 1989-03-10. Mark is 34 years and 11 …
5 Lydia 1991-12-12  32.1 "Lydia was born on 1991-12-12. Lydia is 32 years and 2…

Exercise using Dates 2 ANSWER

Instructions

  1. Now, add a string column in which you have a sentence. For example: “Alex was born 21-02-1986. Alex is 37 years and 11 months old.”
# Define the function to calculate age and create the sentence
calculate_age <- function(names, birthdates) {
    # Convert birthdates to Date objects
    birthdates <- as.Date(birthdates)
    # Specify the desired current date
    current_date <- as.Date("2024-01-31")  # Make sure to use the correct date format
    # Calculate age
    ages <- as.numeric(difftime(current_date, birthdates, units = "days") / 365.25)
    # Adjust for leap years
    months<-round((ages-floor(ages))*12)
    # Create sentences
    sentences <- paste0(names, " was born on ", birthdates, ". ", names, " is ", floor(ages), " years and ", months," months old.", sep = " ")
    return(sentences)
}

Exercise using Dates 2 ANSWER

The output should look like below

# Add a new column 'sentence' to the dataframe using the function
df$sentence <- calculate_age(df$names, df$birthdates)
print(df)
# A tibble: 5 × 4
  names birthdates   age sentence                                               
  <chr> <chr>      <dbl> <chr>                                                  
1 Alex  1986-02-21  37.9 "Alex was born on 1986-02-21. Alex is 37 years and 11 …
2 Joan  1990-07-21  33.5 "Joan was born on 1990-07-21. Joan is 33 years and 6 m…
3 Helen 1995-05-01  28.8 "Helen was born on 1995-05-01. Helen is 28 years and 9…
4 Mark  1989-03-10  34.9 "Mark was born on 1989-03-10. Mark is 34 years and 11 …
5 Lydia 1991-12-12  32.1 "Lydia was born on 1991-12-12. Lydia is 32 years and 2…

Exercise using Left Join 1

Suppose you have two datasets:

# Sample data for sales_data
sales_data <- data.frame(
  transaction_id = c(1, 2, 3, 4, 5),
  product_id = c(101, 102, 103, 101, 104),
  quantity = c(2, 1, 3, 2, 1),
  price = c(10, 20, 15, 10, 25)
)
# Sample data for product_info
product_info <- data.frame(
  product_id = c(101, 102, 103, 104, 105),
  product_name = c("Product A", "Product B", "Product C", "Product D", "Product E"),
  category = c("Category 1", "Category 2", "Category 1", "Category 3", "Category 2")
)

Exercise using Left Join 1

Instructions

  1. Define a function named merge_sales_info that takes sales_data and product_info as input arguments.
  2. Inside the function, use dplyr’s left_join() function to merge sales_data and product_info based on the common column product_id.
  3. Return the merged dataset.

Your function use should look like below:

merged_sales <- merge_sales_info(sales_data, product_info)
print(merged_sales)
  transaction_id product_id quantity price product_name   category
1              1        101        2    10    Product A Category 1
2              2        102        1    20    Product B Category 2
3              3        103        3    15    Product C Category 1
4              4        101        2    10    Product A Category 1
5              5        104        1    25    Product D Category 3

Exercise using Left Join 1 ANSWER

Instructions

  1. Define a function named merge_sales_info that takes sales_data and product_info as input arguments.
  2. Inside the function, use dplyr’s left_join() function to merge sales_data and product_info based on the common column product_id.
  3. Return the merged dataset.
merge_sales_info <- function(sales_data, product_info) {
  merged_data <- left_join(sales_data, product_info, by = "product_id")
  return(merged_data)}

Exercise using Left Join 1 ANSWER

Your function use should look like below:

merged_sales <- merge_sales_info(sales_data, product_info)
print(merged_sales)
  transaction_id product_id quantity price product_name   category
1              1        101        2    10    Product A Category 1
2              2        102        1    20    Product B Category 2
3              3        103        3    15    Product C Category 1
4              4        101        2    10    Product A Category 1
5              5        104        1    25    Product D Category 3

Exercise using Left Join 2

Suppose you have two datasets:

# Sample data for employee_data
employee_data <- data.frame(
  employee_id = c(101, 102, 103, 104, 105),
  name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  department = c("HR", "Finance", "Marketing", "IT", "Operations")
)
# Sample data for performance_data
performance_data <- data.frame(
  employee_id = c(101, 102, 103, 104, 105),
  rating = c(4.5, 3.2, 2.9, 4.8, 3.7),
  bonus = c(1000, 800, 500, 1200, 900)
)

Exercise using Left Join 2

Instructions:

  1. Create a function employee_performance_analysis that takes the two datasets as inputs and performs the following tasks:
  1. Merge employee_data with performance_data based on the employee_id.
  2. Calculate a new column performance_grade based on the following criteria:
  • If rating is greater than or equal to 4, assign “High”.
  • If rating is between 3 and 4, assign “Medium”.
  • If rating is less than 3, assign “Low”.
  1. Return the merged dataset with the added performance_grade column.

Exercise using Left Join 2

This is what your output should look like:

# Test the function
employee_performance <- employee_performance_analysis(employee_data, performance_data)
print(employee_performance)
  employee_id    name department rating bonus performance_grade
1         101   Alice         HR    4.5  1000              High
2         102     Bob    Finance    3.2   800            Medium
3         103 Charlie  Marketing    2.9   500               Low
4         104   David         IT    4.8  1200              High
5         105     Eve Operations    3.7   900            Medium