In the real world, people typically have to work with many tables
Multiple tables of data that need to be combined together, are called relational data
We will first load two libraries
There are five databases (tibbles) inside nycflights13:
airlinesairportsplanesweatherflightsairlines tibbleCreate new objects for each tibble/data frame and print the first 5 entries
airports tibble# A tibble: 5 × 8
  faa   name                            lat   lon   alt    tz dst   tzone       
  <chr> <chr>                         <dbl> <dbl> <dbl> <dbl> <chr> <chr>       
1 04G   Lansdowne Airport              41.1 -80.6  1044    -5 A     America/New…
2 06A   Moton Field Municipal Airport  32.5 -85.7   264    -6 A     America/Chi…
3 06C   Schaumburg Regional            42.0 -88.1   801    -6 A     America/Chi…
4 06N   Randall Airport                41.4 -74.4   523    -5 A     America/New…
5 09J   Jekyll Island Airport          31.1 -81.4    11    -5 A     America/New…
weather tibble# A tibble: 5 × 15
  origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
  <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA
2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA
3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA
4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA
5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA
# ℹ 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
#   time_hour <dttm>
planes tibble# A tibble: 5 × 9
  tailnum  year type               manufacturer model engines seats speed engine
  <chr>   <int> <chr>              <chr>        <chr>   <int> <int> <int> <chr> 
1 N10156   2004 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
2 N102UW   1998 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
3 N103US   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
4 N104UW   1999 Fixed wing multi … AIRBUS INDU… A320…       2   182    NA Turbo…
5 N10575   2002 Fixed wing multi … EMBRAER      EMB-…       2    55    NA Turbo…
flights tibble# A tibble: 5 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
This is how the five tibbles could be connected to one another
flights connects to planes via a single variable, tailnumflights connects to planes via a single variable, tailnumflights connects to airlines through the carrier variableflights connects to airports through: the origin and dest variablesflights connects to airports through: the origin and dest variablesflights connects to weather via origin (the location), and year, month, day and hour (the time)The variables used to connect the tibbles are keys which are used to uniquely identify the observations
Sometimes one variable uniquely identifies the observation - tailnum for planes
Other times, more than one variable is necessary - year, month, day, hour, and origin.
It is always good practice to make sure that the keys identified are in fact unique
This can be done this way for planes:
# A tibble: 0 × 9
# ℹ 9 variables: tailnum <chr>, year <int>, type <chr>, manufacturer <chr>,
#   model <chr>, engines <int>, seats <int>, speed <int>, engine <chr>
tailnum uniquely identifies planesx (left dataframe).y (right dataframe)x and yThe most common type of join that you will need in practive is a left join
This allows you to preserve the original observation even when there is not match
It is important to consider what happens when we have duplicate keys
Case 1: One Table has duplicated keys
It is important to consider what happens when we have duplicate keys
Case 2: Both Tables have duplicated keys
When you join duplicated keys, you get all possible combinations, the Cartesian product
flights connects to planes via a single variable, tailnumStep1: Selecting only relevant variables
Step2: Performing the Left Join
# A tibble: 336,776 × 14
     day  hour origin dest  tailnum carrier  year type        manufacturer model
   <int> <dbl> <chr>  <chr> <chr>   <chr>   <int> <chr>       <chr>        <chr>
 1     1     5 EWR    IAH   N14228  UA       1999 Fixed wing… BOEING       737-…
 2     1     5 LGA    IAH   N24211  UA       1998 Fixed wing… BOEING       737-…
 3     1     5 JFK    MIA   N619AA  AA       1990 Fixed wing… BOEING       757-…
 4     1     5 JFK    BQN   N804JB  B6       2012 Fixed wing… AIRBUS       A320…
 5     1     6 LGA    ATL   N668DN  DL       1991 Fixed wing… BOEING       757-…
 6     1     5 EWR    ORD   N39463  UA       2012 Fixed wing… BOEING       737-…
 7     1     6 EWR    FLL   N516JB  B6       2000 Fixed wing… AIRBUS INDU… A320…
 8     1     6 LGA    IAD   N829AS  EV       1998 Fixed wing… CANADAIR     CL-6…
 9     1     6 JFK    MCO   N593JB  B6       2004 Fixed wing… AIRBUS       A320…
10     1     6 LGA    ORD   N3ALAA  AA         NA <NA>        <NA>         <NA> 
# ℹ 336,766 more rows
# ℹ 4 more variables: engines <int>, seats <int>, speed <int>, engine <chr>
Step2: Performing the Left Join
Because we have the same variable - tailnum in both dataframes, we can also simplify by writing
# A tibble: 336,776 × 14
     day  hour origin dest  tailnum carrier  year type        manufacturer model
   <int> <dbl> <chr>  <chr> <chr>   <chr>   <int> <chr>       <chr>        <chr>
 1     1     5 EWR    IAH   N14228  UA       1999 Fixed wing… BOEING       737-…
 2     1     5 LGA    IAH   N24211  UA       1998 Fixed wing… BOEING       737-…
 3     1     5 JFK    MIA   N619AA  AA       1990 Fixed wing… BOEING       757-…
 4     1     5 JFK    BQN   N804JB  B6       2012 Fixed wing… AIRBUS       A320…
 5     1     6 LGA    ATL   N668DN  DL       1991 Fixed wing… BOEING       757-…
 6     1     5 EWR    ORD   N39463  UA       2012 Fixed wing… BOEING       737-…
 7     1     6 EWR    FLL   N516JB  B6       2000 Fixed wing… AIRBUS INDU… A320…
 8     1     6 LGA    IAD   N829AS  EV       1998 Fixed wing… CANADAIR     CL-6…
 9     1     6 JFK    MCO   N593JB  B6       2004 Fixed wing… AIRBUS       A320…
10     1     6 LGA    ORD   N3ALAA  AA         NA <NA>        <NA>         <NA> 
# ℹ 336,766 more rows
# ℹ 4 more variables: engines <int>, seats <int>, speed <int>, engine <chr>
In the previous example, we looked at join by tailum
But we can also perform join by multiple variables
# A tibble: 336,776 × 29
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
#   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
#   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>
Filtering joins match observations in the same way as mutating joins
They affect observations, not variables
semi_join(x, y) keeps all observations in x that have a match in yanti_join(x, y) drops all observations in x that have a match in yOnly the existence of a match is important
It doesn’t matter which observation is matched
It is the inverse of the semi-join
The anti-join keeps the rows that don’t have a match
They are useful for diagnosing join mismatches
When working spatial data, it is often necessary to work with time dates and time series
In other words, spatial data may contain temporal information
Some common used classes to repsent time in R include:
Times:
Date - To represent data at day-by-day levelPOSIXct and POSIXlt to represent data at a second-by-second levelTime Series:
tszoo (with the zoo package)xts (with the xts package)The simplest data structure to represent time is Date
For example, we can turn an object from character to Date by using as.Date function
The character values are in the standard format known as ISO 8601 (YYYY-MM-DD)
The as.Date fuction works without additional arguments
When the character values are in a non-standard format, we need to specify the format definition
The following table lists the most commonly used symbols for specifying date in R.
| Symbol | Example | Meaning | 
|---|---|---|
%d | 
“15” | Day | 
%m | 
“08” | Month, Numeric | 
%b | 
“Aug” | Month, 3-letter | 
%B | 
“August” | Month, full | 
%y | 
14 | Year, 2-digit | 
%Y | 
2014 | Year, 4-digit | 
For example, we need to specify the format for certain date types.
To fix the error, we need to type:
For example, we need to specify the format for certain date types.
To fix the error, we need to type:
The following will not work
This is because the first two are Date objects while the last two are character objects
Date objects act like numeric objects
We can thus perform:
Another example:
If we look at time_diff1 and time_diff2, they are different classes
We can turn them to numeric classes and to specific units.
We can also create a sequence of dates using seq.
For example, the following expression creates a sequence of dates.
This is how we can subset our date vector.
The diff function can be used to create a vector of differences between consecutive elements
For example:
or
Note that the length of diff(x) is one element less that x
We don’t have the difference for the first (or last) element, depending how you look at it
Create the following dataframe:
library("tibble")
names <- c("Alex", "Joan", "Helen", "Mark", "Lydia")
birthdates <- c("1986-02-21", "1990-07-21", "1995-05-01", "1989-03-10", "1991-12-12")
df <- tibble(names = names, birthdates = birthdates)
df# A tibble: 5 × 2
  names birthdates
  <chr> <chr>     
1 Alex  1986-02-21
2 Joan  1990-07-21
3 Helen 1995-05-01
4 Mark  1989-03-10
5 Lydia 1991-12-12
Instructions
calculate_age that accepts one argument: birthdateas.Date()# Define the function to calculate age
calculate_age <- function(birthdates) {
    # Convert birthdates to Date objects
    birthdates <- as.Date(birthdates)
    # Get current date
    #current_date <- Sys.Date()
    current_date <- as.Date("2024-01-31")
    # Calculate age
    ages <- as.numeric(difftime(current_date, birthdates, units = "days") / 365.25)
    return(ages)
}Instructions
The output should look like below
# A tibble: 5 × 4
  names birthdates   age sentence                                               
  <chr> <chr>      <dbl> <chr>                                                  
1 Alex  1986-02-21  37.9 "Alex was born on 1986-02-21. Alex is 37 years and 11 …
2 Joan  1990-07-21  33.5 "Joan was born on 1990-07-21. Joan is 33 years and 6 m…
3 Helen 1995-05-01  28.8 "Helen was born on 1995-05-01. Helen is 28 years and 9…
4 Mark  1989-03-10  34.9 "Mark was born on 1989-03-10. Mark is 34 years and 11 …
5 Lydia 1991-12-12  32.1 "Lydia was born on 1991-12-12. Lydia is 32 years and 2…
Instructions
# Define the function to calculate age and create the sentence
calculate_age <- function(names, birthdates) {
    # Convert birthdates to Date objects
    birthdates <- as.Date(birthdates)
    # Specify the desired current date
    current_date <- as.Date("2024-01-31")  # Make sure to use the correct date format
    # Calculate age
    ages <- as.numeric(difftime(current_date, birthdates, units = "days") / 365.25)
    # Adjust for leap years
    months<-round((ages-floor(ages))*12)
    # Create sentences
    sentences <- paste0(names, " was born on ", birthdates, ". ", names, " is ", floor(ages), " years and ", months," months old.", sep = " ")
    return(sentences)
}The output should look like below
# A tibble: 5 × 4
  names birthdates   age sentence                                               
  <chr> <chr>      <dbl> <chr>                                                  
1 Alex  1986-02-21  37.9 "Alex was born on 1986-02-21. Alex is 37 years and 11 …
2 Joan  1990-07-21  33.5 "Joan was born on 1990-07-21. Joan is 33 years and 6 m…
3 Helen 1995-05-01  28.8 "Helen was born on 1995-05-01. Helen is 28 years and 9…
4 Mark  1989-03-10  34.9 "Mark was born on 1989-03-10. Mark is 34 years and 11 …
5 Lydia 1991-12-12  32.1 "Lydia was born on 1991-12-12. Lydia is 32 years and 2…
Suppose you have two datasets:
Instructions
merge_sales_info that takes sales_data and product_info as input arguments.left_join() function to merge sales_data and product_info based on the common column product_id.Your function use should look like below:
  transaction_id product_id quantity price product_name   category
1              1        101        2    10    Product A Category 1
2              2        102        1    20    Product B Category 2
3              3        103        3    15    Product C Category 1
4              4        101        2    10    Product A Category 1
5              5        104        1    25    Product D Category 3
Instructions
merge_sales_info that takes sales_data and product_info as input arguments.left_join() function to merge sales_data and product_info based on the common column product_id.Your function use should look like below:
  transaction_id product_id quantity price product_name   category
1              1        101        2    10    Product A Category 1
2              2        102        1    20    Product B Category 2
3              3        103        3    15    Product C Category 1
4              4        101        2    10    Product A Category 1
5              5        104        1    25    Product D Category 3
Suppose you have two datasets:
Instructions:
employee_performance_analysis that takes the two datasets as inputs and performs the following tasks:employee_data with performance_data based on the employee_id.performance_grade based on the following criteria:rating is greater than or equal to 4, assign “High”.rating is between 3 and 4, assign “Medium”.rating is less than 3, assign “Low”.performance_grade column.This is what your output should look like:
# Test the function
employee_performance <- employee_performance_analysis(employee_data, performance_data)
print(employee_performance)  employee_id    name department rating bonus performance_grade
1         101   Alice         HR    4.5  1000              High
2         102     Bob    Finance    3.2   800            Medium
3         103 Charlie  Marketing    2.9   500               Low
4         104   David         IT    4.8  1200              High
5         105     Eve Operations    3.7   900            Medium
Popescu (JCU): Lecture 6