L11: Merging Data

Bogdan G. Popescu

John Cabot University

Introduction

We can use different functions to combine different dataframes including: concat, append, merge, and join.

import pandas as pd
import numpy as np

The following graph shows all the different types of joins available:

Inner Join

In this case, we have two dataframes: x and y.

For an inner join, only the shaded area will be returned.

Outer Join

In this case, we have two dataframes: x and y.

This takes all the data from x and y

Left Join

In this case, we have two dataframes: x and y.

This takes everything from the left as a point of reference and matches everything onto the left dataframe

Right Join

In this case, we have two dataframes: x and y.

This takes everything from the right as a point of reference and matches everything onto the right dataframe

Examples

Let us say that we have the following dataframes:

import pandas as pd
df_a = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
})
df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
df_b = pd.DataFrame({
    'id': [2, 3, 4],
    'city': ['New York', 'Los Angeles', 'Chicago'],
    'salary': [70000, 80000, 60000]
})
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000

Inner Join Example

df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000
# Inner join on 'id'
inner_join = pd.merge(df_a, df_b, on='id', how='inner')
inner_join
   id     name  age         city  salary
0   2      Bob   30     New York   70000
1   3  Charlie   35  Los Angeles   80000

Outer Join Example


df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000
# Inner join on 'id'
outer_join = pd.merge(df_a, df_b, on='id', how='outer')
outer_join
   id     name   age         city   salary
0   1    Alice  25.0          NaN      NaN
1   2      Bob  30.0     New York  70000.0
2   3  Charlie  35.0  Los Angeles  80000.0
3   4      NaN   NaN      Chicago  60000.0

Left Join Example


df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000
# Inner join on 'id'
left_join = pd.merge(df_a, df_b, on='id', how='left')
left_join
   id     name  age         city   salary
0   1    Alice   25          NaN      NaN
1   2      Bob   30     New York  70000.0
2   3  Charlie   35  Los Angeles  80000.0

Right Join Example


df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000
# Inner join on 'id'
right_join = pd.merge(df_a, df_b, on='id', how='right')
right_join
   id     name   age         city  salary
0   2      Bob  30.0     New York   70000
1   3  Charlie  35.0  Los Angeles   80000
2   4      NaN   NaN      Chicago   60000

Other Methods

Other methods of merging two dataframes include join and concat.

To keep things simple, we will not do those.

Exercises

Let us imagine the following datasets:

raw_data_1 = { 'subject_id': ['1', '2', '3', '4', '5'], 
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = { 'subject_id': ['4', '5', '6', '7', '8'], 
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = { 'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'], 
'test_grade': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

Exercises

  1. Create a dataframe for each data
  2. Join the first 2 DFs so that it generates a list with all the users.
  3. Join the DF from the previous section with the data from the grades (raw_data_3) so that each user is assigned their grade.
  4. Join the data from raw_data_1 and raw_data_2 only from the records with the same subject_id

Exercises Solution 1

  1. Create a dataframe for each data
df1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
df2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
df3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_grade'])

Exercises Solution 2

  1. Join the first 2 DFs so that it generates a list with all the users.
# Merge the DataFrames
merged_df = pd.merge(df1, df2, on=['subject_id', 'first_name', 'last_name'], how='outer')
merged_df
  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          4      Billy    Bonder
5          5     Ayoung   Atiches
6          5      Brian     Black
7          6       Bran   Balwner
8          7      Bryce     Brice
9          8      Betty    Btisan
df1
  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          5     Ayoung   Atiches
df2
  subject_id first_name last_name
0          4      Billy    Bonder
1          5      Brian     Black
2          6       Bran   Balwner
3          7      Bryce     Brice
4          8      Betty    Btisan

Exercises Solution 3

  1. Join the DF from the previous section with the data from the grades (raw_data_3) so that each user is assigned their grade.
left_join = pd.merge(merged_df, df3, on='subject_id', how='left')
left_join
  subject_id first_name last_name  test_grade
0          1       Alex  Anderson        51.0
1          2        Amy  Ackerman        15.0
2          3      Allen       Ali        15.0
3          4      Alice      Aoni        61.0
4          4      Billy    Bonder        61.0
5          5     Ayoung   Atiches        16.0
6          5      Brian     Black        16.0
7          6       Bran   Balwner         NaN
8          7      Bryce     Brice        14.0
9          8      Betty    Btisan        15.0

Exercises Solution 4

  1. Join the data from raw_data_1 and raw_data_2 only from the records with the same subject_id
merged_df = pd.merge(df1, df2, on='subject_id', how='inner')
merged_df
  subject_id first_name_x last_name_x first_name_y last_name_y
0          4        Alice        Aoni        Billy      Bonder
1          5       Ayoung     Atiches        Brian       Black