L10: Merging Data

Bogdan G. Popescu

John Cabot University

Aggregating by group

In some instances, we might be interested in performing operations by group.

For example, we might be interested in counting the number of observations by group.

Let’s say that we want to know how many beers there are per beer style.

The following will create a group object.

Aggregating by group

Let us have a look at a mock dataset.

Python
import pandas as pd
beer = pd.read_csv('beer_reviews.csv')
beer.head()
   brewery_id             brewery_name  ...               beer_name  beer_beerid
0       10325          Vecchio Birraio  ...            Sausa Weizen        47986
1       10325          Vecchio Birraio  ...                Red Moon        48213
2       10325          Vecchio Birraio  ...  Black Horse Black Beer        48215
3       10325          Vecchio Birraio  ...              Sausa Pils        47969
4        1075  Caldera Brewing Company  ...           Cauldron DIPA        64883

[5 rows x 9 columns]

Aggregating by group

Python
grouped_by_style = beer.groupby(by='beer_style')
grouped_by_style
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1474f1ac0>

If we actually want to see the number of beers by style, we would do the following

Python
grouped_by_style.size()
beer_style
Altbier                        26
American Adjunct Lager         28
American Amber / Red Ale      147
American Amber / Red Lager      7
American Barleywine             9
                             ... 
Tripel                          9
Vienna Lager                   17
Weizenbock                      1
Winter Warmer                 129
Witbier                        26
Length: 80, dtype: int64

Aggregating by group

We could also rearrange the number of beers by style in the following way:

Python
#Creating a dataframe
df = pd.DataFrame(grouped_by_style.size())
#Renaming the column
df = df.rename(columns={0: 'count'})
#Sorting
df_sorted = df.sort_values(by='count', ascending=False)
df_sorted
                                count
beer_style                           
Russian Imperial Stout           1561
American IPA                      791
American Double / Imperial IPA    493
Light Lager                       466
American Pale Ale (APA)           304
...                               ...
Japanese Rice Lager                 1
Flanders Red Ale                    1
California Common / Steam Beer      1
Bière de Garde                      1
Weizenbock                          1

[80 rows x 1 columns]

Aggregating by group

If we are interested in the average review score for every beer style, we would do:

Python
a = beer.groupby('beer_style')['review_overall'].mean().sort_values(ascending=False)
print(a)
beer_style
California Common / Steam Beer    5.000000
Flanders Red Ale                  4.500000
English Barleywine                4.309091
Chile Beer                        4.166667
Rauchbier                         4.159091
                                    ...   
Euro Pale Lager                   2.926357
English India Pale Ale (IPA)      2.750000
Light Lager                       2.719957
Low Alcohol Beer                  2.500000
Cream Ale                         1.500000
Name: review_overall, Length: 80, dtype: float64

Aggregating by group

We could also calculate the average review for more than one category

Python
a = beer.groupby(['beer_style', 'brewery_name'])['review_overall'].mean().sort_values(ascending=False)
print(a)
beer_style                      brewery_name                   
American Brown Ale              Moon River Brewing Company         5.0
Kölsch                          Broad Ripple Brew Pub              5.0
Schwarzbier                     Broad Ripple Brew Pub              5.0
Oatmeal Stout                   Third Base Sports Bar & Brewery    5.0
California Common / Steam Beer  Broad Ripple Brew Pub              5.0
                                                                  ... 
American Amber / Red Lager      Formosa Brewing Co.                1.5
American Brown Ale              Third Base Sports Bar & Brewery    1.5
Hefeweizen                      Mariposa Brewing Company           1.5
                                Vecchio Birraio                    1.5
Czech Pilsener                  Ohio Brewing Company               1.5
Name: review_overall, Length: 375, dtype: float64

Aggregating by group

Another option to work with grouped data is to use the agg function.

This allows us to apply multiple functions to one group.

Python
a = beer.groupby(['beer_style', 'brewery_name'])['review_overall'].agg(["mean", "count"])
print(a)
                                                              mean  count
beer_style    brewery_name                                               
Altbier       Bakushukan Clair (Kure Beer)                4.000000      1
              Bluegrass Brewing Co.                       3.766667     15
              Bluegrass Brewing Co. - East St. Matthew's  4.000000      1
              Broad Ripple Brew Pub                       4.000000      1
              Clockwork Beer Co, The                      4.000000      4
...                                                            ...    ...
Winter Warmer Hoppin' Frog Brewery                        3.444000    125
              NoDa Brewing Company                        4.000000      2
Witbier       Broad Ripple Brew Pub                       2.833333      6
              Frog at Bercy Village, The                  4.000000      1
              Moon River Brewing Company                  3.842105     19

[375 rows x 2 columns]

Exercises

  1. What are the top 5 beer styles?
  2. What are the top 5 worst breweries?
  3. Which brewery has the greatest variability in its rating?

Exercises

  1. What are the top 5 beer styles?
Python
#Calculate the overall review by beer style
a = beer.groupby(['beer_style'])['review_overall'].agg(["mean"])
#Creating a dataframe
df = pd.DataFrame(a)
#Renaming the column
df = df.rename(columns={"mean": 'avg_review'})
#Sorting
df_sorted = df.sort_values(by='avg_review', ascending=False)
df_sorted.head(5)
                                avg_review
beer_style                                
California Common / Steam Beer    5.000000
Flanders Red Ale                  4.500000
English Barleywine                4.309091
Chile Beer                        4.166667
Rauchbier                         4.159091

Exercises

  1. What are the top 5 worst breweries?
Python
#Calculate the overall review by beer style
b = beer.groupby(['brewery_name'])['review_overall'].agg(["mean"])
#Creating a dataframe
df = pd.DataFrame(b)
#Renaming the column
df = df.rename(columns={"mean": 'avg_review'})
#Sorting
df_sorted = df.sort_values(by='avg_review', ascending=True)
df_sorted.head(5)
                                     avg_review
brewery_name                                   
Eastwood The Brewer (John Eastwood)    1.500000
Trafalgar Ales & Meads                 2.071429
Bootleggers Restaurant & Brewery       2.250000
Piniavos Alutis                        2.500000
Vecchio Birraio                        2.625000

Exercises

  1. Which brewery has the greatest variability in its rating?
Python
#Calculate the overall review by beer style
c = beer.groupby(['brewery_name'])['review_overall'].std()
#Identify the highest variability
max_variability_brewery = c.sort_values(ascending=False).head(1)
max_variability_brewery
brewery_name
Otro Mundo Brewing Company    1.06066
Name: review_overall, dtype: float64

Merging Data Introduction

We can use different functions to combine different dataframes including: concat, append, merge, and join.

Python
import pandas as pd
import numpy as np

The following graph shows all the different types of joins available:

Inner Join

In this case, we have two dataframes: x and y.

For an inner join, only the shaded area will be returned.

Outer Join

In this case, we have two dataframes: x and y.

This takes all the data from x and y

Left Join

In this case, we have two dataframes: x and y.

This takes everything from the left as a point of reference and matches everything onto the left dataframe

Right Join

In this case, we have two dataframes: x and y.

This takes everything from the right as a point of reference and matches everything onto the right dataframe

Examples

Let us say that we have the following dataframes:

Python
import pandas as pd
df_a = pd.DataFrame({
    'id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35]
})
df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
Python
df_b = pd.DataFrame({
    'id': [2, 3, 4],
    'city': ['New York', 'Los Angeles', 'Chicago'],
    'salary': [70000, 80000, 60000]
})
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000

Inner Join Example


Python
df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
Python
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000
Python
# Inner join on 'id'
inner_join = pd.merge(df_a, df_b, on='id', how='inner')
inner_join
   id     name  age         city  salary
0   2      Bob   30     New York   70000
1   3  Charlie   35  Los Angeles   80000

Outer Join Example


Python
df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
Python
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000
Python
# Inner join on 'id'
outer_join = pd.merge(df_a, df_b, on='id', how='outer')
outer_join
   id     name   age         city   salary
0   1    Alice  25.0          NaN      NaN
1   2      Bob  30.0     New York  70000.0
2   3  Charlie  35.0  Los Angeles  80000.0
3   4      NaN   NaN      Chicago  60000.0

Left Join Example


Python
df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
Python
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000
Python
# Inner join on 'id'
left_join = pd.merge(df_a, df_b, on='id', how='left')
left_join
   id     name  age         city   salary
0   1    Alice   25          NaN      NaN
1   2      Bob   30     New York  70000.0
2   3  Charlie   35  Los Angeles  80000.0

Right Join Example


Python
df_a
   id     name  age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   35
Python
df_b
   id         city  salary
0   2     New York   70000
1   3  Los Angeles   80000
2   4      Chicago   60000
Python
# Inner join on 'id'
right_join = pd.merge(df_a, df_b, on='id', how='right')
right_join
   id     name   age         city  salary
0   2      Bob  30.0     New York   70000
1   3  Charlie  35.0  Los Angeles   80000
2   4      NaN   NaN      Chicago   60000

Other Methods

Other methods of merging two dataframes include join and concat.

To keep things simple, we will not do those.

Exercises

Let us imagine the following datasets:

Python
raw_data_1 = { 'subject_id': ['1', '2', '3', '4', '5'], 
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = { 'subject_id': ['4', '5', '6', '7', '8'], 
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = { 'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'], 
'test_grade': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

Exercises

  1. Create a dataframe for each data
  2. Join the first 2 DFs so that it generates a list with all the users.
  3. Join the DF from the previous section with the data from the grades (raw_data_3) so that each user is assigned their grade.
  4. Join the data from raw_data_1 and raw_data_2 only from the records with the same subject_id

Exercises Solution 1

  1. Create a dataframe for each data
Python
df1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
df2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
df3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_grade'])

Exercises Solution 2

  1. Join the first 2 DFs so that it generates a list with all the users.
Python
# Merge the DataFrames
merged_df = pd.merge(df1, df2, on=['subject_id', 'first_name', 'last_name'], how='outer')
merged_df
  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          4      Billy    Bonder
5          5     Ayoung   Atiches
6          5      Brian     Black
7          6       Bran   Balwner
8          7      Bryce     Brice
9          8      Betty    Btisan
Python
df1
  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          5     Ayoung   Atiches
Python
df2
  subject_id first_name last_name
0          4      Billy    Bonder
1          5      Brian     Black
2          6       Bran   Balwner
3          7      Bryce     Brice
4          8      Betty    Btisan

Exercises Solution 3

  1. Join the DF from the previous section with the data from the grades (raw_data_3) so that each user is assigned their grade.
Python
left_join = pd.merge(merged_df, df3, on='subject_id', how='left')
left_join
  subject_id first_name last_name  test_grade
0          1       Alex  Anderson        51.0
1          2        Amy  Ackerman        15.0
2          3      Allen       Ali        15.0
3          4      Alice      Aoni        61.0
4          4      Billy    Bonder        61.0
5          5     Ayoung   Atiches        16.0
6          5      Brian     Black        16.0
7          6       Bran   Balwner         NaN
8          7      Bryce     Brice        14.0
9          8      Betty    Btisan        15.0

Exercises Solution 4

  1. Join the data from raw_data_1 and raw_data_2 only from the records with the same subject_id
Python
merged_df = pd.merge(df1, df2, on='subject_id', how='inner')
merged_df
  subject_id first_name_x last_name_x first_name_y last_name_y
0          4        Alice        Aoni        Billy      Bonder
1          5       Ayoung     Atiches        Brian       Black

Conclusion

Aggregation: Grouping and aggregating data is vital for summarizing patterns, including counting, averaging, and applying multiple functions across groups.

Join Types: Pandas supports diverse join operations (inner, outer, left, right) to combine datasets based on shared keys.

Data Merging: Functions like merge, join, and concat enable seamless integration of datasets, whether aligning by index or column.

Hands-On Practice: Exercises illustrate real-world scenarios for combining datasets and performing operations like filtering and summarizing merged data.