L10: Pandas - Data Wrangling

Bogdan G. Popescu

John Cabot University

Introduction

In this lecture, we will cover steps related to data wrangling. Data wrangling could entail steps such as:

  • establishing indexes
  • renaming columns
  • eliminating duplicated
  • filtering data
  • adding or deleting records
  • adding or deleting columns
  • editing information
  • modifying the data structure

Introduction

The first thing to do is to import libraries

import pandas as pd
import numpy as np

Let us have a look at a mock dataset.

beer = pd.read_csv('beer_reviews.csv')
beer.head()
   brewery_id             brewery_name  ...               beer_name  beer_beerid
0       10325          Vecchio Birraio  ...            Sausa Weizen        47986
1       10325          Vecchio Birraio  ...                Red Moon        48213
2       10325          Vecchio Birraio  ...  Black Horse Black Beer        48215
3       10325          Vecchio Birraio  ...              Sausa Pils        47969
4        1075  Caldera Brewing Company  ...           Cauldron DIPA        64883

[5 rows x 9 columns]

Setting index

In this example, we can use the beer names as indices

The way we do this is using the beer_name as an index

beer.set_index('beer_name',inplace=True)
beer.head(n=3)
                        brewery_id     brewery_name  ...  review_taste  beer_beerid
beer_name                                            ...                           
Sausa Weizen                 10325  Vecchio Birraio  ...           1.5        47986
Red Moon                     10325  Vecchio Birraio  ...           3.0        48213
Black Horse Black Beer       10325  Vecchio Birraio  ...           3.0        48215

[3 rows x 8 columns]

We can turn back to the original data by resetting the index

beer.reset_index(inplace=True)
beer.head(n=3)
                beer_name  brewery_id  ... review_taste  beer_beerid
0            Sausa Weizen       10325  ...          1.5        47986
1                Red Moon       10325  ...          3.0        48213
2  Black Horse Black Beer       10325  ...          3.0        48215

[3 rows x 9 columns]

Changing column names

We can change the names of the columns in the following way:

beer.columns
Index(['beer_name', 'brewery_id', 'brewery_name', 'review_overall',
       'review_aroma', 'review_appearance', 'beer_style', 'review_taste',
       'beer_beerid'],
      dtype='object')
beer.columns = ["name_beer", "id_brewery", "name_brewery", "overall_review",
"aroma_review", "appearance_review", "beer_style", "review_taste", "beerid"]

We can now check if the column names are now changed

beer.columns
Index(['name_beer', 'id_brewery', 'name_brewery', 'overall_review',
       'aroma_review', 'appearance_review', 'beer_style', 'review_taste',
       'beerid'],
      dtype='object')

Changing column names

We can still change the names of our variables by using a dictionary

beer.rename(columns= {'name_beer' : 'beer_name'}, inplace=True)
beer.columns
Index(['beer_name', 'id_brewery', 'name_brewery', 'overall_review',
       'aroma_review', 'appearance_review', 'beer_style', 'review_taste',
       'beerid'],
      dtype='object')

Ordering the columns

Let us look at our data:

beer.head()
                beer_name  id_brewery  ... review_taste  beerid
0            Sausa Weizen       10325  ...          1.5   47986
1                Red Moon       10325  ...          3.0   48213
2  Black Horse Black Beer       10325  ...          3.0   48215
3              Sausa Pils       10325  ...          3.0   47969
4           Cauldron DIPA        1075  ...          4.5   64883

[5 rows x 9 columns]

We can order thenames in a specific column in alphabetical order

beer.sort_values(by = 'beer_name', ascending=True, inplace=True)
beer.head()
                   beer_name  id_brewery  ... review_taste  beerid
5143              1 A.M. Ale       12003  ...          3.0   25251
2365         10 Blocks South       26990  ...          4.0   76766
1761        15th Anniversary        1454  ...          4.5   27918
1762        15th Anniversary        1454  ...          3.5   27918
1742  2001 - A Stout Odyssey        1454  ...          3.0   55096

[5 rows x 9 columns]

Ordering the columns

We can also order values based on multiple columns.

The code first sorts the dataframe by the beer_name. It then sorts rows by review_taste

True means ascending order. False means descending order.

beer.sort_values(by = ['beer_name','review_taste'], ascending=[True,False])
                   beer_name  id_brewery  ... review_taste  beerid
5143              1 A.M. Ale       12003  ...          3.0   25251
2365         10 Blocks South       26990  ...          4.0   76766
1761        15th Anniversary        1454  ...          4.5   27918
1762        15th Anniversary        1454  ...          3.5   27918
1742  2001 - A Stout Odyssey        1454  ...          3.0   55096
...                      ...         ...  ...          ...     ...
2498          Yakima Monster       23246  ...          4.5   61417
2626      Yankee Clipper IPA       15636  ...          4.5   38212
926         Yellowjacket Ale        2724  ...          3.0   69087
4891         YouEnjoyMyStout          14  ...          4.0   21436
2652           Zhigulovskoye        1746  ...          3.0   22746

[5998 rows x 9 columns]

Identifying Duplicates

Many times we deal with duplicated values.

We can for example identify if there are any duplicated values.

beer.duplicated().any()
True

The code above suggests that there are indeed duplicated values.

Identifying Duplicates

Are there any duplicated values within a column?

print('Are there beers with the same name?','\n')
print(beer.beer_name.duplicated().any(),'\n')
Are there beers with the same name? 
True 
print('Are there breweries with the same name?','\n')
print(beer.name_brewery.duplicated().any(),'\n')
Are there breweries with the same name? 
True 

Removing Duplicates

To remove duplicates, we can use drop_duplicates

                   beer_name  id_brewery  ... review_taste  beerid
5143              1 A.M. Ale       12003  ...          3.0   25251
2365         10 Blocks South       26990  ...          4.0   76766
1761        15th Anniversary        1454  ...          4.5   27918
1762        15th Anniversary        1454  ...          3.5   27918
1742  2001 - A Stout Odyssey        1454  ...          3.0   55096
...                      ...         ...  ...          ...     ...
2498          Yakima Monster       23246  ...          4.5   61417
2626      Yankee Clipper IPA       15636  ...          4.5   38212
926         Yellowjacket Ale        2724  ...          3.0   69087
4891         YouEnjoyMyStout          14  ...          4.0   21436
2652           Zhigulovskoye        1746  ...          3.0   22746

[3559 rows x 9 columns]

Removing Duplicates

We can see for example the number of rows for the original dataset.

beer.shape[0]
5998

And compare it to the dataset where we eliminated the duplicates.

beer2.shape[0]
3559

Thus, we got rid of a substantial number of observations.

beer.shape[0] - beer2.shape[0]
2439

Removing Duplicates

We can also remove duplicates from a single column

beer.drop_duplicates(subset=['beer_name'], keep='first')
                                              beer_name  ...  beerid
5143                                         1 A.M. Ale  ...   25251
2365                                    10 Blocks South  ...   76766
1761                                   15th Anniversary  ...   27918
1742                             2001 - A Stout Odyssey  ...   55096
1714  Ackerman's Imperial Double Stout (Indiana Repl...  ...   66889
...                                                 ...  ...     ...
2498                                     Yakima Monster  ...   61417
2626                                 Yankee Clipper IPA  ...   38212
926                                    Yellowjacket Ale  ...   69087
4891                                    YouEnjoyMyStout  ...   21436
2652                                      Zhigulovskoye  ...   22746

[486 rows x 9 columns]

The keep='first' allows us to keep the first entry when we encounter duplicates.

Filtering Data

Applying filters to data is useful to know how our data is distributed.

To apply a filter, we have to add a condition to the data.

Let us say that we want to only look at beers that have a review score above 4.5.

beer[beer['overall_review'] > 4.5]
                                              beer_name  ...  beerid
1701  Ackerman's Imperial Double Stout (Winterfest R...  ...   66430
2757                                   Akron Dopple Alt  ...   44875
2171                                            Altbier  ...   45132
2348                                 Amalgamated Helles  ...   50575
1628                                     Amarillo Lager  ...   43516
...                                                 ...  ...     ...
3030                                Wild Frog Wheat Ale  ...   36862
1766                                     Wobbly Bob APA  ...   15660
1767                                     Wobbly Bob APA  ...   15660
1768                                     Wobbly Bob APA  ...   15660
1772                                     Wobbly Bob APA  ...   15660

[294 rows x 9 columns]

Filtering Data

We can also apply multiple filters.

beer[(beer['overall_review'] > 4.5) & (beer['beer_style'] == "Light Lager")]
         beer_name  id_brewery  ... review_taste  beerid
1262  Amstel Light         163  ...          4.0     436
1264  Amstel Light         163  ...          4.0     436
1075  Amstel Light         163  ...          5.0     436
1025  Amstel Light         163  ...          3.0     436
1032  Amstel Light         163  ...          2.5     436
1131  Amstel Light         163  ...          5.0     436
1145  Amstel Light         163  ...          4.0     436
1136  Amstel Light         163  ...          2.0     436

[8 rows x 9 columns]

Counting Data

We can count the items within categories in the following way:

beer.name_brewery.value_counts()
name_brewery
Hoppin' Frog Brewery                              3113
Caldera Brewing Company                            889
Amstel Brouwerij B. V.                             570
Moon River Brewing Company                         209
Broad Ripple Brew Pub                              152
                                                  ... 
Brasserie La Binchoise (Malterie des Remparts)       1
Pivovar Groll                                        1
Asmara Breweries                                     1
Stadtbrauerei Arnstadt                               1
Voronezhsky Brewery                                  1
Name: count, Length: 73, dtype: int64

Counting Data

We could also use between as another way to filter out information

beer[beer['overall_review'].between(3,4)]
                                              beer_name  ...  beerid
5143                                         1 A.M. Ale  ...   25251
2365                                    10 Blocks South  ...   76766
1762                                   15th Anniversary  ...   27918
1742                             2001 - A Stout Odyssey  ...   55096
1714  Ackerman's Imperial Double Stout (Indiana Repl...  ...   66889
...                                                 ...  ...     ...
1801                                   X-Extra Pale Ale  ...   32141
2498                                     Yakima Monster  ...   61417
926                                    Yellowjacket Ale  ...   69087
4891                                    YouEnjoyMyStout  ...   21436
2652                                      Zhigulovskoye  ...   22746

[3862 rows x 9 columns]

Filtering Data

So far we have seen how to perform filters on the content.

If we want to select by rows it is necessary to use the methods .iloc[], .loc[].

  • .iloc[] performs the selection of records by the numeric index

  • .loc[] performs the selection through the index tag

Filtering Data

The following line selects the third entry.

beer.iloc[3]
beer_name                 15th Anniversary
id_brewery                            1454
name_brewery         Broad Ripple Brew Pub
overall_review                         3.5
aroma_review                           3.0
appearance_review                      3.5
beer_style                English Pale Ale
review_taste                           3.5
beerid                               27918
Name: 1762, dtype: object

Filtering Data

We can also select the third to the sixth observation

beer.iloc[3:6]
                                              beer_name  ...  beerid
1762                                   15th Anniversary  ...   27918
1742                             2001 - A Stout Odyssey  ...   55096
1714  Ackerman's Imperial Double Stout (Indiana Repl...  ...   66889

[3 rows x 9 columns]

Filtering Data

We can also be more specific about which rows we want to filter

beer.iloc[[3,7,9]]
             beer_name  id_brewery  ... review_taste  beerid
1762  15th Anniversary        1454  ...          3.5   27918
2757  Akron Dopple Alt       16994  ...          5.0   44875
733         Alpha Beta        1075  ...          4.0   54723

[3 rows x 9 columns]

Selecting based on specific labels

We can be more specific about which elements we want extracted

beer.loc[beer.name_brewery == 'Broad Ripple Brew Pub']
                                              beer_name  ...  beerid
1761                                   15th Anniversary  ...   27918
1762                                   15th Anniversary  ...   27918
1742                             2001 - A Stout Odyssey  ...   55096
1714  Ackerman's Imperial Double Stout (Indiana Repl...  ...   66889
1701  Ackerman's Imperial Double Stout (Winterfest R...  ...   66430
...                                                 ...  ...     ...
1768                                     Wobbly Bob APA  ...   15660
1770                                     Wobbly Bob APA  ...   15660
1771                                     Wobbly Bob APA  ...   15660
1772                                     Wobbly Bob APA  ...   15660
1769                                     Wobbly Bob APA  ...   15660

[152 rows x 9 columns]

Eliminating Observations

We can easily drop observations in the following way:

beer2 = beer.loc[beer.name_brewery != 'Broad Ripple Brew Pub']
beer2
               beer_name  id_brewery  ... review_taste  beerid
5143          1 A.M. Ale       12003  ...          3.0   25251
2365     10 Blocks South       26990  ...          4.0   76766
2757    Akron Dopple Alt       16994  ...          5.0   44875
2756    Akron Dopple Alt       16994  ...          3.5   44875
733           Alpha Beta        1075  ...          4.0   54723
...                  ...         ...  ...          ...     ...
2498      Yakima Monster       23246  ...          4.5   61417
2626  Yankee Clipper IPA       15636  ...          4.5   38212
926     Yellowjacket Ale        2724  ...          3.0   69087
4891     YouEnjoyMyStout          14  ...          4.0   21436
2652       Zhigulovskoye        1746  ...          3.0   22746

[5846 rows x 9 columns]

We can look at the difference between the number of observations

beer.shape[0]
5998
beer2.shape[0]
5846

Aggregating Columns

We can also aggregate or perform operations on already-existing columns in the following way:

beer['review_avg'] = (beer['aroma_review'] + beer['appearance_review'])/2
beer.head()
                   beer_name  id_brewery  ... beerid  review_avg
5143              1 A.M. Ale       12003  ...  25251        3.75
2365         10 Blocks South       26990  ...  76766        3.50
1761        15th Anniversary        1454  ...  27918        4.00
1762        15th Anniversary        1454  ...  27918        3.25
1742  2001 - A Stout Odyssey        1454  ...  55096        3.75

[5 rows x 10 columns]

Adding Rows to our data

Let’s say that we have one new row

data = {'beer_name': ['Corona'],
        'id_brewery': None,
        'overall_review': ['5'],
        'aroma_review': ['5'],
        'beer_style': ["Old Ale"]}

new_beer = pd.DataFrame(data, index=["Corona"])
new_beer
       beer_name id_brewery overall_review aroma_review beer_style
Corona    Corona       None              5            5    Old Ale

Adding Rows to our data

We can now add this new row to our data

beer2 = beer._append(new_beer)
beer2
                     beer_name id_brewery  ...   beerid review_avg
5143                1 A.M. Ale      12003  ...  25251.0       3.75
2365           10 Blocks South      26990  ...  76766.0       3.50
1761          15th Anniversary       1454  ...  27918.0       4.00
1762          15th Anniversary       1454  ...  27918.0       3.25
1742    2001 - A Stout Odyssey       1454  ...  55096.0       3.75
...                        ...        ...  ...      ...        ...
2626        Yankee Clipper IPA      15636  ...  38212.0       4.25
926           Yellowjacket Ale       2724  ...  69087.0       3.25
4891           YouEnjoyMyStout         14  ...  21436.0       4.00
2652             Zhigulovskoye       1746  ...  22746.0       3.25
Corona                  Corona       None  ...      NaN        NaN

[5999 rows x 10 columns]
set(new_beer.columns)
{'beer_name', 'id_brewery', 'beer_style', 'overall_review', 'aroma_review'}
set(beer2.columns)
{'beer_name', 'beerid', 'id_brewery', 'review_taste', 'beer_style', 'appearance_review', 'name_brewery', 'overall_review', 'review_avg', 'aroma_review'}

Adding Rows to our data

We can see that the columns that did not exist in new_beer got filled with NAs

To see the columns that are in new_beer, but not in beer2, we can easily perform:

# Get the columns that are in new_beer but not in beer2
columns_in_new_beer_not_in_beer2 = set(beer2.columns) - set(new_beer.columns)
columns_in_new_beer_not_in_beer2
{'beerid', 'review_taste', 'appearance_review', 'name_brewery', 'review_avg'}

Exercises

  1. How many different beer styles are there?
  2. Figure out which are the 5 most common beer styles
  3. How many beers have a review score of 5 in all the categories

Exercises Answers

  1. How many different beer styles are there?
unique_beer_styles = beer['beer_style'].nunique()
unique_beer_styles
80

Exercises Answers

  1. Figure out which are the 5 most common beer styles
# Get the counts of each beer style
beer_style_counts = beer['beer_style'].value_counts()

# Get the top 5 most common beer styles
top_5_beer_styles = beer_style_counts.head(5)
print(top_5_beer_styles)
beer_style
Russian Imperial Stout            1561
American IPA                       791
American Double / Imperial IPA     493
Light Lager                        466
American Pale Ale (APA)            304
Name: count, dtype: int64

Exercises Answers

  1. How many beers have a review score of 5 in all the categories
# Select the relevant review columns
review_columns = ['overall_review', 'aroma_review', 'appearance_review', 'review_taste']

# Create a boolean mask where all selected review columns have a score of 5
mask = (beer[review_columns] == 5).all(axis=1)

# Count the number of rows that meet the condition
beers_with_perfect_scores = mask.sum()
beers_with_perfect_scores
19

Operations in Pandas

Pandas can be used for similar operations that exist in numpy.

For example, we can calculate the average of reviews

beer.overall_review.mean()
3.775508502834278

Other similar operations are min and max

beer.overall_review.min()
1.0
beer.overall_review.max()
5.0

Operations in Pandas

Let’s say that we are interested in the correlation among multiple variables

  • overall_review
  • aroma_review
  • appearance_review
  • review_taste

Operations in Pandas

The way we would do this is the following:

pd.set_option('display.max_columns', None)
beer[['overall_review', 'aroma_review', 'appearance_review', 'review_taste']].corr()
overall_review aroma_review appearance_review review_taste
overall_review 1.000000 0.675650 0.570245 0.819431
aroma_review 0.675650 1.000000 0.632493 0.774227
appearance_review 0.570245 0.632493 1.000000 0.616572
review_taste 0.819431 0.774227 0.616572 1.000000

Modifying the Data Structure

The main operations that we can perform with data are:

  • pivoting
  • stacking
  • melting

Pivoting

Let us assume that we want to change the structure of the data

table = {
    "Item": ['Item0', 'Item0', 'Item1', 'Item1'],
    'CType': ['Gold', 'Bronze', 'Gold', 'Silver'],
    'USD': ['1$', '2$', '3$', '4$'],
    'EU': ['1€', '2€', '3€', '4€']
}
d = pd.DataFrame(table)
d
    Item   CType USD  EU
0  Item0    Gold  1$  1€
1  Item0  Bronze  2$  2€
2  Item1    Gold  3$  3€
3  Item1  Silver  4$  4€

Pivoting

d
    Item   CType USD  EU
0  Item0    Gold  1$  1€
1  Item0  Bronze  2$  2€
2  Item1    Gold  3$  3€
3  Item1  Silver  4$  4€

The way we pivot this table is by doing the following

d_pivot = d.pivot(index='Item', columns='CType', values=['USD', 'EU'])
d_pivot
         USD                 EU            
CType Bronze Gold Silver Bronze Gold Silver
Item                                       
Item0     2$   1$    NaN     2€   1€    NaN
Item1    NaN   3$     4$    NaN   3€     4€

Melting

The opposite o pivoting is melting.

Let us examine the following dataframe:

data = {'weekday': ["Monday", "Tuesday", "Wednesday", 
         "Thursday", "Friday", "Saturday", "Sunday"],
        'Person 1': [12, 6, 5, 8, 11, 6, 4],
        'Person 2': [10, 6, 11, 5, 8, 9, 12],
        'Person 3': [8, 5, 7, 3, 7, 11, 15]}
df = pd.DataFrame(data, columns=['weekday',
        'Person 1', 'Person 2', 'Person 3'])
df
     weekday  Person 1  Person 2  Person 3
0     Monday        12        10         8
1    Tuesday         6         6         5
2  Wednesday         5        11         7
3   Thursday         8         5         3
4     Friday        11         8         7
5   Saturday         6         9        11
6     Sunday         4        12        15

Melting

In this case, what we want is to have the following columns: weekday, Person, Score.

The way we do this, is the following:

melted = pd.melt(df, id_vars=["weekday"], 
                 var_name="Person", 
                 value_name="Score")
melted
      weekday    Person  Score
0      Monday  Person 1     12
1     Tuesday  Person 1      6
2   Wednesday  Person 1      5
3    Thursday  Person 1      8
4      Friday  Person 1     11
5    Saturday  Person 1      6
6      Sunday  Person 1      4
7      Monday  Person 2     10
8     Tuesday  Person 2      6
9   Wednesday  Person 2     11
10   Thursday  Person 2      5
11     Friday  Person 2      8
12   Saturday  Person 2      9
13     Sunday  Person 2     12
14     Monday  Person 3      8
15    Tuesday  Person 3      5
16  Wednesday  Person 3      7
17   Thursday  Person 3      3
18     Friday  Person 3      7
19   Saturday  Person 3     11
20     Sunday  Person 3     15

Exercise Pivoting

Look at the following dataframe:

data = {
    'Date': ['2024-08-10', '2024-08-10', '2024-08-10', '2024-08-11', '2024-08-11', '2024-08-11'],
    'City': ['New York', 'Chicago', 'Miami', 'New York', 'Chicago', 'Miami'],
    'Temp': [85, 78, 90, 88, 76, 92],
    'Hum': [70, 65, 80, 72, 68, 82]
}

df = pd.DataFrame(data)
df
         Date      City  Temp  Hum
0  2024-08-10  New York    85   70
1  2024-08-10   Chicago    78   65
2  2024-08-10     Miami    90   80
3  2024-08-11  New York    88   72
4  2024-08-11   Chicago    76   68
5  2024-08-11     Miami    92   82

Exercise Pivoting

  1. Pivot the table so that it looks like below:
            Chicago_Temp  Miami_Temp  New York_Temp  Chicago_Hum  Miami_Hum  \
Date                                                                          
2024-08-10            78          90             85           65         80   
2024-08-11            76          92             88           68         82   

            New York_Hum  
Date                      
2024-08-10            70  
2024-08-11            72  

Exercise Melting

Consider the following DataFrame representing survey responses from three individuals about their preferences for different fruits:

# Sample DataFrame
data = {
    'Person': ['Alice', 'Bob', 'Carol'],
    'Apple': [5, 2, 4],
    'Banana': [3, 5, 4],
    'Orange': [4, 3, 5]
}

df = pd.DataFrame(data)
df
  Person  Apple  Banana  Orange
0  Alice      5       3       4
1    Bob      2       5       3
2  Carol      4       4       5

Exercise Melting

  1. Use the melting operation to transform this wide-format DataFrame into a long-format DataFrame with columns: Person, Fruit, and Preference. Your dataframe should look like below:
  Person   Fruit  Preference
0  Alice   Apple           5
1    Bob   Apple           2
2  Carol   Apple           4
3  Alice  Banana           3
4    Bob  Banana           5
5  Carol  Banana           4
6  Alice  Orange           4
7    Bob  Orange           3
8  Carol  Orange           5

Aggregating by group

In some instances, we might be interested in performing operations by group.

For example, we might be interested in counting the number of observations by group.

Let’s say that we want to know how many beers there are per beer style.

The following will create a group object.

grouped_by_style = beer.groupby(by='beer_style')
grouped_by_style
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x163ed3890>

Aggregating by group

If we actually want to see the number of beers by style, we would do the following

grouped_by_style.size()
beer_style
Altbier                        26
American Adjunct Lager         28
American Amber / Red Ale      147
American Amber / Red Lager      7
American Barleywine             9
                             ... 
Tripel                          9
Vienna Lager                   17
Weizenbock                      1
Winter Warmer                 129
Witbier                        26
Length: 80, dtype: int64

Aggregating by group

We could also rearrange the number of beers by style in the following way:

#Creating a dataframe
df = pd.DataFrame(grouped_by_style.size())
#Renaming the column
df = df.rename(columns={0: 'count'})
#Sorting
df_sorted = df.sort_values(by='count', ascending=False)
df_sorted
                                count
beer_style                           
Russian Imperial Stout           1561
American IPA                      791
American Double / Imperial IPA    493
Light Lager                       466
American Pale Ale (APA)           304
...                               ...
Japanese Rice Lager                 1
Flanders Red Ale                    1
California Common / Steam Beer      1
Bière de Garde                      1
Weizenbock                          1

[80 rows x 1 columns]

Aggregating by group

If we are interested in the average review score for every beer style, we would do:

a = beer.groupby('beer_style')['overall_review'].mean()
print(a)
beer_style
Altbier                       3.846154
American Adjunct Lager        3.642857
American Amber / Red Ale      3.792517
American Amber / Red Lager    3.071429
American Barleywine           3.722222
                                ...   
Tripel                        3.777778
Vienna Lager                  3.500000
Weizenbock                    4.000000
Winter Warmer                 3.457364
Witbier                       3.615385
Name: overall_review, Length: 80, dtype: float64

Aggregating by group

We could also calculate the average review for more than one category

a = beer.groupby(['beer_style', 'name_brewery'])['overall_review'].mean()
print(a)
beer_style     name_brewery                              
Altbier        Bakushukan Clair (Kure Beer)                  4.000000
               Bluegrass Brewing Co.                         3.766667
               Bluegrass Brewing Co. - East St. Matthew's    4.000000
               Broad Ripple Brew Pub                         4.000000
               Clockwork Beer Co, The                        4.000000
                                                               ...   
Winter Warmer  Hoppin' Frog Brewery                          3.444000
               NoDa Brewing Company                          4.000000
Witbier        Broad Ripple Brew Pub                         2.833333
               Frog at Bercy Village, The                    4.000000
               Moon River Brewing Company                    3.842105
Name: overall_review, Length: 375, dtype: float64

Aggregating by group

Another option to work with grouped data is to use the agg function.

This allows us to apply multiple functions to one group.

a = beer.groupby(['beer_style', 'name_brewery'])['overall_review'].agg(["mean", "count"])
print(a)
                                                              mean  count
beer_style    name_brewery                                               
Altbier       Bakushukan Clair (Kure Beer)                4.000000      1
              Bluegrass Brewing Co.                       3.766667     15
              Bluegrass Brewing Co. - East St. Matthew's  4.000000      1
              Broad Ripple Brew Pub                       4.000000      1
              Clockwork Beer Co, The                      4.000000      4
...                                                            ...    ...
Winter Warmer Hoppin' Frog Brewery                        3.444000    125
              NoDa Brewing Company                        4.000000      2
Witbier       Broad Ripple Brew Pub                       2.833333      6
              Frog at Bercy Village, The                  4.000000      1
              Moon River Brewing Company                  3.842105     19

[375 rows x 2 columns]

Exercises

  1. What are the top 5 beer styles?
  2. What are the top 5 worst breweries?
  3. Which brewery has the greatest variability in its rating?

Exercises

  1. What are the top 5 beer styles?
#Calculate the overall review by beer style
a = beer.groupby(['beer_style'])['overall_review'].agg(["mean"])
#Creating a dataframe
df = pd.DataFrame(a)
#Renaming the column
df = df.rename(columns={"mean": 'avg_review'})
#Sorting
df_sorted = df.sort_values(by='avg_review', ascending=False)
df_sorted.head(5)
                                avg_review
beer_style                                
California Common / Steam Beer    5.000000
Flanders Red Ale                  4.500000
English Barleywine                4.309091
Chile Beer                        4.166667
Rauchbier                         4.159091

Exercises

  1. What are the top 5 worst breweries?
#Calculate the overall review by beer style
b = beer.groupby(['name_brewery'])['overall_review'].agg(["mean"])
#Creating a dataframe
df = pd.DataFrame(b)
#Renaming the column
df = df.rename(columns={"mean": 'avg_review'})
#Sorting
df_sorted = df.sort_values(by='avg_review', ascending=True)
df_sorted.head(5)
                                     avg_review
name_brewery                                   
Eastwood The Brewer (John Eastwood)    1.500000
Trafalgar Ales & Meads                 2.071429
Bootleggers Restaurant & Brewery       2.250000
Piniavos Alutis                        2.500000
Vecchio Birraio                        2.625000

Exercises

  1. Which brewery has the greatest variability in its rating?
#Calculate the overall review by beer style
c = beer.groupby(['name_brewery'])['overall_review'].std()
#Identify the highest variability
max_variability_brewery = c.sort_values(ascending=False).head(1)
max_variability_brewery
name_brewery
Otro Mundo Brewing Company    1.06066
Name: overall_review, dtype: float64