L9: Pandas - Data Wrangling

Bogdan G. Popescu

bogdan.popescu@johncabot.edu

John Cabot University

Introduction

In this lecture, we will cover steps related to data wrangling. Data wrangling could entail steps such as:

establishing indexes
renaming columns

eliminating duplicated
filtering data

adding or deleting records
adding or deleting columns

editing information
modifying the data structure

Introduction

The first thing to do is to import libraries

Python

import pandas as pd
import numpy as np

Let us have a look at a mock dataset.

Download it and place it within the week’s folder.

Python

beer = pd.read_csv('beer_reviews.csv')
beer.head()

   brewery_id             brewery_name  ...               beer_name  beer_beerid
0       10325          Vecchio Birraio  ...            Sausa Weizen        47986
1       10325          Vecchio Birraio  ...                Red Moon        48213
2       10325          Vecchio Birraio  ...  Black Horse Black Beer        48215
3       10325          Vecchio Birraio  ...              Sausa Pils        47969
4        1075  Caldera Brewing Company  ...           Cauldron DIPA        64883

[5 rows x 9 columns]

Setting index

In this example, we can use the beer names as indices

The way we do this is using the beer_name as an index

Python

beer.set_index('beer_name',inplace=True)
beer.head(n=3)

                        brewery_id     brewery_name  ...  review_taste  beer_beerid
beer_name                                            ...                           
Sausa Weizen                 10325  Vecchio Birraio  ...           1.5        47986
Red Moon                     10325  Vecchio Birraio  ...           3.0        48213
Black Horse Black Beer       10325  Vecchio Birraio  ...           3.0        48215

[3 rows x 8 columns]

Setting index

We can turn back to the original data by resetting the index

Python

beer.reset_index(inplace=True)
beer.head(n=3)

                beer_name  brewery_id  ... review_taste  beer_beerid
0            Sausa Weizen       10325  ...          1.5        47986
1                Red Moon       10325  ...          3.0        48213
2  Black Horse Black Beer       10325  ...          3.0        48215

[3 rows x 9 columns]

Changing column names

We can change the names of the columns in the following way:

Python

beer.columns

Index(['beer_name', 'brewery_id', 'brewery_name', 'review_overall',
       'review_aroma', 'review_appearance', 'beer_style', 'review_taste',
       'beer_beerid'],
      dtype='object')

Python

beer.columns = ["name_beer", "id_brewery", "name_brewery", "overall_review",
"aroma_review", "appearance_review", "beer_style", "review_taste", "beerid"]

We can now check if the column names are now changed

Python

beer.columns

Index(['name_beer', 'id_brewery', 'name_brewery', 'overall_review',
       'aroma_review', 'appearance_review', 'beer_style', 'review_taste',
       'beerid'],
      dtype='object')

Changing column names

We can still change the names of our variables by using a dictionary

Python

beer.rename(columns= {'name_beer' : 'beer_name'}, inplace=True)
beer.columns

Index(['beer_name', 'id_brewery', 'name_brewery', 'overall_review',
       'aroma_review', 'appearance_review', 'beer_style', 'review_taste',
       'beerid'],
      dtype='object')

Ordering the columns

Let us look at our data:

Python

beer.head()

                beer_name  id_brewery  ... review_taste  beerid
0            Sausa Weizen       10325  ...          1.5   47986
1                Red Moon       10325  ...          3.0   48213
2  Black Horse Black Beer       10325  ...          3.0   48215
3              Sausa Pils       10325  ...          3.0   47969
4           Cauldron DIPA        1075  ...          4.5   64883

[5 rows x 9 columns]

Ordering the columns

We can order the names in a specific column in alphabetical order

Python

beer.sort_values(by = 'beer_name', ascending=True, inplace=True)
beer.head()

                   beer_name  id_brewery  ... review_taste  beerid
5143              1 A.M. Ale       12003  ...          3.0   25251
2365         10 Blocks South       26990  ...          4.0   76766
1761        15th Anniversary        1454  ...          4.5   27918
1762        15th Anniversary        1454  ...          3.5   27918
1742  2001 - A Stout Odyssey        1454  ...          3.0   55096

[5 rows x 9 columns]

Ordering the columns

We can also order values based on multiple columns.

The code first sorts the dataframe by the beer_name. It then sorts rows by review_taste

True means ascending order. False means descending order.

Python

beer.sort_values(by = ['beer_name','review_taste'], ascending=[True,False])

                   beer_name  id_brewery  ... review_taste  beerid
5143              1 A.M. Ale       12003  ...          3.0   25251
2365         10 Blocks South       26990  ...          4.0   76766
1761        15th Anniversary        1454  ...          4.5   27918
1762        15th Anniversary        1454  ...          3.5   27918
1742  2001 - A Stout Odyssey        1454  ...          3.0   55096
...                      ...         ...  ...          ...     ...
2498          Yakima Monster       23246  ...          4.5   61417
2626      Yankee Clipper IPA       15636  ...          4.5   38212
926         Yellowjacket Ale        2724  ...          3.0   69087
4891         YouEnjoyMyStout          14  ...          4.0   21436
2652           Zhigulovskoye        1746  ...          3.0   22746

[5998 rows x 9 columns]

Identifying Duplicates

Many times we deal with duplicated values.

We can for example identify if there are any duplicated values.

Python

beer.duplicated().any()

True

The code above suggests that there are indeed duplicated values.

Identifying Duplicates

Are there any duplicated values within a column?

Python

print('Are there beers with the same name?')
print(beer.beer_name.duplicated().any())

Are there beers with the same name?

True

Python

print('Are there breweries with the same name?')
print(beer.name_brewery.duplicated().any())

Are there breweries with the same name?

True

Removing Duplicates

To remove duplicates, we can use drop_duplicates

Python

beer2 = beer.drop_duplicates()
beer2.head(5)

                   beer_name  id_brewery  ... review_taste  beerid
5143              1 A.M. Ale       12003  ...          3.0   25251
2365         10 Blocks South       26990  ...          4.0   76766
1761        15th Anniversary        1454  ...          4.5   27918
1762        15th Anniversary        1454  ...          3.5   27918
1742  2001 - A Stout Odyssey        1454  ...          3.0   55096

[5 rows x 9 columns]

This keeps unique combinations of variable values.

Removing Duplicates

We can see for example the number of rows for the original dataset.

Python

beer.shape[0]

And compare it to the dataset where we eliminated the duplicates.

Python

beer2.shape[0]

Thus, we got rid of a substantial number of observations.

Python

beer.shape[0] - beer2.shape[0]

Removing Duplicates

We can also remove duplicates from a single column

Python

beer3 = beer.drop_duplicates(subset=['beer_name'], keep='first')

The keep='first' allows us to keep the first entry when we encounter duplicates.

Filtering Data

Applying filters to data is useful to know how our data is distributed.

To apply a filter, we have to add a condition to the data.

Let us say that we want to only look at beers that have a review score above 4.5.

Python

beer4 = beer[beer['overall_review'] > 4.5]

Filtering Data

We can also apply multiple filters.

Python

beer5 = beer[(beer['overall_review'] > 4.5) & (beer['beer_style'] == "Light Lager")]

Counting Data

We can count the items within categories in the following way:

Python

beer.name_brewery.value_counts()

name_brewery
Hoppin' Frog Brewery                              3113
Caldera Brewing Company                            889
Amstel Brouwerij B. V.                             570
Moon River Brewing Company                         209
Broad Ripple Brew Pub                              152
                                                  ... 
Brasserie La Binchoise (Malterie des Remparts)       1
Pivovar Groll                                        1
Asmara Breweries                                     1
Stadtbrauerei Arnstadt                               1
Voronezhsky Brewery                                  1
Name: count, Length: 73, dtype: int64

Counting Data

We could also use between as another way to filter out information

Python

beer6 =beer[beer['overall_review'].between(3,4)]

Filtering Data

So far we have seen how to perform filters on the content.

If we want to select by rows it is necessary to use the methods .iloc[], .loc[].

.iloc[] performs the selection of records by the numeric index
.loc[] performs the selection through the index tag

Filtering Data

The following line selects the third entry.

Python

beer.iloc[3]

beer_name                 15th Anniversary
id_brewery                            1454
name_brewery         Broad Ripple Brew Pub
overall_review                         3.5
aroma_review                           3.0
appearance_review                      3.5
beer_style                English Pale Ale
review_taste                           3.5
beerid                               27918
Name: 1762, dtype: object

Filtering Data

We can also select the third to the sixth observation

Python

beer.iloc[3:6]

                                              beer_name  ...  beerid
1762                                   15th Anniversary  ...   27918
1742                             2001 - A Stout Odyssey  ...   55096
1714  Ackerman's Imperial Double Stout (Indiana Repl...  ...   66889

[3 rows x 9 columns]

Filtering Data

We can also be more specific about which rows we want to filter

Python

beer.iloc[[3,7,9]]

             beer_name  id_brewery  ... review_taste  beerid
1762  15th Anniversary        1454  ...          3.5   27918
2757  Akron Dopple Alt       16994  ...          5.0   44875
733         Alpha Beta        1075  ...          4.0   54723

[3 rows x 9 columns]

Filtering Data

Selecting based on specific labels

We can be more specific about which elements we want extracted

Python

beer.loc[beer.name_brewery == 'Broad Ripple Brew Pub']

                                              beer_name  ...  beerid
1761                                   15th Anniversary  ...   27918
1762                                   15th Anniversary  ...   27918
1742                             2001 - A Stout Odyssey  ...   55096
1714  Ackerman's Imperial Double Stout (Indiana Repl...  ...   66889
1701  Ackerman's Imperial Double Stout (Winterfest R...  ...   66430
...                                                 ...  ...     ...
1768                                     Wobbly Bob APA  ...   15660
1770                                     Wobbly Bob APA  ...   15660
1771                                     Wobbly Bob APA  ...   15660
1772                                     Wobbly Bob APA  ...   15660
1769                                     Wobbly Bob APA  ...   15660

[152 rows x 9 columns]

Filtering Data

Eliminating Observations

We can easily drop observations in the following way:

Python

beer2 = beer.loc[beer.name_brewery != 'Broad Ripple Brew Pub']
beer2

               beer_name  id_brewery  ... review_taste  beerid
5143          1 A.M. Ale       12003  ...          3.0   25251
2365     10 Blocks South       26990  ...          4.0   76766
2757    Akron Dopple Alt       16994  ...          5.0   44875
2756    Akron Dopple Alt       16994  ...          3.5   44875
733           Alpha Beta        1075  ...          4.0   54723
...                  ...         ...  ...          ...     ...
2498      Yakima Monster       23246  ...          4.5   61417
2626  Yankee Clipper IPA       15636  ...          4.5   38212
926     Yellowjacket Ale        2724  ...          3.0   69087
4891     YouEnjoyMyStout          14  ...          4.0   21436
2652       Zhigulovskoye        1746  ...          3.0   22746

[5846 rows x 9 columns]

Filtering Data

Eliminating Observations

We can look at the difference between the number of observations

Python

beer.shape[0]

Python

beer2.shape[0]

Aggregating Columns

We can also aggregate or perform operations on already-existing columns in the following way:

Python

beer['review_avg'] = (beer['aroma_review'] + beer['appearance_review'])/2
beer.head()

                   beer_name  id_brewery  ... beerid  review_avg
5143              1 A.M. Ale       12003  ...  25251        3.75
2365         10 Blocks South       26990  ...  76766        3.50
1761        15th Anniversary        1454  ...  27918        4.00
1762        15th Anniversary        1454  ...  27918        3.25
1742  2001 - A Stout Odyssey        1454  ...  55096        3.75

[5 rows x 10 columns]

Adding Rows to our data

Let’s say that we have one new row

Python

data = {'beer_name': ['Corona'],
        'id_brewery': None,
        'overall_review': ['5'],
        'aroma_review': ['5'],
        'beer_style': ["Old Ale"]}

new_beer = pd.DataFrame(data)
new_beer

  beer_name id_brewery overall_review aroma_review beer_style
0    Corona       None              5            5    Old Ale

Adding Rows to our data

We can now add this new row to our data

Python

beer2 = beer._append(new_beer)
beer2

                   beer_name id_brewery  ...   beerid review_avg
5143              1 A.M. Ale      12003  ...  25251.0       3.75
2365         10 Blocks South      26990  ...  76766.0       3.50
1761        15th Anniversary       1454  ...  27918.0       4.00
1762        15th Anniversary       1454  ...  27918.0       3.25
1742  2001 - A Stout Odyssey       1454  ...  55096.0       3.75
...                      ...        ...  ...      ...        ...
2626      Yankee Clipper IPA      15636  ...  38212.0       4.25
926         Yellowjacket Ale       2724  ...  69087.0       3.25
4891         YouEnjoyMyStout         14  ...  21436.0       4.00
2652           Zhigulovskoye       1746  ...  22746.0       3.25
0                     Corona       None  ...      NaN        NaN

[5999 rows x 10 columns]

Python

set(new_beer.columns)
set(beer2.columns)

{'beer_name', 'id_brewery', 'beer_style', 'overall_review', 'aroma_review'}

{'beerid', 'review_taste', 'beer_name', 'id_brewery', 'beer_style', 'appearance_review', 'review_avg', 'overall_review', 'aroma_review', 'name_brewery'}

Adding Rows to our data

We can see that the columns that did not exist in new_beer got filled with NAs

To see the columns that are in new_beer, but not in beer2, we can easily perform:

Python

# Get the columns that are in new_beer but not in beer2
columns_in_new_beer_not_in_beer2 = set(beer2.columns) - set(new_beer.columns)
columns_in_new_beer_not_in_beer2

{'beerid', 'review_taste', 'appearance_review', 'review_avg', 'name_brewery'}

Exercises

How many different beer styles are there?
Figure out which are the 5 most common beer styles
How many beers have a review score of 5 in all the categories

Exercises Answers

How many different beer styles are there?

Python

unique_beer_styles = beer['beer_style'].nunique()
unique_beer_styles

Exercises Answers

Figure out which are the 5 most common beer styles

Python

# Get the counts of each beer style
beer_style_counts = beer['beer_style'].value_counts()

# Get the top 5 most common beer styles
top_5_beer_styles = beer_style_counts.head(5)
print(top_5_beer_styles)

beer_style
Russian Imperial Stout            1561
American IPA                       791
American Double / Imperial IPA     493
Light Lager                        466
American Pale Ale (APA)            304
Name: count, dtype: int64

Exercises Answers

How many beers have a review score of 5 in all the categories

Python

# Select the relevant review columns
review_columns = ['overall_review', 'aroma_review', 'appearance_review', 'review_taste']

# Create a boolean mask where all selected review columns have a score of 5
mask = (beer[review_columns] == 5).all(axis=1)
#print(mask)
# Count the number of rows that meet the condition
beers_with_perfect_scores = mask.sum()
beers_with_perfect_scores

Exercises 2

Here is some data:

Python

movies_data = {
    "title": [
        "Inception", "The Dark Knight", "Pulp Fiction", "The Matrix", "Forrest Gump",
        "The Godfather", "Fight Club", "Interstellar", "The Shawshank Redemption", "The Social Network",
        "Whiplash", "Parasite", "Gladiator", "Toy Story", "The Grand Budapest Hotel"
    ],
    "genre": [
        "Sci-Fi", "Action", "Crime", "Sci-Fi", "Drama",
        "Crime", "Drama", "Sci-Fi", "Drama", "Drama",
        "Drama", "Thriller", "Action", "Animation", "Comedy"
    ],
    "rating": [4.8, 4.9, 4.7, 4.8, 4.6, 4.9, 4.5, 4.7, 4.9, 4.3, 4.6, 4.8, 4.7, 4.4, 4.5],
    "votes": [2500000, 2300000, 1800000, 1900000, 2100000, 2000000, 1700000, 2200000, 2400000, 1500000, 1400000, 1600000, 1800000, 1300000, 1200000],
    "critic_score": [94, 98, 96, 87, 92, 100, 88, 85, 99, 93, 95, 100, 90, 89, 91],
    "user_score": [9.5, 9.7, 9.4, 9.0, 8.9, 10.0, 8.7, 8.5, 9.8, 8.6, 9.2, 10.0, 8.8, 8.3, 8.5],
    "year": [2010, 2008, 1994, 1999, 1994, 1972, 1999, 2014, 1994, 2010, 2014, 2019, 2000, 1995, 2014]
}

Exercises 2

Create a pandas dataframe using pd.DataFrame
How many unique genres are there in the dataset?
Find the top 5 most common genres.
How many movies have both a critic_score of 100 and a user_score of 10?

Exercises 2

Operations in Pandas

Pandas can be used for similar operations that exist in numpy.

For example, we can calculate the average of reviews

Python

beer.overall_review.mean()

3.775508502834278

Other similar operations are min and max

Python

beer.overall_review.min()

1.0

Python

beer.overall_review.max()

5.0

Operations in Pandas

Let’s say that we are interested in the correlation among multiple variables

overall_review
aroma_review
appearance_review
review_taste

Operations in Pandas

The way we would do this is the following:

Python

pd.set_option('display.max_columns', None)
beer[['overall_review', 'aroma_review', 'appearance_review', 'review_taste']].corr()

	overall_review	aroma_review	appearance_review	review_taste
overall_review	1.000000	0.675650	0.570245	0.819431
aroma_review	0.675650	1.000000	0.632493	0.774227
appearance_review	0.570245	0.632493	1.000000	0.616572
review_taste	0.819431	0.774227	0.616572	1.000000

Modifying the Data Structure

The main operations that we can perform with data are:

pivoting
stacking
melting

Pivoting

Let us assume that we want to change the structure of the data

Python

table = {
    "Item": ['Item0', 'Item0', 'Item1', 'Item1'],
    'CType': ['Gold', 'Bronze', 'Gold', 'Silver'],
    'USD': ['1$', '2$', '3$', '4$'],
    'EU': ['1€', '2€', '3€', '4€']
}
d = pd.DataFrame(table)
d

    Item   CType USD  EU
0  Item0    Gold  1$  1€
1  Item0  Bronze  2$  2€
2  Item1    Gold  3$  3€
3  Item1  Silver  4$  4€

Pivoting

Python

    Item   CType USD  EU
0  Item0    Gold  1$  1€
1  Item0  Bronze  2$  2€
2  Item1    Gold  3$  3€
3  Item1  Silver  4$  4€

The way we pivot this table is by doing the following

Python

d_pivot = d.pivot(index='Item', columns='CType', values=['USD', 'EU'])
d_pivot

         USD                 EU            
CType Bronze Gold Silver Bronze Gold Silver
Item                                       
Item0     2$   1$    NaN     2€   1€    NaN
Item1    NaN   3$     4$    NaN   3€     4€

Melting

The opposite o pivoting is melting.

Let us examine the following dataframe:

Python

data = {'weekday': ["Monday", "Tuesday", "Wednesday", 
         "Thursday", "Friday", "Saturday", "Sunday"],
        'Person 1': [12, 6, 5, 8, 11, 6, 4],
        'Person 2': [10, 6, 11, 5, 8, 9, 12],
        'Person 3': [8, 5, 7, 3, 7, 11, 15]}
df = pd.DataFrame(data, columns=['weekday',
        'Person 1', 'Person 2', 'Person 3'])
df

     weekday  Person 1  Person 2  Person 3
0     Monday        12        10         8
1    Tuesday         6         6         5
2  Wednesday         5        11         7
3   Thursday         8         5         3
4     Friday        11         8         7
5   Saturday         6         9        11
6     Sunday         4        12        15

Melting

In this case, what we want is to have the following columns: weekday, Person, Score.

The way we do this, is the following:

Python

melted = pd.melt(df, id_vars=["weekday"], 
                 var_name="Person", 
                 value_name="Score")
melted

      weekday    Person  Score
0      Monday  Person 1     12
1     Tuesday  Person 1      6
2   Wednesday  Person 1      5
3    Thursday  Person 1      8
4      Friday  Person 1     11
5    Saturday  Person 1      6
6      Sunday  Person 1      4
7      Monday  Person 2     10
8     Tuesday  Person 2      6
9   Wednesday  Person 2     11
10   Thursday  Person 2      5
11     Friday  Person 2      8
12   Saturday  Person 2      9
13     Sunday  Person 2     12
14     Monday  Person 3      8
15    Tuesday  Person 3      5
16  Wednesday  Person 3      7
17   Thursday  Person 3      3
18     Friday  Person 3      7
19   Saturday  Person 3     11
20     Sunday  Person 3     15

Exercise Pivoting

Look at the following dataframe:

Python

data = {
    'Date': ['2024-08-10', '2024-08-10', '2024-08-10', '2024-08-11', '2024-08-11', '2024-08-11'],
    'City': ['New York', 'Chicago', 'Miami', 'New York', 'Chicago', 'Miami'],
    'Temp': [85, 78, 90, 88, 76, 92],
    'Hum': [70, 65, 80, 72, 68, 82]
}

df = pd.DataFrame(data)
df

         Date      City  Temp  Hum
0  2024-08-10  New York    85   70
1  2024-08-10   Chicago    78   65
2  2024-08-10     Miami    90   80
3  2024-08-11  New York    88   72
4  2024-08-11   Chicago    76   68
5  2024-08-11     Miami    92   82

Exercise Pivoting

Pivot the table so that it looks like below:

            Chicago_Temp  Miami_Temp  New York_Temp  Chicago_Hum  Miami_Hum  \
Date                                                                          
2024-08-10            78          90             85           65         80   
2024-08-11            76          92             88           68         82   

            New York_Hum  
Date                      
2024-08-10            70  
2024-08-11            72

Exercise Melting

Consider the following DataFrame representing survey responses from three individuals about their preferences for different fruits:

Python

# Sample DataFrame
data = {
    'Person': ['Alice', 'Bob', 'Carol'],
    'Apple': [5, 2, 4],
    'Banana': [3, 5, 4],
    'Orange': [4, 3, 5]
}

df = pd.DataFrame(data)
df

  Person  Apple  Banana  Orange
0  Alice      5       3       4
1    Bob      2       5       3
2  Carol      4       4       5

Exercise Melting

Use the melting operation to transform this wide-format DataFrame into a long-format DataFrame with columns: Person, Fruit, and Preference. Your dataframe should look like below:

  Person   Fruit  Preference
0  Alice   Apple           5
1    Bob   Apple           2
2  Carol   Apple           4
3  Alice  Banana           3
4    Bob  Banana           5
5  Carol  Banana           4
6  Alice  Orange           4
7    Bob  Orange           3
8  Carol  Orange           5

Conclusion

Data Wrangling Tools: Pandas simplifies renaming, reindexing, filtering, and modifying data structures for effective data wrangling.

Handling Duplicates: Provides efficient methods to detect and remove duplicates at the row or column level.

Reshaping Data: Enables pivoting and melting to transform data structures for specific analytical needs.