In this lecture, we will cover steps related to data wrangling. Data wrangling could entail steps such as:
The first thing to do is to import libraries
Let us have a look at a mock dataset.
Download it and place it within the week’s folder.
brewery_id brewery_name ... beer_name beer_beerid
0 10325 Vecchio Birraio ... Sausa Weizen 47986
1 10325 Vecchio Birraio ... Red Moon 48213
2 10325 Vecchio Birraio ... Black Horse Black Beer 48215
3 10325 Vecchio Birraio ... Sausa Pils 47969
4 1075 Caldera Brewing Company ... Cauldron DIPA 64883
[5 rows x 9 columns]
In this example, we can use the beer names as indices
The way we do this is using the beer_name
as an index
brewery_id brewery_name ... review_taste beer_beerid
beer_name ...
Sausa Weizen 10325 Vecchio Birraio ... 1.5 47986
Red Moon 10325 Vecchio Birraio ... 3.0 48213
Black Horse Black Beer 10325 Vecchio Birraio ... 3.0 48215
[3 rows x 8 columns]
We can turn back to the original data by resetting the index
We can change the names of the columns in the following way:
Index(['beer_name', 'brewery_id', 'brewery_name', 'review_overall',
'review_aroma', 'review_appearance', 'beer_style', 'review_taste',
'beer_beerid'],
dtype='object')
We can still change the names of our variables by using a dictionary
Let us look at our data:
We can order the names in a specific column in alphabetical order
beer_name id_brewery ... review_taste beerid
5143 1 A.M. Ale 12003 ... 3.0 25251
2365 10 Blocks South 26990 ... 4.0 76766
1761 15th Anniversary 1454 ... 4.5 27918
1762 15th Anniversary 1454 ... 3.5 27918
1742 2001 - A Stout Odyssey 1454 ... 3.0 55096
[5 rows x 9 columns]
We can also order values based on multiple columns.
The code first sorts the dataframe by the beer_name
. It then sorts rows by review_taste
True
means ascending order. False
means descending order.
beer_name id_brewery ... review_taste beerid
5143 1 A.M. Ale 12003 ... 3.0 25251
2365 10 Blocks South 26990 ... 4.0 76766
1761 15th Anniversary 1454 ... 4.5 27918
1762 15th Anniversary 1454 ... 3.5 27918
1742 2001 - A Stout Odyssey 1454 ... 3.0 55096
... ... ... ... ... ...
2498 Yakima Monster 23246 ... 4.5 61417
2626 Yankee Clipper IPA 15636 ... 4.5 38212
926 Yellowjacket Ale 2724 ... 3.0 69087
4891 YouEnjoyMyStout 14 ... 4.0 21436
2652 Zhigulovskoye 1746 ... 3.0 22746
[5998 rows x 9 columns]
Many times we deal with duplicated values.
We can for example identify if there are any duplicated values.
Are there any duplicated values within a column?
Are there beers with the same name?
True
Are there breweries with the same name?
True
To remove duplicates, we can use drop_duplicates
beer_name id_brewery ... review_taste beerid
5143 1 A.M. Ale 12003 ... 3.0 25251
2365 10 Blocks South 26990 ... 4.0 76766
1761 15th Anniversary 1454 ... 4.5 27918
1762 15th Anniversary 1454 ... 3.5 27918
1742 2001 - A Stout Odyssey 1454 ... 3.0 55096
[5 rows x 9 columns]
This keeps unique combinations of variable values.
We can see for example the number of rows for the original dataset.
We can also remove duplicates from a single column
The keep='first'
allows us to keep the first entry when we encounter duplicates.
Applying filters to data is useful to know how our data is distributed.
To apply a filter, we have to add a condition to the data.
We can also apply multiple filters.
We can count the items within categories in the following way:
name_brewery
Hoppin' Frog Brewery 3113
Caldera Brewing Company 889
Amstel Brouwerij B. V. 570
Moon River Brewing Company 209
Broad Ripple Brew Pub 152
...
Brasserie La Binchoise (Malterie des Remparts) 1
Pivovar Groll 1
Asmara Breweries 1
Stadtbrauerei Arnstadt 1
Voronezhsky Brewery 1
Name: count, Length: 73, dtype: int64
We could also use between
as another way to filter out information
So far we have seen how to perform filters on the content.
If we want to select by rows it is necessary to use the methods .iloc[]
, .loc[]
.
.iloc[]
performs the selection of records by the numeric index
.loc[]
performs the selection through the index tag
The following line selects the third entry.
We can also select the third to the sixth observation
We can also be more specific about which rows we want to filter
We can be more specific about which elements we want extracted
beer_name ... beerid
1761 15th Anniversary ... 27918
1762 15th Anniversary ... 27918
1742 2001 - A Stout Odyssey ... 55096
1714 Ackerman's Imperial Double Stout (Indiana Repl... ... 66889
1701 Ackerman's Imperial Double Stout (Winterfest R... ... 66430
... ... ... ...
1768 Wobbly Bob APA ... 15660
1770 Wobbly Bob APA ... 15660
1771 Wobbly Bob APA ... 15660
1772 Wobbly Bob APA ... 15660
1769 Wobbly Bob APA ... 15660
[152 rows x 9 columns]
We can easily drop observations in the following way:
beer_name id_brewery ... review_taste beerid
5143 1 A.M. Ale 12003 ... 3.0 25251
2365 10 Blocks South 26990 ... 4.0 76766
2757 Akron Dopple Alt 16994 ... 5.0 44875
2756 Akron Dopple Alt 16994 ... 3.5 44875
733 Alpha Beta 1075 ... 4.0 54723
... ... ... ... ... ...
2498 Yakima Monster 23246 ... 4.5 61417
2626 Yankee Clipper IPA 15636 ... 4.5 38212
926 Yellowjacket Ale 2724 ... 3.0 69087
4891 YouEnjoyMyStout 14 ... 4.0 21436
2652 Zhigulovskoye 1746 ... 3.0 22746
[5846 rows x 9 columns]
We can look at the difference between the number of observations
We can also aggregate or perform operations on already-existing columns in the following way:
beer_name id_brewery ... beerid review_avg
5143 1 A.M. Ale 12003 ... 25251 3.75
2365 10 Blocks South 26990 ... 76766 3.50
1761 15th Anniversary 1454 ... 27918 4.00
1762 15th Anniversary 1454 ... 27918 3.25
1742 2001 - A Stout Odyssey 1454 ... 55096 3.75
[5 rows x 10 columns]
Let’s say that we have one new row
We can now add this new row to our data
beer_name id_brewery ... beerid review_avg
5143 1 A.M. Ale 12003 ... 25251.0 3.75
2365 10 Blocks South 26990 ... 76766.0 3.50
1761 15th Anniversary 1454 ... 27918.0 4.00
1762 15th Anniversary 1454 ... 27918.0 3.25
1742 2001 - A Stout Odyssey 1454 ... 55096.0 3.75
... ... ... ... ... ...
2626 Yankee Clipper IPA 15636 ... 38212.0 4.25
926 Yellowjacket Ale 2724 ... 69087.0 3.25
4891 YouEnjoyMyStout 14 ... 21436.0 4.00
2652 Zhigulovskoye 1746 ... 22746.0 3.25
0 Corona None ... NaN NaN
[5999 rows x 10 columns]
{'beer_name', 'id_brewery', 'beer_style', 'overall_review', 'aroma_review'}
{'beerid', 'review_taste', 'beer_name', 'id_brewery', 'beer_style', 'appearance_review', 'review_avg', 'overall_review', 'aroma_review', 'name_brewery'}
We can see that the columns that did not exist in new_beer
got filled with NAs
To see the columns that are in new_beer
, but not in beer2
, we can easily perform:
Python
beer_style
Russian Imperial Stout 1561
American IPA 791
American Double / Imperial IPA 493
Light Lager 466
American Pale Ale (APA) 304
Name: count, dtype: int64
Python
# Select the relevant review columns
review_columns = ['overall_review', 'aroma_review', 'appearance_review', 'review_taste']
# Create a boolean mask where all selected review columns have a score of 5
mask = (beer[review_columns] == 5).all(axis=1)
#print(mask)
# Count the number of rows that meet the condition
beers_with_perfect_scores = mask.sum()
beers_with_perfect_scores
19
Here is some data:
Python
movies_data = {
"title": [
"Inception", "The Dark Knight", "Pulp Fiction", "The Matrix", "Forrest Gump",
"The Godfather", "Fight Club", "Interstellar", "The Shawshank Redemption", "The Social Network",
"Whiplash", "Parasite", "Gladiator", "Toy Story", "The Grand Budapest Hotel"
],
"genre": [
"Sci-Fi", "Action", "Crime", "Sci-Fi", "Drama",
"Crime", "Drama", "Sci-Fi", "Drama", "Drama",
"Drama", "Thriller", "Action", "Animation", "Comedy"
],
"rating": [4.8, 4.9, 4.7, 4.8, 4.6, 4.9, 4.5, 4.7, 4.9, 4.3, 4.6, 4.8, 4.7, 4.4, 4.5],
"votes": [2500000, 2300000, 1800000, 1900000, 2100000, 2000000, 1700000, 2200000, 2400000, 1500000, 1400000, 1600000, 1800000, 1300000, 1200000],
"critic_score": [94, 98, 96, 87, 92, 100, 88, 85, 99, 93, 95, 100, 90, 89, 91],
"user_score": [9.5, 9.7, 9.4, 9.0, 8.9, 10.0, 8.7, 8.5, 9.8, 8.6, 9.2, 10.0, 8.8, 8.3, 8.5],
"year": [2010, 2008, 1994, 1999, 1994, 1972, 1999, 2014, 1994, 2010, 2014, 2019, 2000, 1995, 2014]
}
pd.DataFrame
Pandas can be used for similar operations that exist in numpy
.
For example, we can calculate the average of reviews
Let’s say that we are interested in the correlation among multiple variables
The way we would do this is the following:
overall_review | aroma_review | appearance_review | review_taste | |
---|---|---|---|---|
overall_review | 1.000000 | 0.675650 | 0.570245 | 0.819431 |
aroma_review | 0.675650 | 1.000000 | 0.632493 | 0.774227 |
appearance_review | 0.570245 | 0.632493 | 1.000000 | 0.616572 |
review_taste | 0.819431 | 0.774227 | 0.616572 | 1.000000 |
The main operations that we can perform with data are:
Let us assume that we want to change the structure of the data
Item CType USD EU
0 Item0 Gold 1$ 1€
1 Item0 Bronze 2$ 2€
2 Item1 Gold 3$ 3€
3 Item1 Silver 4$ 4€
The opposite o pivoting is melting.
Let us examine the following dataframe:
Python
weekday Person 1 Person 2 Person 3
0 Monday 12 10 8
1 Tuesday 6 6 5
2 Wednesday 5 11 7
3 Thursday 8 5 3
4 Friday 11 8 7
5 Saturday 6 9 11
6 Sunday 4 12 15
In this case, what we want is to have the following columns: weekday
, Person
, Score
.
The way we do this, is the following:
weekday Person Score
0 Monday Person 1 12
1 Tuesday Person 1 6
2 Wednesday Person 1 5
3 Thursday Person 1 8
4 Friday Person 1 11
5 Saturday Person 1 6
6 Sunday Person 1 4
7 Monday Person 2 10
8 Tuesday Person 2 6
9 Wednesday Person 2 11
10 Thursday Person 2 5
11 Friday Person 2 8
12 Saturday Person 2 9
13 Sunday Person 2 12
14 Monday Person 3 8
15 Tuesday Person 3 5
16 Wednesday Person 3 7
17 Thursday Person 3 3
18 Friday Person 3 7
19 Saturday Person 3 11
20 Sunday Person 3 15
Look at the following dataframe:
Python
Date City Temp Hum
0 2024-08-10 New York 85 70
1 2024-08-10 Chicago 78 65
2 2024-08-10 Miami 90 80
3 2024-08-11 New York 88 72
4 2024-08-11 Chicago 76 68
5 2024-08-11 Miami 92 82
Chicago_Temp Miami_Temp New York_Temp Chicago_Hum Miami_Hum \
Date
2024-08-10 78 90 85 65 80
2024-08-11 76 92 88 68 82
New York_Hum
Date
2024-08-10 70
2024-08-11 72
Consider the following DataFrame representing survey responses from three individuals about their preferences for different fruits:
Person Fruit Preference
0 Alice Apple 5
1 Bob Apple 2
2 Carol Apple 4
3 Alice Banana 3
4 Bob Banana 5
5 Carol Banana 4
6 Alice Orange 4
7 Bob Orange 3
8 Carol Orange 5
Data Wrangling Tools: Pandas simplifies renaming, reindexing, filtering, and modifying data structures for effective data wrangling.
Handling Duplicates: Provides efficient methods to detect and remove duplicates at the row or column level.
Reshaping Data: Enables pivoting and melting to transform data structures for specific analytical needs.
Popescu (JCU): Lecture 9