In this lecture, we will cover steps related to data wrangling. Data wrangling could entail steps such as:
The first thing to do is to import libraries
Let us have a look at a mock dataset.
brewery_id brewery_name ... beer_name beer_beerid
0 10325 Vecchio Birraio ... Sausa Weizen 47986
1 10325 Vecchio Birraio ... Red Moon 48213
2 10325 Vecchio Birraio ... Black Horse Black Beer 48215
3 10325 Vecchio Birraio ... Sausa Pils 47969
4 1075 Caldera Brewing Company ... Cauldron DIPA 64883
[5 rows x 9 columns]
In this example, we can use the beer names as indices
The way we do this is using the beer_name
as an index
brewery_id brewery_name ... review_taste beer_beerid
beer_name ...
Sausa Weizen 10325 Vecchio Birraio ... 1.5 47986
Red Moon 10325 Vecchio Birraio ... 3.0 48213
Black Horse Black Beer 10325 Vecchio Birraio ... 3.0 48215
[3 rows x 8 columns]
We can turn back to the original data by resetting the index
We can change the names of the columns in the following way:
Index(['beer_name', 'brewery_id', 'brewery_name', 'review_overall',
'review_aroma', 'review_appearance', 'beer_style', 'review_taste',
'beer_beerid'],
dtype='object')
We can now check if the column names are now changed
We can still change the names of our variables by using a dictionary
Let us look at our data:
beer_name id_brewery ... review_taste beerid
0 Sausa Weizen 10325 ... 1.5 47986
1 Red Moon 10325 ... 3.0 48213
2 Black Horse Black Beer 10325 ... 3.0 48215
3 Sausa Pils 10325 ... 3.0 47969
4 Cauldron DIPA 1075 ... 4.5 64883
[5 rows x 9 columns]
We can order thenames in a specific column in alphabetical order
beer_name id_brewery ... review_taste beerid
5143 1 A.M. Ale 12003 ... 3.0 25251
2365 10 Blocks South 26990 ... 4.0 76766
1761 15th Anniversary 1454 ... 4.5 27918
1762 15th Anniversary 1454 ... 3.5 27918
1742 2001 - A Stout Odyssey 1454 ... 3.0 55096
[5 rows x 9 columns]
We can also order values based on multiple columns.
The code first sorts the dataframe by the beer_name
. It then sorts rows by review_taste
True
means ascending order. False
means descending order.
beer_name id_brewery ... review_taste beerid
5143 1 A.M. Ale 12003 ... 3.0 25251
2365 10 Blocks South 26990 ... 4.0 76766
1761 15th Anniversary 1454 ... 4.5 27918
1762 15th Anniversary 1454 ... 3.5 27918
1742 2001 - A Stout Odyssey 1454 ... 3.0 55096
... ... ... ... ... ...
2498 Yakima Monster 23246 ... 4.5 61417
2626 Yankee Clipper IPA 15636 ... 4.5 38212
926 Yellowjacket Ale 2724 ... 3.0 69087
4891 YouEnjoyMyStout 14 ... 4.0 21436
2652 Zhigulovskoye 1746 ... 3.0 22746
[5998 rows x 9 columns]
Many times we deal with duplicated values.
We can for example identify if there are any duplicated values.
The code above suggests that there are indeed duplicated values.
Are there any duplicated values within a column?
Are there beers with the same name?
True
Are there breweries with the same name?
True
To remove duplicates, we can use drop_duplicates
beer_name id_brewery ... review_taste beerid
5143 1 A.M. Ale 12003 ... 3.0 25251
2365 10 Blocks South 26990 ... 4.0 76766
1761 15th Anniversary 1454 ... 4.5 27918
1762 15th Anniversary 1454 ... 3.5 27918
1742 2001 - A Stout Odyssey 1454 ... 3.0 55096
... ... ... ... ... ...
2498 Yakima Monster 23246 ... 4.5 61417
2626 Yankee Clipper IPA 15636 ... 4.5 38212
926 Yellowjacket Ale 2724 ... 3.0 69087
4891 YouEnjoyMyStout 14 ... 4.0 21436
2652 Zhigulovskoye 1746 ... 3.0 22746
[3559 rows x 9 columns]
We can see for example the number of rows for the original dataset.
And compare it to the dataset where we eliminated the duplicates.
Thus, we got rid of a substantial number of observations.
We can also remove duplicates from a single column
beer_name ... beerid
5143 1 A.M. Ale ... 25251
2365 10 Blocks South ... 76766
1761 15th Anniversary ... 27918
1742 2001 - A Stout Odyssey ... 55096
1714 Ackerman's Imperial Double Stout (Indiana Repl... ... 66889
... ... ... ...
2498 Yakima Monster ... 61417
2626 Yankee Clipper IPA ... 38212
926 Yellowjacket Ale ... 69087
4891 YouEnjoyMyStout ... 21436
2652 Zhigulovskoye ... 22746
[486 rows x 9 columns]
The keep='first'
allows us to keep the first entry when we encounter duplicates.
Applying filters to data is useful to know how our data is distributed.
To apply a filter, we have to add a condition to the data.
Let us say that we want to only look at beers that have a review score above 4.5.
beer_name ... beerid
1701 Ackerman's Imperial Double Stout (Winterfest R... ... 66430
2757 Akron Dopple Alt ... 44875
2171 Altbier ... 45132
2348 Amalgamated Helles ... 50575
1628 Amarillo Lager ... 43516
... ... ... ...
3030 Wild Frog Wheat Ale ... 36862
1766 Wobbly Bob APA ... 15660
1767 Wobbly Bob APA ... 15660
1768 Wobbly Bob APA ... 15660
1772 Wobbly Bob APA ... 15660
[294 rows x 9 columns]
We can also apply multiple filters.
beer_name id_brewery ... review_taste beerid
1262 Amstel Light 163 ... 4.0 436
1264 Amstel Light 163 ... 4.0 436
1075 Amstel Light 163 ... 5.0 436
1025 Amstel Light 163 ... 3.0 436
1032 Amstel Light 163 ... 2.5 436
1131 Amstel Light 163 ... 5.0 436
1145 Amstel Light 163 ... 4.0 436
1136 Amstel Light 163 ... 2.0 436
[8 rows x 9 columns]
We can count the items within categories in the following way:
name_brewery
Hoppin' Frog Brewery 3113
Caldera Brewing Company 889
Amstel Brouwerij B. V. 570
Moon River Brewing Company 209
Broad Ripple Brew Pub 152
...
Brasserie La Binchoise (Malterie des Remparts) 1
Pivovar Groll 1
Asmara Breweries 1
Stadtbrauerei Arnstadt 1
Voronezhsky Brewery 1
Name: count, Length: 73, dtype: int64
We could also use between
as another way to filter out information
beer_name ... beerid
5143 1 A.M. Ale ... 25251
2365 10 Blocks South ... 76766
1762 15th Anniversary ... 27918
1742 2001 - A Stout Odyssey ... 55096
1714 Ackerman's Imperial Double Stout (Indiana Repl... ... 66889
... ... ... ...
1801 X-Extra Pale Ale ... 32141
2498 Yakima Monster ... 61417
926 Yellowjacket Ale ... 69087
4891 YouEnjoyMyStout ... 21436
2652 Zhigulovskoye ... 22746
[3862 rows x 9 columns]
So far we have seen how to perform filters on the content.
If we want to select by rows it is necessary to use the methods .iloc[]
, .loc[]
.
.iloc[]
performs the selection of records by the numeric index
.loc[]
performs the selection through the index tag
The following line selects the third entry.
We can also select the third to the sixth observation
We can also be more specific about which rows we want to filter
We can be more specific about which elements we want extracted
beer_name ... beerid
1761 15th Anniversary ... 27918
1762 15th Anniversary ... 27918
1742 2001 - A Stout Odyssey ... 55096
1714 Ackerman's Imperial Double Stout (Indiana Repl... ... 66889
1701 Ackerman's Imperial Double Stout (Winterfest R... ... 66430
... ... ... ...
1768 Wobbly Bob APA ... 15660
1770 Wobbly Bob APA ... 15660
1771 Wobbly Bob APA ... 15660
1772 Wobbly Bob APA ... 15660
1769 Wobbly Bob APA ... 15660
[152 rows x 9 columns]
We can easily drop observations in the following way:
beer_name id_brewery ... review_taste beerid
5143 1 A.M. Ale 12003 ... 3.0 25251
2365 10 Blocks South 26990 ... 4.0 76766
2757 Akron Dopple Alt 16994 ... 5.0 44875
2756 Akron Dopple Alt 16994 ... 3.5 44875
733 Alpha Beta 1075 ... 4.0 54723
... ... ... ... ... ...
2498 Yakima Monster 23246 ... 4.5 61417
2626 Yankee Clipper IPA 15636 ... 4.5 38212
926 Yellowjacket Ale 2724 ... 3.0 69087
4891 YouEnjoyMyStout 14 ... 4.0 21436
2652 Zhigulovskoye 1746 ... 3.0 22746
[5846 rows x 9 columns]
We can look at the difference between the number of observations
We can also aggregate or perform operations on already-existing columns in the following way:
beer_name id_brewery ... beerid review_avg
5143 1 A.M. Ale 12003 ... 25251 3.75
2365 10 Blocks South 26990 ... 76766 3.50
1761 15th Anniversary 1454 ... 27918 4.00
1762 15th Anniversary 1454 ... 27918 3.25
1742 2001 - A Stout Odyssey 1454 ... 55096 3.75
[5 rows x 10 columns]
Let’s say that we have one new row
We can now add this new row to our data
beer_name id_brewery ... beerid review_avg
5143 1 A.M. Ale 12003 ... 25251.0 3.75
2365 10 Blocks South 26990 ... 76766.0 3.50
1761 15th Anniversary 1454 ... 27918.0 4.00
1762 15th Anniversary 1454 ... 27918.0 3.25
1742 2001 - A Stout Odyssey 1454 ... 55096.0 3.75
... ... ... ... ... ...
2626 Yankee Clipper IPA 15636 ... 38212.0 4.25
926 Yellowjacket Ale 2724 ... 69087.0 3.25
4891 YouEnjoyMyStout 14 ... 21436.0 4.00
2652 Zhigulovskoye 1746 ... 22746.0 3.25
Corona Corona None ... NaN NaN
[5999 rows x 10 columns]
We can see that the columns that did not exist in new_beer
got filled with NAs
To see the columns that are in new_beer
, but not in beer2
, we can easily perform:
# Get the counts of each beer style
beer_style_counts = beer['beer_style'].value_counts()
# Get the top 5 most common beer styles
top_5_beer_styles = beer_style_counts.head(5)
print(top_5_beer_styles)
beer_style
Russian Imperial Stout 1561
American IPA 791
American Double / Imperial IPA 493
Light Lager 466
American Pale Ale (APA) 304
Name: count, dtype: int64
# Select the relevant review columns
review_columns = ['overall_review', 'aroma_review', 'appearance_review', 'review_taste']
# Create a boolean mask where all selected review columns have a score of 5
mask = (beer[review_columns] == 5).all(axis=1)
# Count the number of rows that meet the condition
beers_with_perfect_scores = mask.sum()
beers_with_perfect_scores
19
Pandas can be used for similar operations that exist in numpy
.
For example, we can calculate the average of reviews
Other similar operations are min and max
Let’s say that we are interested in the correlation among multiple variables
The way we would do this is the following:
overall_review | aroma_review | appearance_review | review_taste | |
---|---|---|---|---|
overall_review | 1.000000 | 0.675650 | 0.570245 | 0.819431 |
aroma_review | 0.675650 | 1.000000 | 0.632493 | 0.774227 |
appearance_review | 0.570245 | 0.632493 | 1.000000 | 0.616572 |
review_taste | 0.819431 | 0.774227 | 0.616572 | 1.000000 |
The main operations that we can perform with data are:
Let us assume that we want to change the structure of the data
The way we pivot this table is by doing the following
The opposite o pivoting is melting.
Let us examine the following dataframe:
data = {'weekday': ["Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"],
'Person 1': [12, 6, 5, 8, 11, 6, 4],
'Person 2': [10, 6, 11, 5, 8, 9, 12],
'Person 3': [8, 5, 7, 3, 7, 11, 15]}
df = pd.DataFrame(data, columns=['weekday',
'Person 1', 'Person 2', 'Person 3'])
df
weekday Person 1 Person 2 Person 3
0 Monday 12 10 8
1 Tuesday 6 6 5
2 Wednesday 5 11 7
3 Thursday 8 5 3
4 Friday 11 8 7
5 Saturday 6 9 11
6 Sunday 4 12 15
In this case, what we want is to have the following columns: weekday
, Person
, Score
.
The way we do this, is the following:
weekday Person Score
0 Monday Person 1 12
1 Tuesday Person 1 6
2 Wednesday Person 1 5
3 Thursday Person 1 8
4 Friday Person 1 11
5 Saturday Person 1 6
6 Sunday Person 1 4
7 Monday Person 2 10
8 Tuesday Person 2 6
9 Wednesday Person 2 11
10 Thursday Person 2 5
11 Friday Person 2 8
12 Saturday Person 2 9
13 Sunday Person 2 12
14 Monday Person 3 8
15 Tuesday Person 3 5
16 Wednesday Person 3 7
17 Thursday Person 3 3
18 Friday Person 3 7
19 Saturday Person 3 11
20 Sunday Person 3 15
Look at the following dataframe:
data = {
'Date': ['2024-08-10', '2024-08-10', '2024-08-10', '2024-08-11', '2024-08-11', '2024-08-11'],
'City': ['New York', 'Chicago', 'Miami', 'New York', 'Chicago', 'Miami'],
'Temp': [85, 78, 90, 88, 76, 92],
'Hum': [70, 65, 80, 72, 68, 82]
}
df = pd.DataFrame(data)
df
Date City Temp Hum
0 2024-08-10 New York 85 70
1 2024-08-10 Chicago 78 65
2 2024-08-10 Miami 90 80
3 2024-08-11 New York 88 72
4 2024-08-11 Chicago 76 68
5 2024-08-11 Miami 92 82
Chicago_Temp Miami_Temp New York_Temp Chicago_Hum Miami_Hum \
Date
2024-08-10 78 90 85 65 80
2024-08-11 76 92 88 68 82
New York_Hum
Date
2024-08-10 70
2024-08-11 72
Consider the following DataFrame representing survey responses from three individuals about their preferences for different fruits:
Person Fruit Preference
0 Alice Apple 5
1 Bob Apple 2
2 Carol Apple 4
3 Alice Banana 3
4 Bob Banana 5
5 Carol Banana 4
6 Alice Orange 4
7 Bob Orange 3
8 Carol Orange 5
In some instances, we might be interested in performing operations by group.
For example, we might be interested in counting the number of observations by group.
Let’s say that we want to know how many beers there are per beer style.
The following will create a group object.
If we actually want to see the number of beers by style, we would do the following
We could also rearrange the number of beers by style in the following way:
#Creating a dataframe
df = pd.DataFrame(grouped_by_style.size())
#Renaming the column
df = df.rename(columns={0: 'count'})
#Sorting
df_sorted = df.sort_values(by='count', ascending=False)
df_sorted
count
beer_style
Russian Imperial Stout 1561
American IPA 791
American Double / Imperial IPA 493
Light Lager 466
American Pale Ale (APA) 304
... ...
Japanese Rice Lager 1
Flanders Red Ale 1
California Common / Steam Beer 1
Bière de Garde 1
Weizenbock 1
[80 rows x 1 columns]
If we are interested in the average review score for every beer style, we would do:
beer_style
Altbier 3.846154
American Adjunct Lager 3.642857
American Amber / Red Ale 3.792517
American Amber / Red Lager 3.071429
American Barleywine 3.722222
...
Tripel 3.777778
Vienna Lager 3.500000
Weizenbock 4.000000
Winter Warmer 3.457364
Witbier 3.615385
Name: overall_review, Length: 80, dtype: float64
We could also calculate the average review for more than one category
beer_style name_brewery
Altbier Bakushukan Clair (Kure Beer) 4.000000
Bluegrass Brewing Co. 3.766667
Bluegrass Brewing Co. - East St. Matthew's 4.000000
Broad Ripple Brew Pub 4.000000
Clockwork Beer Co, The 4.000000
...
Winter Warmer Hoppin' Frog Brewery 3.444000
NoDa Brewing Company 4.000000
Witbier Broad Ripple Brew Pub 2.833333
Frog at Bercy Village, The 4.000000
Moon River Brewing Company 3.842105
Name: overall_review, Length: 375, dtype: float64
Another option to work with grouped data is to use the agg
function.
This allows us to apply multiple functions to one group.
mean count
beer_style name_brewery
Altbier Bakushukan Clair (Kure Beer) 4.000000 1
Bluegrass Brewing Co. 3.766667 15
Bluegrass Brewing Co. - East St. Matthew's 4.000000 1
Broad Ripple Brew Pub 4.000000 1
Clockwork Beer Co, The 4.000000 4
... ... ...
Winter Warmer Hoppin' Frog Brewery 3.444000 125
NoDa Brewing Company 4.000000 2
Witbier Broad Ripple Brew Pub 2.833333 6
Frog at Bercy Village, The 4.000000 1
Moon River Brewing Company 3.842105 19
[375 rows x 2 columns]
#Calculate the overall review by beer style
a = beer.groupby(['beer_style'])['overall_review'].agg(["mean"])
#Creating a dataframe
df = pd.DataFrame(a)
#Renaming the column
df = df.rename(columns={"mean": 'avg_review'})
#Sorting
df_sorted = df.sort_values(by='avg_review', ascending=False)
df_sorted.head(5)
avg_review
beer_style
California Common / Steam Beer 5.000000
Flanders Red Ale 4.500000
English Barleywine 4.309091
Chile Beer 4.166667
Rauchbier 4.159091
#Calculate the overall review by beer style
b = beer.groupby(['name_brewery'])['overall_review'].agg(["mean"])
#Creating a dataframe
df = pd.DataFrame(b)
#Renaming the column
df = df.rename(columns={"mean": 'avg_review'})
#Sorting
df_sorted = df.sort_values(by='avg_review', ascending=True)
df_sorted.head(5)
avg_review
name_brewery
Eastwood The Brewer (John Eastwood) 1.500000
Trafalgar Ales & Meads 2.071429
Bootleggers Restaurant & Brewery 2.250000
Piniavos Alutis 2.500000
Vecchio Birraio 2.625000
#Calculate the overall review by beer style
c = beer.groupby(['name_brewery'])['overall_review'].std()
#Identify the highest variability
max_variability_brewery = c.sort_values(ascending=False).head(1)
max_variability_brewery
name_brewery
Otro Mundo Brewing Company 1.06066
Name: overall_review, dtype: float64
Popescu (JCU): Lecture 10