In this lecture, we will cover steps related to data wrangling. Data wrangling could entail steps such as:
The first thing to do is to import libraries
Let us have a look at a mock dataset.
Download it and place it within the week’s folder.
   brewery_id             brewery_name  ...               beer_name  beer_beerid
0       10325          Vecchio Birraio  ...            Sausa Weizen        47986
1       10325          Vecchio Birraio  ...                Red Moon        48213
2       10325          Vecchio Birraio  ...  Black Horse Black Beer        48215
3       10325          Vecchio Birraio  ...              Sausa Pils        47969
4        1075  Caldera Brewing Company  ...           Cauldron DIPA        64883
[5 rows x 9 columns]
In this example, we can use the beer names as indices
The way we do this is using the beer_name as an index
                        brewery_id     brewery_name  ...  review_taste  beer_beerid
beer_name                                            ...                           
Sausa Weizen                 10325  Vecchio Birraio  ...           1.5        47986
Red Moon                     10325  Vecchio Birraio  ...           3.0        48213
Black Horse Black Beer       10325  Vecchio Birraio  ...           3.0        48215
[3 rows x 8 columns]
We can turn back to the original data by resetting the index
We can change the names of the columns in the following way:
Index(['beer_name', 'brewery_id', 'brewery_name', 'review_overall',
       'review_aroma', 'review_appearance', 'beer_style', 'review_taste',
       'beer_beerid'],
      dtype='object')
We can still change the names of our variables by using a dictionary
Let us look at our data:
We can order the names in a specific column in alphabetical order
                   beer_name  id_brewery  ... review_taste  beerid
5143              1 A.M. Ale       12003  ...          3.0   25251
2365         10 Blocks South       26990  ...          4.0   76766
1761        15th Anniversary        1454  ...          4.5   27918
1762        15th Anniversary        1454  ...          3.5   27918
1742  2001 - A Stout Odyssey        1454  ...          3.0   55096
[5 rows x 9 columns]
We can also order values based on multiple columns.
The code first sorts the dataframe by the beer_name. It then sorts rows by review_taste
True means ascending order. False means descending order.
                   beer_name  id_brewery  ... review_taste  beerid
5143              1 A.M. Ale       12003  ...          3.0   25251
2365         10 Blocks South       26990  ...          4.0   76766
1761        15th Anniversary        1454  ...          4.5   27918
1762        15th Anniversary        1454  ...          3.5   27918
1742  2001 - A Stout Odyssey        1454  ...          3.0   55096
...                      ...         ...  ...          ...     ...
2498          Yakima Monster       23246  ...          4.5   61417
2626      Yankee Clipper IPA       15636  ...          4.5   38212
926         Yellowjacket Ale        2724  ...          3.0   69087
4891         YouEnjoyMyStout          14  ...          4.0   21436
2652           Zhigulovskoye        1746  ...          3.0   22746
[5998 rows x 9 columns]
Many times we deal with duplicated values.
We can for example identify if there are any duplicated values.
Are there any duplicated values within a column?
Are there beers with the same name?
True
Are there breweries with the same name?
True
To remove duplicates, we can use drop_duplicates
                   beer_name  id_brewery  ... review_taste  beerid
5143              1 A.M. Ale       12003  ...          3.0   25251
2365         10 Blocks South       26990  ...          4.0   76766
1761        15th Anniversary        1454  ...          4.5   27918
1762        15th Anniversary        1454  ...          3.5   27918
1742  2001 - A Stout Odyssey        1454  ...          3.0   55096
[5 rows x 9 columns]
This keeps unique combinations of variable values.
We can see for example the number of rows for the original dataset.
We can also remove duplicates from a single column
The keep='first' allows us to keep the first entry when we encounter duplicates.
Applying filters to data is useful to know how our data is distributed.
To apply a filter, we have to add a condition to the data.
We can also apply multiple filters.
We can count the items within categories in the following way:
name_brewery
Hoppin' Frog Brewery                              3113
Caldera Brewing Company                            889
Amstel Brouwerij B. V.                             570
Moon River Brewing Company                         209
Broad Ripple Brew Pub                              152
                                                  ... 
Brasserie La Binchoise (Malterie des Remparts)       1
Pivovar Groll                                        1
Asmara Breweries                                     1
Stadtbrauerei Arnstadt                               1
Voronezhsky Brewery                                  1
Name: count, Length: 73, dtype: int64
We could also use between as another way to filter out information
So far we have seen how to perform filters on the content.
If we want to select by rows it is necessary to use the methods .iloc[], .loc[].
.iloc[] performs the selection of records by the numeric index
.loc[] performs the selection through the index tag
The following line selects the third entry.
We can also select the third to the sixth observation
We can also be more specific about which rows we want to filter
We can be more specific about which elements we want extracted
                                              beer_name  ...  beerid
1761                                   15th Anniversary  ...   27918
1762                                   15th Anniversary  ...   27918
1742                             2001 - A Stout Odyssey  ...   55096
1714  Ackerman's Imperial Double Stout (Indiana Repl...  ...   66889
1701  Ackerman's Imperial Double Stout (Winterfest R...  ...   66430
...                                                 ...  ...     ...
1768                                     Wobbly Bob APA  ...   15660
1770                                     Wobbly Bob APA  ...   15660
1771                                     Wobbly Bob APA  ...   15660
1772                                     Wobbly Bob APA  ...   15660
1769                                     Wobbly Bob APA  ...   15660
[152 rows x 9 columns]
We can easily drop observations in the following way:
               beer_name  id_brewery  ... review_taste  beerid
5143          1 A.M. Ale       12003  ...          3.0   25251
2365     10 Blocks South       26990  ...          4.0   76766
2757    Akron Dopple Alt       16994  ...          5.0   44875
2756    Akron Dopple Alt       16994  ...          3.5   44875
733           Alpha Beta        1075  ...          4.0   54723
...                  ...         ...  ...          ...     ...
2498      Yakima Monster       23246  ...          4.5   61417
2626  Yankee Clipper IPA       15636  ...          4.5   38212
926     Yellowjacket Ale        2724  ...          3.0   69087
4891     YouEnjoyMyStout          14  ...          4.0   21436
2652       Zhigulovskoye        1746  ...          3.0   22746
[5846 rows x 9 columns]
We can look at the difference between the number of observations
We can also aggregate or perform operations on already-existing columns in the following way:
                   beer_name  id_brewery  ... beerid  review_avg
5143              1 A.M. Ale       12003  ...  25251        3.75
2365         10 Blocks South       26990  ...  76766        3.50
1761        15th Anniversary        1454  ...  27918        4.00
1762        15th Anniversary        1454  ...  27918        3.25
1742  2001 - A Stout Odyssey        1454  ...  55096        3.75
[5 rows x 10 columns]
Let’s say that we have one new row
We can now add this new row to our data
                   beer_name id_brewery  ...   beerid review_avg
5143              1 A.M. Ale      12003  ...  25251.0       3.75
2365         10 Blocks South      26990  ...  76766.0       3.50
1761        15th Anniversary       1454  ...  27918.0       4.00
1762        15th Anniversary       1454  ...  27918.0       3.25
1742  2001 - A Stout Odyssey       1454  ...  55096.0       3.75
...                      ...        ...  ...      ...        ...
2626      Yankee Clipper IPA      15636  ...  38212.0       4.25
926         Yellowjacket Ale       2724  ...  69087.0       3.25
4891         YouEnjoyMyStout         14  ...  21436.0       4.00
2652           Zhigulovskoye       1746  ...  22746.0       3.25
0                     Corona       None  ...      NaN        NaN
[5999 rows x 10 columns]
{'beer_name', 'id_brewery', 'beer_style', 'overall_review', 'aroma_review'}
{'beerid', 'review_taste', 'beer_name', 'id_brewery', 'beer_style', 'appearance_review', 'review_avg', 'overall_review', 'aroma_review', 'name_brewery'}
We can see that the columns that did not exist in new_beer got filled with NAs
To see the columns that are in new_beer, but not in beer2, we can easily perform:
Python
beer_style
Russian Imperial Stout            1561
American IPA                       791
American Double / Imperial IPA     493
Light Lager                        466
American Pale Ale (APA)            304
Name: count, dtype: int64
Python
# Select the relevant review columns
review_columns = ['overall_review', 'aroma_review', 'appearance_review', 'review_taste']
# Create a boolean mask where all selected review columns have a score of 5
mask = (beer[review_columns] == 5).all(axis=1)
#print(mask)
# Count the number of rows that meet the condition
beers_with_perfect_scores = mask.sum()
beers_with_perfect_scores19
Here is some data:
Python
movies_data = {
    "title": [
        "Inception", "The Dark Knight", "Pulp Fiction", "The Matrix", "Forrest Gump",
        "The Godfather", "Fight Club", "Interstellar", "The Shawshank Redemption", "The Social Network",
        "Whiplash", "Parasite", "Gladiator", "Toy Story", "The Grand Budapest Hotel"
    ],
    "genre": [
        "Sci-Fi", "Action", "Crime", "Sci-Fi", "Drama",
        "Crime", "Drama", "Sci-Fi", "Drama", "Drama",
        "Drama", "Thriller", "Action", "Animation", "Comedy"
    ],
    "rating": [4.8, 4.9, 4.7, 4.8, 4.6, 4.9, 4.5, 4.7, 4.9, 4.3, 4.6, 4.8, 4.7, 4.4, 4.5],
    "votes": [2500000, 2300000, 1800000, 1900000, 2100000, 2000000, 1700000, 2200000, 2400000, 1500000, 1400000, 1600000, 1800000, 1300000, 1200000],
    "critic_score": [94, 98, 96, 87, 92, 100, 88, 85, 99, 93, 95, 100, 90, 89, 91],
    "user_score": [9.5, 9.7, 9.4, 9.0, 8.9, 10.0, 8.7, 8.5, 9.8, 8.6, 9.2, 10.0, 8.8, 8.3, 8.5],
    "year": [2010, 2008, 1994, 1999, 1994, 1972, 1999, 2014, 1994, 2010, 2014, 2019, 2000, 1995, 2014]
}pd.DataFramePandas can be used for similar operations that exist in numpy.
For example, we can calculate the average of reviews
Let’s say that we are interested in the correlation among multiple variables
The way we would do this is the following:
| overall_review | aroma_review | appearance_review | review_taste | |
|---|---|---|---|---|
| overall_review | 1.000000 | 0.675650 | 0.570245 | 0.819431 | 
| aroma_review | 0.675650 | 1.000000 | 0.632493 | 0.774227 | 
| appearance_review | 0.570245 | 0.632493 | 1.000000 | 0.616572 | 
| review_taste | 0.819431 | 0.774227 | 0.616572 | 1.000000 | 
The main operations that we can perform with data are:
Let us assume that we want to change the structure of the data
    Item   CType USD  EU
0  Item0    Gold  1$  1€
1  Item0  Bronze  2$  2€
2  Item1    Gold  3$  3€
3  Item1  Silver  4$  4€
The opposite o pivoting is melting.
Let us examine the following dataframe:
Python
     weekday  Person 1  Person 2  Person 3
0     Monday        12        10         8
1    Tuesday         6         6         5
2  Wednesday         5        11         7
3   Thursday         8         5         3
4     Friday        11         8         7
5   Saturday         6         9        11
6     Sunday         4        12        15
In this case, what we want is to have the following columns: weekday, Person, Score.
The way we do this, is the following:
      weekday    Person  Score
0      Monday  Person 1     12
1     Tuesday  Person 1      6
2   Wednesday  Person 1      5
3    Thursday  Person 1      8
4      Friday  Person 1     11
5    Saturday  Person 1      6
6      Sunday  Person 1      4
7      Monday  Person 2     10
8     Tuesday  Person 2      6
9   Wednesday  Person 2     11
10   Thursday  Person 2      5
11     Friday  Person 2      8
12   Saturday  Person 2      9
13     Sunday  Person 2     12
14     Monday  Person 3      8
15    Tuesday  Person 3      5
16  Wednesday  Person 3      7
17   Thursday  Person 3      3
18     Friday  Person 3      7
19   Saturday  Person 3     11
20     Sunday  Person 3     15
Look at the following dataframe:
Python
         Date      City  Temp  Hum
0  2024-08-10  New York    85   70
1  2024-08-10   Chicago    78   65
2  2024-08-10     Miami    90   80
3  2024-08-11  New York    88   72
4  2024-08-11   Chicago    76   68
5  2024-08-11     Miami    92   82
            Chicago_Temp  Miami_Temp  New York_Temp  Chicago_Hum  Miami_Hum  \
Date                                                                          
2024-08-10            78          90             85           65         80   
2024-08-11            76          92             88           68         82   
            New York_Hum  
Date                      
2024-08-10            70  
2024-08-11            72  
Consider the following DataFrame representing survey responses from three individuals about their preferences for different fruits:
  Person   Fruit  Preference
0  Alice   Apple           5
1    Bob   Apple           2
2  Carol   Apple           4
3  Alice  Banana           3
4    Bob  Banana           5
5  Carol  Banana           4
6  Alice  Orange           4
7    Bob  Orange           3
8  Carol  Orange           5
Data Wrangling Tools: Pandas simplifies renaming, reindexing, filtering, and modifying data structures for effective data wrangling.
Handling Duplicates: Provides efficient methods to detect and remove duplicates at the row or column level.
Reshaping Data: Enables pivoting and melting to transform data structures for specific analytical needs.
Popescu (JCU): Lecture 9