Pandas is a library that provides analytical tools and data structures with high performance and ease of use.
Pandas is useful for manipulating tabulated data including spreadsheets and SQL tables.
Common files that work with pandas include: csv, xls, sql, or hdf5.
Common operations with pandas include: filtering, adding, or deleting data.
Overall, it combines the high performance of the numpy arrays with the ability to manipulate data
To import the library, we can simply type:
A DataFrame is a data structure that stores data in tabular form, that is, ordered in rows and labeled columns.
Each row contains an observation and each column a variable.
A DataFrame accepts heterogeneous data, that is, variables can be of different types (numeric, string, boolean, etc.).
We can also do
The dataframe can have both values and an index. For example, we can also define our list
We can extract the values of the Pandas Dataframe in the following way:
We can extract the index associated with this data in the following way:
The index does not need to be series of numbers:
We can also create Pandas dataframes from dictionaries
There are multiple ways to access data in a dataframe
One way is by using the series internal index. Note that indices start with 0.
We can also select only some observations
We can also select only some observations
This is where we use .loc instead of iloc
.loc is a Label-based indexing method.iloc is a Position-based indexingDataframes in pandas are designed to store heterogeneous multivariate data.
For example, we can see below:
Notice how column names are derived from the dictionary keys.
We notice that the index appears automatically
We can also specify our very own index in the following way:
Python
       state  year  pop
Row1    Ohio  2000  1.5
Row2    Ohio  2001  1.7
Row3    Ohio  2002  3.6
Row4  Nevada  2001  2.4
Row5  Nevada  2002  2.9
If we simply want to see the columns in a dataframe, we can easily use the columns attribute.
The columns attribute returns the column labels of the DataFrame as an Index object.
Columns can be extracted from a dataframe using the column label
The result is a Series object
It is possible to extract more than one column.
There are a variety of operations that can be used with dataframes
For example, the following tells us the shape of the dataframe:
We can also inspect the head and the tail of our dataframes.
For example, we can have:
We can have a summary of our variables in the following way:
             year       pop
count     5.00000  5.000000
mean   2001.20000  2.420000
std       0.83666  0.864292
min    2000.00000  1.500000
25%    2001.00000  1.700000
50%    2001.00000  2.400000
75%    2002.00000  2.900000
max    2002.00000  3.600000
.describe() works on numeric columns by default. For non-numeric columns, use include='all':
       state        year       pop
count      5     5.00000  5.000000
unique     2         NaN       NaN
top     Ohio         NaN       NaN
freq       3         NaN       NaN
mean     NaN  2001.20000  2.420000
std      NaN     0.83666  0.864292
min      NaN  2000.00000  1.500000
25%      NaN  2001.00000  1.700000
50%      NaN  2001.00000  2.400000
75%      NaN  2002.00000  2.900000
max      NaN  2002.00000  3.600000
Look at the following dictionary
       animal       age     visits priority
count      10  8.000000  10.000000       10
unique      3       NaN        NaN        2
top       cat       NaN        NaN       no
freq        4       NaN        NaN        6
mean      NaN  3.437500   1.900000      NaN
std       NaN  2.007797   0.875595      NaN
min       NaN  0.500000   1.000000      NaN
25%       NaN  2.375000   1.000000      NaN
50%       NaN  3.000000   2.000000      NaN
75%       NaN  4.625000   2.750000      NaN
max       NaN  7.000000   3.000000      NaN
Look at the following dictionary
Python
import numpy as np
data = {'city': ['Rome', 'Paris', 'London', 'Berlin', 'Madrid', 'Rome', 'Paris', 'London', 'Berlin', 'Madrid'],
        'temperature': [22.5, 18.3, 15.0, np.nan, 25.2, 21.0, 19.5, np.nan, 14.8, 26.1],
        'humidity': [55, 65, 70, 60, 50, 58, 68, 72, 63, 48],
        'pollution_level': ['low', 'moderate', 'high', 'moderate', 'low', 'low', 'moderate', 'high', 'moderate', 'low']}city and temperature.Many datasets have fields where some observations are missing.
These can cause problems when trying to represent data and perform certain operations
Thus, it is necessary to identify and do something about these missing values.
There are two strategies to deal with the deletion and assignment of a given value:
Python
       VarA  VarB  VarC    VarD
Case1    aa  20.0  1234  1234.0
Case2  None  30.0  3456   888.0
Case3    cc   NaN  6789     NaN
Case4  None   NaN   765     NaN
We can check whether data is missing
        VarA   VarB   VarC   VarD
Case1  False  False  False  False
Case2   True  False  False  False
Case3  False   True  False   True
Case4   True   True  False   True
The most simple way to deal with NAs is to drop the NAs
In order to drop the deleted values, we can perform the following:
We can indicate the columns from which we want to drop the NAs.
The other way to deal with missing values is to assign values to NAs
We can do this with the fillna method.
Similar to dropna, it is required to use the parameter inplace=True to maintain the changes
Python
       VarA  VarB  VarC    VarD
Case1    aa  20.0  1234  1234.0
Case2  None  30.0  3456   888.0
Case3    cc   NaN  6789     NaN
Case4  None   NaN   765     NaN
The other way to deal with missing values is to assign values to NAs
We can do this with the fillna method.
Here is a dataframe:
Python
import numpy as np
raw_data = {'first_name': ['Jason', 'Mary', 'Tina', 'Jake', 'Amy','Anne'], 
        'last_name': ['Miller', 'Smith', 'Ali', 'Milner', 'Cooze','Lynn'], 
        'age': [42, np.nan, 36, 24, 73,'23'], 
        'sex': ['m', np.nan, 'f', 'm', 'f','f'], 
        'preTestScore': [4, np.nan, np.nan, 2, 3, np.nan],
        'postTestScore': [25, np.nan, np.nan, 62, 70, np.nan]}
df = pd.DataFrame(raw_data)
df  first_name last_name  age  sex  preTestScore  postTestScore
0      Jason    Miller   42    m           4.0           25.0
1       Mary     Smith  NaN  NaN           NaN            NaN
2       Tina       Ali   36    f           NaN            NaN
3       Jake    Milner   24    m           2.0           62.0
4        Amy     Cooze   73    f           3.0           70.0
5       Anne      Lynn   23    f           NaN            NaN
preTestScore_mean to contain the mean value for preTestScorePython
DataFrame after eliminating entries with the largest number of NAs:
  first_name last_name age sex  preTestScore  postTestScore
0      Jason    Miller  42   m           4.0           25.0
3       Jake    Milner  24   m           2.0           62.0
4        Amy     Cooze  73   f           3.0           70.0
Python
DataFrame after adding 'preTestScore_mean':
  first_name last_name age sex  preTestScore  postTestScore  preTestScore_mean
0      Jason    Miller  42   m           4.0           25.0                3.0
3       Jake    Milner  24   m           2.0           62.0                3.0
4        Amy     Cooze  73   f           3.0           70.0                3.0
Here is a dataframe:
Python
import pandas as pd
import numpy as np
raw_data = {'first_name': ['Liam', 'Olivia', 'Noah', 'Emma', 'Ava', 'Sophia'], 
            'last_name': ['Brown', 'Davis', 'Wilson', 'Moore', 'Taylor', 'Anderson'], 
            'age': [34, np.nan, 28, 22, 67, '29'], 
            'sex': ['m', 'f', np.nan, 'f', 'f', 'm'], 
            'exam1_score': [85, np.nan, 78, np.nan, 92, 88],
            'exam2_score': [91, 87, np.nan, np.nan, 95, np.nan]}
df = pd.DataFrame(raw_data, columns=['first_name', 'last_name', 'age', 'sex', 'exam1_score', 'exam2_score'])
df  first_name last_name  age  sex  exam1_score  exam2_score
0       Liam     Brown   34    m         85.0         91.0
1     Olivia     Davis  NaN    f          NaN         87.0
2       Noah    Wilson   28  NaN         78.0          NaN
3       Emma     Moore   22    f          NaN          NaN
4        Ava    Taylor   67    f         92.0         95.0
5     Sophia  Anderson   29    m         88.0          NaN
preTestScore_mean to contain the mean value for preTestScoreCore Structures: Pandas relies on Series (1D arrays) and DataFrames (tabular data) for efficient data handling.
Data Operations: Supports filtering, indexing, appending, and modifying data seamlessly.
Missing Data: Manage missing values with dropna or fillna methods effectively.
Practical Use: Exercises highlight creating, accessing, and analyzing data with ease.
Popescu (JCU): Lecture 8