Pandas is a library that provides analytical tools and data structures with high performance and ease of use.
Pandas is useful for manipulating tabulated data including spreadsheets and SQL tables.
Common files that work with pandas include: csv, xls, sql, or hdf5.
Common operations with pandas include: filtering, adding, or deleting data.
Overall, it combines the high performance of the numpy arrays with the ability to manipulate data
To import the library, we can simply type:
A DataFrame is a data structure that stores data in tabular form, that is, ordered in rows and labeled columns.
Each row contains an observation and each column a variable.
A DataFrame accepts heterogeneous data, that is, variables can be of different types (numeric, string, boolean, etc.).
We can also do
The dataframe can have both values and an index. For example, we can also define our list
We can extract the values of the Pandas Dataframe in the following way:
We can extract the index associated with this data in the following way:
The index does not need to be series of numbers:
We can also create Pandas dataframes from dictionaries
There are multiple ways to access data in a dataframe
One way is by using the series internal index. Note that indices start with 0.
We can also select only some observations
We can also select only some observations
This is where we use .loc
instead of iloc
.loc
is a Label-based indexing method.iloc
is a Position-based indexingDataframes in pandas are designed to store heterogeneous multivariate data.
For example, we can see below:
Notice how column names are derived from the dictionary keys.
We notice that the index appears automatically
We can also specify our very own index in the following way:
Python
state year pop
Row1 Ohio 2000 1.5
Row2 Ohio 2001 1.7
Row3 Ohio 2002 3.6
Row4 Nevada 2001 2.4
Row5 Nevada 2002 2.9
If we simply want to see the columns in a dataframe, we can easily use the columns
attribute.
The columns
attribute returns the column labels of the DataFrame as an Index object.
Columns can be extracted from a dataframe using the column label
The result is a Series object
It is possible to extract more than one column.
There are a variety of operations that can be used with dataframes
For example, the following tells us the shape of the dataframe:
We can also inspect the head and the tail of our dataframes.
For example, we can have:
We can have a summary of our variables in the following way:
year pop
count 5.00000 5.000000
mean 2001.20000 2.420000
std 0.83666 0.864292
min 2000.00000 1.500000
25% 2001.00000 1.700000
50% 2001.00000 2.400000
75% 2002.00000 2.900000
max 2002.00000 3.600000
.describe()
works on numeric columns by default. For non-numeric columns, use include='all'
:
state year pop
count 5 5.00000 5.000000
unique 2 NaN NaN
top Ohio NaN NaN
freq 3 NaN NaN
mean NaN 2001.20000 2.420000
std NaN 0.83666 0.864292
min NaN 2000.00000 1.500000
25% NaN 2001.00000 1.700000
50% NaN 2001.00000 2.400000
75% NaN 2002.00000 2.900000
max NaN 2002.00000 3.600000
Look at the following dictionary
animal age visits priority
count 10 8.000000 10.000000 10
unique 3 NaN NaN 2
top cat NaN NaN no
freq 4 NaN NaN 6
mean NaN 3.437500 1.900000 NaN
std NaN 2.007797 0.875595 NaN
min NaN 0.500000 1.000000 NaN
25% NaN 2.375000 1.000000 NaN
50% NaN 3.000000 2.000000 NaN
75% NaN 4.625000 2.750000 NaN
max NaN 7.000000 3.000000 NaN
Look at the following dictionary
Python
import numpy as np
data = {'city': ['Rome', 'Paris', 'London', 'Berlin', 'Madrid', 'Rome', 'Paris', 'London', 'Berlin', 'Madrid'],
'temperature': [22.5, 18.3, 15.0, np.nan, 25.2, 21.0, 19.5, np.nan, 14.8, 26.1],
'humidity': [55, 65, 70, 60, 50, 58, 68, 72, 63, 48],
'pollution_level': ['low', 'moderate', 'high', 'moderate', 'low', 'low', 'moderate', 'high', 'moderate', 'low']}
city
and temperature
.Many datasets have fields where some observations are missing.
These can cause problems when trying to represent data and perform certain operations
Thus, it is necessary to identify and do something about these missing values.
There are two strategies to deal with the deletion and assignment of a given value:
Python
VarA VarB VarC VarD
Case1 aa 20.0 1234 1234.0
Case2 None 30.0 3456 888.0
Case3 cc NaN 6789 NaN
Case4 None NaN 765 NaN
We can check whether data is missing
VarA VarB VarC VarD
Case1 False False False False
Case2 True False False False
Case3 False True False True
Case4 True True False True
The most simple way to deal with NAs is to drop the NAs
In order to drop the deleted values, we can perform the following:
We can indicate the columns from which we want to drop the NAs.
The other way to deal with missing values is to assign values to NAs
We can do this with the fillna
method.
Similar to dropna
, it is required to use the parameter inplace=True
to maintain the changes
Python
VarA VarB VarC VarD
Case1 aa 20.0 1234 1234.0
Case2 None 30.0 3456 888.0
Case3 cc NaN 6789 NaN
Case4 None NaN 765 NaN
The other way to deal with missing values is to assign values to NAs
We can do this with the fillna
method.
Here is a dataframe:
Python
import numpy as np
raw_data = {'first_name': ['Jason', 'Mary', 'Tina', 'Jake', 'Amy','Anne'],
'last_name': ['Miller', 'Smith', 'Ali', 'Milner', 'Cooze','Lynn'],
'age': [42, np.nan, 36, 24, 73,'23'],
'sex': ['m', np.nan, 'f', 'm', 'f','f'],
'preTestScore': [4, np.nan, np.nan, 2, 3, np.nan],
'postTestScore': [25, np.nan, np.nan, 62, 70, np.nan]}
df = pd.DataFrame(raw_data)
df
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42 m 4.0 25.0
1 Mary Smith NaN NaN NaN NaN
2 Tina Ali 36 f NaN NaN
3 Jake Milner 24 m 2.0 62.0
4 Amy Cooze 73 f 3.0 70.0
5 Anne Lynn 23 f NaN NaN
preTestScore_mean
to contain the mean value for preTestScore
Python
DataFrame after eliminating entries with the largest number of NAs:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42 m 4.0 25.0
3 Jake Milner 24 m 2.0 62.0
4 Amy Cooze 73 f 3.0 70.0
Python
DataFrame after adding 'preTestScore_mean':
first_name last_name age sex preTestScore postTestScore preTestScore_mean
0 Jason Miller 42 m 4.0 25.0 3.0
3 Jake Milner 24 m 2.0 62.0 3.0
4 Amy Cooze 73 f 3.0 70.0 3.0
Here is a dataframe:
Python
import pandas as pd
import numpy as np
raw_data = {'first_name': ['Liam', 'Olivia', 'Noah', 'Emma', 'Ava', 'Sophia'],
'last_name': ['Brown', 'Davis', 'Wilson', 'Moore', 'Taylor', 'Anderson'],
'age': [34, np.nan, 28, 22, 67, '29'],
'sex': ['m', 'f', np.nan, 'f', 'f', 'm'],
'exam1_score': [85, np.nan, 78, np.nan, 92, 88],
'exam2_score': [91, 87, np.nan, np.nan, 95, np.nan]}
df = pd.DataFrame(raw_data, columns=['first_name', 'last_name', 'age', 'sex', 'exam1_score', 'exam2_score'])
df
first_name last_name age sex exam1_score exam2_score
0 Liam Brown 34 m 85.0 91.0
1 Olivia Davis NaN f NaN 87.0
2 Noah Wilson 28 NaN 78.0 NaN
3 Emma Moore 22 f NaN NaN
4 Ava Taylor 67 f 92.0 95.0
5 Sophia Anderson 29 m 88.0 NaN
preTestScore_mean
to contain the mean value for preTestScore
Core Structures: Pandas relies on Series (1D arrays) and DataFrames (tabular data) for efficient data handling.
Data Operations: Supports filtering, indexing, appending, and modifying data seamlessly.
Missing Data: Manage missing values with dropna or fillna methods effectively.
Practical Use: Exercises highlight creating, accessing, and analyzing data with ease.
Popescu (JCU): Lecture 8