Pandas is a library that provides analytical tools and data structures with high performance and ease of use.
Pandas is useful for manipulating tabulated data including spreadsheets and SQL tables.
Common files that work with pandas include: cvs, xls, sql, or hdf5.
Common operations with pandas include: filtering, adding, or deleting data.
Overall, it combines the high performance of the numpy arrays with the ability to manipulate data
To import the library, we can simply type:
Panda’s functionalities are based on two fundamental structures: Series and Dataframes
A Series is an object that contains a one-dimensional array of data and an array of labels, known as an index.
If no index or label is specified, it is generated internally as an ordered sequence of integers.
A DataFrame is a data structure that stores data in tabular form, that is, ordered in rows and labeled columns.
Each row contains an observation and each column a variable.
A DataFrame accepts heterogeneous data, that is, variables can be of different types (numeric, string, boolean, etc.).
It easy to create a series starting from a simple list
The series can have both values and an index. For example, we can also define our list
We can extract the values of the Pandas Series in the following way:
We can extract the index associated with this data in the following way:
The index does not need to be series of numbers:
We can also create Pandas series from dictionaries
{'0 squared, is:': 0, '1 squared, is:': 1, '2 squared, is:': 4, '3 squared, is:': 9}
This is what the series would look like:
There are multiple ways to access data in a series
index = ['carolina', 'martha', 'nicky', 'theresa']
series = [1979, 1980, 1981, 1982]
series_w_index = pd.Series(series, index)
series_w_index
carolina 1979
martha 1980
nicky 1981
theresa 1982
dtype: int64
One way is by using the index
Another way is by using the series internal index. Note that indices start with 0.
We can also select only some observations
Or
We can append two series together.
There are three scores in the first series and four scores in the second
We can append them without ignoring the index:
0 85
1 90
2 78
0 82
1 91
2 79
3 94
dtype: int64
Notice how the indices from test2_scores (0 to 4) are retained.
Note that when we append s2
to s1
using the _append()
method, the indices of s2
are retained.
This means that the indices from s2
(which start again from 0) are kept as they are.
We can use the ignore_index=True
, which will produce new indices:
The take-aways here are:
Without ignore_index
, the resulting series keeps the original indices in both series, leading to duplicate indices
Using ignore_index=True
results in reindexing the combined series, ensuring that the index is continuous.
Another type of operation is pop
For example, let’s say that we have a bunch of students and grades.
We need to remove a particular student’s score from the series.
To get rid of Charlie’s score, we would use pop
.
Unlike series, dataframes in pandas are designed to store heterogeneous multivariate data.
For example, we can see below:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)
df
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
We notice that the index appears automatically
We can also specify our very own index in the following way:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data, index = ["100", "101", "102", "103", "104"] )
df
state year pop
100 Ohio 2000 1.5
101 Ohio 2001 1.7
102 Ohio 2002 3.6
103 Nevada 2001 2.4
104 Nevada 2002 2.9
If we simply want to see the columns in a dataframe, we can easily use the columns
attribute.
Columns can be extracted from a dataframe using the column label
The result is a Series object
It is possible to extract more than one column.
state year
100 Ohio 2000
101 Ohio 2001
102 Ohio 2002
103 Nevada 2001
104 Nevada 2002
To access a particular observation, we can use loc
There are a variety of operations that can be used with dataframes
For example, the following tells us the shape of the dataframe:
The following will tell us the values within our dataframe (the first five by default)
We can also inspect the head and the tail of our dataframes.
For example, we can have:
<bound method NDFrame.head of state year pop
100 Ohio 2000 1.5
101 Ohio 2001 1.7
102 Ohio 2002 3.6
103 Nevada 2001 2.4
104 Nevada 2002 2.9>
or
We can have a summary of our variables in the following way:
Look at the following dictionary and index
import numpy as np
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
Many datasets have fields where some observations are missing.
These can cause problems when trying to represent data and perform certain operations
Thus, it is necessary to identify and do something about these missing values.
There are two strategies to deal with the deletion and assignment of a given value:
df = pd.DataFrame({'VarA': ['aa', None, 'cc',None],
'VarB': [20, 30, None,None],
'VarC': [1234, 3456, 6789,765],
'VarD': [1234, 888, None,None]
},
index=['Case1', 'Case2', 'Case3', 'Case4'])
df
VarA VarB VarC VarD
Case1 aa 20.0 1234 1234.0
Case2 None 30.0 3456 888.0
Case3 cc NaN 6789 NaN
Case4 None NaN 765 NaN
We can check whether data is missing
VarA VarB VarC VarD
Case1 False False False False
Case2 True False False False
Case3 False True False True
Case4 True True False True
We can also check if there is any element in our data that is missing
The most simple way to deal with NAs is to drop the NAs
We can see below that the NA values have not been dropped
VarA VarB VarC VarD
Case1 aa 20.0 1234 1234.0
Case2 None 30.0 3456 888.0
Case3 cc NaN 6789 NaN
Case4 None NaN 765 NaN
In order to keep the deleted values, we can perform the following:
We can indicate the columns from which we want to drop the NAs.
The other way to deal with missing values is to assign values to NAs
We can do this with the fillna
method.
Similar to dropna
, it is required to use the parameter inplace=TRUE
to maintain the changes
df = pd.DataFrame({'VarA': ['aa', None, 'cc',None],
'VarB': [20, 30, None,None],
'VarC': [1234, 3456, 6789,765],
'VarD': [1234, 888, None,None]
},
index=['Case1', 'Case2', 'Case3', 'Case4'])
df
VarA VarB VarC VarD
Case1 aa 20.0 1234 1234.0
Case2 None 30.0 3456 888.0
Case3 cc NaN 6789 NaN
Case4 None NaN 765 NaN
The other way to deal with missing values is to assign values to NAs
We can do this with the fillna
method.
Similar to dropna
, it is required to use the parameter inplace=TRUE
to maintain the changes
Here is a dataframe:
raw_data = {'first_name': ['Jason', 'Mary', 'Tina', 'Jake', 'Amy','Anne'],
'last_name': ['Miller', 'Smith', 'Ali', 'Milner', 'Cooze','Lynn'],
'age': [42, np.nan, 36, 24, 73,'23'],
'sex': ['m', np.nan, 'f', 'm', 'f','f'],
'preTestScore': [4, np.nan, np.nan, 2, 3, np.nan],
'postTestScore': [25, np.nan, np.nan, 62, 70, np.nan]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42 m 4.0 25.0
1 Mary Smith NaN NaN NaN NaN
2 Tina Ali 36 f NaN NaN
3 Jake Milner 24 m 2.0 62.0
4 Amy Cooze 73 f 3.0 70.0
5 Anne Lynn 23 f NaN NaN
preTestScore
with the mean value.# 2. Eliminate the entries with the largest number of NAs
max_na_column = na_counts.idxmax()
df_cleaned = df.dropna(subset=[max_na_column])
print("\nDataFrame after eliminating entries with the largest number of NAs:")
DataFrame after eliminating entries with the largest number of NAs:
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42 m 4.0 25.0
3 Jake Milner 24 m 2.0 62.0
4 Amy Cooze 73 f 3.0 70.0
# 3. Add a new variable 'preTestScore' with the mean value
mean_preTestScore = df_cleaned['preTestScore'].mean()
df_cleaned['preTestScore_mean'] = mean_preTestScore
print("\nDataFrame after adding 'preTestScore_mean':")
DataFrame after adding 'preTestScore_mean':
first_name last_name age sex preTestScore postTestScore preTestScore_mean
0 Jason Miller 42 m 4.0 25.0 3.0
3 Jake Milner 24 m 2.0 62.0 3.0
4 Amy Cooze 73 f 3.0 70.0 3.0
Popescu (JCU): Lecture 9