L9: Pandas

Bogdan G. Popescu

John Cabot University

Introduction

Pandas is a library that provides analytical tools and data structures with high performance and ease of use.

Pandas is useful for manipulating tabulated data including spreadsheets and SQL tables.

Common files that work with pandas include: cvs, xls, sql, or hdf5.

Common operations with pandas include: filtering, adding, or deleting data.

Overall, it combines the high performance of the numpy arrays with the ability to manipulate data

Introduction

To import the library, we can simply type:

import pandas as pd

Dataframes and Series

Panda’s functionalities are based on two fundamental structures: Series and Dataframes

A Series is an object that contains a one-dimensional array of data and an array of labels, known as an index.

If no index or label is specified, it is generated internally as an ordered sequence of integers.

s = pd.Series([1979, 1980, 1981, 1982], index=[1,2,3,4])
s
1    1979
2    1980
3    1981
4    1982
dtype: int64

Dataframes and Series

A DataFrame is a data structure that stores data in tabular form, that is, ordered in rows and labeled columns.

Each row contains an observation and each column a variable.

A DataFrame accepts heterogeneous data, that is, variables can be of different types (numeric, string, boolean, etc.).

s = pd.DataFrame([1979, 1980, 1981, 1982])
s
      0
0  1979
1  1980
2  1981
3  1982

Creating a Pandas Series

It easy to create a series starting from a simple list

series = pd.Series([1979, 1980, 1981, 1982])
series
0    1979
1    1980
2    1981
3    1982
dtype: int64

The series can have both values and an index. For example, we can also define our list

series_w_index = pd.Series([1979, 1980, 1981, 1982], index = [0,1,2,3])
series_w_index
0    1979
1    1980
2    1981
3    1982
dtype: int64

Extracting Elements of a Pandas Series

We can extract the values of the Pandas Series in the following way:

series_w_index.values
array([1979, 1980, 1981, 1982])

We can extract the index associated with this data in the following way:

series_w_index.index
Index([0, 1, 2, 3], dtype='int64')

The index does not need to be series of numbers:

index = ['carolina', 'martha', 'nicky', 'theresa']
series = [1979, 1980, 1981, 1982]
series_w_index = pd.Series(series, index) 
series_w_index
carolina    1979
martha      1980
nicky       1981
theresa     1982
dtype: int64

Series from Dictionaries

We can also create Pandas series from dictionaries

dicc = {'{} squared, is:'.format(i): i * i for i in range(4)}
dicc
{'0 squared, is:': 0, '1 squared, is:': 1, '2 squared, is:': 4, '3 squared, is:': 9}

This is what the series would look like:

series_dicc = pd.Series(dicc)
series_dicc
0 squared, is:    0
1 squared, is:    1
2 squared, is:    4
3 squared, is:    9
dtype: int64

Accessing Data in a Series

There are multiple ways to access data in a series

index = ['carolina', 'martha', 'nicky', 'theresa']
series = [1979, 1980, 1981, 1982]
series_w_index = pd.Series(series, index) 
series_w_index
carolina    1979
martha      1980
nicky       1981
theresa     1982
dtype: int64

One way is by using the index

series_w_index['martha']
1980

Another way is by using the series internal index. Note that indices start with 0.

series_w_index.iloc[1]
1980

Accessing Data in a Series

We can also select only some observations

series_w_index.iloc[1:]
martha     1980
nicky      1981
theresa    1982
dtype: int64

Or

series_w_index.iloc[1:3]
martha    1980
nicky     1981
dtype: int64

Operations with Series: appending

We can append two series together.

test1_scores = pd.Series([85, 90, 78])
test2_scores = pd.Series([82, 91, 79, 94])

There are three scores in the first series and four scores in the second

We can append them without ignoring the index:

combined_scores = test1_scores._append(test2_scores)
combined_scores
0    85
1    90
2    78
0    82
1    91
2    79
3    94
dtype: int64

Notice how the indices from test2_scores (0 to 4) are retained.

Operations with Series: appending

Note that when we append s2 to s1 using the _append() method, the indices of s2 are retained.

This means that the indices from s2 (which start again from 0) are kept as they are.

We can use the ignore_index=True, which will produce new indices:

combined_scores_reset = test1_scores._append(test2_scores, ignore_index=True)
print("Combined Scores with ignoring index:\n", combined_scores_reset)
Combined Scores with ignoring index:
 0    85
1    90
2    78
3    82
4    91
5    79
6    94
dtype: int64

Operations with Series: appending

The take-aways here are:

  • Without ignore_index, the resulting series keeps the original indices in both series, leading to duplicate indices

  • Using ignore_index=True results in reindexing the combined series, ensuring that the index is continuous.

Operations with Series: pop

Another type of operation is pop

For example, let’s say that we have a bunch of students and grades.

We need to remove a particular student’s score from the series.

scores = pd.Series([85, 90, 78, 92, 88], 
index=['Alice', 'Bob', 'Charlie', 'David', 'Eve'])
scores
Alice      85
Bob        90
Charlie    78
David      92
Eve        88
dtype: int64

Operations with Series: pop

To get rid of Charlie’s score, we would use pop.

scores.pop('Charlie')
78
scores
Alice    85
Bob      90
David    92
Eve      88
dtype: int64

Dataframes in Pandas

Unlike series, dataframes in pandas are designed to store heterogeneous multivariate data.

For example, we can see below:

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)
df
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9

We notice that the index appears automatically

Dataframes in Pandas

We can also specify our very own index in the following way:

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data, index = ["100", "101", "102", "103", "104"] )
df
      state  year  pop
100    Ohio  2000  1.5
101    Ohio  2001  1.7
102    Ohio  2002  3.6
103  Nevada  2001  2.4
104  Nevada  2002  2.9

If we simply want to see the columns in a dataframe, we can easily use the columns attribute.

df.columns
Index(['state', 'year', 'pop'], dtype='object')

Accessing Data in a Dataframe

Columns can be extracted from a dataframe using the column label

The result is a Series object

df['pop']
100    1.5
101    1.7
102    3.6
103    2.4
104    2.9
Name: pop, dtype: float64

Accessing Data in a Dataframe

It is possible to extract more than one column.

df[["state", "year"]]
      state  year
100    Ohio  2000
101    Ohio  2001
102    Ohio  2002
103  Nevada  2001
104  Nevada  2002

To access a particular observation, we can use loc

df.loc["100"]
state    Ohio
year     2000
pop       1.5
Name: 100, dtype: object

Dataframes Methods

There are a variety of operations that can be used with dataframes

For example, the following tells us the shape of the dataframe:

df.shape
(5, 3)

The following will tell us the values within our dataframe (the first five by default)

df.values
array([['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9]], dtype=object)

Dataframes Methods

We can also inspect the head and the tail of our dataframes.

For example, we can have:

df.head
<bound method NDFrame.head of       state  year  pop
100    Ohio  2000  1.5
101    Ohio  2001  1.7
102    Ohio  2002  3.6
103  Nevada  2001  2.4
104  Nevada  2002  2.9>

or

df.tail
<bound method NDFrame.tail of       state  year  pop
100    Ohio  2000  1.5
101    Ohio  2001  1.7
102    Ohio  2002  3.6
103  Nevada  2001  2.4
104  Nevada  2002  2.9>

Dataframes Methods

We can have a summary of our variables in the following way:

df.describe()
             year       pop
count     5.00000  5.000000
mean   2001.20000  2.420000
std       0.83666  0.864292
min    2000.00000  1.500000
25%    2001.00000  1.700000
50%    2001.00000  2.400000
75%    2002.00000  2.900000
max    2002.00000  3.600000

Exercises

Look at the following dictionary and index

import numpy as np
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

Exercises

  1. Create a DataFrame from the dictionary and indexes.
  2. Select the columns animal and age.
  3. Indicate how many different types of animals there are.
  4. Indicate how many animals there are of each type.
  5. Show a statistical summary of all variables.

Exercises

  1. Create a DataFrame from the dictionary and indexes.
new_data = pd.DataFrame(data)
new_data
  animal  age  visits priority
0    cat  2.5       1      yes
1    cat  3.0       3      yes
2  snake  0.5       2       no
3    dog  NaN       3      yes
4    dog  5.0       2       no
5    cat  2.0       3       no
6  snake  4.5       1       no
7    cat  NaN       1      yes
8    dog  7.0       2       no
9    dog  3.0       1       no

Exercises

  1. Select the columns animal and age.
new_data[["animal", 'age']]
  animal  age
0    cat  2.5
1    cat  3.0
2  snake  0.5
3    dog  NaN
4    dog  5.0
5    cat  2.0
6  snake  4.5
7    cat  NaN
8    dog  7.0
9    dog  3.0

Exercises

  1. Indicate how many different types of animals there are.
new_data.animal.nunique()
3

Exercises

  1. Indicate how many different types of animals there are.
new_data.animal.value_counts()
animal
cat      4
dog      4
snake    2
Name: count, dtype: int64

Exercises

  1. Show a statistical summary of all variables.
new_data.describe()
            age     visits
count  8.000000  10.000000
mean   3.437500   1.900000
std    2.007797   0.875595
min    0.500000   1.000000
25%    2.375000   1.000000
50%    3.000000   2.000000
75%    4.625000   2.750000
max    7.000000   3.000000

Missing Data

Many datasets have fields where some observations are missing.

These can cause problems when trying to represent data and perform certain operations

Thus, it is necessary to identify and do something about these missing values.

There are two strategies to deal with the deletion and assignment of a given value:

  • deletion
  • assignment of a value

Missing Data

df = pd.DataFrame({'VarA': ['aa', None, 'cc',None],
                  'VarB': [20, 30, None,None],
                  'VarC': [1234, 3456, 6789,765],
                  'VarD': [1234, 888, None,None]
                 },
                 index=['Case1', 'Case2', 'Case3', 'Case4'])
df
       VarA  VarB  VarC    VarD
Case1    aa  20.0  1234  1234.0
Case2  None  30.0  3456   888.0
Case3    cc   NaN  6789     NaN
Case4  None   NaN   765     NaN

Checking Missing Data

We can check whether data is missing

df.isnull()
        VarA   VarB   VarC   VarD
Case1  False  False  False  False
Case2   True  False  False  False
Case3  False   True  False   True
Case4   True   True  False   True

We can also check if there is any element in our data that is missing

df.isnull().values.any()
True

Eliminating Missing Values

The most simple way to deal with NAs is to drop the NAs

df.dropna()
      VarA  VarB  VarC    VarD
Case1   aa  20.0  1234  1234.0

We can see below that the NA values have not been dropped

df
       VarA  VarB  VarC    VarD
Case1    aa  20.0  1234  1234.0
Case2  None  30.0  3456   888.0
Case3    cc   NaN  6789     NaN
Case4  None   NaN   765     NaN

In order to keep the deleted values, we can perform the following:

df.dropna(inplace=True)
df
      VarA  VarB  VarC    VarD
Case1   aa  20.0  1234  1234.0

Eliminating Missing Values

We can indicate the columns from which we want to drop the NAs.

df.dropna(subset=['VarB'])
      VarA  VarB  VarC    VarD
Case1   aa  20.0  1234  1234.0

Assigning Values

The other way to deal with missing values is to assign values to NAs

We can do this with the fillna method.

Similar to dropna, it is required to use the parameter inplace=TRUE to maintain the changes

df = pd.DataFrame({'VarA': ['aa', None, 'cc',None],
                  'VarB': [20, 30, None,None],
                  'VarC': [1234, 3456, 6789,765],
                  'VarD': [1234, 888, None,None]
                 },
                 index=['Case1', 'Case2', 'Case3', 'Case4'])
df
       VarA  VarB  VarC    VarD
Case1    aa  20.0  1234  1234.0
Case2  None  30.0  3456   888.0
Case3    cc   NaN  6789     NaN
Case4  None   NaN   765     NaN

Assigning Values

The other way to deal with missing values is to assign values to NAs

We can do this with the fillna method.

Similar to dropna, it is required to use the parameter inplace=TRUE to maintain the changes

df.fillna('new', inplace=True)
df
      VarA  VarB  VarC    VarD
Case1   aa  20.0  1234  1234.0
Case2  new  30.0  3456   888.0
Case3   cc   new  6789     new
Case4  new   new   765     new

Exercises

Here is a dataframe:

raw_data = {'first_name': ['Jason', 'Mary', 'Tina', 'Jake', 'Amy','Anne'], 
        'last_name': ['Miller', 'Smith', 'Ali', 'Milner', 'Cooze','Lynn'], 
        'age': [42, np.nan, 36, 24, 73,'23'], 
        'sex': ['m', np.nan, 'f', 'm', 'f','f'], 
        'preTestScore': [4, np.nan, np.nan, 2, 3, np.nan],
        'postTestScore': [25, np.nan, np.nan, 62, 70, np.nan]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
df
  first_name last_name  age  sex  preTestScore  postTestScore
0      Jason    Miller   42    m           4.0           25.0
1       Mary     Smith  NaN  NaN           NaN            NaN
2       Tina       Ali   36    f           NaN            NaN
3       Jake    Milner   24    m           2.0           62.0
4        Amy     Cooze   73    f           3.0           70.0
5       Anne      Lynn   23    f           NaN            NaN

Exercises

  1. Determine which column(s) has the largest numbers of NAs
  2. Eliminate the entries with the largest number of NAs
  3. Add a new variable preTestScore with the mean value.

Exercises Solutions

# 1. Determine which column(s) has the largest numbers of NAs
na_counts = df.isna().sum()
print("Number of NAs in each column:")
Number of NAs in each column:
print(na_counts)
first_name       0
last_name        0
age              1
sex              1
preTestScore     3
postTestScore    3
dtype: int64

Exercises Solutions

# 2. Eliminate the entries with the largest number of NAs
max_na_column = na_counts.idxmax()
df_cleaned = df.dropna(subset=[max_na_column])
print("\nDataFrame after eliminating entries with the largest number of NAs:")

DataFrame after eliminating entries with the largest number of NAs:
print(df_cleaned)
  first_name last_name age sex  preTestScore  postTestScore
0      Jason    Miller  42   m           4.0           25.0
3       Jake    Milner  24   m           2.0           62.0
4        Amy     Cooze  73   f           3.0           70.0

Exercises Solutions

# 3. Add a new variable 'preTestScore' with the mean value
mean_preTestScore = df_cleaned['preTestScore'].mean()
df_cleaned['preTestScore_mean'] = mean_preTestScore
print("\nDataFrame after adding 'preTestScore_mean':")

DataFrame after adding 'preTestScore_mean':
print(df_cleaned)
  first_name last_name age sex  preTestScore  postTestScore  preTestScore_mean
0      Jason    Miller  42   m           4.0           25.0                3.0
3       Jake    Milner  24   m           2.0           62.0                3.0
4        Amy     Cooze  73   f           3.0           70.0                3.0