L8: Pandas

Bogdan G. Popescu

John Cabot University

Introduction

Pandas is a library that provides analytical tools and data structures with high performance and ease of use.

Pandas is useful for manipulating tabulated data including spreadsheets and SQL tables.

Common files that work with pandas include: csv, xls, sql, or hdf5.

Common operations with pandas include: filtering, adding, or deleting data.

Overall, it combines the high performance of the numpy arrays with the ability to manipulate data

Introduction

To import the library, we can simply type:

Python
import pandas as pd

A DataFrame is a data structure that stores data in tabular form, that is, ordered in rows and labeled columns.

Each row contains an observation and each column a variable.

A DataFrame accepts heterogeneous data, that is, variables can be of different types (numeric, string, boolean, etc.).

Python
s = pd.DataFrame([1979, 1980, 1981, 1982])
s
      0
0  1979
1  1980
2  1981
3  1982

Dataframes

We can also do

Python
s = pd.DataFrame([1979, 1980, 1981, 1982], columns=["Year"])
s
   Year
0  1979
1  1980
2  1981
3  1982

The dataframe can have both values and an index. For example, we can also define our list

Python
s_w_index = pd.DataFrame([1979, 1980, 1981, 1982], index = [0,1,2,3])
s_w_index
      0
0  1979
1  1980
2  1981
3  1982

Extracting Elements of a Pandas Dataframe

We can extract the values of the Pandas Dataframe in the following way:

Python
s_w_index.values
array([[1979],
       [1980],
       [1981],
       [1982]])

We can extract the index associated with this data in the following way:

Python
s_w_index.index
Index([0, 1, 2, 3], dtype='int64')

Extracting Elements of a Pandas Dataframe

The index does not need to be series of numbers:

Python
index = ['carolina', 'martha', 'nicky', 'theresa']
series = [1979, 1980, 1981, 1982]
series_w_index = pd.DataFrame(series, index) 
series_w_index
             0
carolina  1979
martha    1980
nicky     1981
theresa   1982

Dataframes from Dictionaries

We can also create Pandas dataframes from dictionaries

Python
dicc = {"Alice": 56, "John": 53, "Anna": 98}
dicc
{'Alice': 56, 'John': 53, 'Anna': 98}
Python
dicc_df = pd.DataFrame(dicc.items())
dicc_df
       0   1
0  Alice  56
1   John  53
2   Anna  98

Or

Python
dicc_df = pd.DataFrame(dicc.items(), columns=['Student', 'Grade'])
dicc_df
  Student  Grade
0   Alice     56
1    John     53
2    Anna     98

Dataframes from Dictionaries

There are multiple ways to access data in a dataframe

One way is by using the series internal index. Note that indices start with 0.

Python
dicc_df.iloc[1]
Student    John
Grade        53
Name: 1, dtype: object

Accessing Data in a Series

We can also select only some observations

Python
dicc_df.iloc[1:]
  Student  Grade
1    John     53
2    Anna     98

Or

Python
dicc_df.iloc[1:3]
  Student  Grade
1    John     53
2    Anna     98

Accessing Data in a Series

We can also select only some observations

This is where we use .loc instead of iloc

  • .loc is a Label-based indexing method
  • .iloc is a Position-based indexing
Python
dicc_df.loc[dicc_df["Student"] == "Alice", "Grade"]
0    56
Name: Grade, dtype: int64

or

Python
dicc_df.loc[dicc_df["Grade"] == 56, "Student"]
0    Alice
Name: Student, dtype: object

Dataframes in Pandas

Dataframes in pandas are designed to store heterogeneous multivariate data.

For example, we can see below:

Python
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)
df
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9

Notice how column names are derived from the dictionary keys.

We notice that the index appears automatically

Dataframes in Pandas

We can also specify our very own index in the following way:

Python
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data, index=["Row1", "Row2", "Row3", "Row4", "Row5"])
df
       state  year  pop
Row1    Ohio  2000  1.5
Row2    Ohio  2001  1.7
Row3    Ohio  2002  3.6
Row4  Nevada  2001  2.4
Row5  Nevada  2002  2.9

Dataframes in Pandas

If we simply want to see the columns in a dataframe, we can easily use the columns attribute.

Python
df.columns
Index(['state', 'year', 'pop'], dtype='object')

The columns attribute returns the column labels of the DataFrame as an Index object.

Accessing Data in a Dataframe

Columns can be extracted from a dataframe using the column label

The result is a Series object

Python
df['pop']
Row1    1.5
Row2    1.7
Row3    3.6
Row4    2.4
Row5    2.9
Name: pop, dtype: float64

Accessing Data in a Dataframe

It is possible to extract more than one column.

Python
df2 = df[["state", "year"]]

To access a particular observation, we can use loc

Python
df.loc["Row3"]
state    Ohio
year     2002
pop       3.6
Name: Row3, dtype: object

Dataframes Methods

There are a variety of operations that can be used with dataframes

For example, the following tells us the shape of the dataframe:

Python
df.shape
(5, 3)

The following will tell us the values within our dataframe (the first five by default)

Python
df.values
array([['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9]], dtype=object)

Dataframes Methods

We can also inspect the head and the tail of our dataframes.

For example, we can have:

Python
df.head
<bound method NDFrame.head of        state  year  pop
Row1    Ohio  2000  1.5
Row2    Ohio  2001  1.7
Row3    Ohio  2002  3.6
Row4  Nevada  2001  2.4
Row5  Nevada  2002  2.9>

or

Python
df.tail
<bound method NDFrame.tail of        state  year  pop
Row1    Ohio  2000  1.5
Row2    Ohio  2001  1.7
Row3    Ohio  2002  3.6
Row4  Nevada  2001  2.4
Row5  Nevada  2002  2.9>

Dataframes Methods

We can have a summary of our variables in the following way:

Python
df.describe()
             year       pop
count     5.00000  5.000000
mean   2001.20000  2.420000
std       0.83666  0.864292
min    2000.00000  1.500000
25%    2001.00000  1.700000
50%    2001.00000  2.400000
75%    2002.00000  2.900000
max    2002.00000  3.600000

.describe() works on numeric columns by default. For non-numeric columns, use include='all':

Python
df.describe(include='all')
       state        year       pop
count      5     5.00000  5.000000
unique     2         NaN       NaN
top     Ohio         NaN       NaN
freq       3         NaN       NaN
mean     NaN  2001.20000  2.420000
std      NaN     0.83666  0.864292
min      NaN  2000.00000  1.500000
25%      NaN  2001.00000  1.700000
50%      NaN  2001.00000  2.400000
75%      NaN  2002.00000  2.900000
max      NaN  2002.00000  3.600000

Exercise 1

Look at the following dictionary

Python
import numpy as np
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

Exercise 1

  1. Create a DataFrame from the dictionary and indexes.
  2. Select the columns animal and age.
  3. Indicate how many different types of animals there are.
  4. Indicate how many animals there are of each type.
  5. Show a statistical summary of all variables.

Exercise 1

  1. Create a DataFrame from the dictionary and indexes.
Python
new_data = pd.DataFrame(data)
new_data
  animal  age  visits priority
0    cat  2.5       1      yes
1    cat  3.0       3      yes
2  snake  0.5       2       no
3    dog  NaN       3      yes
4    dog  5.0       2       no
5    cat  2.0       3       no
6  snake  4.5       1       no
7    cat  NaN       1      yes
8    dog  7.0       2       no
9    dog  3.0       1       no

Exercise 1

  1. Select the columns animal and age.
Python
new_data[["animal", 'age']]
  animal  age
0    cat  2.5
1    cat  3.0
2  snake  0.5
3    dog  NaN
4    dog  5.0
5    cat  2.0
6  snake  4.5
7    cat  NaN
8    dog  7.0
9    dog  3.0

Exercise 1

  1. Indicate how many different types of animals there are.
Python
new_data.animal.nunique()
3

Exercise 1

  1. Indicate how many animals there are of each type.
Python
new_data.animal.value_counts()
animal
cat      4
dog      4
snake    2
Name: count, dtype: int64

Exercise 1

  1. Show a statistical summary of all variables.
Python
new_data.describe(include='all')
       animal       age     visits priority
count      10  8.000000  10.000000       10
unique      3       NaN        NaN        2
top       cat       NaN        NaN       no
freq        4       NaN        NaN        6
mean      NaN  3.437500   1.900000      NaN
std       NaN  2.007797   0.875595      NaN
min       NaN  0.500000   1.000000      NaN
25%       NaN  2.375000   1.000000      NaN
50%       NaN  3.000000   2.000000      NaN
75%       NaN  4.625000   2.750000      NaN
max       NaN  7.000000   3.000000      NaN

Exercise 2

Look at the following dictionary

Python
import numpy as np
data = {'city': ['Rome', 'Paris', 'London', 'Berlin', 'Madrid', 'Rome', 'Paris', 'London', 'Berlin', 'Madrid'],
        'temperature': [22.5, 18.3, 15.0, np.nan, 25.2, 21.0, 19.5, np.nan, 14.8, 26.1],
        'humidity': [55, 65, 70, 60, 50, 58, 68, 72, 63, 48],
        'pollution_level': ['low', 'moderate', 'high', 'moderate', 'low', 'low', 'moderate', 'high', 'moderate', 'low']}

Exercise 2

  1. Create a DataFrame from the dictionary.
  2. Select the columns city and temperature.
  3. Indicate how many different cities are represented in the dataset.
  4. Indicate how many times each city appears in the dataset.
  5. Show a statistical summary of all numerical variables.

Missing Data

Many datasets have fields where some observations are missing.

These can cause problems when trying to represent data and perform certain operations

Thus, it is necessary to identify and do something about these missing values.

There are two strategies to deal with the deletion and assignment of a given value:

  • deletion
  • assignment of a value

Missing Data

Python
df = pd.DataFrame({'VarA': ['aa', None, 'cc',None],
                  'VarB': [20, 30, None,None],
                  'VarC': [1234, 3456, 6789,765],
                  'VarD': [1234, 888, None,None]
                 },
                 index=['Case1', 'Case2', 'Case3', 'Case4'])
df
       VarA  VarB  VarC    VarD
Case1    aa  20.0  1234  1234.0
Case2  None  30.0  3456   888.0
Case3    cc   NaN  6789     NaN
Case4  None   NaN   765     NaN

Checking Missing Data

We can check whether data is missing

Python
df.isnull()
        VarA   VarB   VarC   VarD
Case1  False  False  False  False
Case2   True  False  False  False
Case3  False   True  False   True
Case4   True   True  False   True

We can also check if there is any element in our data that is missing

Python
df.isnull().values.any()
True

Eliminating Missing Values

The most simple way to deal with NAs is to drop the NAs

Python
df.dropna()
      VarA  VarB  VarC    VarD
Case1   aa  20.0  1234  1234.0

We can see below that the NA values have not been dropped

Python
df
       VarA  VarB  VarC    VarD
Case1    aa  20.0  1234  1234.0
Case2  None  30.0  3456   888.0
Case3    cc   NaN  6789     NaN
Case4  None   NaN   765     NaN

Eliminating Missing Values

In order to drop the deleted values, we can perform the following:

Python
df.dropna(inplace=True)
df
      VarA  VarB  VarC    VarD
Case1   aa  20.0  1234  1234.0

Eliminating Missing Values

We can indicate the columns from which we want to drop the NAs.

Python
df.dropna(subset=['VarB'])
      VarA  VarB  VarC    VarD
Case1   aa  20.0  1234  1234.0

Assigning Values

The other way to deal with missing values is to assign values to NAs

We can do this with the fillna method.

Similar to dropna, it is required to use the parameter inplace=True to maintain the changes

Python
df = pd.DataFrame({'VarA': ['aa', None, 'cc',None],
                  'VarB': [20, 30, None,None],
                  'VarC': [1234, 3456, 6789,765],
                  'VarD': [1234, 888, None,None]
                 },
                 index=['Case1', 'Case2', 'Case3', 'Case4'])
df
       VarA  VarB  VarC    VarD
Case1    aa  20.0  1234  1234.0
Case2  None  30.0  3456   888.0
Case3    cc   NaN  6789     NaN
Case4  None   NaN   765     NaN

Assigning Values

The other way to deal with missing values is to assign values to NAs

We can do this with the fillna method.

Similar to dropna, it is required to use the parameter inplace=True to maintain the changes

Python
df.fillna('new', inplace=True)
df
      VarA  VarB  VarC    VarD
Case1   aa  20.0  1234  1234.0
Case2  new  30.0  3456   888.0
Case3   cc   new  6789     new
Case4  new   new   765     new

Exercises 1

Here is a dataframe:

Python
import numpy as np
raw_data = {'first_name': ['Jason', 'Mary', 'Tina', 'Jake', 'Amy','Anne'], 
        'last_name': ['Miller', 'Smith', 'Ali', 'Milner', 'Cooze','Lynn'], 
        'age': [42, np.nan, 36, 24, 73,'23'], 
        'sex': ['m', np.nan, 'f', 'm', 'f','f'], 
        'preTestScore': [4, np.nan, np.nan, 2, 3, np.nan],
        'postTestScore': [25, np.nan, np.nan, 62, 70, np.nan]}
df = pd.DataFrame(raw_data)
df
  first_name last_name  age  sex  preTestScore  postTestScore
0      Jason    Miller   42    m           4.0           25.0
1       Mary     Smith  NaN  NaN           NaN            NaN
2       Tina       Ali   36    f           NaN            NaN
3       Jake    Milner   24    m           2.0           62.0
4        Amy     Cooze   73    f           3.0           70.0
5       Anne      Lynn   23    f           NaN            NaN

Exercises 1

  1. Determine which column(s) has the largest numbers of NAs
  2. Eliminate the entries with the largest number of NAs
  3. Add a new variable the original dataframe and include preTestScore_mean to contain the mean value for preTestScore

Exercises 1 Solutions

Python
# 1. Determine which column(s) has the largest numbers of NAs
na_counts = df.isna().sum()
print("Number of NAs in each column:")
Number of NAs in each column:
print(na_counts)
first_name       0
last_name        0
age              1
sex              1
preTestScore     3
postTestScore    3
dtype: int64

Exercises 1 Solutions

Python
# 2. Eliminate the entries with the largest number of NAs
max_na_column = na_counts.idxmax()
df_cleaned = df.dropna(subset=[max_na_column])
print("\nDataFrame after eliminating entries with the largest number of NAs:")

DataFrame after eliminating entries with the largest number of NAs:
print(df_cleaned)
  first_name last_name age sex  preTestScore  postTestScore
0      Jason    Miller  42   m           4.0           25.0
3       Jake    Milner  24   m           2.0           62.0
4        Amy     Cooze  73   f           3.0           70.0

Exercises 1 Solutions

Python
# 3. Add a new variable 'preTestScore' with the mean value
mean_preTestScore = df_cleaned['preTestScore'].mean()
df_cleaned['preTestScore_mean'] = mean_preTestScore
print("\nDataFrame after adding 'preTestScore_mean':")

DataFrame after adding 'preTestScore_mean':
print(df_cleaned)
  first_name last_name age sex  preTestScore  postTestScore  preTestScore_mean
0      Jason    Miller  42   m           4.0           25.0                3.0
3       Jake    Milner  24   m           2.0           62.0                3.0
4        Amy     Cooze  73   f           3.0           70.0                3.0

Exercises 2

Here is a dataframe:

Python
import pandas as pd
import numpy as np

raw_data = {'first_name': ['Liam', 'Olivia', 'Noah', 'Emma', 'Ava', 'Sophia'], 
            'last_name': ['Brown', 'Davis', 'Wilson', 'Moore', 'Taylor', 'Anderson'], 
            'age': [34, np.nan, 28, 22, 67, '29'], 
            'sex': ['m', 'f', np.nan, 'f', 'f', 'm'], 
            'exam1_score': [85, np.nan, 78, np.nan, 92, 88],
            'exam2_score': [91, 87, np.nan, np.nan, 95, np.nan]}

df = pd.DataFrame(raw_data, columns=['first_name', 'last_name', 'age', 'sex', 'exam1_score', 'exam2_score'])
df
  first_name last_name  age  sex  exam1_score  exam2_score
0       Liam     Brown   34    m         85.0         91.0
1     Olivia     Davis  NaN    f          NaN         87.0
2       Noah    Wilson   28  NaN         78.0          NaN
3       Emma     Moore   22    f          NaN          NaN
4        Ava    Taylor   67    f         92.0         95.0
5     Sophia  Anderson   29    m         88.0          NaN

Exercises 2

  1. Determine which column(s) have the largest number of NAs.
  2. Eliminate the entries with the largest number of NAs.
  3. Add a new variable the original dataframe and include preTestScore_mean to contain the mean value for preTestScore

Conclusion

Core Structures: Pandas relies on Series (1D arrays) and DataFrames (tabular data) for efficient data handling.

Data Operations: Supports filtering, indexing, appending, and modifying data seamlessly.

Missing Data: Manage missing values with dropna or fillna methods effectively.

Practical Use: Exercises highlight creating, accessing, and analyzing data with ease.