Handling Missing data
Contents
Module information¶
Key questions:
“What is missing data?”
“How can I analyse my data with missing values using Pandas?”
Learning objectives:
Describe what missing values mean.
How to access missing values in dataset.
What NaN values are, how they might be represented, and what this means for your work
How to replace NaN values, if desired
How to use
to_csv
to write manipulated data to a file.
Handling Missing Data¶
Most of the times real-world data is rarely clean and homogeneous. In many cases, dataset of interest will have some amount of data missing. To make matters even more complicated, different data sources may indicate missing data in different ways.
In this module, we will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for handling missing data in Python. We refer the missing data as null, NaN, or NA values in general.
Before we start, lets make sure the Pandas and matplotlib packages are installed.
!pip install pandas matplotlib
Requirement already satisfied: pandas in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (1.4.0)
Requirement already satisfied: matplotlib in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (3.5.1)
Requirement already satisfied: pytz>=2020.1 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from pandas) (2021.3)
Requirement already satisfied: numpy>=1.18.5 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from pandas) (1.22.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from pandas) (2.8.2)
Requirement already satisfied: packaging>=20.0 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from matplotlib) (21.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from matplotlib) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: pillow>=6.2.0 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from matplotlib) (9.0.1)
Requirement already satisfied: pyparsing>=2.2.1 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from matplotlib) (3.0.7)
Requirement already satisfied: fonttools>=4.22.0 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from matplotlib) (4.29.1)
Requirement already satisfied: six>=1.5 in /home/danph/Repos/win_ssd/myprojects/python-workshop-base/.venv/lib/python3.8/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
# Now import pandas into your notebook as pd
import pandas as pd
Now again import surveys.csv
dataset into our notebook as we did in previous lesson.
surveys_df = pd.read_csv("surveys.csv")
Using masks to identify a specific condition¶
A mask can be useful to locate where a particular subset of values exist or
don’t exist - for example, NaN, or “Not a Number” values. To understand masks,
we also need to understand BOOLEAN
objects in Python.
Boolean values include True
or False
. For example,
# set value of x to be 5
x = 5
x > 5
False
x == 5
True
Finding Missing Values¶
Let’s identify all locations in the survey data that have
null (missing or NaN) data values. We can use the isnull
method to do this.
The isnull
method will compare each cell with a null value. If an element
has a null value, it will be assigned a value of True
in the output object.
pd.isnull(surveys_df).head()
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | True |
1 | False | False | False | False | False | False | False | False | True |
2 | False | False | False | False | False | False | False | False | True |
3 | False | False | False | False | False | False | False | False | True |
4 | False | False | False | False | False | False | False | False | True |
How to select rows with missing data¶
To select the rows where there are null values, we can use the mask as an index to subset our data as follows:
# To select only the rows with NaN values, we can use the 'any()' method
surveys_df[pd.isnull(surveys_df).any(axis=1)]
record_id | month | day | year | site_id | species_id | sex | hindfoot_length | weight | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 7 | 16 | 1977 | 2 | NL | M | 32.0 | NaN |
1 | 2 | 7 | 16 | 1977 | 3 | NL | M | 33.0 | NaN |
2 | 3 | 7 | 16 | 1977 | 2 | DM | F | 37.0 | NaN |
3 | 4 | 7 | 16 | 1977 | 7 | DM | M | 36.0 | NaN |
4 | 5 | 7 | 16 | 1977 | 3 | DM | M | 35.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
35530 | 35531 | 12 | 31 | 2002 | 13 | PB | F | 27.0 | NaN |
35543 | 35544 | 12 | 31 | 2002 | 15 | US | NaN | NaN | NaN |
35544 | 35545 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35545 | 35546 | 12 | 31 | 2002 | 15 | AH | NaN | NaN | NaN |
35548 | 35549 | 12 | 31 | 2002 | 5 | NaN | NaN | NaN | NaN |
4873 rows × 9 columns
Explaination¶
Notice that we have 4873
observations/rows that contain one or more missing values.
Thats roughly 14%
of data contains missing values.
We have used []
convension to select subset of data.
More information about slicing and indexing can be found out here.
(axis=1) is a numpy convention to specify columns.
Note that the weight
column of our DataFrame contains many null
or NaN
values. Next, we will explore ways of dealing with this.
If we look at the weight
column in the surveys
data we notice that there are NaN (Not a Number) values. NaN values are undefined
values that cannot be represented mathematically. Pandas, for example, will read
an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable properties: if we
were to average the weight
column without replacing our NaNs, Python would know to skip
over those cells.
Dealing with missing values.¶
Dealing with missing data values is always a challenge. It’s sometimes hard to know why values are missing - was it because of a data entry error? Or data that someone was unable to collect? Should the value be 0? We need to know how missing values are represented in the dataset in order to make good decisions. If we’re lucky, we have some metadata that will tell us more about how null values were handled.
For instance, in some disciplines, like Remote Sensing, missing data values are often defined as -9999. Having a bunch of -9999 values in your data could really alter numeric calculations. Often in spreadsheets, cells are left empty where no data are available. Pandas will, by default, replace those missing values with NaN. However it is good practice to get in the habit of intentionally marking cells that have no data, with a no data value! That way there are no questions in the future when you (or someone else) explores your data.
Where Are the NaN’s?¶
Let’s explore the NaN values in our data a bit further. Using the tools we learned in lesson 02, we can figure out how many rows contain NaN values for weight. We can also create a new subset from our data that only contains rows with weight values > 0 (i.e., select meaningful weight values):
## How many missing values are there in weight column?
len(surveys_df[pd.isnull(surveys_df.weight)])
3266
# How many rows have weight values?
len(surveys_df[surveys_df.weight> 0])
32283
We can replace all NaN values with zeroes using the .fillna()
method (after
making a copy of the data so we don’t lose our work):
# Creat a new DataFrame using copy
df1 = surveys_df.copy()
# Fill all NaN values with 0
df1['weight'] = df1['weight'].fillna(0)
However NaN and 0 yield different analysis results. The mean value when NaN values are replaced with 0 is different from when NaN values are simply thrown out or ignored.
surveys_df['weight'].mean()
42.672428212991356
df1['weight'].mean()
38.751976145601844
Extra Information¶
We can fill NaN values with any value that we chose. The code below fills all NaN values with a mean for all weight values.
df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())
df1['weight'].mean()
42.672428212991356
Writing Out Data to CSV¶
We’ve learned about using manipulating data to get desired outputs. But we’ve also discussed keeping data that has been manipulated separate from our raw data. Something we might be interested in doing is working with only the columns that have full data. First, let’s reload the data so we’re not mixing up all of our previous manipulations.
df_na = surveys_df.dropna()
If you now type df_na
, you should observe that the resulting DataFrame has 30676
rows
and 9 columns, much smaller than the 35549
row original.
We can now use the to_csv
command to do export a DataFrame in CSV format. Note that the code
below will by default save the data into the current working directory. We can
save it to a different folder by adding the foldername and a slash before the filename:
df1.to_csv('foldername/out.csv')
.
We use ‘index=False’ so that
pandas doesn’t include the index number for each line.
# Write DataFrame to CSV
df_na.to_csv('output/surveys_complete.csv', index=False)
Recap¶
What we’ve learned:
What NaN values are, how they might be represented, and what this means for your work
How to replace NaN values, if desired
How to use
to_csv
to write manipulated data to a file.
Extra¶
We can run isnull
on a particular column too. What does the code below do?
# What does this do?
empty_weights = surveys_df[pd.isnull(surveys_df['weight'])]['weight']
print(empty_weights)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
35530 NaN
35543 NaN
35544 NaN
35545 NaN
35548 NaN
Name: weight, Length: 3266, dtype: float64
Let’s take a minute to look at the statement above. We are using the Boolean
object pd.isnull(surveys_df['weight'])
as an index to surveys_df
. We are
asking Python to select rows that have a NaN
value of weight.