Open In Colab


8. Missing Data

In this exercise we will learn how to handle missing data.

Outline:

  • How to find missing data

  • How to remove missing data (if appropriate!)

8.1. Load Data

First load in the pandas library and bring in the NYC flight data

#Import pandas library
??

#read in the NYC flight data
df_flights = ??

Give a thumbs up in Slack once your done!

8.2. Identifying Missing Data

Most data that you’ll find will have missing values, or values that cannot be true (i.e., errors). Here we will look at how to find missing values, and how to handle them.

Below we use the method isna() to identify cells in the dataframe where the values are NA. We use the sum function ontop of the isna method to count the number of NA values.

Note: when you combine methods in this way the outputs of the first (going left to right) act as inputs for the second. This is called method chaining, and we will use it with pandas objects!

#count Null values for each column of the dataframe
df_flights.isna().sum()

So we can see that there are not that many missing values in this dataframe. Let’s take a look at how many rows of data we have:

len(df_flights)

And at what proportion of data is missing:

df_flights.isnull().sum() / len(df_flights)

We can take a look at these missing values within the dataframe by opening the dataframe in colab (Files - dubble click on the file - then filter by NA)

So we can see in this dataset there is very little in the way of missing data! But what should we do with those data entries? We could:

  1. Understand how/why they are missing (data story!)

  2. Remove those data entries (missing at random?)

  3. Fill them in with estimates (impute the missing data)

8.3. Drop Missing Data

Below we will remove the rows with missing data in one column. So if there is no data in this column then it will remove that entire row from the data.

#drop rows if air time contains missing values
df_flights_airtime_na = df_flights[df_flights.air_time.isna()==False]

#take a look at the new length of the dataframe
len(df_flights_airtime_na)

It is also possible to remove any row that has missing values.

#drop rows if any column contain missing values
df_flights_sub_na = df_flights.dropna(how='any')

len(df_flights_sub_na)

Try removing all the rows with NAs in arr_time.

#drop rows if arrival time contains missing values
df_flights_arr_time_na = ?

#check the length

8.4. Add column of missingness

Here the idea is that missing data might be useful for making predictions. Let’s add another column to the DataFrame to identify missing data in air time.

Here we use method chaining to select a column ‘air_time’ and check to see if each row has an NA value. This gives us true and false values for each row.

#create new column with true/false if there is missing data in air time
df_flights['missingAirTime'] = df_flights.air_time.isna()

#take a look 
df_flights

Let’s add one more step to that method chain, and convert the true false into integers. To do so we’ll add the method asType(‘int’), which converts the true/false into 1/0.

#create new column with true/false if there is missing data in air time
df_flights['missingAirTime'] = df_flights.air_time.isna().astype('int')

#take a look 
df_flights

Try adding a column of missingness for plane tail number (i.e., tailnum).

#create new column with 1/0 if there is missing data in tailnum
df_flights[?] = ?

#take a look
?

8.5. Further reading

There are many ways to deal with missing data with pandas. Just remember that some ways are more justifyable than others, and a good understanding of how the data came to be is key in deciding which ways might work best.

If you would like the notebook without missing code check out the full code version.