Open In Colab


12. Summary statistics and histograms

In this exercise we will look at how to describe a variable of interest using some statistical measures. We will also learn how to plot and interpret a histogram.

For plotting there are many options in python. For the most part, we’ll use seaborn for this class. It is a high level plotting library that will make it easy to plot your data in many different ways!

#Import python packages
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

We’ll explore different ways to summarize and visualize data using the flights data we used last class.

Load that in and let’s get started!

  
#load in the nyc data 
df_flights = pd.read_csv('?')

#take a look
?

What kinds of variables are we dealing with?

?

For numeric variables we can describe their distribution.

#get the mean
delay_mean = df_flights['arr_delay'].mean()

#geth the mode
delay_mode = df_flights['arr_delay'].mode()

#get the median
delay_median = df_flights['arr_delay'].median()

#get the variance
delay_std = df_flights['arr_delay'].std()

print("the mean arrival delay is: ", delay_mean)
print("the most common arrival delay is: ", delay_mode)
print("the middle most arrival delay is: ", delay_median)
print("the standard deviation of arrival delays is: ", delay_std)

Let’s make this a little more visual and plot a histogram.

Here we will use the histogram to visualize what delay times are most common and how spread out they are.

sns.displot(df_flights, x='arr_delay')

Try that again but this time try playing with the bin size. How does the width of the bins we break the data into impact the histogram? Try going very big and very small. And feel free to share your results on Slack!

sns.displot(df_flights, x='arr_delay', binwidth=?) #E.g., 100

Let’s now add in the measures of central tendencies and see how they match up on the histogram. Here we use the plt.axvline() method to add some vertical lines.

sns.displot(df_flights,x='arr_delay', binwidth=10)
plt.axvline(delay_mean, color="red")
plt.axvline(delay_mode[0], color="purple")
plt.axvline(delay_median, color="pink")

It’s a little hard to see the difference, let’s set focus in on a particular range of the x-axis using the plt.xlim() method.

sns.displot(df_flights,x='arr_delay', binwidth=10)
plt.xlim(-100,100)
plt.axvline(delay_mean, color="red")
plt.axvline(delay_mode[0], color="purple")
plt.axvline(delay_median, color="pink")

Try this again but this time look at departure time as the variable of interest.

#get the mean
dep_mean = df_flights['dep_time'].?()

#geth the mode
dep_mode = df_flights['?'].?()

#get the median
dep_median = df_flights?

#get the standard deviation
dep_std = ?

Create a histogram

sns.displot(df_flights['dep_time'], binwidth=?)
plt.axvline(?, color="?")
plt.axvline(dep_mode[0], color="?")
plt.axvline(?, color="?")

12.1. Measures of dispersion

#get standard deviation for arrival delay
delay_sd = df_flights['?'].std()

#get standard deviation for departure times
dep_sd = df_flights?

print("standard deviation of arrival delays = ", delay_sd)
print("standard deviation of departure times = ", dep_sd)

Similarly we can find the min and max of the variables

#get the min and max of delay times
delay_min = df_flights['arr_delay'].min()
delay_max = df_flights['arr_delay'].max()

#get the min and max of departure times
time_min = ?
time_max = ?

#print out the values
print("min arrival delays = ", delay_min)
print("max arrival delays = ", delay_max)
print("min arrival delays = ", time_min)
print("min arrival delays = ", time_max)

12.2. (BONUS QUESTION)

Find the variable(s) with the highest and lowest standard deviations and plot them in a histogram.

#estimate and sort standard deviation of all the columns
?
?
?

What is going on with the variables identified as having the lowest standard deviation? What does this tell you about the data?

Finally can you create a histogram of air speeds?

?

What are the unit of the x-axis in this histogram?

Finally, can you identify the speediest airline carrier?

?

12.3. Further reading

If you would like the notebook without missing code check out the full code version.