{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "view-in-github"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4sudTlocuCVM"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Mzk8IlLNCNQk"
},
"source": [
"## Summary statistics and histograms"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jkqMc39RCSyO"
},
"source": [
"In this exercise we will look at how to describe a variable of interest using some statistical measures. We will also learn how to plot and interpret a histogram. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9Zxm9znvzsuV"
},
"source": [
"For plotting there are many options in python. For the most part, we'll use **seaborn** for this class. It is a high level plotting library that will make it easy to plot your data in many different ways!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "hSya5UmWztHd"
},
"outputs": [],
"source": [
"#Import python packages\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"import pandas as pd\n",
"import numpy as np\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6QqhB-2QzgBy"
},
"source": [
"We'll explore different ways to summarize and visualize data using the flights data we used last class.\n",
"\n",
"Load that in and let's get started!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "953AfwMvn1R1"
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Lef4nEGz8pGr"
},
"outputs": [],
"source": [
" \n",
"#load in the nyc data \n",
"df_flights = pd.read_csv('?')\n",
"\n",
"#take a look\n",
"?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "p3QvQAoh536E"
},
"source": [
"What kinds of variables are we dealing with?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "EGOd3A7E59al"
},
"outputs": [],
"source": [
"?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aiBpHDzR6P80"
},
"source": [
"\n",
"For numeric variables we can describe their distribution.\n",
"\n",
"\n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "2IyIUk2L7fZb"
},
"outputs": [],
"source": [
"#get the mean\n",
"delay_mean = df_flights['arr_delay'].mean()\n",
"\n",
"#geth the mode\n",
"delay_mode = df_flights['arr_delay'].mode()\n",
"\n",
"#get the median\n",
"delay_median = df_flights['arr_delay'].median()\n",
"\n",
"#get the variance\n",
"delay_std = df_flights['arr_delay'].std()\n",
"\n",
"print(\"the mean arrival delay is: \", delay_mean)\n",
"print(\"the most common arrival delay is: \", delay_mode)\n",
"print(\"the middle most arrival delay is: \", delay_median)\n",
"print(\"the standard deviation of arrival delays is: \", delay_std)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4pBgYCJw28GJ"
},
"source": [
"Let's make this a little more visual and plot a histogram. \n",
"> Here we will use the histogram to visualize what delay times are most common and how spread out they are."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "KfRbxhWr2S6D"
},
"outputs": [],
"source": [
"sns.displot(df_flights, x='arr_delay')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-8XTETjVqokE"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0bEjAb7NpNmN"
},
"source": [
" \n",
"Try that again but this time try playing with the bin size. How does the width of the bins we break the data into impact the histogram? Try going very big and very small. And feel free to share your results on Slack!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "n6LRfTiqpM7L"
},
"outputs": [],
"source": [
"sns.displot(df_flights, x='arr_delay', binwidth=?) #E.g., 100"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a9Mj_PLd5n0k"
},
"source": [
"Let's now add in the measures of central tendencies and see how they match up on the histogram. Here we use the **plt.axvline()** method to add some vertical lines."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "hcALW35eAIyW"
},
"outputs": [],
"source": [
"sns.displot(df_flights,x='arr_delay', binwidth=10)\n",
"plt.axvline(delay_mean, color=\"red\")\n",
"plt.axvline(delay_mode[0], color=\"purple\")\n",
"plt.axvline(delay_median, color=\"pink\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yjbg6wLzqwMp"
},
"source": [
"It's a little hard to see the difference, let's set focus in on a particular range of the x-axis using the **plt.xlim()** method."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "BgkXXQIPI9G3"
},
"outputs": [],
"source": [
"sns.displot(df_flights,x='arr_delay', binwidth=10)\n",
"plt.xlim(-100,100)\n",
"plt.axvline(delay_mean, color=\"red\")\n",
"plt.axvline(delay_mode[0], color=\"purple\")\n",
"plt.axvline(delay_median, color=\"pink\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QPxAz7HYnwuk"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7voAhzTSJLZv"
},
"source": [
" \n",
"Try this again but this time look at departure time as the variable of interest. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "CXBhOWB6JXcG"
},
"outputs": [],
"source": [
"#get the mean\n",
"dep_mean = df_flights['dep_time'].?()\n",
"\n",
"#geth the mode\n",
"dep_mode = df_flights['?'].?()\n",
"\n",
"#get the median\n",
"dep_median = df_flights?\n",
"\n",
"#get the standard deviation\n",
"dep_std = ?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "M-zyc_NGthe3"
},
"source": [
"Create a histogram"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Sf142N09Jffu"
},
"outputs": [],
"source": [
"sns.displot(df_flights['dep_time'], binwidth=?)\n",
"plt.axvline(?, color=\"?\")\n",
"plt.axvline(dep_mode[0], color=\"?\")\n",
"plt.axvline(?, color=\"?\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "oowe8KJlxFwV"
},
"source": [
"### Measures of dispersion"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "tNZRbhT9KR4l"
},
"outputs": [],
"source": [
"#get standard deviation for arrival delay\n",
"delay_sd = df_flights['?'].std()\n",
"\n",
"#get standard deviation for departure times\n",
"dep_sd = df_flights?\n",
"\n",
"print(\"standard deviation of arrival delays = \", delay_sd)\n",
"print(\"standard deviation of departure times = \", dep_sd)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OZCbn6Rqsqbo"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WqxexEaBR3JF"
},
"source": [
" \n",
"Similarly we can find the min and max of the variables\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "04DX7yuvR9gT"
},
"outputs": [],
"source": [
"#get the min and max of delay times\n",
"delay_min = df_flights['arr_delay'].min()\n",
"delay_max = df_flights['arr_delay'].max()\n",
"\n",
"#get the min and max of departure times\n",
"time_min = ?\n",
"time_max = ?\n",
"\n",
"#print out the values\n",
"print(\"min arrival delays = \", delay_min)\n",
"print(\"max arrival delays = \", delay_max)\n",
"print(\"min arrival delays = \", time_min)\n",
"print(\"min arrival delays = \", time_max)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rRrZqIfRJeFQ"
},
"source": [
"### (BONUS QUESTION)\n",
"\n",
"Find the variable(s) with the highest and lowest standard deviations and plot them in a histogram.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "yEpCp6PPJ4yr"
},
"outputs": [],
"source": [
"#estimate and sort standard deviation of all the columns\n",
"?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "W6x6LHIbKPyf"
},
"outputs": [],
"source": [
"?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UwifoK1HKoBT"
},
"outputs": [],
"source": [
"?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OFnkk5NnKrXy"
},
"source": [
"What is going on with the variables identified as having the lowest standard deviation? What does this tell you about the data?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "np6GPRVILDAj"
},
"source": [
"Finally can you create a histogram of air speeds? "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XKLXHqiELbT2"
},
"outputs": [],
"source": [
"?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4ORY2hAQLmfN"
},
"source": [
"What are the unit of the x-axis in this histogram? \n",
" \n",
" \n",
"Finally, can you identify the speediest airline carrier?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Bre4kagfLw-N"
},
"outputs": [],
"source": [
"?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BJqFrYyyYzMx"
},
"source": [
"### Further reading"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_introViz1_histograms.ipynb) version."
]
}
],
"metadata": {
"colab": {
"authorship_tag": "ABX9TyNumEMVt9qbngm8+/l5al7M",
"collapsed_sections": [],
"include_colab_link": true,
"mount_file_id": "1wmmR4t_k1tLgeLSdfIlxVTLfExf_HSqA",
"name": "introViz1_histograms.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}