{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "gsV5DGfCQ_Ce" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": { "id": "K6wrqFznuqCk" }, "source": [ "## Density plots and normal distributions" ] }, { "cell_type": "markdown", "metadata": { "id": "oH7GFT6DuyTp" }, "source": [ "In this exercise we will start to use density plots." ] }, { "cell_type": "markdown", "metadata": { "id": "kRaoG6jTzDO5" }, "source": [ "### Normal distribution" ] }, { "cell_type": "markdown", "metadata": { "id": "vLtpucVUvuRO" }, "source": [ "Let's start by simulating some data from a normal distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "a2deIuFav6HO" }, "outputs": [], "source": [ "# import the libraries we will use\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n" ] }, { "cell_type": "markdown", "metadata": { "id": "vyXJESl7Uz5x" }, "source": [ "Let's choose a mean and std and simulate some data from a normal distribution. Here we use the method np.random.normal(mean,std, size) to simulate data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "x_itjh6pTQaq" }, "outputs": [], "source": [ "#choose a mean and a standard deviation\n", "my_mean= 2\n", "my_std = 10\n", "\n", "#sample data from a normal distribution\n", "my_made_up_data = np.random.normal(my_mean, my_std, size=1000)\n", "\n", "#take a look the data you made\n", "my_made_up_data[0:10]" ] }, { "cell_type": "markdown", "metadata": { "id": "X0UjYjwaz8_L" }, "source": [ "Let's take a look at the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bC7X8v4nz8_M" }, "outputs": [], "source": [ "#plot a histogram\n", "sns.displot(my_made_up_data)" ] }, { "cell_type": "markdown", "metadata": { "id": "JkzHLISKz8_M" }, "source": [ "Given you know the 'real' mean and std how well can you estimate them? Use the code above to get the mean and sd from your sample." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rJhPo0R3z8_N" }, "outputs": [], "source": [ "#get the mean\n", "estimated_mean = my_made_up_data.mean()\n", "\n", "#get the sd\n", "estimated_std = my_made_up_data.std()\n", "\n", "#print \n", "print(\"The estimated mean is \",estimated_mean)\n", "print(\"The estimated mean is \",estimated_std)" ] }, { "cell_type": "markdown", "metadata": { "id": "v00fHEPUSJUL" }, "source": [ " " ] }, { "cell_type": "markdown", "metadata": { "id": "6QlhylQOz8_O" }, "source": [ " \n", "How well can you estimate the true mean from the sample? Try playing around with the size of the sample, the mean, and the std. Can you generate samples where the estimated mean is way off? \n", "\n", "Share plots to Slack, let's see who can get the biggest difference!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "l5eQMbl_TkK7" }, "outputs": [], "source": [ "print(\"The true mean is {0} compared to the estimated mean {1}\".format(my_mean, estimated_mean.round(3)))\n", "print(\"The true sd is {0} compared to the estimated sd {1}\".format(my_std, estimated_std.round(3)))\n" ] }, { "cell_type": "markdown", "metadata": { "id": "oKwu53z8Tdsw" }, "source": [ "Use a histogram to compare how close your estimates are" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OH8AQWajTchf" }, "outputs": [], "source": [ "#plot the histogram again and place the estimated mean on the plot\n", "sns.displot(my_made_up_data)\n", "plt.axvline(estimated_mean,color=\"black\")\n", "plt.axvline(my_mean,color=\"red\")\n", "\n", "#save a plot\n", "plt.savefig('hist_compare.png')" ] }, { "cell_type": "markdown", "metadata": { "id": "_AxFmCsCy_tJ" }, "source": [ "### Density plots" ] }, { "cell_type": "markdown", "metadata": { "id": "crRu8jZsVJV-" }, "source": [ "Let's take a look at density plots next. These plots are like the histogram, but instead of counts on the y-axis we now have densities. Values that are found a lot have high densities, whereas values that are rare have low densities. Let's take a look! " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BEAPLe3Oul4x" }, "outputs": [], "source": [ "#bring in the nyc flight data\n", "df_flights = pd.read_csv('/content/nyc_flight_data.csv')\n", "\n", "#fit a histogram of departure delay times\n", "sns.displot(df_flights,x='arr_delay', binwidth=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "N9TjqmiLVlZR" }, "outputs": [], "source": [ "#plot one with both histogram and density \n", "sns.displot(df_flights, x='arr_delay', kde=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Yv0kPuGJUQ08" }, "outputs": [], "source": [ "#plot just density\n", "sns.displot(df_flights, x='arr_delay', kind='kde')" ] }, { "cell_type": "markdown", "metadata": { "id": "TjhoXN5DWD8e" }, "source": [ "How does this distribution compare to the normal distributions 'bell shaped' curve? Do you see any similarities/differences?" ] }, { "cell_type": "markdown", "metadata": { "id": "0ViU5oozWS2p" }, "source": [ "### Distributions within categories" ] }, { "cell_type": "markdown", "metadata": { "id": "6sNLNpvVX_0I" }, "source": [ "\n", "We can also start to look at these distributions for different groups. So in this case we can look at distributions within carriers using the *hue* input." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gEHpKNzJYJOZ" }, "outputs": [], "source": [ "#Create a density plot for each carrier\n", "sns.displot(df_flights, x='arr_delay', kind='kde', hue='carrier', fill=False).set(xlabel='Arrival delay')\n", "plt.savefig('overlapDens.png', dpi=600)" ] }, { "cell_type": "markdown", "metadata": { "id": "FIll09p3alC3" }, "source": [ "This helps see that they are mostly centered around 0 minutes delay. Are there better ways to vizualize this to see the differences between carriers?\n", "\n", "**Let's try a **violin plot****" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BScOZdAYatqc" }, "outputs": [], "source": [ "#plot a violin plot\n", "sns.violinplot(data=df_flights, x='carrier',y='arr_delay')\n", "plt.axhline(0, ls='--') #add a dashed line at zero\n", "\n", "#save the figure\n", "plt.savefig('output_figure.png',dpi=600)" ] }, { "cell_type": "markdown", "metadata": { "id": "jyHOvgcK2Oj0" }, "source": [ "Can you use what you've learnt above to decide which airlines are better to take to avoid departure and arrival delays? \n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "jYwbraz-kFGS" }, "source": [ "**Alternatively we could try out a Bar plot**" ] }, { "cell_type": "markdown", "metadata": { "id": "mByoSSXPimtX" }, "source": [ "> There is often many ways to view the same data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-q76qqviitEu" }, "outputs": [], "source": [ "sns.barplot(data=df_flights, x=\"carrier\", y=\"arr_delay\" ).set(ylabel='Arrival delay') #change the y-axis lable\n", "plt.axhline(0, ls='--') #add a dashed line at zero\n", "\n", "#save the figure\n", "plt.savefig(\"bar_test.png\")" ] }, { "cell_type": "markdown", "metadata": { "id": "3LDaZKbvj3q4" }, "source": [ "Which version of the plot do you find more intuitive? Is one easier to *read* than the other?" ] }, { "cell_type": "markdown", "metadata": { "id": "0LG0RR7KTM45" }, "source": [ " " ] }, { "cell_type": "markdown", "metadata": { "id": "0PtzV31zTNhs" }, "source": [ " \n", "Can you reproduce the top three figures (density plot, violin plot, and bar plot) using origin instead of carrier? Which one do you find the most intuitive?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yBTb43-STN86" }, "outputs": [], "source": [ "#density plot\n", "sns.displot(?, x='?', kind='kde', hue='?', fill=?)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "s07C1_TkTuqc" }, "outputs": [], "source": [ "#plot a violin plot\n", "sns.violinplot(data=?, x='?',y='?')\n", "plt.axhline(0, ls='--') #add a dashed line at zero" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ssS14ViSTvBV" }, "outputs": [], "source": [ "#bar plot\n", "sns.barplot(data=?, x=\"?\", y=\"?\" )" ] }, { "cell_type": "markdown", "metadata": { "id": "kK-ljIm5Q71G" }, "source": [ "### Further reading" ] }, { "cell_type": "markdown", "metadata": { "id": "qwyufH-L2e52" }, "source": [ "* Plot using [categorical data](https://seaborn.pydata.org/tutorial/categorical.html) \n", "* Plotting [distributions](https://seaborn.pydata.org/tutorial/distributions.html) " ] }, { "cell_type": "markdown", "metadata": { "id": "44JZi9NqR0L-" }, "source": [ "### (Bonus Questions)" ] }, { "cell_type": "markdown", "metadata": { "id": "NWSA7uVBSBBT" }, "source": [ "Create a new variable that indicates if it is day time. (Hint: google -- pandas between function, then choose to time values that roughly delineate day/night)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6NOTfWjBR5uJ" }, "outputs": [], "source": [ "df_flights['day'] = ?.?.between(?,?) " ] }, { "cell_type": "markdown", "metadata": { "id": "vM6LkLE3WNys" }, "source": [ "Next create plots to determine the difference in arrival delays during the day and night." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c7wkGfJhWHBr" }, "outputs": [], "source": [ "#density plot\n", "?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "h6vQCA-QWdR9" }, "outputs": [], "source": [ "#plot a violin plot\n", "?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IqcBx4xWWhi4" }, "outputs": [], "source": [ "#bar plot\n", "?" ] }, { "cell_type": "markdown", "metadata": { "id": "7Soixa2PW9LI" }, "source": [ "Post to slack and see if your results rought corespond to what others have found? \n", " \n", "How sensitive are your results to the choice of times for day/night?" ] }, { "cell_type": "markdown", "metadata": { "id": "BJqFrYyyYzMx" }, "source": [ "### Further reading" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_introViz2_densityPlots.ipynb) version." ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyPA6/TwVzZX9FitKrjiVLgW", "collapsed_sections": [ "44JZi9NqR0L-" ], "include_colab_link": true, "mount_file_id": "1HDcsWYRxfD-4GPwTmd-ECovCcls6m1iG", "name": "introViz2_densityPlots.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 2 }