{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "view-in-github"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/tbonne/peds/blob/main/docs/introViz/IntroViz3_scatterplots.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DqPzbMV5v_NQ"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1C2o3BW9_N9LkeIi2lM4viUwvXK75G6Nc'>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "***"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jyHOvgcK2Oj0"
      },
      "source": [
        "## <font color='darkorange'>Scatter plots</font>\n",
        "\n",
        "In this exercise we will learn how to plot data in scatter plots. Unlike the previous examples with histograms and density plots, scatter plots will let us look at two variables at once (i.e., bivariate relationships).\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wBv_U2EwAWFS"
      },
      "source": [
        "Import the libraries\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "npo0C3reAbqn"
      },
      "outputs": [],
      "source": [
        "#import libraries\n",
        "import pandas as pd\n",
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M2r3E549wJZL"
      },
      "source": [
        "Bring in the nyc flight data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sobqsVEo4tyE"
      },
      "outputs": [],
      "source": [
        "df_flights = ?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qETX2W0nApeh"
      },
      "source": [
        "In our previous exercises we looked at departure and arrival delays as seperate, but are they related to each other. Let's use a scatter plot to see if higher departure delays lead to higher arrival delays.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "RIUmxFlrAmei"
      },
      "outputs": [],
      "source": [
        "#plot a scaterplot\n",
        "sns.scatterplot(data=df_flights, x='dep_delay',y='arr_delay')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Xqc_NPvp50vK"
      },
      "source": [
        "Let's add some nicer labels to the axis of the plot."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qHUCBjcl5YWW"
      },
      "outputs": [],
      "source": [
        "#scatterplot with labels\n",
        "sns.scatterplot(data=df_flights, x='dep_delay',y='arr_delay').set(xlabel='Departure delay (minutes)', ylabel='Arrival delay (minutes)')\n",
        "\n",
        "#save file\n",
        "plt.savefig(\"delay_scatterplot.png\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SSNminDs5_8F"
      },
      "source": [
        "This plot is showing that if departure delays are high, so too will arrival delays. In other words flights that start out late are not able to make up for lost time and end up arriving at their destination late."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_nLixPTJwztL"
      },
      "source": [
        "### <font color='darkorange'>Estimate correlations</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yNJ4olmjw5OY"
      },
      "source": [
        "> We will use the corr function build into pandas to estimate the correlation between departure delay and arrival delay."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ElNwZCDyw6Kc"
      },
      "outputs": [],
      "source": [
        "#estimate correlation\n",
        "df_flights.arr_delay.corr(df_flights.dep_delay)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bVO2HyyqxUDH"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1WC4tXGCEF-1_2LQ74gIxJAZ-GLXCwBdK' width=\"100\">  "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8QK4WYDdxXR3"
      },
      "source": [
        "Try and estimate correlations between a few other variables"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3soH-jZIxi0_"
      },
      "outputs": [],
      "source": [
        "#take a look at potential variables to compare (i.e., what columns/variables do we have)\n",
        "?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "aQQtZb34xusk"
      },
      "outputs": [],
      "source": [
        "#estimate the correlation between two variables\n",
        "df_flights.?.corr(df_flights.?)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vT7JT-5ox3Pz"
      },
      "source": [
        "What is the largest correlation you can find? Can you also plot this relationship as a scatter plot?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nCRoyW5Bx9OC"
      },
      "outputs": [],
      "source": [
        "#Scatterplot\n",
        "sns.scatterplot(data=df_flights, x='?',y='?')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EXbJ-lUsHwlf"
      },
      "source": [
        "### <font color='darkorange'>Compare many variables using pair plots</font>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "aZ-mq9qBHzG3"
      },
      "outputs": [],
      "source": [
        "#let's choose some varibles to look at \n",
        "df_flights_pairs = df_flights[[\"arr_delay\",\"dep_delay\",\"distance\",\"carrier\"]] #notice it did not use carrier... why?\n",
        "\n",
        "#use the pairplot method to look at all combinations of these variables\n",
        "sns.pairplot(df_flights_pairs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1KBOOBwt10tn"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1WC4tXGCEF-1_2LQ74gIxJAZ-GLXCwBdK' width=\"100\">  "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Pb-L7wcH10to"
      },
      "source": [
        "Try and visualize the relationships between a few variables."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "RVU4IFF4b0M-"
      },
      "outputs": [],
      "source": [
        "#Choose some varibles to look at \n",
        "df_flights_pairs = df_flights[[\"?\",\"?\",\"?\"]] #notice it did not use carrier... why?\n",
        "\n",
        "#use the pairplot method to look at all combinations of these variables\n",
        "sns.pairplot(?)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7Th7ymBOEPSD"
      },
      "source": [
        "### <font color='darkorange'>Heat Maps</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NMMImYa-ESep"
      },
      "source": [
        "We will use our new found correlation skills to more effectively search for patterns in our data using heat maps! These maps can quickly help us identify high/low correlations between our variables.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jn1piAE_HaDP"
      },
      "outputs": [],
      "source": [
        "#run a correlation on all combinations of variables in df_flights\n",
        "corrmat = df_flights.corr()\n",
        "\n",
        "#plot the results as a heat map\n",
        "sns.heatmap(corrmat, square=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "RRg3YmXqGRq4"
      },
      "outputs": [],
      "source": [
        "#plot the results as a heat map (this time let's make the figure bigger)\n",
        "plt.subplots(figsize=(12,9))\n",
        "sns.heatmap(corrmat, square=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BenI8ldsGfuz"
      },
      "source": [
        "Here we can see from the legend on the right that the lighter colours are combinations of variables that have high positive correlations. While darker colors have larger negative correlations.\n",
        "\n",
        "Things to think about:\n",
        "- do all these comparisons make sense? e.g., flight# and distance?\n",
        "- what variable types are there?\n",
        "- why are year and month not showing any values?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hUa0DRE_8Zag"
      },
      "source": [
        "### <font color='darkorange'>Further reading</font>\n",
        "\n",
        "Check out seaborn's very nice page on [plotting relationships using scatterplots](https://seaborn.pydata.org/tutorial/relational.html).\n",
        "\n",
        "> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_IntroViz3_scatterplots.ipynb) version."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SSHJ5OFc6xbv"
      },
      "source": [
        "### <font color='darkorange'>Bonus material</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RioJDCRMUiPW"
      },
      "source": [
        "Visualizing your data is a very important step in any data science workflow. Let's take a look at the case below where four seperate datasets have the same mean and standard deviation, but differ wildly in how their data is ditributed.\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "h0Znowy0r1NU"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1o1sS3SVNg7SFivTay0damWoN9hq9-0ap'>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U95SXGWk7X5E"
      },
      "source": [
        "Let's load in the data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XLfGRZybUus9"
      },
      "outputs": [],
      "source": [
        "df_anscombe = pd.read_json(\"/content/sample_data/anscombe.json\")\n",
        "\n",
        "df_anscombe.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5xA21fWo7TBG"
      },
      "source": [
        "First let's show that each has the same summary statistics."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "39BioL_Q7kYK"
      },
      "outputs": [],
      "source": [
        "df_anscombe.groupby('Series').mean()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4Al_BziW71u1"
      },
      "outputs": [],
      "source": [
        "df_anscombe.groupby('Series').std()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BrmTHBUQ73Xv"
      },
      "source": [
        "Now let's take a look using scatter plots"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "pvtx3lIj7kle"
      },
      "outputs": [],
      "source": [
        "sns.scatterplot(data=df_anscombe[(df_anscombe.Series==\"I\")],x=\"X\",y=\"Y\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "2fp4cMoI9IPP"
      },
      "outputs": [],
      "source": [
        "sns.scatterplot(data=df_anscombe[(df_anscombe.Series==\"II\")],x=\"X\",y=\"Y\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "N4TYK4sI9KMQ"
      },
      "outputs": [],
      "source": [
        "sns.scatterplot(data=df_anscombe[(df_anscombe.Series==\"III\")],x=\"X\",y=\"Y\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "SbUS-MPF9Ls0"
      },
      "outputs": [],
      "source": [
        "sns.scatterplot(data=df_anscombe[(df_anscombe.Series==\"IV\")],x=\"X\",y=\"Y\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XUdQiBEd9TdI"
      },
      "source": [
        "Even though each of these series of points have the same descriptive statistics (mean and standard deviation) they are very different in how they are distributed. This is why it is important to visualize your data!"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "authorship_tag": "ABX9TyO01ucfzwad4L3xXsMDNiHY",
      "collapsed_sections": [
        "SSHJ5OFc6xbv"
      ],
      "include_colab_link": true,
      "mount_file_id": "1R49MCyzdNfH9lVkC2Fum1xi63d5aUXtg",
      "name": "IntroViz3_scatterplots.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}