{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "view-in-github"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/tbonne/peds/blob/main/docs/introComm/IntroCausalAnalysis.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9C2arduFECBl"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1gfef2jOOS8IuF0L-2yYmx2D0eEEmZwQs' width=500>\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "***"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BQpnBRMDe_mB"
      },
      "source": [
        "## <font color='darkorange'>Explainability vs Causality</font>\n",
        "\n",
        "\n",
        "Here we will look at the difference between understanding how the ML model is making predictions (explainability) and what is causing the outcome (causality)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yFmN1aLjfVej"
      },
      "source": [
        "To do so we will look at a silly example where we know that the patterns picked up by the model are not causal.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "H4-8qJBvfiL1"
      },
      "source": [
        "### <font color='darkorange'>Waffle houses and divorce rates</font>\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "R6oL7k7uftq4"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "import sklearn as sk\n",
        "import seaborn as sns\n",
        "from matplotlib import pyplot as plt\n",
        "from sklearn.model_selection import train_test_split"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GZ7MxOMHflXV"
      },
      "source": [
        "Load the data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5NHlsf8GeMWW"
      },
      "outputs": [],
      "source": [
        "#load data\n",
        "df_waffles = pd.read_csv(\"/content/waffles.csv\")\n",
        "\n",
        "#take a look\n",
        "df_waffles.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FovWxCFQkiTn"
      },
      "source": [
        "Visualize the data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Lm36a2uZmIW5"
      },
      "outputs": [],
      "source": [
        "#sort the dataframe\n",
        "pd_df = df_waffles.sort_values(['Divorce']).reset_index(drop=True)\n",
        "\n",
        "#plot by state\n",
        "sns.barplot(data=pd_df, x=\"Loc\",y=\"Divorce\")\n",
        "plt.xticks(rotation=90)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GMJVKC1ckysB"
      },
      "source": [
        "### <font color='darkorange'>Do whaffle houses cause divorce?</font>\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8ByGVB4fLmIG"
      },
      "outputs": [],
      "source": [
        "#correlation\n",
        "?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GUnoEdV7AZLP"
      },
      "outputs": [],
      "source": [
        "#scatter plot\n",
        "sns.?(data=?, x=\"WaffleHouses\", y=\"Divorce\" )\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lKdW7X6B0epK"
      },
      "source": [
        "Data wrangling"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7uuaVICoRlMs"
      },
      "outputs": [],
      "source": [
        "#split these data into training and testing datasets\n",
        "df_train, df_test = train_test_split(df_waffles, test_size=0.20, random_state=14)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OAk0s-hKmH7i"
      },
      "source": [
        "### <font color='darkorange'>Build a model</font>\n",
        "\n",
        "Can we predict divorce rates based on:\n",
        "1. Population\n",
        "2. Marriage rates (more marriage more divorce)\n",
        "3. Median age at marriage \n",
        "4. Number of waffle houses"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GTUwaLV9HdvV"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1WC4tXGCEF-1_2LQ74gIxJAZ-GLXCwBdK' width=\"100\">  "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eqXwuYoe8Va9"
      },
      "source": [
        "Build a linear regression predicting Divorce using wafflehouses."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "eHrG5Z1lRJ5q"
      },
      "outputs": [],
      "source": [
        "import statsmodels.api as sm #for running regression!\n",
        "import statsmodels.formula.api as smf\n",
        "\n",
        "#1. Build the model\n",
        "?\n",
        "\n",
        "#2. Use the data to fit the model (i.e., find the best intercept and slope parameters)\n",
        "?\n",
        "\n",
        "#Look summary\n",
        "?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-l9Slvyc72dT"
      },
      "source": [
        "### <font color='darkorange'>Fit the model again, this time add the South variable</font>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "mweD6mtJSYPh"
      },
      "outputs": [],
      "source": [
        "#Build the model\n",
        "?\n",
        "\n",
        "#Use the data to fit the model (i.e., find the best intercept and slope parameters)\n",
        "?\n",
        "\n",
        "#summary\n",
        "?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NK_IenAS48i1"
      },
      "source": [
        "#### <font color='darkorange'>Bonus</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9lEDh3SO4_4u"
      },
      "source": [
        "Try to run the models with alternative combinations of variables? How does the model estimate of the effect of wafflehouses on divorce change?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KgxVHlWrjkry"
      },
      "source": [
        "### <font color='darkorange'>Statistical confounds</font>\n",
        "\n",
        "> Statistical confounds make it hard to determine the causal nature of the patterns we find in ML model results. We need to be careful about how we explain how a model makes predictions and the causal nature of those patterns.\n",
        "\n",
        "> In the case of the whaffle houses and divorce rates, there are just more waffle houses in southern states. South --> wafflehouses --> Divorce rates"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xsWU8gAgkGUo"
      },
      "outputs": [],
      "source": [
        "sns.boxplot(data=df_waffles, x=\"South\", y=\"WaffleHouses\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZZmp46mF7reu"
      },
      "source": [
        "### <font color='darkorange'>Let's see what feature importance suggests</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aXmjmKWK4Bnh"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1WC4tXGCEF-1_2LQ74gIxJAZ-GLXCwBdK' width=\"100\">  "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "r9PhBzfXO26O"
      },
      "outputs": [],
      "source": [
        "from sklearn.linear_model import LinearRegression\n",
        "from sklearn.inspection import permutation_importance\n",
        "\n",
        "#split data into predictors (X) and target (y)\n",
        "X = df_waffles.drop(['Divorce','Location','Loc'],axis=1)\n",
        "y = df_waffles['Divorce']\n",
        "\n",
        "#split these data into training and testing datasets\n",
        "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)\n",
        "\n",
        "#fit linear regression\n",
        "LR1 = LinearRegression()\n",
        "LR1.fit(X_train, y_train)\n",
        "\n",
        "#model interpretation\n",
        "rel_impo = permutation_importance(LR1, X_test, y_test,n_repeats=30,random_state=0)\n",
        "pd.DataFrame({\"feature\":X_test.columns,\"importance\":rel_impo.importances_mean, \"sd\":rel_impo.importances_std})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "md1RcfQv8iJp"
      },
      "source": [
        "### <font color='darkorange'>Let's see what feature selection suggests</font>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yp9Zr-3C80AC"
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import KFold\n",
        "from sklearn.feature_selection import RFECV\n",
        "\n",
        "#split data into predictors (X) and target (y)\n",
        "X = df_waffles.drop(['Divorce','Location','Loc'], axis=1)\n",
        "y = df_waffles['Divorce']\n",
        "\n",
        "#split these data into training and testing datasets\n",
        "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)\n",
        "\n",
        "#build a linear regression (full model)\n",
        "LR1 = LinearRegression()\n",
        "\n",
        "#fit linear regression\n",
        "LR1.fit(X_train, y_train)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qLvRgura9lSE"
      },
      "outputs": [],
      "source": [
        "#min number of variables/features\n",
        "min_features_to_select = 1\n",
        "\n",
        "#build the feature selection algorithm\n",
        "rfecv = RFECV(estimator=LR1, step=1, cv=3,scoring='neg_mean_squared_error', min_features_to_select=min_features_to_select)\n",
        "\n",
        "#fit the algorithm to the data\n",
        "rfecv.fit(X_train, y_train)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FSObYPwA80AR"
      },
      "outputs": [],
      "source": [
        "print(\"Optimal number of features : %d\" % rfecv.n_features_)\n",
        "\n",
        "# Plot number of features VS. cross-validation scores\n",
        "plt.figure()\n",
        "plt.xlabel(\"Number of features selected\")\n",
        "plt.ylabel(\"Cross validation score (mean square error?)\")\n",
        "plt.plot(range(min_features_to_select,\n",
        "               len(rfecv.grid_scores_) + min_features_to_select),\n",
        "         rfecv.grid_scores_)\n",
        "plt.show()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XfXCCK-q80AS"
      },
      "outputs": [],
      "source": [
        "rfecv.support_"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3f-RxGiU80AS"
      },
      "outputs": [],
      "source": [
        "X_train_reduced = X_train.iloc[:,rfecv.support_]\n",
        "\n",
        "X_train_reduced.head(3)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5E1zJmWw-Oox"
      },
      "outputs": [],
      "source": [
        "#get the slopes!\n",
        "rfecv.estimator_.coef_"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rhsmIwF-OoJU"
      },
      "source": [
        "### <font color='darkorange'>Bonus</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rTOtIGZmOqsF"
      },
      "source": [
        "Redo the exercise above this time using a more black box approach, e.g., Random Forest!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BJqFrYyyYzMx"
      },
      "source": [
        "### <font color='darkorange'>Further reading</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_IntroCausalAnalysis.ipynb) version."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "authorship_tag": "ABX9TyN3HQqgWw6dGvAp7C4PrcfC",
      "collapsed_sections": [],
      "include_colab_link": true,
      "mount_file_id": "1ZSejLEaSqtPw_BnKWwNvlKloiaTPp-zf",
      "name": "IntroCausalAnalysis.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}