{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "9C2arduFECBl" }, "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": { "id": "BQpnBRMDe_mB" }, "source": [ "## Explainability vs Causality\n", "\n", "\n", "Here we will look at the difference between understanding how the ML model is making predictions (explainability) and what is causing the outcome (causality)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "yFmN1aLjfVej" }, "source": [ "To do so we will look at a silly example where we know that the patterns picked up by the model are not causal.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "H4-8qJBvfiL1" }, "source": [ "### Waffle houses and divorce rates\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "R6oL7k7uftq4" }, "outputs": [], "source": [ "import pandas as pd\n", "import sklearn as sk\n", "import seaborn as sns\n", "from matplotlib import pyplot as plt\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": { "id": "GZ7MxOMHflXV" }, "source": [ "Load the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5NHlsf8GeMWW" }, "outputs": [], "source": [ "#load data\n", "df_waffles = pd.read_csv(\"/content/waffles.csv\")\n", "\n", "#take a look\n", "df_waffles.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "FovWxCFQkiTn" }, "source": [ "Visualize the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Lm36a2uZmIW5" }, "outputs": [], "source": [ "#sort the dataframe\n", "pd_df = df_waffles.sort_values(['Divorce']).reset_index(drop=True)\n", "\n", "#plot by state\n", "sns.barplot(data=pd_df, x=\"Loc\",y=\"Divorce\")\n", "plt.xticks(rotation=90)" ] }, { "cell_type": "markdown", "metadata": { "id": "GMJVKC1ckysB" }, "source": [ "### Do whaffle houses cause divorce?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8ByGVB4fLmIG" }, "outputs": [], "source": [ "#correlation\n", "?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GUnoEdV7AZLP" }, "outputs": [], "source": [ "#scatter plot\n", "sns.?(data=?, x=\"WaffleHouses\", y=\"Divorce\" )\n" ] }, { "cell_type": "markdown", "metadata": { "id": "lKdW7X6B0epK" }, "source": [ "Data wrangling" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7uuaVICoRlMs" }, "outputs": [], "source": [ "#split these data into training and testing datasets\n", "df_train, df_test = train_test_split(df_waffles, test_size=0.20, random_state=14)" ] }, { "cell_type": "markdown", "metadata": { "id": "OAk0s-hKmH7i" }, "source": [ "### Build a model\n", "\n", "Can we predict divorce rates based on:\n", "1. Population\n", "2. Marriage rates (more marriage more divorce)\n", "3. Median age at marriage \n", "4. Number of waffle houses" ] }, { "cell_type": "markdown", "metadata": { "id": "GTUwaLV9HdvV" }, "source": [ " " ] }, { "cell_type": "markdown", "metadata": { "id": "eqXwuYoe8Va9" }, "source": [ "Build a linear regression predicting Divorce using wafflehouses." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eHrG5Z1lRJ5q" }, "outputs": [], "source": [ "import statsmodels.api as sm #for running regression!\n", "import statsmodels.formula.api as smf\n", "\n", "#1. Build the model\n", "?\n", "\n", "#2. Use the data to fit the model (i.e., find the best intercept and slope parameters)\n", "?\n", "\n", "#Look summary\n", "?" ] }, { "cell_type": "markdown", "metadata": { "id": "-l9Slvyc72dT" }, "source": [ "### Fit the model again, this time add the South variable" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mweD6mtJSYPh" }, "outputs": [], "source": [ "#Build the model\n", "?\n", "\n", "#Use the data to fit the model (i.e., find the best intercept and slope parameters)\n", "?\n", "\n", "#summary\n", "?" ] }, { "cell_type": "markdown", "metadata": { "id": "NK_IenAS48i1" }, "source": [ "#### Bonus" ] }, { "cell_type": "markdown", "metadata": { "id": "9lEDh3SO4_4u" }, "source": [ "Try to run the models with alternative combinations of variables? How does the model estimate of the effect of wafflehouses on divorce change?" ] }, { "cell_type": "markdown", "metadata": { "id": "KgxVHlWrjkry" }, "source": [ "### Statistical confounds\n", "\n", "> Statistical confounds make it hard to determine the causal nature of the patterns we find in ML model results. We need to be careful about how we explain how a model makes predictions and the causal nature of those patterns.\n", "\n", "> In the case of the whaffle houses and divorce rates, there are just more waffle houses in southern states. South --> wafflehouses --> Divorce rates" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xsWU8gAgkGUo" }, "outputs": [], "source": [ "sns.boxplot(data=df_waffles, x=\"South\", y=\"WaffleHouses\")" ] }, { "cell_type": "markdown", "metadata": { "id": "ZZmp46mF7reu" }, "source": [ "### Let's see what feature importance suggests" ] }, { "cell_type": "markdown", "metadata": { "id": "aXmjmKWK4Bnh" }, "source": [ " " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "r9PhBzfXO26O" }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "from sklearn.inspection import permutation_importance\n", "\n", "#split data into predictors (X) and target (y)\n", "X = df_waffles.drop(['Divorce','Location','Loc'],axis=1)\n", "y = df_waffles['Divorce']\n", "\n", "#split these data into training and testing datasets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)\n", "\n", "#fit linear regression\n", "LR1 = LinearRegression()\n", "LR1.fit(X_train, y_train)\n", "\n", "#model interpretation\n", "rel_impo = permutation_importance(LR1, X_test, y_test,n_repeats=30,random_state=0)\n", "pd.DataFrame({\"feature\":X_test.columns,\"importance\":rel_impo.importances_mean, \"sd\":rel_impo.importances_std})" ] }, { "cell_type": "markdown", "metadata": { "id": "md1RcfQv8iJp" }, "source": [ "### Let's see what feature selection suggests" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yp9Zr-3C80AC" }, "outputs": [], "source": [ "from sklearn.model_selection import KFold\n", "from sklearn.feature_selection import RFECV\n", "\n", "#split data into predictors (X) and target (y)\n", "X = df_waffles.drop(['Divorce','Location','Loc'], axis=1)\n", "y = df_waffles['Divorce']\n", "\n", "#split these data into training and testing datasets\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)\n", "\n", "#build a linear regression (full model)\n", "LR1 = LinearRegression()\n", "\n", "#fit linear regression\n", "LR1.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qLvRgura9lSE" }, "outputs": [], "source": [ "#min number of variables/features\n", "min_features_to_select = 1\n", "\n", "#build the feature selection algorithm\n", "rfecv = RFECV(estimator=LR1, step=1, cv=3,scoring='neg_mean_squared_error', min_features_to_select=min_features_to_select)\n", "\n", "#fit the algorithm to the data\n", "rfecv.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FSObYPwA80AR" }, "outputs": [], "source": [ "print(\"Optimal number of features : %d\" % rfecv.n_features_)\n", "\n", "# Plot number of features VS. cross-validation scores\n", "plt.figure()\n", "plt.xlabel(\"Number of features selected\")\n", "plt.ylabel(\"Cross validation score (mean square error?)\")\n", "plt.plot(range(min_features_to_select,\n", " len(rfecv.grid_scores_) + min_features_to_select),\n", " rfecv.grid_scores_)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XfXCCK-q80AS" }, "outputs": [], "source": [ "rfecv.support_" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3f-RxGiU80AS" }, "outputs": [], "source": [ "X_train_reduced = X_train.iloc[:,rfecv.support_]\n", "\n", "X_train_reduced.head(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5E1zJmWw-Oox" }, "outputs": [], "source": [ "#get the slopes!\n", "rfecv.estimator_.coef_" ] }, { "cell_type": "markdown", "metadata": { "id": "rhsmIwF-OoJU" }, "source": [ "### Bonus" ] }, { "cell_type": "markdown", "metadata": { "id": "rTOtIGZmOqsF" }, "source": [ "Redo the exercise above this time using a more black box approach, e.g., Random Forest!" ] }, { "cell_type": "markdown", "metadata": { "id": "BJqFrYyyYzMx" }, "source": [ "### Further reading" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_IntroCausalAnalysis.ipynb) version." ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyN3HQqgWw6dGvAp7C4PrcfC", "collapsed_sections": [], "include_colab_link": true, "mount_file_id": "1ZSejLEaSqtPw_BnKWwNvlKloiaTPp-zf", "name": "IntroCausalAnalysis.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }