{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "view-in-github"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9C2arduFECBl"
},
"source": [
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BQpnBRMDe_mB"
},
"source": [
"## Explainability vs Causality\n",
"\n",
"\n",
"Here we will look at the difference between understanding how the ML model is making predictions (explainability) and what is causing the outcome (causality)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yFmN1aLjfVej"
},
"source": [
"To do so we will look at a silly example where we know that the patterns picked up by the model are not causal.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "H4-8qJBvfiL1"
},
"source": [
"### Waffle houses and divorce rates\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "R6oL7k7uftq4"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import sklearn as sk\n",
"import seaborn as sns\n",
"from matplotlib import pyplot as plt\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GZ7MxOMHflXV"
},
"source": [
"Load the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5NHlsf8GeMWW"
},
"outputs": [],
"source": [
"#load data\n",
"df_waffles = pd.read_csv(\"/content/waffles.csv\")\n",
"\n",
"#take a look\n",
"df_waffles.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FovWxCFQkiTn"
},
"source": [
"Visualize the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Lm36a2uZmIW5"
},
"outputs": [],
"source": [
"#sort the dataframe\n",
"pd_df = df_waffles.sort_values(['Divorce']).reset_index(drop=True)\n",
"\n",
"#plot by state\n",
"sns.barplot(data=pd_df, x=\"Loc\",y=\"Divorce\")\n",
"plt.xticks(rotation=90)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GMJVKC1ckysB"
},
"source": [
"### Do whaffle houses cause divorce?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8ByGVB4fLmIG"
},
"outputs": [],
"source": [
"#correlation\n",
"?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "GUnoEdV7AZLP"
},
"outputs": [],
"source": [
"#scatter plot\n",
"sns.?(data=?, x=\"WaffleHouses\", y=\"Divorce\" )\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lKdW7X6B0epK"
},
"source": [
"Data wrangling"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7uuaVICoRlMs"
},
"outputs": [],
"source": [
"#split these data into training and testing datasets\n",
"df_train, df_test = train_test_split(df_waffles, test_size=0.20, random_state=14)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OAk0s-hKmH7i"
},
"source": [
"### Build a model\n",
"\n",
"Can we predict divorce rates based on:\n",
"1. Population\n",
"2. Marriage rates (more marriage more divorce)\n",
"3. Median age at marriage \n",
"4. Number of waffle houses"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GTUwaLV9HdvV"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eqXwuYoe8Va9"
},
"source": [
"Build a linear regression predicting Divorce using wafflehouses."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eHrG5Z1lRJ5q"
},
"outputs": [],
"source": [
"import statsmodels.api as sm #for running regression!\n",
"import statsmodels.formula.api as smf\n",
"\n",
"#1. Build the model\n",
"?\n",
"\n",
"#2. Use the data to fit the model (i.e., find the best intercept and slope parameters)\n",
"?\n",
"\n",
"#Look summary\n",
"?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-l9Slvyc72dT"
},
"source": [
"### Fit the model again, this time add the South variable"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "mweD6mtJSYPh"
},
"outputs": [],
"source": [
"#Build the model\n",
"?\n",
"\n",
"#Use the data to fit the model (i.e., find the best intercept and slope parameters)\n",
"?\n",
"\n",
"#summary\n",
"?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NK_IenAS48i1"
},
"source": [
"#### Bonus"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9lEDh3SO4_4u"
},
"source": [
"Try to run the models with alternative combinations of variables? How does the model estimate of the effect of wafflehouses on divorce change?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KgxVHlWrjkry"
},
"source": [
"### Statistical confounds\n",
"\n",
"> Statistical confounds make it hard to determine the causal nature of the patterns we find in ML model results. We need to be careful about how we explain how a model makes predictions and the causal nature of those patterns.\n",
"\n",
"> In the case of the whaffle houses and divorce rates, there are just more waffle houses in southern states. South --> wafflehouses --> Divorce rates"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "xsWU8gAgkGUo"
},
"outputs": [],
"source": [
"sns.boxplot(data=df_waffles, x=\"South\", y=\"WaffleHouses\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZZmp46mF7reu"
},
"source": [
"### Let's see what feature importance suggests"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aXmjmKWK4Bnh"
},
"source": [
"
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "r9PhBzfXO26O"
},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.inspection import permutation_importance\n",
"\n",
"#split data into predictors (X) and target (y)\n",
"X = df_waffles.drop(['Divorce','Location','Loc'],axis=1)\n",
"y = df_waffles['Divorce']\n",
"\n",
"#split these data into training and testing datasets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)\n",
"\n",
"#fit linear regression\n",
"LR1 = LinearRegression()\n",
"LR1.fit(X_train, y_train)\n",
"\n",
"#model interpretation\n",
"rel_impo = permutation_importance(LR1, X_test, y_test,n_repeats=30,random_state=0)\n",
"pd.DataFrame({\"feature\":X_test.columns,\"importance\":rel_impo.importances_mean, \"sd\":rel_impo.importances_std})"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "md1RcfQv8iJp"
},
"source": [
"### Let's see what feature selection suggests"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "yp9Zr-3C80AC"
},
"outputs": [],
"source": [
"from sklearn.model_selection import KFold\n",
"from sklearn.feature_selection import RFECV\n",
"\n",
"#split data into predictors (X) and target (y)\n",
"X = df_waffles.drop(['Divorce','Location','Loc'], axis=1)\n",
"y = df_waffles['Divorce']\n",
"\n",
"#split these data into training and testing datasets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)\n",
"\n",
"#build a linear regression (full model)\n",
"LR1 = LinearRegression()\n",
"\n",
"#fit linear regression\n",
"LR1.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qLvRgura9lSE"
},
"outputs": [],
"source": [
"#min number of variables/features\n",
"min_features_to_select = 1\n",
"\n",
"#build the feature selection algorithm\n",
"rfecv = RFECV(estimator=LR1, step=1, cv=3,scoring='neg_mean_squared_error', min_features_to_select=min_features_to_select)\n",
"\n",
"#fit the algorithm to the data\n",
"rfecv.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "FSObYPwA80AR"
},
"outputs": [],
"source": [
"print(\"Optimal number of features : %d\" % rfecv.n_features_)\n",
"\n",
"# Plot number of features VS. cross-validation scores\n",
"plt.figure()\n",
"plt.xlabel(\"Number of features selected\")\n",
"plt.ylabel(\"Cross validation score (mean square error?)\")\n",
"plt.plot(range(min_features_to_select,\n",
" len(rfecv.grid_scores_) + min_features_to_select),\n",
" rfecv.grid_scores_)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XfXCCK-q80AS"
},
"outputs": [],
"source": [
"rfecv.support_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "3f-RxGiU80AS"
},
"outputs": [],
"source": [
"X_train_reduced = X_train.iloc[:,rfecv.support_]\n",
"\n",
"X_train_reduced.head(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5E1zJmWw-Oox"
},
"outputs": [],
"source": [
"#get the slopes!\n",
"rfecv.estimator_.coef_"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rhsmIwF-OoJU"
},
"source": [
"### Bonus"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rTOtIGZmOqsF"
},
"source": [
"Redo the exercise above this time using a more black box approach, e.g., Random Forest!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BJqFrYyyYzMx"
},
"source": [
"### Further reading"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_IntroCausalAnalysis.ipynb) version."
]
}
],
"metadata": {
"colab": {
"authorship_tag": "ABX9TyN3HQqgWw6dGvAp7C4PrcfC",
"collapsed_sections": [],
"include_colab_link": true,
"mount_file_id": "1ZSejLEaSqtPw_BnKWwNvlKloiaTPp-zf",
"name": "IntroCausalAnalysis.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}