{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "9C2arduFECBl"
},
"source": [
"
\n",
"\n",
" \n",
"\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "A1R0ALhiKSGE"
},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BQpnBRMDe_mB"
},
"source": [
"## Explainability vs Causality\n",
"\n",
"\n",
"Here we will look at the difference between understanding how the ML model is making predictions (explainability) and what is causing the outcome (causality)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "yFmN1aLjfVej"
},
"source": [
"To do so we will look at a university admission example. You have been tasked with deciding whether there is a gender bias in admission, and if there is reason for legal action against the university.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "H4-8qJBvfiL1"
},
"source": [
"### Gender and university admissions\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "R6oL7k7uftq4"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import sklearn as sk\n",
"import seaborn as sns\n",
"from matplotlib import pyplot as plt\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GZ7MxOMHflXV"
},
"source": [
"Load the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5NHlsf8GeMWW"
},
"outputs": [],
"source": [
"#load data\n",
"df_admit = pd.read_csv(\"/content/UCBadmit_01.csv\")\n",
"\n",
"#take a look\n",
"df_admit.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HZfgB0RFcizl"
},
"source": [
"Check for missing data, and the types of data we are dealing with."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "BK8hGyCGVlkv"
},
"outputs": [],
"source": [
"df_admit.info()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FovWxCFQkiTn"
},
"source": [
"### Visualize the data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hbXrp1POYEQ2"
},
"source": [
"Let's do some exploritory data analysis before build a model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1soeP6gtWf_B"
},
"outputs": [],
"source": [
"#plot admissions by reported gender\n",
"sns.barplot(?)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Lm36a2uZmIW5"
},
"outputs": [],
"source": [
"#plot admissions by department\n",
"sns.barplot(?)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GMJVKC1ckysB"
},
"source": [
"### Preprocessing\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "r9usCTOzW1XL"
},
"source": [
"We have some categorical variables so let's do some preprocessing!\n",
"\n",
"Let's one-hot-encode 'dept'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "490tR09wVFib"
},
"outputs": [],
"source": [
"#convert the categorical variable into dummy variables\n",
"df_cat = pd.get_dummies(?)\n",
"\n",
"#concat the dummy variables back onto the dataframe\n",
"df_admit = ?([df_admit, df_cat], axis = 1)\n",
"\n",
"#drop the original categorical variable\n",
"df_admit = ?.drop(['dept'], axis=1)\n",
"\n",
"#take a look\n",
"df_admit"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZfOwnvv4W5Oe"
},
"source": [
"Let's encode the binary gender column as 0/1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "aFHf0bsNW9iC"
},
"outputs": [],
"source": [
"from sklearn.preprocessing import OrdinalEncoder\n",
"\n",
"#build the encoder\n",
"my_gen = ?()\n",
"\n",
"#fit and transform the gender column\n",
"df_admit['?'] = my_gen.?(df_admit['?'].values.reshape(-1,1))\n",
"\n",
"#take a look\n",
"df_admit\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "strsEHsxXZTd"
},
"outputs": [],
"source": [
"#take a look at the categories\n",
"my_gen.categories_"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lKdW7X6B0epK"
},
"source": [
"Finally, let's do a training testing split on the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7uuaVICoRlMs"
},
"outputs": [],
"source": [
"#split these data into training and testing datasets\n",
"df_train, df_test = train_test_split(df_admit, test_size=0.20)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OAk0s-hKmH7i"
},
"source": [
"### Build a model\n",
"\n",
"Can we predict admision based on reported gender?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GTUwaLV9HdvV"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "eqXwuYoe8Va9"
},
"source": [
"Build a linear regression predicting admission using gender."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "eHrG5Z1lRJ5q"
},
"outputs": [],
"source": [
"import statsmodels.api as sm #for running regression!\n",
"import statsmodels.formula.api as smf\n",
"\n",
"#1. Build the model\n",
"m1 = ?(\"admitted ~ gender\", data=?)\n",
"\n",
"#2. Use the data to fit the model (i.e., find the best intercept and slope parameters)\n",
"m1_results = ?()\n",
"\n",
"#Look summary\n",
"print(?)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QMRMzS7bdUcG"
},
"source": [
"Gender is very useful in predicting if someone will be admitted!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-l9Slvyc72dT"
},
"source": [
"Let's fit the model again, this time add the department"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "mweD6mtJSYPh"
},
"outputs": [],
"source": [
"#1. Build the model\n",
"m2 = ?(\"admitted ~ gender + B + C + D + E + F\", data=?)\n",
"\n",
"#2. Use the data to fit the model (i.e., find the best intercept and slope parameters)\n",
"m2_results = ?()\n",
"\n",
"#Look summary\n",
"print(?)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "9lEDh3SO4_4u"
},
"source": [
"We can see when we account for department in the model the slope for gender get's close to zero. What is going on here?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KgxVHlWrjkry"
},
"source": [
"### Statistical confounds\n",
"\n",
"> Statistical confounds make it hard to determine the causal nature of the patterns we find in ML model results. We need to be careful about how we explain how a model makes predictions and the causal nature of those patterns.\n",
"\n",
"> In the case of the admissions and gender, there is a process where genders are not applying to departments in equal measure.\n",
"\n",
"That is, the causal relationships that generated this data might look something like:\n",
"\n",
"Gender --> Department --> Admission"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZZmp46mF7reu"
},
"source": [
"### Let's see what feature importance suggests"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aXmjmKWK4Bnh"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "I7FtuGAjd2ig"
},
"source": [
"Let's use linear regression from sklearn!\n",
"\n",
"As it is sklearn let's define the inputs and outputs explicitly. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "r9PhBzfXO26O"
},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.inspection import permutation_importance\n",
"\n",
"#split data into predictors (X) and target (y)\n",
"X = df_admit.drop(['admitted', 'A'],axis=1)\n",
"y = df_admit['admitted']\n",
"\n",
"#split these data into training and testing datasets\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BV5HsWaKeFai"
},
"source": [
"Then we can use sklean format to build and fit the model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XLzi3ypUeEuM"
},
"outputs": [],
"source": [
"#build the linear regression\n",
"LR1 = LinearRegression()\n",
"\n",
"#fit the linear regression\n",
"LR1.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "N8rkrQdDeNQq"
},
"source": [
"Finally let's use feature importance to see what variables are most useful in making predictions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "atApQK02eQHZ"
},
"outputs": [],
"source": [
"#model interpretation\n",
"rel_impo = permutation_importance(LR1, X_test, y_test,n_repeats=30,random_state=0)\n",
"\n",
"#build a dataframe to store the results\n",
"df_rel_impo = pd.DataFrame({\"feature\":X_test.columns,\"importance\":rel_impo.importances_mean, \"sd\":rel_impo.importances_std})\n",
"\n",
"#take a look\n",
"df_rel_impo"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5NBB_iPqg_qj"
},
"outputs": [],
"source": [
"sns.barplot(data=df_rel_impo, x='feature', y='importance')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "md1RcfQv8iJp"
},
"source": [
"### Let's see what feature selection suggests"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aKi2ln7mefxZ"
},
"source": [
"Let's take some time to introduce how to do feature selection. This is very useful when you are looking to include only the most useful variables to help you make predictions.\n",
"\n",
"Caution: as we've seen above, we need to think about how the data was generated if we want to talk about the results in a causal way. Feature selection is great for creating models for prediction."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "qLvRgura9lSE"
},
"outputs": [],
"source": [
"from sklearn.feature_selection import RFECV\n",
"\n",
"#min number of variables/features\n",
"min_features_to_select = 1\n",
"\n",
"#build the feature selection algorithm\n",
"rfecv = RFECV(estimator=LR1, step=1, cv=3, min_features_to_select=min_features_to_select)\n",
"\n",
"#fit the algorithm to the data\n",
"rfecv.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LqmvLrJnhUDj"
},
"source": [
"How many features did the feature selection find useful?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "RMcuP0Bohze2"
},
"outputs": [],
"source": [
"print(\"Optimal number of features : %d\" % rfecv.n_features_)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WIGB60Nkh0QY"
},
"source": [
"Let's plot out what the feature selection found!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "FSObYPwA80AR"
},
"outputs": [],
"source": [
"# Plot number of features VS. cross-validation scores\n",
"plt.figure()\n",
"plt.xlabel(\"Number of features selected\")\n",
"plt.ylabel(\"Cross validation score (mean square error?)\")\n",
"plt.plot(range(min_features_to_select, len(rfecv.grid_scores_) + min_features_to_select),\n",
" rfecv.grid_scores_)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "XfXCCK-q80AS"
},
"outputs": [],
"source": [
"rfecv.support_"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "3f-RxGiU80AS"
},
"outputs": [],
"source": [
"X_train_reduced = X_train.iloc[:,rfecv.support_]\n",
"\n",
"X_train_reduced.head(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "5E1zJmWw-Oox"
},
"outputs": [],
"source": [
"#get the slopes!\n",
"rfecv.estimator_.coef_"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rhsmIwF-OoJU"
},
"source": [
"### Bonus"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rTOtIGZmOqsF"
},
"source": [
"Redo the exercise above this time using a more black box approach, e.g., Random Forest!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BJqFrYyyYzMx"
},
"source": [
"### Further reading"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vx-NYAjcKSGK"
},
"source": [
"> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_IntroCausalAnalysis_admissions.ipynb) version."
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"name": "IntroCausalAnalysis_admissions.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 0
}