Explainability vs Causality

27. Explainability vs Causality#

Here we will look at the difference between understanding how the ML model is making predictions (explainability) and what is causing the outcome (causality)

To do so we will look at a university admission example. You have been tasked with deciding whether there is a gender bias in admission, and if there is reason for legal action against the university.

27.1. Gender and university admissions#

import pandas as pd
import sklearn as sk
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

Load the data

#load data
df_admit = pd.read_csv("/content/UCBadmit_01.csv")

#take a look
df_admit.head()

Check for missing data, and the types of data we are dealing with.

df_admit.info()

27.2. Visualize the data#

Let’s do some exploritory data analysis before build a model.

#plot admissions by reported gender
sns.barplot(?)

#plot admissions by department
sns.barplot(?)

27.3. Preprocessing#

We have some categorical variables so let’s do some preprocessing!

Let’s one-hot-encode ‘dept’

#convert the categorical variable into dummy variables
df_cat = pd.get_dummies(?)

#concat the dummy variables back onto the dataframe
df_admit = ?([df_admit, df_cat], axis = 1)

#drop the original categorical variable
df_admit = ?.drop(['dept'], axis=1)

#take a look
df_admit

Let’s encode the binary gender column as 0/1

from sklearn.preprocessing import OrdinalEncoder

#build the encoder
my_gen = ?()

#fit and transform the gender column
df_admit['?'] = my_gen.?(df_admit['?'].values.reshape(-1,1))

#take a look
df_admit

#take a look at the categories
my_gen.categories_

Finally, let’s do a training testing split on the data.

#split these data into training and testing datasets
df_train, df_test = train_test_split(df_admit, test_size=0.20)

27.4. Build a model#

Can we predict admision based on reported gender?

colab

Build a linear regression predicting admission using gender.

import statsmodels.api as sm #for running regression!
import statsmodels.formula.api as smf

#1. Build the model
m1 = ?("admitted ~ gender", data=?)

#2. Use the data to fit the model (i.e., find the best intercept and slope parameters)
m1_results = ?()

#Look summary
print(?)

Gender is very useful in predicting if someone will be admitted!

Let’s fit the model again, this time add the department

#1. Build the model
m2 = ?("admitted ~ gender + B + C + D + E + F", data=?)

#2. Use the data to fit the model (i.e., find the best intercept and slope parameters)
m2_results = ?()

#Look summary
print(?)

We can see when we account for department in the model the slope for gender get’s close to zero. What is going on here?

27.5. Statistical confounds#

Statistical confounds make it hard to determine the causal nature of the patterns we find in ML model results. We need to be careful about how we explain how a model makes predictions and the causal nature of those patterns.

In the case of the admissions and gender, there is a process where genders are not applying to departments in equal measure.

That is, the causal relationships that generated this data might look something like:

Gender –> Department –> Admission

27.6. Let’s see what feature importance suggests#

colab

Let’s use linear regression from sklearn!

As it is sklearn let’s define the inputs and outputs explicitly.

from sklearn.linear_model import LinearRegression
from sklearn.inspection import permutation_importance

#split data into predictors (X) and target (y)
X = df_admit.drop(['admitted', 'A'],axis=1)
y = df_admit['admitted']

#split these data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

Then we can use sklean format to build and fit the model.

#build the linear regression
LR1 = LinearRegression()

#fit the linear regression
LR1.fit(X_train, y_train)

Finally let’s use feature importance to see what variables are most useful in making predictions.

#model interpretation
rel_impo = permutation_importance(LR1, X_test, y_test,n_repeats=30,random_state=0)

#build a dataframe to store the results
df_rel_impo = pd.DataFrame({"feature":X_test.columns,"importance":rel_impo.importances_mean, "sd":rel_impo.importances_std})

#take a look
df_rel_impo

sns.barplot(data=df_rel_impo, x='feature', y='importance')

27.7. Let’s see what feature selection suggests#

Let’s take some time to introduce how to do feature selection. This is very useful when you are looking to include only the most useful variables to help you make predictions.

Caution: as we’ve seen above, we need to think about how the data was generated if we want to talk about the results in a causal way. Feature selection is great for creating models for prediction.

from sklearn.feature_selection import RFECV

#min number of variables/features
min_features_to_select = 1

#build the feature selection algorithm
rfecv = RFECV(estimator=LR1, step=1, cv=3, min_features_to_select=min_features_to_select)

#fit the algorithm to the data
rfecv.fit(X_train, y_train)

How many features did the feature selection find useful?

print("Optimal number of features : %d" % rfecv.n_features_)

Let’s plot out what the feature selection found!

# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (mean square error?)")
plt.plot(range(min_features_to_select, len(rfecv.grid_scores_) + min_features_to_select),
         rfecv.grid_scores_)
plt.show()

rfecv.support_

X_train_reduced = X_train.iloc[:,rfecv.support_]

X_train_reduced.head(3)

#get the slopes!
rfecv.estimator_.coef_

27.8. Bonus#

Redo the exercise above this time using a more black box approach, e.g., Random Forest!

27.9. Further reading#

If you would like the notebook without missing code check out the full code version.