Open In Colab


23. Random forests

Let’s take a look at random forests as a way to avoid over/under fitting our model decision tree models. Here we will use this algorithm to predict who will have diabetes.

Load in the needed libraries

import sklearn as sk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Load the data

Get the ‘diabetes.csv’ from the class’s shared data folder and load it into a dataframe.

#get wine to a dataframe
df_diab = pd.read_csv('')

#take a look
?

Q: what kinds of data are we dealing with?

?

Q: are there any missing values?

?

23.1. Descriptive statistics

Let’s take a little time to look at some summary statistics.

E.g., how many values of outcome types there are?

#count how many of each value in a column using value_conunts
df_diab.?.value_counts()

What would be our accuracy if we always predicted the most common value?

?

Q: Choose one feature (column) and get the mean, min, and max.

?

23.2. Visualizing the data

Let’s plot the relationships between outcome and some of the health measures.

Q: Choose one or more wine measures and generate a plot that shows the relationship between that measure and plant type.

sns.scatterplot(data=df_diab, x='?', y='?')
sns.violinplot(data=df_diab, y='?', x='?')

23.3. Data wrangling

Training testing split


?

23.4. Model building

Here we will build our first random forest model!

from sklearn.ensemble import RandomForestClassifier

#1. Build the model
forest_classifier = RandomForestClassifier(n_estimators=1000, bootstrap=True, max_features=0.8,max_samples=0.8)

#2. Fit the model to the data
forest_classifier.fit(X_train, y_train)

Let’s also build a decision tree model for comparison.

from sklearn.tree import DecisionTreeClassifier

#1. Build the model
tree_classifier = ?()

#2. Fit the model to the data
?.fit(?, ?)

Make some predictions

#predictions from the forest model
y_forest_pred = forest_classifier.predict(X_test)
#predictions from the tree model
y_tree_pred = tree_classifier.predict(X_test)

Measure classification success

from sklearn.metrics import confusion_matrix
?
#more visual approach
?

More detailed metrics?

print('Accuracy (forest): {:.2f}'.format(sk.metrics.accuracy_score(y_test, y_forest_pred)))
print('Accuracy (tree): {:.2f}'.format(sk.metrics.accuracy_score(y_test, y_tree_pred)))
print('Null Accuracy: {:.2f}'.format(1-(y_train.sum()/(y_train.count()))))
print('Precision (tree): {:.2f}'.format(sk.metrics.precision_score(y_test, y_tree_pred)))
print('Recall (tree): {:.2f}'.format(sk.metrics.recall_score(y_test, y_tree_pred)))

23.5. Hyperparameter tuning

Above we used the default values for how much randomness to include while building our trees for the random forest. Let’s look at how we can use k-fold cross validation to help us choose how much randomness to use when building these trees!

One hyperparameter:

from sklearn.model_selection import cross_val_score

number_of_trees = [50, 100, 150, 200, 250, 300, 350]

for val in number_of_trees:
  scores = cross_val_score(RandomForestClassifier(n_estimators=val), X_train, y_train, cv=5, scoring='accuracy')
  print(scores.mean())

Many hyperparameters:

from sklearn.model_selection import GridSearchCV

#define what parameters and what values to vary
parameters = {'max_features': [0.5,0.7,0.9,1.0],
              'n_estimators':list(range(50,200,50)),
              'max_samples':[0.5,0.7,0.9,0.99] }

#build the grid search algorithm
grid_search = GridSearchCV(RandomForestClassifier(), parameters, cv=5, scoring='accuracy') #strattified cross validation when traget is binary or multiclass

#Use training data to perform the nfold cross validation
grid_search.fit(X_train, y_train)

#find the best hyperparameters
print(grid_search.best_params_)
grid_search.best_score_

We can see that this quickly becomes time consuming to run. There are many algorithms out there to help tune your models. They generally break down into exhaustive grid searches, vs random searches. But this is an active field that is growing all the time.

Now that we’ve tuned our random forest model let’s see if we can beat the default model that we fit!

#1. build an optimized model
forest_classifier_opt = RandomForestClassifier(n_estimators=?,max_samples=?, max_features=?)

#2. Fit the model to the data
?.fit(?, ?)

#3. make predictions
y_forest_pred_opt = ?.predict(?)

#Measure accuracy
print('Accuracy: {:.2f}'.format(sk.metrics.accuracy_score(y_test, y_forest_pred_opt)))
print('Precision : {:.2f}'.format(sk.metrics.precision_score(y_test, y_forest_pred_opt)))
print('Recall : {:.2f}'.format(sk.metrics.recall_score(y_test, y_forest_pred_opt)))
#calculate a confusion matrix
?

#Plot the confusion matrix
?

Did your optimized model beat the default one?

23.6. Bonus

#Try adding other hyperparameters, how much better can you make the model by tuning?
#e.g., min_samples_split (how many points in a node are required to allow a split)
#e.g., max_depth (max depth of each tree)

23.7. Model interpretation

Random forests are collections of many decision trees. This makes it a little more difficult to interpret how the predictions are being made, as there can be 1000s of individual trees.

Let’s look at how to use feature importance to evaluate what is being used by the model to make predictions.

from sklearn.inspection import permutation_importance

#use permutation importance
perm_result = permutation_importance(forest_classifier_opt, X=X_test, y=y_test, scoring='accuracy', n_repeats=30)

#place values into a dataframe
forest_importances = pd.DataFrame({'variable':X_test.columns,'impo':perm_result.importances_mean.round(4), "sd":perm_result.importances_std.round(4)})

#sort the dataframe
forest_importances.sort_values(by='impo', ascending=False)
#plot the importance
sns.barplot(data=forest_importances, x='variable',y='impo')
plt.xticks(rotation=70)
#plot the importance (switch axis to display labels better?)
sns.barplot(data=forest_importances, y='variable',x='impo')

Asking your model questions?

Sometimes it can be very helpful to create a dataset that represents a question you have, and then use your model to make predictions to answer that question. For instance, what if someone had mean values for all measures?

#1. Create a dataframe
df_question = pd.DataFrame({'Pregnancies':X_train.Pregnancies.mean(),
                            'Glucose':X_train.Glucose.mean(),
                            'BloodPressure':X_train.BloodPressure.mean(),
                            'SkinThickness':X_train.SkinThickness.mean(),
                            'Insulin':X_train.Insulin.mean(),
                            'BMI':X_train.BMI.mean(),
                            'DiabetesPedigreeFunction':X_train.DiabetesPedigreeFunction.mean(),
                            'Age':X_train.Age.mean()},
                             index=[0])
                            

#2. Use the model to make predictions
question_pred =  forest_classifier_opt.predict(df_question)

#3. Take a look at the answer
question_pred

Now we can make our question a little more interesting by allowing one variable to vary. Let’s see how the predictions change as we vary glucose of the average person.

#1. Create a dataframe
df_question = pd.DataFrame({'Pregnancies':X_train.Pregnancies.mean(),
                            'Glucose':list(range(0,200,10)),
                            'BloodPressure':X_train.BloodPressure.mean(),
                            'SkinThickness':X_train.SkinThickness.mean(),
                            'Insulin':X_train.Insulin.mean(),
                            'BMI':X_train.BMI.mean(),
                            'DiabetesPedigreeFunction':X_train.DiabetesPedigreeFunction.mean(),
                            'Age':X_train.Age.mean()})
                            

#2. Use the model to make predictions
question_pred =  forest_classifier_opt.predict(df_question)

#3. Take a look at the answer
question_pred

Let’s plot the answer

#add a column to the df_question
df_question['predicted_diabetes'] = question_pred

#plot the predictions
sns.scatterplot(data=df_question, x='Glucose',y='predicted_diabetes')

23.8. Model Application

Let’s apply what we learnt about random forests to other datasets.

#load dataset!

23.9. Further reading

If you would like the notebook without missing code check out the full code version.