Open In Colab


22. Decision trees

Let’s take a look at decision trees as classifiers. Here we will use this algorithm to classify which grape plant was used to create a wine.

Load in the needed libraries

import sklearn as sk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

22.1. Load the data

Get the ‘wine_labs.csv’ from the class’s shared data folder and load it into a dataframe.

#get wine to a dataframe
df_wine = pd.read_csv('/content/wine_labs.csv')

#take a look
df_wine.head(3)

Q: what kinds of data are we dealing with?

?

Q: are there any missing values?

?
#let's drop rows with missing data
?

22.2. Descriptive statistics

Let’s take a little time to look at some summary statistics.

E.g., how many values of plant type there are?

#count how many of each value in a column using value_conunts
df_wine.plant.value_counts()

Choose one feature (column) and get the mean, min, and max.

?

22.3. Visualizing the data

Let’s plot the relationships between plant type and some of the wine measures.

Q: Choose one or more wine measures and generate a plot that shows the relationship between that measure and plant type.

?

22.4. Data wrangling

*Preprocessing (categorical input variables)*

Convert the categorical ‘lab’ variable using onehot encoding (i.e., create dummy columns).

?

Preprocessing (categorical target variable)

As the target variable is categorical we will convert each category into a number, and unlike the onehot encoding we will keep these numbers within the same column.

?

Training testing split

Here we change a little how we do the data splitting. Instead of spliting the dataframe into training and testing, we’ll first split the dataframe into inputs and target variables. I.e., we’d like to use X to predict y. Then we split each into training and testing sets. This makes it easier to work with sklearn algorithms.


#split data into predictors (X) and target (y)
X = df_wine.drop('plant', axis=1)
y = df_wine['plant']

#split these data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

Preprocessing (numeric variables)

?

22.5. Model building

Here we will build our first decision tree!

from sklearn.tree import DecisionTreeClassifier

#1. build the algorithm
classifier = DecisionTreeClassifier()

#2. fit the algorithm to the data
classifier_res= classifier.fit(X_train, y_train)

Predictions

Make some predictions on the testing data

y_pred = classifier.predict(X_test)

Measure classification success

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
#more visual approach
sns.heatmap(cm, annot=True)
plt.xlabel('Predicted label')
plt.ylabel('True label')

More detailed metrics?

print('Accuracy: {:.2f}'.format(sk.metrics.accuracy_score(y_test, y_pred)))
print('Precision: {:.2f}'.format(sk.metrics.precision_score(y_test, y_pred, average='micro')))
print('Recal: {:.2f}'.format(sk.metrics.recall_score(y_test, y_pred, average='micro')))

22.6. Hyperparameters

Decision tree algorithms have a number of hyperparameters that can be tuned to acheive better predictions. Let’s take a look at one and how we can tune it!

First we need some way to test how model performance varies as we change the parameter. We can’t use the testing dataset… if we did it really wouldn’t be a good test. I.e., the testing data would be used to help build the model and so would not be independent.

So let’s split the training dataset again! This will create training and validation datasets!

X_hyper_train, X_hyper_val, y_hyper_train, y_hyper_val = train_test_split(X_train, y_train, test_size=0.20)

Let’s next focus on the max depth parameter, and see if we can find a value that maximizes the performance of the model on the validation dataset.

We’ll first build a function that takes as input max depth, and outputs the accuracy score.

We’ll then use a loop to try out many max depth scores.

Finally we’ll plot the the accuracy scores for each max depth value.

Let’s first build a function the takes max depth as input and outputs a accuracy score.

def fit_decision_tree(maxDep):

  #1. build the algorithm
  ? = ?(max_depth=maxDep)

  #2. Fit the algorithm
  ?= ?.fit(X_hyper_train, y_hyper_train)

  #3. Make predictions
  y_pred = ?(X_hyper_val)

  #4. Meausure the accuracy
  accuracy_measured = ?(y_hyper_val, y_pred)

  return accuracy_measured

Try out your new function!

fit_decision_tree(maxDep=4)

Next let’s build a loop and see what values of max depth give the best results!

acc_scores = []
for i in range(1,?):
  acc_s = ?(i)
  acc_scores.append(?)

Then let’s plot it!

#create a dataframe
df_plot_maxDep = pd.DataFrame({'accuracy':acc_scores, 'maxDep':range(1,10)})

#make a plot
sns.relplot(?, kind='line')

22.7. Bonus

Try this exercise again but this time use min_samples_split. Which is the parameter that defines when splits are no longer considered (e.g., if a leaf has 10 points in it and min_samples_split is 11 then the algorithm will not look to split the leaf.

22.8. Model interpretation

Decision trees can offer some nice visuals that can help interpret and communicate your results, lets take a look at a few.

!pip install dtreeviz
from dtreeviz.trees import dtreeviz 
#build the figure
viz = dtreeviz(classifier_res, X_train, y_train,
                target_name="plant",
                feature_names=X_train.columns.to_list(),
                class_names={0:'plantA',1:'plantB',2:'plantC'},
                scale=1.0)

#take a look
viz

Let’s look at how the tree does with the test data.

#build the figure
viz_test = dtreeviz(classifier_res, X_test, y_test,
                target_name="plant",
                feature_names=X_train.columns.to_list(),
                class_names={0:'plantA',1:'plantB',2:'plantC'},
                scale=1)

#take a look
viz_test
#Vizualize one prediction 
import numpy as np
# pick random X test point
X_values_for_pred = X_test.iloc[12] #you can choose any other row!

X_values_for_pred
#build the figure
viz_one_pred = dtreeviz(classifier_res, X_train, y_train,
                target_name="plant",
                feature_names=X_train.columns.to_list(),
                class_names={0:'plantA',1:'plantB',2:'plantC'},
                scale=0.75,
                X=X_values_for_pred)

#take a look
viz_one_pred
y_test.iloc[12]

22.9. Bonus 2

Now that you know how to use decision tree models, try going back to one of the datasets we’ve already worked on and see if you can get better predictions? Can you still explain what features are helping you to make those predictions?

#load a dataset

22.10. Further reading

If you would like the notebook without missing code check out the full code version.