Open In Colab

colab


25. Decision trees#

Let’s take a look at decision trees as classifiers. Here we will use this algorithm to classify which grape plant was used to create a wine.

Load in the needed libraries

import sklearn as sk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

25.1. Load the data#

Get the ‘wine_labs.csv’ from the class’s shared data folder and load it into a dataframe.

#get wine to a dataframe
df_wine = pd.read_csv('/content/wine_labs.csv')

#take a look
df_wine.head(3)

colab

Q: what kinds of data are we dealing with?

?

Q: are there any missing values?

?
#let's drop rows with missing data
?

25.2. Descriptive statistics#

Let’s take a little time to look at some summary statistics.

E.g., how many values of plant type there are?

#count how many of each value in a column using value_conunts
df_wine.plant.value_counts()

colab

Choose one feature (column) and get the mean, min, and max.

?

25.3. Visualizing the data#

Let’s plot the relationships between plant type and some of the wine measures.

Q: Choose one or more wine measures and generate a plot that shows the relationship between that measure and plant type.

?

25.4. Data wrangling#

Before building our models we have to do some preprocessing steps with our data. We’ll cover some common steps here, but we’ll see more along the way.

Let’s first consider categorical predictor variables.

Preprocessing (categorical input variables)

Convert the categorical ‘lab’ variable using onehot encoding (i.e., create many dummy columns to replace the one categorical column).

Note: we could also convert the categories to numbers (Ordinal) but this assumes that there is an order to the categories which isn’t always a good assumption.

#categorical variables
cat_names = ['lab']

#create dummy variables
df_cat = pd.get_dummies(df_wine[cat_names])

#add them back to the original dataframe
df_wine = pd.concat([df_wine,df_cat], axis=1)

#remove the old columns
df_wine = df_wine.drop(cat_names, axis=1)

#take a look
df_wine

Next let’s look at how we might convert a categorical outcome variable.

Preprocessing (categorical outcome variable)

As the outcome variable is categorical we will convert each category into a number, and unlike the onehot encoding we will keep these numbers within the same column.

from sklearn.preprocessing import LabelEncoder

#create the encoder
le_plants = LabelEncoder()

#create outcome variable
df_wine['plant'] = le_plants.fit_transform(df_wine['plant'])

#take a look
df_wine

We can then always get back what plant category corresponds to which numeric label.

le_plants.classes_

Training testing split

We will follow a general approach when building models. We will divide the dataset into training and testing datasets. This lets us fit the model to one part of the data and then use the withheld data to test the predictions of the model. This helps us detect and avoid overfitting our model!

To do this we’ll first split the dataframe into inputs and target variables. I.e., we’d like to use X to predict y. Then we split each into training and testing sets. This makes it easier to work with sklearn.


#split data into predictors (X) and target (y)
X = df_wine.drop('plant', axis=1)
y = df_wine['plant']

#split these data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

Now that we’ve split our data, we can do some more processing. Here let’s scale our numeric variables. Importantly, we’ll build our scaling function from the training data and apply it to the testing. That way we are only using information from the training data and not contaminating our model with testing data information as well. I.e., no data leakage between training and testing datasets.

Preprocessing (numeric variables)

#Feature Scaling (after spliting the data!)
from sklearn.preprocessing import StandardScaler 

#numeric variables
numb_names = X_train.drop(['lab_lab1','lab_lab2','lab_lab3'],axis=1).select_dtypes('number').columns.tolist()

#create the standard scaler object
sc = StandardScaler()

#use this object to fit (i.e., to calculate the mean and sd of each variable in the training data) and then to transform the training data
X_train[numb_names] = sc.fit_transform(X_train[numb_names])

#use the fit from the training data to transform the test data
X_test[numb_names] = sc.transform(X_test[numb_names])

#take a look
X_train

colab

25.5. Model building#

Here we will build our first decision tree!

from sklearn.tree import DecisionTreeClassifier

#1. build the algorithm
classifier = DecisionTreeClassifier()

#2. fit the algorithm to the data
classifier_res = classifier.fit(X_train, y_train)

Predictions

Make some predictions on the testing data

y_pred = classifier.predict(X_test)

Measure classification success

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
#more visual approach
sns.heatmap(cm, annot=True)
plt.xlabel('Predicted label')
plt.ylabel('True label')

More detailed metrics?

print(f'Accuracy: {sk.metrics.accuracy_score(y_test, y_pred):.2f}')
print(f'Precision: {sk.metrics.precision_score(y_test, y_pred, average='micro'):.2f}')
print(f'Recal: {sk.metrics.recall_score(y_test, y_pred, average='micro'):.2f}')

colab

25.6. Hyperparameters#

Decision tree algorithms have a number of hyperparameters that can be tuned to acheive better predictions. Let’s take a look at one and how we can tune it!

First we need some way to test how model performance varies as we change the parameter. We can’t use the testing dataset… if we did it really wouldn’t be a good test. I.e., the testing data would be used to help build the model and so would not be independent.

So let’s split the training dataset again! This will create training and validation datasets!

X_hyper_train, X_hyper_val, y_hyper_train, y_hyper_val = train_test_split(X_train, y_train, test_size=0.20)

Let’s next focus on the max depth parameter, and see if we can find a value that maximizes the performance of the model on the validation dataset.

We’ll first build a function that takes as input max depth, and outputs the accuracy score.

We’ll then use a loop to try out many max depth scores.

Finally we’ll plot the the accuracy scores for each max depth value.

Let’s first build a function the takes max depth as input and outputs a accuracy score.

def fit_decision_tree(maxDep):

  #1. build the algorithm
  ? = ?(max_depth=maxDep)

  #2. Fit the algorithm
  ?= ?.fit(X_hyper_train, y_hyper_train)

  #3. Make predictions
  y_pred = ?(X_hyper_val)

  #4. Meausure the accuracy
  accuracy_measured = ?(y_hyper_val, y_pred)

  return accuracy_measured

Try out your new function!

fit_decision_tree(maxDep=4)

Next let’s build a loop and see what values of max depth give the best results!

acc_scores = []
for i in range(1,?):
  acc_s = ?(i)
  acc_scores.append(?)

Then let’s plot it!

#create a dataframe
df_plot_maxDep = pd.DataFrame({'accuracy':acc_scores, 'maxDep':range(1,?)})

#make a plot
sns.relplot(?, kind='line')

25.7. Bonus#

Try this exercise again but this time use min_samples_split. Which is the parameter that defines when splits are no longer considered (e.g., if a leaf has 10 points in it and min_samples_split is 11 then the algorithm will not look to split the leaf).

colab

25.8. Model interpretation#

!pip install dtreeviz
import dtreeviz

Decision trees can offer some nice visuals that can help interpret and communicate your results, lets take a look at a few.

First let’s see how what splits the model found in the training data, and how many data points fell within each leaf.

#build the figure
viz = dtreeviz.model(classifier_res, X_train, y_train,
                target_name="plant",
                feature_names=X_train.columns.to_list(),
                class_names={0:'plantA',1:'plantB',2:'plantC'}
                )

#take a look
viz.view(fontname="DejaVu Sans")

Now let’s look at how the tree does with the test data.

In this case, we can see how the test samples fall into the fixed structure of the tree. Do you see any big differences in leaf sizes and mixtures?

#build the figure
viz_test = dtreeviz.model(classifier_res, X_test, y_test,
                target_name="plant",
                feature_names=X_train.columns.to_list(),
                class_names={0:'plantA',1:'plantB',2:'plantC'}
                )

#take a look
viz_test.view(fontname="DejaVu Sans")

Finally, let’s see how one data point in the test data gets place within the tree.

#Vizualize one prediction
import numpy as np

# pick a X test point
X_values_for_pred = X_test.iloc[12] #you can choose any other row!

X_values_for_pred
viz_obj = viz_test.view(
    x=X_values_for_pred,
    fontname="DejaVu Sans"
)

viz_obj

25.9. Bonus 2#

Now that you know how to use decision tree models, try going back to one of the datasets we’ve already worked on and see if you can get better predictions? Can you still explain what features are helping you to make those predictions?

Note: decision trees work well for both categorical outcome variables (as shown above) and continuous/integer outcome variables. The difference is in how the splits are made:

  • categorical models split based on class impurity e.g., gini impurity or entropy.

  • regressor models split based on variation e.g., mean squared error.

#load a dataset

25.10. Further reading#

If you would like the notebook without missing code check out the full code version.