{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "view-in-github"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/tbonne/peds/blob/main/docs/introModeling/Intro_RandomForest.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9C2arduFECBl"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1Xa0UZixN0DKiYhSnWf6FakY7Kz3xZFI_' width=500>\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "***"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xDo_JqBMQqJM"
      },
      "source": [
        "## <font color='darkorange'>Random forests</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0HwGNjO9Qt6G"
      },
      "source": [
        "Let's take a look at random forests as a way to avoid over/under fitting our model decision tree models. Here we will use this algorithm to predict who will have diabetes. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d0x5is0FE_j9"
      },
      "source": [
        "Load in the needed libraries"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "d-rPIThpIE3o"
      },
      "outputs": [],
      "source": [
        "import sklearn as sk\n",
        "import pandas as pd\n",
        "import seaborn as sns\n",
        "import matplotlib.pyplot as plt\n",
        "from sklearn.model_selection import train_test_split\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1z7obQJoGmr5"
      },
      "source": [
        "**Load the data**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cwYGOotUQfqo"
      },
      "source": [
        "Get the 'diabetes.csv' from the class's shared data folder and load it into a dataframe."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NCpMYU5fQiM-"
      },
      "outputs": [],
      "source": [
        "#get wine to a dataframe\n",
        "df_diab = pd.read_csv('')\n",
        "\n",
        "#take a look\n",
        "?\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GTUwaLV9HdvV"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1WC4tXGCEF-1_2LQ74gIxJAZ-GLXCwBdK' width=\"100\">  "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DN6DonAlHYuX"
      },
      "source": [
        "Q: what kinds of data are we dealing with?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8-4mDkjCHbsc"
      },
      "outputs": [],
      "source": [
        "?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hlYmLGWeTpri"
      },
      "source": [
        "Q: are there any missing values?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LN9UdxsATt8D"
      },
      "outputs": [],
      "source": [
        "?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nfcRG5tmRhOS"
      },
      "source": [
        "### <font color='darkorange'>Descriptive statistics</font> "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kptMv845FjpI"
      },
      "source": [
        "Let's take a little time to look at some summary statistics.\n",
        " \n",
        "E.g., how many values of outcome types there are? "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nG4Am986FjTM"
      },
      "outputs": [],
      "source": [
        "#count how many of each value in a column using value_conunts\n",
        "df_diab.?.value_counts()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_Tr4i1cWv_Ez"
      },
      "source": [
        "What would be our accuracy if we always predicted the most common value?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KqdWFqh7wDPY"
      },
      "outputs": [],
      "source": [
        "?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dLe7t6uCGiSb"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1WC4tXGCEF-1_2LQ74gIxJAZ-GLXCwBdK' width=\"100\">  "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2-_xBgiiFyAO"
      },
      "source": [
        "Q: Choose one feature (column) and get the mean, min, and max."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yIYqTxojGAqj"
      },
      "outputs": [],
      "source": [
        "?\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WsIoTD89RnSH"
      },
      "source": [
        "### <font color='darkorange'>Visualizing the data</font> "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Yi5f6tbMHLjl"
      },
      "source": [
        "Let's plot the relationships between outcome and some of the health measures.\n",
        " \n",
        "Q: Choose one or more wine measures and generate a plot that shows the relationship between that measure and plant type."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PP1hbowmH4zH"
      },
      "outputs": [],
      "source": [
        "sns.scatterplot(data=df_diab, x='?', y='?')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1TRRJdscthpG"
      },
      "outputs": [],
      "source": [
        "sns.violinplot(data=df_diab, y='?', x='?')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ozz-07c9RwXe"
      },
      "source": [
        "### <font color='darkorange'>Data wrangling</font> \n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1wngk9w1YMLY"
      },
      "source": [
        "Training testing split"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JcziOvoDYO6I"
      },
      "outputs": [],
      "source": [
        "\n",
        "?\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### <font color=\"darkorange\">Model building</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cekIelOqSG2B"
      },
      "source": [
        "Here we will build our first random forest model!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xWRmQuhdSH5h"
      },
      "outputs": [],
      "source": [
        "from sklearn.ensemble import RandomForestClassifier\n",
        "\n",
        "#1. Build the model\n",
        "forest_classifier = RandomForestClassifier(n_estimators=1000, bootstrap=True, max_features=0.8,max_samples=0.8)\n",
        "\n",
        "#2. Fit the model to the data\n",
        "forest_classifier.fit(X_train, y_train)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3qqm9rJiu-6u"
      },
      "source": [
        "Let's also build a decision tree model for comparison."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Q-Tv-GsUvC1c"
      },
      "outputs": [],
      "source": [
        "from sklearn.tree import DecisionTreeClassifier\n",
        "\n",
        "#1. Build the model\n",
        "tree_classifier = ?()\n",
        "\n",
        "#2. Fit the model to the data\n",
        "?.fit(?, ?)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "x1Mwjlurbeux"
      },
      "source": [
        "Make some predictions"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JIKo-fbPbBG7"
      },
      "outputs": [],
      "source": [
        "#predictions from the forest model\n",
        "y_forest_pred = forest_classifier.predict(X_test)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_H6kguMZvSfF"
      },
      "outputs": [],
      "source": [
        "#predictions from the tree model\n",
        "y_tree_pred = tree_classifier.predict(X_test)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DEr16lBebd6X"
      },
      "source": [
        "Measure classification success"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ew9BnKJDblBY"
      },
      "outputs": [],
      "source": [
        "from sklearn.metrics import confusion_matrix\n",
        "?\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rVJQdODwcclq"
      },
      "outputs": [],
      "source": [
        "#more visual approach\n",
        "?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xXdo-AEHcnQf"
      },
      "source": [
        "More detailed metrics?"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bXbkeYN3cqgN"
      },
      "outputs": [],
      "source": [
        "print('Accuracy (forest): {:.2f}'.format(sk.metrics.accuracy_score(y_test, y_forest_pred)))\n",
        "print('Accuracy (tree): {:.2f}'.format(sk.metrics.accuracy_score(y_test, y_tree_pred)))\n",
        "print('Null Accuracy: {:.2f}'.format(1-(y_train.sum()/(y_train.count()))))\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Jca32Rq8wv8Q"
      },
      "outputs": [],
      "source": [
        "print('Precision (tree): {:.2f}'.format(sk.metrics.precision_score(y_test, y_tree_pred)))\n",
        "print('Recall (tree): {:.2f}'.format(sk.metrics.recall_score(y_test, y_tree_pred)))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7hyZXZE4d0VM"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1qWrKY9TgpgQaBCzZfz1xLTV6iCeSwfmG' width=\"100\">"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bKfJGA5V1scP"
      },
      "source": [
        "### <font color='darkorange'>Hyperparameter tuning</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xyOD7ZsC12B7"
      },
      "source": [
        "Above we used the default values for how much randomness to include while building our trees for the random forest. Let's look at how we can use k-fold cross validation to help us choose how much randomness to use when building these trees!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UHDRS-2eve5y"
      },
      "source": [
        "One hyperparameter:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iVRGUvjH1r8u"
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import cross_val_score\n",
        "\n",
        "number_of_trees = [50, 100, 150, 200, 250, 300, 350]\n",
        "\n",
        "for val in number_of_trees:\n",
        "  scores = cross_val_score(RandomForestClassifier(n_estimators=val), X_train, y_train, cv=5, scoring='accuracy')\n",
        "  print(scores.mean())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JEVhNgo2veU4"
      },
      "source": [
        "Many hyperparameters:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "D2RjbQGtufbv"
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import GridSearchCV\n",
        "\n",
        "#define what parameters and what values to vary\n",
        "parameters = {'max_features': [0.5,0.7,0.9,1.0],\n",
        "              'n_estimators':list(range(50,200,50)),\n",
        "              'max_samples':[0.5,0.7,0.9,0.99] }\n",
        "\n",
        "#build the grid search algorithm\n",
        "grid_search = GridSearchCV(RandomForestClassifier(), parameters, cv=5, scoring='accuracy') #strattified cross validation when traget is binary or multiclass\n",
        "\n",
        "#Use training data to perform the nfold cross validation\n",
        "grid_search.fit(X_train, y_train)\n",
        "\n",
        "#find the best hyperparameters\n",
        "print(grid_search.best_params_)\n",
        "grid_search.best_score_"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4DsVBg1sdCRH"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1WC4tXGCEF-1_2LQ74gIxJAZ-GLXCwBdK' width=\"100\">  "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-fEGDNTc83fQ"
      },
      "source": [
        "We can see that this quickly becomes time consuming to run. There are many algorithms out there to help tune your models. They generally break down into exhaustive grid searches, vs random searches. But this is an active field that is growing all the time. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3giY46R8-xqy"
      },
      "source": [
        "Now that we've tuned our random forest model let's see if we can beat the default model that we fit! "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "w-7pDJcy_g3a"
      },
      "outputs": [],
      "source": [
        "#1. build an optimized model\n",
        "forest_classifier_opt = RandomForestClassifier(n_estimators=?,max_samples=?, max_features=?)\n",
        "\n",
        "#2. Fit the model to the data\n",
        "?.fit(?, ?)\n",
        "\n",
        "#3. make predictions\n",
        "y_forest_pred_opt = ?.predict(?)\n",
        "\n",
        "#Measure accuracy\n",
        "print('Accuracy: {:.2f}'.format(sk.metrics.accuracy_score(y_test, y_forest_pred_opt)))\n",
        "print('Precision : {:.2f}'.format(sk.metrics.precision_score(y_test, y_forest_pred_opt)))\n",
        "print('Recall : {:.2f}'.format(sk.metrics.recall_score(y_test, y_forest_pred_opt)))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "f0AZ1ESewgvu"
      },
      "outputs": [],
      "source": [
        "#calculate a confusion matrix\n",
        "?\n",
        "\n",
        "#Plot the confusion matrix\n",
        "?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3XX3TBEGAV2z"
      },
      "source": [
        "Did your optimized model beat the default one?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6bGe3ihU9Rzk"
      },
      "source": [
        "### <font color='darkorange'>Bonus</font>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FUEujBl09hM3"
      },
      "outputs": [],
      "source": [
        "#Try adding other hyperparameters, how much better can you make the model by tuning?\n",
        "#e.g., min_samples_split (how many points in a node are required to allow a split)\n",
        "#e.g., max_depth (max depth of each tree)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jAtLBkSi8SlD"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1qWrKY9TgpgQaBCzZfz1xLTV6iCeSwfmG' width=\"100\">"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0nsw2I5da-tH"
      },
      "source": [
        "### <font color='darkorange'>Model interpretation</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wuVYJCI9uVNr"
      },
      "source": [
        "Random forests are collections of many decision trees. This makes it a little more difficult to interpret how the predictions are being made, as there can be 1000s of individual trees.\n",
        "> Let's look at how to use feature importance to evaluate what is being used by the model to make predictions.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qsSw9xV1uoxA"
      },
      "outputs": [],
      "source": [
        "from sklearn.inspection import permutation_importance\n",
        "\n",
        "#use permutation importance\n",
        "perm_result = permutation_importance(forest_classifier_opt, X=X_test, y=y_test, scoring='accuracy', n_repeats=30)\n",
        "\n",
        "#place values into a dataframe\n",
        "forest_importances = pd.DataFrame({'variable':X_test.columns,'impo':perm_result.importances_mean.round(4), \"sd\":perm_result.importances_std.round(4)})\n",
        "\n",
        "#sort the dataframe\n",
        "forest_importances.sort_values(by='impo', ascending=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "S2Nt5sxD0nhW"
      },
      "outputs": [],
      "source": [
        "#plot the importance\n",
        "sns.barplot(data=forest_importances, x='variable',y='impo')\n",
        "plt.xticks(rotation=70)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0rUKMEaMyxQP"
      },
      "outputs": [],
      "source": [
        "#plot the importance (switch axis to display labels better?)\n",
        "sns.barplot(data=forest_importances, y='variable',x='impo')\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DKH1xdwsfqH3"
      },
      "source": [
        "**Asking your model questions?**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8YWPrMu5fx3L"
      },
      "source": [
        "Sometimes it can be very helpful to create a dataset that represents a question you have, and then use your model to make predictions to answer that question. For instance, what if someone had mean values for all measures? "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qJQEwL61gpO7"
      },
      "outputs": [],
      "source": [
        "#1. Create a dataframe\n",
        "df_question = pd.DataFrame({'Pregnancies':X_train.Pregnancies.mean(),\n",
        "                            'Glucose':X_train.Glucose.mean(),\n",
        "                            'BloodPressure':X_train.BloodPressure.mean(),\n",
        "                            'SkinThickness':X_train.SkinThickness.mean(),\n",
        "                            'Insulin':X_train.Insulin.mean(),\n",
        "                            'BMI':X_train.BMI.mean(),\n",
        "                            'DiabetesPedigreeFunction':X_train.DiabetesPedigreeFunction.mean(),\n",
        "                            'Age':X_train.Age.mean()},\n",
        "                             index=[0])\n",
        "                            \n",
        "\n",
        "#2. Use the model to make predictions\n",
        "question_pred =  forest_classifier_opt.predict(df_question)\n",
        "\n",
        "#3. Take a look at the answer\n",
        "question_pred"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TTpT78RGiZ12"
      },
      "source": [
        "Now we can make our question a little more interesting by allowing one variable to vary. Let's see how the predictions change as we vary glucose of the average person."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XdnqTmkNilCp"
      },
      "outputs": [],
      "source": [
        "#1. Create a dataframe\n",
        "df_question = pd.DataFrame({'Pregnancies':X_train.Pregnancies.mean(),\n",
        "                            'Glucose':list(range(0,200,10)),\n",
        "                            'BloodPressure':X_train.BloodPressure.mean(),\n",
        "                            'SkinThickness':X_train.SkinThickness.mean(),\n",
        "                            'Insulin':X_train.Insulin.mean(),\n",
        "                            'BMI':X_train.BMI.mean(),\n",
        "                            'DiabetesPedigreeFunction':X_train.DiabetesPedigreeFunction.mean(),\n",
        "                            'Age':X_train.Age.mean()})\n",
        "                            \n",
        "\n",
        "#2. Use the model to make predictions\n",
        "question_pred =  forest_classifier_opt.predict(df_question)\n",
        "\n",
        "#3. Take a look at the answer\n",
        "question_pred"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Y24R8peWi7O6"
      },
      "source": [
        "Let's plot the answer"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Qw5ArXCFi64K"
      },
      "outputs": [],
      "source": [
        "#add a column to the df_question\n",
        "df_question['predicted_diabetes'] = question_pred\n",
        "\n",
        "#plot the predictions\n",
        "sns.scatterplot(data=df_question, x='Glucose',y='predicted_diabetes')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5Z2I7aMSxtwy"
      },
      "source": [
        "### <font color='darkorange'>Model Application</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_IpbBZLvxx8J"
      },
      "source": [
        "Let's apply what we learnt about random forests to other datasets. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "u_Tzi4kAx9aF"
      },
      "outputs": [],
      "source": [
        "#load dataset!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BJqFrYyyYzMx"
      },
      "source": [
        "### <font color='darkorange'>Further reading</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_Intro_RandomForest.ipynb) version."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "authorship_tag": "ABX9TyMsIltJCzN63S2NuI7BCzUI",
      "collapsed_sections": [],
      "include_colab_link": true,
      "name": "Intro_RandomForest.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}