{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "0BFAQnfHmeaB" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": { "id": "4VTsxBzT7tAb" }, "source": [ "## Logistic regression" ] }, { "cell_type": "markdown", "metadata": { "id": "03s3L0-C8Zxr" }, "source": [ "> In previous classes we have used exploratory approaches to visualize and quantify relationships between variables. We used linear regression to make predictions about numeric values (e.g., boston house prices), now we will use logistic regression models for a classification problem. Here we will try and distinguish tissue samples as positive/negative for breast cancer." ] }, { "cell_type": "markdown", "metadata": { "id": "MAhpAGvgCH_5" }, "source": [ "Let's load in our growing list of python packages that we are getting used to using.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JyXkhDShnx8J" }, "outputs": [], "source": [ "import sklearn as sk\n", "from sklearn.model_selection import train_test_split\n", "\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "\n", "import statsmodels.api as sm\n", "import statsmodels.formula.api as smf\n" ] }, { "cell_type": "markdown", "metadata": { "id": "aZZS3rQvo7Xk" }, "source": [ "Then let's load in the breast cancer dataset, and get it into a format we can use." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Iv6LFQPupABa" }, "outputs": [], "source": [ "# The tissue sample dataset\n", "df_cancer = ?\n", "\n", "#take a look\n", "?" ] }, { "cell_type": "markdown", "metadata": { "id": "wJHjdshLGGyI" }, "source": [ "### Understand the data " ] }, { "cell_type": "markdown", "metadata": { "id": "E-eSiYP3HOlx" }, "source": [ "What kinds of data is the cancer data?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3YbSSBjNHWun" }, "outputs": [], "source": [ "?" ] }, { "cell_type": "markdown", "metadata": { "id": "4JTcOqxPHXZh" }, "source": [ "Are there missing values anywhere?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6hmyJub7HX6h" }, "outputs": [], "source": [ "?" ] }, { "cell_type": "markdown", "metadata": { "id": "IRu_wGLKG5qK" }, "source": [ "### Visualize and Explore " ] }, { "cell_type": "markdown", "metadata": { "id": "mEcsEoBeFpYa" }, "source": [ "Histogram of the target variable" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RNIgmjtjFrz5" }, "outputs": [], "source": [ "?" ] }, { "cell_type": "markdown", "metadata": { "id": "g87llbBEG-J4" }, "source": [ "Plot the target variable (benign) on the y-axis with another variable on the x-axis. Try out a few different variables on the x-axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "STIJY6mXGGh3" }, "outputs": [], "source": [ "?" ] }, { "cell_type": "markdown", "metadata": { "id": "kBlzvYeAHb_p" }, "source": [ "Create a heat map to help you explore" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gIxHKp43JMzO" }, "outputs": [], "source": [ "?" ] }, { "cell_type": "markdown", "metadata": { "id": "GFsEVUnqquWz" }, "source": [ "### Data wrangling " ] }, { "cell_type": "markdown", "metadata": { "id": "HYoCKQeQsVBl" }, "source": [ "**Data preprocessing (binary variables)**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Mc_Ru7FRsdJ9" }, "outputs": [], "source": [ "from sklearn.preprocessing import OrdinalEncoder\n", "\n", "#get the columns names of features you'd like to turn into 0/1\n", "bin_names = ['benign']\n", "\n", "#create a dataframe of those features\n", "bin_features = df_cancer[bin_names]\n", "\n", "#fit the scaler to those data\n", "bin_scaler = OrdinalEncoder().fit(bin_features.values)\n", "\n", "#use the scaler to transform your data\n", "bin_features = bin_scaler.transform(bin_features.values)\n", "\n", "#put these scaled features back into your transformed features dataframe\n", "df_cancer[bin_names] = bin_features\n", "\n", "#take a look\n", "df_cancer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dl4BZFyoM6qq" }, "outputs": [], "source": [ "bin_scaler.categories_" ] }, { "cell_type": "markdown", "metadata": { "id": "RWiQn1gHs4i5" }, "source": [ "**Data preprocessing (categorical variables)**" ] }, { "cell_type": "markdown", "metadata": { "id": "O2wZLBstt_rV" }, "source": [ "Technician ID number is a categorical value, but it is being treated as a number (int64). Let's convert it to a category." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6hLs5unAN0kb" }, "outputs": [], "source": [ "df_cancer['technician'] = df_cancer.technician.astype('category')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "X2loCli-s0SF" }, "outputs": [], "source": [ "\n", "#categorical variables\n", "cat_names = ['technician']\n", "\n", "#create dummy variables\n", "df_cat = pd.get_dummies(df_cancer[cat_names])\n", "\n", "#add them back to the original dataframe\n", "df_cancer = pd.concat([df_cancer,df_cat], axis=1)\n", "\n", "#remove the old columns\n", "df_cancer = df_cancer.drop(cat_names, axis=1)\n", "\n", "#take a look\n", "df_cancer" ] }, { "cell_type": "markdown", "metadata": { "id": "n8C4RUB7RYCG" }, "source": [ "**Split our dataframe into training and testing datasets**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eZEdEuROIFNC" }, "outputs": [], "source": [ "#split the data into training and testing (80/20 split)\n", "df_train, df_test = sk.model_selection.train_test_split(df_cancer, test_size=0.20)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2CmwRtpxS7Vf" }, "outputs": [], "source": [ "#take a look training dataset\n", "df_train.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HdMXfjuFSNfq" }, "outputs": [], "source": [ "#take a look\n", "df_test.shape\n" ] }, { "cell_type": "markdown", "metadata": { "id": "gk9WNnywF7Cw" }, "source": [ "**Data pre-processing (numeric)**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7dXBfXA8F9UD" }, "outputs": [], "source": [ "#Feature Scaling (after spliting the data!)\n", "from sklearn.preprocessing import StandardScaler \n", "\n", "#numeric variables\n", "numb_names = df_train.drop(['benign','technician_1','technician_2','technician_3','technician_4'],axis=1).select_dtypes('number').columns.tolist()\n", "\n", "#create the standard scaler object\n", "sc = StandardScaler()\n", "\n", "#use this object to fit (i.e., to calculate the mean and sd of each variable in the training data) and then to transform the training data\n", "df_train[numb_names] = sc.fit_transform(df_train[numb_names])\n", "\n", "#use the fit from the training data to transform the test data\n", "df_test[numb_names] = sc.transform(df_test[numb_names])\n" ] }, { "cell_type": "markdown", "metadata": { "id": "d5JdBUC5HlCl" }, "source": [ "\n", "### Modeling and Prediction" ] }, { "cell_type": "markdown", "metadata": { "id": "xlIVmLX7HqMm" }, "source": [ "> Let's look building our second kind of model -- logistic regression! How well can we predict the benign cases? This is similar to clustering analysis except we have the labels! Can we train a model to make the right predictions?\n", "
\n", "We will follow a general approach when building models. We will divide the dataset into *training* and *testing* datasets. \n", "
\n", "This lets us fit the model to one part of the data and then use the withheld data to test the predictions of the model. This helps us avoid *overfitting* our model!" ] }, { "cell_type": "markdown", "metadata": { "id": "moh5HFYwTKoz" }, "source": [ "**Fit a model**" ] }, { "cell_type": "markdown", "metadata": { "id": "_YLTO-6TN-wq" }, "source": [ " " ] }, { "cell_type": "markdown", "metadata": { "id": "dIvVVDYnN_uq" }, "source": [ "Below choose a variable to predict if the tissue sample is benign or not." ] }, { "cell_type": "markdown", "metadata": { "id": "vrThvCCaTbqR" }, "source": [ "In general when using sklearn to fit a model we will follow these steps:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jI65cVtTTbFd" }, "outputs": [], "source": [ "#define model parameters\n", "log_reg = smf.logit('benign ~ ?', data=df_train)\n", "\n", "#fit the model to the training data\n", "results = log_reg.fit()\n", "\n", "#Get a summary of the model parameters\n", "print(results.summary())\n" ] }, { "cell_type": "markdown", "metadata": { "id": "GmYUjT27dbBy" }, "source": [ "**Visualize and explore the model predictions**" ] }, { "cell_type": "markdown", "metadata": { "id": "_s_PldxndpS3" }, "source": [ "Let's look at where the model to a good/bad job of classifying images into benign or not!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OYucuvmqj0uN" }, "outputs": [], "source": [ "#let's first predict values in the testing dataset\n", "df_test['benign_prob'] = results.predict(df_test).round(2)\n", "\n", "df_test['benign_pred'] = (df_test['benign_prob']>0.5).astype(int) #here we've used 0.5 as the threshold of benign or not!\n", "\n", "df_test" ] }, { "cell_type": "markdown", "metadata": { "id": "V4YdwcpskQyC" }, "source": [ "Let's plot the predicted and observed points!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EZxBRFwqdwd5" }, "outputs": [], "source": [ "sns.scatterplot(data=df_test,x='mean_symmetry', y='benign')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-OgHBdczgYn1" }, "outputs": [], "source": [ "sns.scatterplot(data=df_test,x='mean_symmetry', y='benign')\n", "sns.scatterplot(data=df_test,x='mean_symmetry', y='benign_pred')" ] }, { "cell_type": "markdown", "metadata": { "id": "C0MmbEZ7dq5e" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "S7gvWHDrbuK1" }, "source": [ "How good is the model at classifying?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ksoip4-QFDYe" }, "outputs": [], "source": [ "#confusion table\n", "confusion_matrix = sk.metrics.confusion_matrix(df_test['benign'], df_test['benign_pred'])\n", "print(confusion_matrix)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AhYeBHZJMn2u" }, "outputs": [], "source": [ "#more visual approach\n", "sns.heatmap(confusion_matrix, annot=True)\n", "plt.xlabel('Predicted label')\n", "plt.ylabel('True label')" ] }, { "cell_type": "markdown", "metadata": { "id": "bEvZvpW3FFdB" }, "source": [ "Measuring classification success:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "T-3K34M6clh1" }, "outputs": [], "source": [ "print('Accuracy: {:.2f}'.format(sk.metrics.accuracy_score(df_test['benign'], df_test['benign_pred'])))\n", "print('precision: {:.2f}'.format(sk.metrics.precision_score(df_test['benign'], df_test['benign_pred'])))\n", "print('recal: {:.2f}'.format(sk.metrics.recall_score(df_test['benign'], df_test['benign_pred'])))\n" ] }, { "cell_type": "markdown", "metadata": { "id": "8wk1MP3yPt41" }, "source": [ "> **Accuracy** is the overall ability of the model to correctly identify positive and negative samples.\n", "\n", "> **Precision** is intuitively the ability of the classifier to not label a sample as positive if it is negative.\n", "\n", "> **Recall** is intuitively the ability of the classifier to find all the positive samples." ] }, { "cell_type": "markdown", "metadata": { "id": "imuhCYCRwAKW" }, "source": [ "Compare that accuracy if we just predicted the most common type (i.e., let's compute a baseline!)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nZtsrN4AwHZG" }, "outputs": [], "source": [ "df_cancer.benign.value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mBePam8lwMDT" }, "outputs": [], "source": [ "357/(212+357)" ] }, { "cell_type": "markdown", "metadata": { "id": "d2iuYitVkb3O" }, "source": [ "Is all that variation noise? Or maybe there are other variables that might explain why the predictions are off." ] }, { "cell_type": "markdown", "metadata": { "id": "SS_vxHN7vpBg" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "6L3gTa8Ik_6I" }, "source": [ "**Fit a more complex model**" ] }, { "cell_type": "markdown", "metadata": { "id": "S4pSaQXslLVk" }, "source": [ "This time we will try logistic regression with many predictors. How high can you get the accuracy?" ] }, { "cell_type": "markdown", "metadata": { "id": "uUm_ZfRBvhs5" }, "source": [ " " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LcG4MXuolVpB" }, "outputs": [], "source": [ "#define model parameters\n", "log_reg2 = smf.logit('benign ~ ?' , data=df_train)\n", "\n", "#fit the model to the training data\n", "results2 = log_reg2.fit(method='bfgs')\n", "\n", "#Get a summary of the model parameters\n", "print(results2.summary())\n" ] }, { "cell_type": "markdown", "metadata": { "id": "8yeNY-fNlVpo" }, "source": [ "Visualize and explore these predictions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ql6ferivj2v4" }, "outputs": [], "source": [ "#let's first predict values in the testing dataset\n", "df_test['benign_prob_multi'] = ?.predict(?).round(2)\n", "\n", "df_test['benign_pred_multi'] = (?>0.5).astype(int) #here we've used 0.5 as the threshold of benign or not!\n", "\n", "df_test" ] }, { "cell_type": "markdown", "metadata": { "id": "UkNy9y6XlVpp" }, "source": [ "First let's look at how the model fit to the training data. Now that we have two predictors we'll have to look at one at a time.\n", "
\n", "Let's look at RM first:" ] }, { "cell_type": "markdown", "metadata": { "id": "NHsjIRSOlVpu" }, "source": [ "How good is the model at predicting?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "aFLuCoA1TFjY" }, "outputs": [], "source": [ "#confusion table\n", "confusion_matrix2 = sk.metrics.confusion_matrix(?, ?)\n", "print(?)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "IyradqCwTFj4" }, "outputs": [], "source": [ "#more visual approach\n", "sns.heatmap(?, annot=True)\n", "plt.xlabel('Predicted label')\n", "plt.ylabel('True label')" ] }, { "cell_type": "markdown", "metadata": { "id": "CxIVSJErTFj7" }, "source": [ "Measuring classification success:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LoOHb_z4TFj7" }, "outputs": [], "source": [ "print('Accuracy: {:.2f}'.format(sk.metrics.accuracy_score(?, ?)))\n", "print('Precision: {:.2f}'.format(sk.metrics.precision_score(?, ?)))\n", "print('Recall: {:.2f}'.format(sk.metrics.recall_score(?, ?)))" ] }, { "cell_type": "markdown", "metadata": { "id": "0ng0E-HeTgiB" }, "source": [ "> **Accuracy** is the fraction of predictions our model got right.\n", "\n", "> **Precision** is intuitively the ability of the classifier to not label a sample as positive if it is negative.\n", "\n", "> **Recall** is intuitively the ability of the classifier to find all the positive samples." ] }, { "cell_type": "markdown", "metadata": { "id": "03Ocqib_Yqqu" }, "source": [ "### Bonus \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Titanic survivors**" ] }, { "cell_type": "markdown", "metadata": { "id": "IDtR4pZx5Z6v" }, "source": [ "Let's see if we can use what we learnt today to predict who survived the titanic sinking, and what features help us to make these predictions.\n", "\n", "> I've taken a random 20% sample from the titanic data. Try and build a model on the data you have (titanic_subsample.csv - in the shared data folder) that can best predict who will survive.\n", "\n", "> When you think you've got a good model, let me know on slack and I'll give you the with-held sample. You can then estimate your models performance!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sZ0zdUOa5swV" }, "outputs": [], "source": [ "df_titanic = pd.read_csv('titanic_subset.csv')\n", "\n", "df_titanic" ] }, { "cell_type": "markdown", "metadata": { "id": "yILa1g5052Qj" }, "source": [ "**Data understanding**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CI7uRfHX54qE" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "id": "ecEdp5iY55ML" }, "source": [ "**Exploration and visualization**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xz9Cc2uy59QN" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "id": "GhMq1eKv6Bbm" }, "source": [ "**Data Preprocessing**" ] }, { "cell_type": "markdown", "metadata": { "id": "4dgLp6Qc6F7u" }, "source": [ "> Feel free here to work with a subset of features that you think will help make predictions in the with-held dataset! I.e., what relationships will generalize well?" ] }, { "cell_type": "markdown", "metadata": { "id": "4XqfoM4I594j" }, "source": [ "**Model building**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dmF6gOEK81cb" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "id": "JmzHUa11-RVB" }, "source": [ "**Model predictions**" ] }, { "cell_type": "markdown", "metadata": { "id": "wdacn-sf-B7h" }, "source": [ "When you've got a good model and you are ready to test it out let me know and I'll send you the withheld data! When you measure the performance of the model does it differ in accuracy, precision, and recall?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LEp71grz-UC8" }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "id": "BJqFrYyyYzMx" }, "source": [ "### Further reading" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_IntroModelling_LogisticReg.ipynb) version." ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyOeqLWnzBhQehWioLwgrotP", "collapsed_sections": [], "include_colab_link": true, "name": "IntroModelling_LogisticReg.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }