{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "view-in-github"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/tbonne/IntroDataScience/blob/main/InClassNotebooks/IntroClustering2_highDimensions.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ff3qPaLI390b"
      },
      "source": [
        "<img src='http://drive.google.com/uc?export=view&id=1eaZWfBLfmkzNndnA8pJvo6tO4Lm7Qz83' width=500>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "***"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "N6ULenhF39aV"
      },
      "source": [
        "## <font color='darkorange'>Clustering in higher dimensions</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "x8xd-YdGCN0h"
      },
      "source": [
        "In this exercise we will look at adding an extra dimension to our points. We'll look at how changing the problem from 2D to 3D can cause challenges. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dZqwcX2934AE"
      },
      "outputs": [],
      "source": [
        "import numpy as np\n",
        "import seaborn as sns\n",
        "import pandas as pd\n",
        "from sklearn import cluster\n",
        "import sklearn as sk\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YlV3DICbZa_B"
      },
      "source": [
        "### <font color='darkorange'>Simulating data</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wJZnfzlmr01l"
      },
      "source": [
        "Let's simulate some data where we know how many clusters there are. This time we'll add an extra dimension. \n",
        "\n",
        "> i.e., let's create 1000 points and set them to class 1. Each point will get a random x, y, and z coordinate."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cPRKRQ5BwR_i"
      },
      "outputs": [],
      "source": [
        "#simulate some random values\n",
        "array_class1 = {\"x\":np.random.normal(1,4, size=1000),\n",
        "                \"y\":np.random.normal(1,4, size=1000),\n",
        "                \"z\":np.random.normal(1,40, size=1000),\n",
        "                \"class\": 1}\n",
        "\n",
        "#put them in a dataframe\n",
        "df_class1 = pd.DataFrame(data=array_class1)\n",
        "\n",
        "#plot it\n",
        "sns.scatterplot(data=df_class1, x=\"x\",y=\"y\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sec46iri4BJN"
      },
      "source": [
        "Because the cluster is more than 2 dimensions it is hard to visualize with a scatterplot. Let's look at the cluster from the x and z dimension."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rq5spzB74Noa"
      },
      "outputs": [],
      "source": [
        "sns.scatterplot(data=df_class1, x=\"x\",y=\"z\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RY0aNadI2B5g"
      },
      "source": [
        "Create another set of 1000 points and assign them to class 2. Then we'll add the two sets of points together by using **concat**:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "OzXYdm-b2GgT"
      },
      "outputs": [],
      "source": [
        "#generate some random values\n",
        "array_class2 = {\"x\":np.random.normal(8,5, size=1000),\n",
        "                \"y\":np.random.normal(8,1, size=1000),\n",
        "                \"z\":np.random.normal(8,0.5, size=1000),\n",
        "                \"class\": 2}\n",
        "\n",
        "#put them in a dataframe\n",
        "df_class2 = pd.DataFrame(data=array_class2)\n",
        "\n",
        "#bind the two dataframes together by rows\n",
        "df_class = pd.concat([df_class1,df_class2], axis = 0) #axis=0 just says to bind by rows, axis=1 would be by columns \n",
        "\n",
        "#plot it\n",
        "sns.scatterplot(data=df_class, x=\"x\",y=\"y\", hue='class')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "29Vfwaa24eEN"
      },
      "outputs": [],
      "source": [
        "sns.scatterplot(data=df_class, x=\"x\",y=\"z\", hue='class')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1jqsbr0I4pVD"
      },
      "source": [
        "\n",
        "### <font color='darkorange'>Visualizing in 3D</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lOtwPbVtADO7"
      },
      "source": [
        "Let's learn how to visualize data in 3D! We'll use **plotly** as it is easy to use and gives us a great interactive plot to use!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "M_0_GUkyGNTJ"
      },
      "outputs": [],
      "source": [
        "#import plotly\n",
        "import plotly.express as px\n",
        "\n",
        "#build a figure with three axis\n",
        "fig = px.scatter_3d(df_class, x='x', y='y', z='z', color='class')\n",
        "fig.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a6bIK4CF4sqg"
      },
      "source": [
        "### <font color='darkorange'>Clustering with many dimensions</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7UtPQ_MO32ju"
      },
      "source": [
        "Let's try out the clustering in higher dimensions using k-means.\n",
        "> The approach we use will work the same for HDBScan and many of the other clustering algorithm."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "B9Bpf4vGcL0b"
      },
      "source": [
        "**First** let's build the machine learning algorithm that we will use (i.e., k-means)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vVy7GtvOcVkT"
      },
      "outputs": [],
      "source": [
        "#initialize the kmeans algorithm\n",
        "clus_kmeans = cluster.KMeans(n_clusters=2) #how many clusters are there?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-KQTFISrcWIc"
      },
      "source": [
        "**Second** let's fit the model using data \n",
        "> This is the only difference: we need to make sure we pass all the dimensions when fitting the algorithm"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9ovnDjraccDV"
      },
      "outputs": [],
      "source": [
        "#fit the model\n",
        "clus_kmeans.fit(df_class[['x','y','z']] )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7mKx6_ihcdrs"
      },
      "source": [
        "**Third**, now that the model is built and fit to data we can use it to make predictions!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JBeiP5E6cjoC"
      },
      "outputs": [],
      "source": [
        "#make some predictions\n",
        "df_class['pred_kmeans'] = clus_kmeans.fit_predict(df_class[['x','y','z']] )\n",
        "\n",
        "#take a look\n",
        "df_class"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wUJ2AG7k52cI"
      },
      "source": [
        "Similarly, when measuring the performance of the algorithm we need to include all dimension when making predictions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UHMdnrtP6DcP"
      },
      "outputs": [],
      "source": [
        "sk.metrics.silhouette_score(X=df_class.loc[:,['x','y','z']],labels=df_class['pred_kmeans'])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZgUO8FBYkU7g"
      },
      "source": [
        "Finally, let's plot those predictions in 3D"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NWrVKvxekXlR"
      },
      "outputs": [],
      "source": [
        "#build a figure with three axis\n",
        "fig_pred = px.scatter_3d(df_class, x='x', y='y', z='z', color='pred_kmeans')\n",
        "fig_pred.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kN1BOzx-k0Aq"
      },
      "source": [
        "### <font color='darkorange'>Bonus</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "c5fxlKMpkwqE"
      },
      "source": [
        "Try to again cluster the above dataset, this time using the HDBScan algorithm."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1jrDNJj8rTTY"
      },
      "outputs": [],
      "source": [
        "!pip install hdbscan\n",
        "import hdbscan"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3MuVt9PPqkzL"
      },
      "outputs": [],
      "source": [
        "#initalize the kmeans algorithm (hyperparameter - choose minimum cluster size)\n",
        "clus_hdbscan = hdbscan.HDBSCAN(min_cluster_size = ?) "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tkm4aTkRqwJZ"
      },
      "source": [
        "Second fit the ml algorithm to the data\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sVsRb1VOq3Qa"
      },
      "outputs": [],
      "source": [
        "#fit the model\n",
        "clus_hdbscan.fit(df_class[['?','?','?']] )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "x-wtYK27q3xK"
      },
      "source": [
        "Third use the ml algorithm to make some predictions"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "e1ESWmAtq7_o"
      },
      "outputs": [],
      "source": [
        "#make some predictions\n",
        "df_class['pred_khdbscan'] = clus_hdbscan.fit_predict(?] )"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kHTV4MqtlOhn"
      },
      "outputs": [],
      "source": [
        "#build a figure with three axis\n",
        "fig_Hpred = px.scatter_3d(df_class, x='?', y='?', z='?', color='?')\n",
        "fig_Hpred.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6E-l7OCGlnVJ"
      },
      "source": [
        "How well does the HDBScan algorithm do, and does this change with your choice of min cluster size? How do the results differ from KMeans?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "v6F-hjnc6zFB"
      },
      "source": [
        "### <font color='darkorange'>Further reading</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HwhprU0g61lt"
      },
      "source": [
        "[Plotly tutorials](https://plotly.com/python/plotly-fundamentals/)\n",
        "\n",
        "> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_IntroClustering2_highDimensions.ipynb) version."
      ]
    }
  ],
  "metadata": {
    "colab": {
      "authorship_tag": "ABX9TyNfszrrPnEkyzwddLAwjT0e",
      "collapsed_sections": [],
      "include_colab_link": true,
      "name": "IntroClustering2_highDimensions.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 2
}