{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "Ff3qPaLI390b" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***" ] }, { "cell_type": "markdown", "metadata": { "id": "N6ULenhF39aV" }, "source": [ "## Clustering in higher dimensions" ] }, { "cell_type": "markdown", "metadata": { "id": "x8xd-YdGCN0h" }, "source": [ "In this exercise we will look at adding an extra dimension to our points. We'll look at how changing the problem from 2D to 3D can cause challenges. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dZqwcX2934AE" }, "outputs": [], "source": [ "import numpy as np\n", "import seaborn as sns\n", "import pandas as pd\n", "from sklearn import cluster\n", "import sklearn as sk\n" ] }, { "cell_type": "markdown", "metadata": { "id": "YlV3DICbZa_B" }, "source": [ "### Simulating data" ] }, { "cell_type": "markdown", "metadata": { "id": "wJZnfzlmr01l" }, "source": [ "Let's simulate some data where we know how many clusters there are. This time we'll add an extra dimension. \n", "\n", "> i.e., let's create 1000 points and set them to class 1. Each point will get a random x, y, and z coordinate." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cPRKRQ5BwR_i" }, "outputs": [], "source": [ "#simulate some random values\n", "array_class1 = {\"x\":np.random.normal(1,4, size=1000),\n", " \"y\":np.random.normal(1,4, size=1000),\n", " \"z\":np.random.normal(1,40, size=1000),\n", " \"class\": 1}\n", "\n", "#put them in a dataframe\n", "df_class1 = pd.DataFrame(data=array_class1)\n", "\n", "#plot it\n", "sns.scatterplot(data=df_class1, x=\"x\",y=\"y\")\n" ] }, { "cell_type": "markdown", "metadata": { "id": "sec46iri4BJN" }, "source": [ "Because the cluster is more than 2 dimensions it is hard to visualize with a scatterplot. Let's look at the cluster from the x and z dimension." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rq5spzB74Noa" }, "outputs": [], "source": [ "sns.scatterplot(data=df_class1, x=\"x\",y=\"z\")" ] }, { "cell_type": "markdown", "metadata": { "id": "RY0aNadI2B5g" }, "source": [ "Create another set of 1000 points and assign them to class 2. Then we'll add the two sets of points together by using **concat**:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OzXYdm-b2GgT" }, "outputs": [], "source": [ "#generate some random values\n", "array_class2 = {\"x\":np.random.normal(8,5, size=1000),\n", " \"y\":np.random.normal(8,1, size=1000),\n", " \"z\":np.random.normal(8,0.5, size=1000),\n", " \"class\": 2}\n", "\n", "#put them in a dataframe\n", "df_class2 = pd.DataFrame(data=array_class2)\n", "\n", "#bind the two dataframes together by rows\n", "df_class = pd.concat([df_class1,df_class2], axis = 0) #axis=0 just says to bind by rows, axis=1 would be by columns \n", "\n", "#plot it\n", "sns.scatterplot(data=df_class, x=\"x\",y=\"y\", hue='class')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "29Vfwaa24eEN" }, "outputs": [], "source": [ "sns.scatterplot(data=df_class, x=\"x\",y=\"z\", hue='class')" ] }, { "cell_type": "markdown", "metadata": { "id": "1jqsbr0I4pVD" }, "source": [ "\n", "### Visualizing in 3D" ] }, { "cell_type": "markdown", "metadata": { "id": "lOtwPbVtADO7" }, "source": [ "Let's learn how to visualize data in 3D! We'll use **plotly** as it is easy to use and gives us a great interactive plot to use!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "M_0_GUkyGNTJ" }, "outputs": [], "source": [ "#import plotly\n", "import plotly.express as px\n", "\n", "#build a figure with three axis\n", "fig = px.scatter_3d(df_class, x='x', y='y', z='z', color='class')\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "a6bIK4CF4sqg" }, "source": [ "### Clustering with many dimensions" ] }, { "cell_type": "markdown", "metadata": { "id": "7UtPQ_MO32ju" }, "source": [ "Let's try out the clustering in higher dimensions using k-means.\n", "> The approach we use will work the same for HDBScan and many of the other clustering algorithm." ] }, { "cell_type": "markdown", "metadata": { "id": "B9Bpf4vGcL0b" }, "source": [ "**First** let's build the machine learning algorithm that we will use (i.e., k-means)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vVy7GtvOcVkT" }, "outputs": [], "source": [ "#initialize the kmeans algorithm\n", "clus_kmeans = cluster.KMeans(n_clusters=2) #how many clusters are there?" ] }, { "cell_type": "markdown", "metadata": { "id": "-KQTFISrcWIc" }, "source": [ "**Second** let's fit the model using data \n", "> This is the only difference: we need to make sure we pass all the dimensions when fitting the algorithm" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9ovnDjraccDV" }, "outputs": [], "source": [ "#fit the model\n", "clus_kmeans.fit(df_class[['x','y','z']] )" ] }, { "cell_type": "markdown", "metadata": { "id": "7mKx6_ihcdrs" }, "source": [ "**Third**, now that the model is built and fit to data we can use it to make predictions!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JBeiP5E6cjoC" }, "outputs": [], "source": [ "#make some predictions\n", "df_class['pred_kmeans'] = clus_kmeans.fit_predict(df_class[['x','y','z']] )\n", "\n", "#take a look\n", "df_class" ] }, { "cell_type": "markdown", "metadata": { "id": "wUJ2AG7k52cI" }, "source": [ "Similarly, when measuring the performance of the algorithm we need to include all dimension when making predictions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UHMdnrtP6DcP" }, "outputs": [], "source": [ "sk.metrics.silhouette_score(X=df_class.loc[:,['x','y','z']],labels=df_class['pred_kmeans'])" ] }, { "cell_type": "markdown", "metadata": { "id": "ZgUO8FBYkU7g" }, "source": [ "Finally, let's plot those predictions in 3D" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NWrVKvxekXlR" }, "outputs": [], "source": [ "#build a figure with three axis\n", "fig_pred = px.scatter_3d(df_class, x='x', y='y', z='z', color='pred_kmeans')\n", "fig_pred.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "kN1BOzx-k0Aq" }, "source": [ "### Bonus" ] }, { "cell_type": "markdown", "metadata": { "id": "c5fxlKMpkwqE" }, "source": [ "Try to again cluster the above dataset, this time using the HDBScan algorithm." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1jrDNJj8rTTY" }, "outputs": [], "source": [ "!pip install hdbscan\n", "import hdbscan" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3MuVt9PPqkzL" }, "outputs": [], "source": [ "#initalize the kmeans algorithm (hyperparameter - choose minimum cluster size)\n", "clus_hdbscan = hdbscan.HDBSCAN(min_cluster_size = ?) " ] }, { "cell_type": "markdown", "metadata": { "id": "tkm4aTkRqwJZ" }, "source": [ "Second fit the ml algorithm to the data\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sVsRb1VOq3Qa" }, "outputs": [], "source": [ "#fit the model\n", "clus_hdbscan.fit(df_class[['?','?','?']] )" ] }, { "cell_type": "markdown", "metadata": { "id": "x-wtYK27q3xK" }, "source": [ "Third use the ml algorithm to make some predictions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "e1ESWmAtq7_o" }, "outputs": [], "source": [ "#make some predictions\n", "df_class['pred_khdbscan'] = clus_hdbscan.fit_predict(?] )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kHTV4MqtlOhn" }, "outputs": [], "source": [ "#build a figure with three axis\n", "fig_Hpred = px.scatter_3d(df_class, x='?', y='?', z='?', color='?')\n", "fig_Hpred.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "6E-l7OCGlnVJ" }, "source": [ "How well does the HDBScan algorithm do, and does this change with your choice of min cluster size? How do the results differ from KMeans?" ] }, { "cell_type": "markdown", "metadata": { "id": "v6F-hjnc6zFB" }, "source": [ "### Further reading" ] }, { "cell_type": "markdown", "metadata": { "id": "HwhprU0g61lt" }, "source": [ "[Plotly tutorials](https://plotly.com/python/plotly-fundamentals/)\n", "\n", "> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_IntroClustering2_highDimensions.ipynb) version." ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyNfszrrPnEkyzwddLAwjT0e", "collapsed_sections": [], "include_colab_link": true, "name": "IntroClustering2_highDimensions.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }