{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "view-in-github"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ff3qPaLI390b"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "N6ULenhF39aV"
},
"source": [
"## Clustering in higher dimensions"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "x8xd-YdGCN0h"
},
"source": [
"In this exercise we will look at adding an extra dimension to our points. We'll look at how changing the problem from 2D to 3D can cause challenges. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "dZqwcX2934AE"
},
"outputs": [],
"source": [
"import numpy as np\n",
"import seaborn as sns\n",
"import pandas as pd\n",
"from sklearn import cluster\n",
"import sklearn as sk\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YlV3DICbZa_B"
},
"source": [
"### Simulating data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wJZnfzlmr01l"
},
"source": [
"Let's simulate some data where we know how many clusters there are. This time we'll add an extra dimension. \n",
"\n",
"> i.e., let's create 1000 points and set them to class 1. Each point will get a random x, y, and z coordinate."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cPRKRQ5BwR_i"
},
"outputs": [],
"source": [
"#simulate some random values\n",
"array_class1 = {\"x\":np.random.normal(1,4, size=1000),\n",
" \"y\":np.random.normal(1,4, size=1000),\n",
" \"z\":np.random.normal(1,40, size=1000),\n",
" \"class\": 1}\n",
"\n",
"#put them in a dataframe\n",
"df_class1 = pd.DataFrame(data=array_class1)\n",
"\n",
"#plot it\n",
"sns.scatterplot(data=df_class1, x=\"x\",y=\"y\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sec46iri4BJN"
},
"source": [
"Because the cluster is more than 2 dimensions it is hard to visualize with a scatterplot. Let's look at the cluster from the x and z dimension."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "rq5spzB74Noa"
},
"outputs": [],
"source": [
"sns.scatterplot(data=df_class1, x=\"x\",y=\"z\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RY0aNadI2B5g"
},
"source": [
"Create another set of 1000 points and assign them to class 2. Then we'll add the two sets of points together by using **concat**:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OzXYdm-b2GgT"
},
"outputs": [],
"source": [
"#generate some random values\n",
"array_class2 = {\"x\":np.random.normal(8,5, size=1000),\n",
" \"y\":np.random.normal(8,1, size=1000),\n",
" \"z\":np.random.normal(8,0.5, size=1000),\n",
" \"class\": 2}\n",
"\n",
"#put them in a dataframe\n",
"df_class2 = pd.DataFrame(data=array_class2)\n",
"\n",
"#bind the two dataframes together by rows\n",
"df_class = pd.concat([df_class1,df_class2], axis = 0) #axis=0 just says to bind by rows, axis=1 would be by columns \n",
"\n",
"#plot it\n",
"sns.scatterplot(data=df_class, x=\"x\",y=\"y\", hue='class')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "29Vfwaa24eEN"
},
"outputs": [],
"source": [
"sns.scatterplot(data=df_class, x=\"x\",y=\"z\", hue='class')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1jqsbr0I4pVD"
},
"source": [
"\n",
"### Visualizing in 3D"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lOtwPbVtADO7"
},
"source": [
"Let's learn how to visualize data in 3D! We'll use **plotly** as it is easy to use and gives us a great interactive plot to use!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "M_0_GUkyGNTJ"
},
"outputs": [],
"source": [
"#import plotly\n",
"import plotly.express as px\n",
"\n",
"#build a figure with three axis\n",
"fig = px.scatter_3d(df_class, x='x', y='y', z='z', color='class')\n",
"fig.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "a6bIK4CF4sqg"
},
"source": [
"### Clustering with many dimensions"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7UtPQ_MO32ju"
},
"source": [
"Let's try out the clustering in higher dimensions using k-means.\n",
"> The approach we use will work the same for HDBScan and many of the other clustering algorithm."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "B9Bpf4vGcL0b"
},
"source": [
"**First** let's build the machine learning algorithm that we will use (i.e., k-means)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vVy7GtvOcVkT"
},
"outputs": [],
"source": [
"#initialize the kmeans algorithm\n",
"clus_kmeans = cluster.KMeans(n_clusters=2) #how many clusters are there?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-KQTFISrcWIc"
},
"source": [
"**Second** let's fit the model using data \n",
"> This is the only difference: we need to make sure we pass all the dimensions when fitting the algorithm"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9ovnDjraccDV"
},
"outputs": [],
"source": [
"#fit the model\n",
"clus_kmeans.fit(df_class[['x','y','z']] )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7mKx6_ihcdrs"
},
"source": [
"**Third**, now that the model is built and fit to data we can use it to make predictions!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "JBeiP5E6cjoC"
},
"outputs": [],
"source": [
"#make some predictions\n",
"df_class['pred_kmeans'] = clus_kmeans.fit_predict(df_class[['x','y','z']] )\n",
"\n",
"#take a look\n",
"df_class"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wUJ2AG7k52cI"
},
"source": [
"Similarly, when measuring the performance of the algorithm we need to include all dimension when making predictions."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "UHMdnrtP6DcP"
},
"outputs": [],
"source": [
"sk.metrics.silhouette_score(X=df_class.loc[:,['x','y','z']],labels=df_class['pred_kmeans'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZgUO8FBYkU7g"
},
"source": [
"Finally, let's plot those predictions in 3D"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "NWrVKvxekXlR"
},
"outputs": [],
"source": [
"#build a figure with three axis\n",
"fig_pred = px.scatter_3d(df_class, x='x', y='y', z='z', color='pred_kmeans')\n",
"fig_pred.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kN1BOzx-k0Aq"
},
"source": [
"### Bonus"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "c5fxlKMpkwqE"
},
"source": [
"Try to again cluster the above dataset, this time using the HDBScan algorithm."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1jrDNJj8rTTY"
},
"outputs": [],
"source": [
"!pip install hdbscan\n",
"import hdbscan"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "3MuVt9PPqkzL"
},
"outputs": [],
"source": [
"#initalize the kmeans algorithm (hyperparameter - choose minimum cluster size)\n",
"clus_hdbscan = hdbscan.HDBSCAN(min_cluster_size = ?) "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tkm4aTkRqwJZ"
},
"source": [
"Second fit the ml algorithm to the data\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "sVsRb1VOq3Qa"
},
"outputs": [],
"source": [
"#fit the model\n",
"clus_hdbscan.fit(df_class[['?','?','?']] )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "x-wtYK27q3xK"
},
"source": [
"Third use the ml algorithm to make some predictions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "e1ESWmAtq7_o"
},
"outputs": [],
"source": [
"#make some predictions\n",
"df_class['pred_khdbscan'] = clus_hdbscan.fit_predict(?] )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "kHTV4MqtlOhn"
},
"outputs": [],
"source": [
"#build a figure with three axis\n",
"fig_Hpred = px.scatter_3d(df_class, x='?', y='?', z='?', color='?')\n",
"fig_Hpred.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6E-l7OCGlnVJ"
},
"source": [
"How well does the HDBScan algorithm do, and does this change with your choice of min cluster size? How do the results differ from KMeans?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "v6F-hjnc6zFB"
},
"source": [
"### Further reading"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HwhprU0g61lt"
},
"source": [
"[Plotly tutorials](https://plotly.com/python/plotly-fundamentals/)\n",
"\n",
"> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_IntroClustering2_highDimensions.ipynb) version."
]
}
],
"metadata": {
"colab": {
"authorship_tag": "ABX9TyNfszrrPnEkyzwddLAwjT0e",
"collapsed_sections": [],
"include_colab_link": true,
"name": "IntroClustering2_highDimensions.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}