16. Clustering in higher dimensions¶

In this exercise we will look at adding an extra dimension to our points. We’ll look at how changing the problem from 2D to 3D can cause challenges.

import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import cluster
import sklearn as sk

16.1. Simulating data¶

Let’s simulate some data where we know how many clusters there are. This time we’ll add an extra dimension.

i.e., let’s create 1000 points and set them to class 1. Each point will get a random x, y, and z coordinate.

#simulate some random values
array_class1 = {"x":np.random.normal(1,4, size=1000),
                "y":np.random.normal(1,4, size=1000),
                "z":np.random.normal(1,40, size=1000),
                "class": 1}

#put them in a dataframe
df_class1 = pd.DataFrame(data=array_class1)

#plot it
sns.scatterplot(data=df_class1, x="x",y="y")

Because the cluster is more than 2 dimensions it is hard to visualize with a scatterplot. Let’s look at the cluster from the x and z dimension.

sns.scatterplot(data=df_class1, x="x",y="z")

Create another set of 1000 points and assign them to class 2. Then we’ll add the two sets of points together by using concat:

#generate some random values
array_class2 = {"x":np.random.normal(8,5, size=1000),
                "y":np.random.normal(8,1, size=1000),
                "z":np.random.normal(8,0.5, size=1000),
                "class": 2}

#put them in a dataframe
df_class2 = pd.DataFrame(data=array_class2)

#bind the two dataframes together by rows
df_class = pd.concat([df_class1,df_class2], axis = 0) #axis=0 just says to bind by rows, axis=1 would be by columns 

#plot it
sns.scatterplot(data=df_class, x="x",y="y", hue='class')

sns.scatterplot(data=df_class, x="x",y="z", hue='class')

16.2. Visualizing in 3D¶

Let’s learn how to visualize data in 3D! We’ll use plotly as it is easy to use and gives us a great interactive plot to use!

#import plotly
import plotly.express as px

#build a figure with three axis
fig = px.scatter_3d(df_class, x='x', y='y', z='z', color='class')
fig.show()

16.3. Clustering with many dimensions¶

Let’s try out the clustering in higher dimensions using k-means.

The approach we use will work the same for HDBScan and many of the other clustering algorithm.

First let’s build the machine learning algorithm that we will use (i.e., k-means)

#initialize the kmeans algorithm
clus_kmeans = cluster.KMeans(n_clusters=2) #how many clusters are there?

Second let’s fit the model using data

This is the only difference: we need to make sure we pass all the dimensions when fitting the algorithm

#fit the model
clus_kmeans.fit(df_class[['x','y','z']] )

Third, now that the model is built and fit to data we can use it to make predictions!

#make some predictions
df_class['pred_kmeans'] = clus_kmeans.fit_predict(df_class[['x','y','z']] )

#take a look
df_class

Similarly, when measuring the performance of the algorithm we need to include all dimension when making predictions.

sk.metrics.silhouette_score(X=df_class.loc[:,['x','y','z']],labels=df_class['pred_kmeans'])

Finally, let’s plot those predictions in 3D

#build a figure with three axis
fig_pred = px.scatter_3d(df_class, x='x', y='y', z='z', color='pred_kmeans')
fig_pred.show()

16.4. Bonus¶

Try to again cluster the above dataset, this time using the HDBScan algorithm.

!pip install hdbscan
import hdbscan

#initalize the kmeans algorithm (hyperparameter - choose minimum cluster size)
clus_hdbscan = hdbscan.HDBSCAN(min_cluster_size = ?) 

Second fit the ml algorithm to the data

#fit the model
clus_hdbscan.fit(df_class[['?','?','?']] )

Third use the ml algorithm to make some predictions

#make some predictions
df_class['pred_khdbscan'] = clus_hdbscan.fit_predict(?] )

#build a figure with three axis
fig_Hpred = px.scatter_3d(df_class, x='?', y='?', z='?', color='?')
fig_Hpred.show()

How well does the HDBScan algorithm do, and does this change with your choice of min cluster size? How do the results differ from KMeans?

16.5. Further reading¶

Plotly tutorials

If you would like the notebook without missing code check out the full code version.

Practical exercises in data science - PEDS

Clustering in higher dimensions

Contents