{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "view-in-github"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pgSM23bPW5GI"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HkP-sbomvut5"
},
"source": [
"## Clustering algorithms"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NQSi-Pjtv09u"
},
"source": [
"We will look at developing skills to detect clusters in your data. Identifying clusters can be very helpful when exploring your data, and is a form of unsupervised machine learning.\n",
"> To perform clustering we will make use of the **sklearn** library. This library is a very useful machine learning library in python. We'll use this library a lot in this course!"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rrntCePBvuDZ"
},
"source": [
"There is no single best clustering algorithm, as it will often depend on the data you have and the question you'd like to address. In this exercise we will look at two good clustering algorithms.\n",
"\n",
"* K-means\n",
"* Hdbscan"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zVdrSn8sZWfH"
},
"source": [
"Load python libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_gbNhHczrzC2"
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import sklearn as sk\n",
"\n",
"from sklearn import cluster"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "YlV3DICbZa_B"
},
"source": [
"### Simulating data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2iZQMTmLZnWC"
},
"source": [
"Simulating data is a great way to test out different machine learning algorithms. Here we'll use the numpy library to create some random numbers."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Akco80d4Z2MG"
},
"outputs": [],
"source": [
"#simulate 100 values from a normal distribution with a mean of 1 and an sd of 4\n",
"x = np.random.normal(1,4, size = 10)\n",
"\n",
"#take a look\n",
"x"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wJZnfzlmr01l"
},
"source": [
"Let's simulate some data where we know how many clusters there are. \n",
"\n",
"> First let's create 1000 points and set them to class 1. Each point will get a random x and y coordinate.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cPRKRQ5BwR_i"
},
"outputs": [],
"source": [
"#simulate some random values\n",
"array_class1 = {\"x\":np.random.normal(1,4, size=1000),\n",
" \"y\":np.random.normal(1,4, size=1000),\n",
" \"class\": 1}\n",
"\n",
"#put them in a dataframe\n",
"df_class1 = pd.DataFrame(data=array_class1)\n",
"\n",
"#plot it\n",
"sns.scatterplot(data=df_class1, x=\"x\",y=\"y\")\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RY0aNadI2B5g"
},
"source": [
"Create another set of 1000 points and assign them to class 2. Then we'll add the two sets of points together by using **concat**:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "OzXYdm-b2GgT"
},
"outputs": [],
"source": [
"#generate some random values\n",
"array_class2 = {\"x\":np.random.normal(8,5, size=1000),\n",
" \"y\":np.random.normal(8,1, size=1000),\n",
" \"class\": 2}\n",
"\n",
"#put them in a dataframe\n",
"df_class2 = pd.DataFrame(data=array_class2)\n",
"\n",
"#bind the two dataframes together by rows\n",
"df_class = pd.concat([df_class1,df_class2], axis = 0) #axis=0 just says to bind by rows, axis=1 would be by columns \n",
"\n",
"#plot it\n",
"sns.scatterplot(data=df_class, x=\"x\",y=\"y\", hue='class')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "J8rChnbytk-O"
},
"source": [
"### K-Means"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7UtPQ_MO32ju"
},
"source": [
"Let's try out the first classification algorithm: k-means. \n",
"> Sklearn uses a standard approach to machine learning models. We'll go through each step with k-means."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "B9Bpf4vGcL0b"
},
"source": [
"**First** let's build the machine learning algorithm that we will use (i.e., k-means)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "vVy7GtvOcVkT"
},
"outputs": [],
"source": [
"#initalize the kmeans algorithm\n",
"clus_kmeans = cluster.KMeans(n_clusters=2) "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-KQTFISrcWIc"
},
"source": [
"**Second** let's fit the model using data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "9ovnDjraccDV"
},
"outputs": [],
"source": [
"#fit the model\n",
"clus_kmeans.fit(df_class[['x','y']] )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7mKx6_ihcdrs"
},
"source": [
"**Third**, now that the model is built and fit to data we can use it to make predictions!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "JBeiP5E6cjoC"
},
"outputs": [],
"source": [
"#make some predictions\n",
"df_class['pred_kmeans'] = clus_kmeans.fit_predict(df_class[['x','y']] )\n",
"\n",
"#take a look\n",
"df_class"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4jZwDckyclQx"
},
"source": [
"**Finally**, we can visualize the predictions it made by using seaborn"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "KzgGYWxT32IK"
},
"outputs": [],
"source": [
"#plot it!\n",
"sns.scatterplot(data=df_class,x=\"x\",y=\"y\",hue=\"pred_kmeans\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b4XDjNLIs6gW"
},
"source": [
"This is what is considered as a hard clustering algorithm - it assigns every point to a cluster. It also makes some assumptions about clusters:\n",
"\n",
"* You know in advance how many clusters there are\n",
"* Clusters are roughly oval shaped (normally distributed along each dimension)\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4DsVBg1sdCRH"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "uQk_QgTyRxsc"
},
"source": [
"Go back and try shifting the mean and standard deviation of the simulated clusters and see how well k-means does. Try changing the number of clusters k-means looks for. When does it break down?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When this book is used within a class setting there are often good points to break up long sections of coding exercises. We'll use this stop sign (below) to indicate when we will take a break from coding and discuss more theoretical issues."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7hyZXZE4d0VM"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "msRNBxDx8Dcn"
},
"source": [
"### Tuning machine learning algorithms"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1Ljpkm-k8GLV"
},
"source": [
"The elbow method can be used to help choose how many clusters are likely in the data. It also introduces us to how tune machine learning algorithms by running the model using different parameters and monitoring how well the algorithm performs."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dVMH7tGuMIZo"
},
"source": [
"First let's measure how successfull the clustering algorithm is. There are many ways to measure *success* in clustering, here we will use *mean silhouette distance* which describes how far apart the points within clusters are from each other. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "QnLfbUGHANYT"
},
"outputs": [],
"source": [
"#Mean silhouette distance\n",
"sk.metrics.silhouette_score(X=df_class.iloc[:,0:2],labels=df_class['pred_kmeans'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "O_KkpICpbLgK"
},
"source": [
"Try the model above with a different number of clusters and see how this mean silhouette value changes. At what number of clusters is the mean silhouette distance the largest (large means more seperation between clusters).\n",
"> e.g., is the true number of clusters the highest?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vuAEma-4joKa"
},
"source": [
"### Make our own functions"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "He7j5J_gf9TA"
},
"source": [
"Note: here we are changing a parameter and re-running the code. In cases like these it can be sometimes useful to create our own functions. Let's take a look at how to create a function in python."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "pLdGcJC_gpIM"
},
"outputs": [],
"source": [
"#here is a simple function\n",
"def run_k_means(numb):\n",
" return numb"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zP2ZVSeSg_Pc"
},
"source": [
"Now that we've created it let's use it"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "l51Z6e1rhB2u"
},
"outputs": [],
"source": [
"run_k_means(2)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "f56PFqxGhG1w"
},
"source": [
"This function just takes one input (numb) and then prints that number.\n",
"\n",
"Let's add each of the model steps above inside the function, with the goal of printing the average silhouette distance."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-te3FE1whFlY"
},
"outputs": [],
"source": [
"def run_k_means(numb):\n",
" \n",
" #1. initalize the ml algorithm\n",
" clus_kmeans = cluster.KMeans(n_clusters=numb)\n",
"\n",
" #2. fit the ml algorithm\n",
" clus_kmeans.fit(df_class[['x','y']] )\n",
"\n",
" #3. make some predictions\n",
" df_class['pred_kmeans'] = clus_kmeans.fit_predict(df_class[['x','y']] )\n",
"\n",
" #4. measure performance\n",
" avg_sil = sk.metrics.silhouette_score(X=df_class.iloc[:,0:2],labels=df_class['pred_kmeans'])\n",
"\n",
" return avg_sil"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "795CzG3qiAvH"
},
"source": [
"Notice here that the input (numb) is now used to build a kmeans model with n_clusters = numb\n",
" \n",
"Let's use this new function and see how it works!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "uV-H-W77iNcG"
},
"outputs": [],
"source": [
"run_k_means(4)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TiNgoz4DiVR3"
},
"source": [
"Is it much easier to figure out what number of clusters results in the highest average silhouette score?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "b8LVYw3Hj3k_"
},
"source": [
"### Using loops"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dzMureKUiloS"
},
"source": [
"One last coding trick we'll learn today, is how to use a loop to make this even easier. \n",
" \n",
"Let's first look at how loops work in python"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZrpW7LEpiyeT"
},
"outputs": [],
"source": [
"for i in range(2,10):\n",
" print(i)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fTD5DcEhi8k3"
},
"source": [
"Now that we know the basic structure of a loop, and what it can do, let's use it to help figure out what number of clusters is optimal!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "cDNFz6AijGUS"
},
"outputs": [],
"source": [
"for i in range(2,10):\n",
" print(run_k_means(i))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6dUMpF5Bj_Ny"
},
"source": [
"Let's plot this out! To store the silhouette values as we run the loop, we'll create an empty list (avg_sil) and each run in the loop we'll add the average silhouette value to the list using **append**. \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "bEgM7uvSkGxR"
},
"outputs": [],
"source": [
"avg_sil = []\n",
"for i in range(2,10):\n",
" avg_sil.append(run_k_means(i))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0fIxrz8dm_oD"
},
"source": [
"Now that we have a list of average silhouette values, let's place these values in a **dataframe** and plot them using **seaborn**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-koLyLENj-et"
},
"outputs": [],
"source": [
"#create a dictionary\n",
"dict_sil = {'sil':avg_sil, 'clusters':list(range(2,10)) }\n",
"\n",
"#convert the dictionary into a dataframe\n",
"df_sil = pd.DataFrame(dict_sil)\n",
"\n",
"#plot\n",
"sns.relplot(x='clusters', y='sil', data=df_sil, kind=\"line\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "P3RjCENKn6bv"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HD1xfmrFtnuI"
},
"source": [
"### HDBscan"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "g4D1EMPcKYC9"
},
"source": [
"Ok let's see how another algorithm does: hdbscan. This method is newer than k-means and focuses on the finding high density regions surrounded by low density."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sE0Sp761rRqe"
},
"source": [
"To run this clustering algorithm we need to install another python library. Up until now we've been using the libraries on colab, but now we'd like to use one that is not already installed. To do this we install it manually using the !pip command. Then import it just like any other library."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "1jrDNJj8rTTY"
},
"outputs": [],
"source": [
"!pip install hdbscan\n",
"import hdbscan"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SJOiRzzArUwy"
},
"source": [
"Now we can use the **hdbscan** clustering method, it will follow the same steps as sklearn."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MEv1JlRHqlTi"
},
"source": [
"First initalize the ml algorithm"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "3MuVt9PPqkzL"
},
"outputs": [],
"source": [
"#initialize the kmeans algorithm (hyperparameter - choose minimum cluster size)\n",
"clus_hdbscan = hdbscan.HDBSCAN(min_cluster_size = 30) "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tkm4aTkRqwJZ"
},
"source": [
"Second fit the ml algorithm to the data\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "sVsRb1VOq3Qa"
},
"outputs": [],
"source": [
"#fit the model\n",
"clus_hdbscan.fit(df_class[['x','y']] )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "x-wtYK27q3xK"
},
"source": [
"Third use the ml algorithm to make some predictions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "e1ESWmAtq7_o"
},
"outputs": [],
"source": [
"#make some predictions\n",
"df_class['pred_khdbscan'] = clus_hdbscan.fit_predict(df_class[['x','y']] )"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ytAJNmDxrFnN"
},
"source": [
"Finally, visualize the predictions of the ml algorithm"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "uJhTAIZiKYwP"
},
"outputs": [],
"source": [
"#plot it!\n",
"sns.scatterplot(data=df_class, x=\"x\",y=\"y\", hue='pred_khdbscan')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "AVc6JC13Ph2v"
},
"source": [
"Notice the negative class? This is HDBscans way of saying that it isn't sure about those points. It is a soft clustering algorithm in that it will give you the probability of each point belonging to a class."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "-V8u6b9FZs8h"
},
"source": [
"We can now measure the mean silhouette distance for HDBscan clustering."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "SAP3CEINZtTk"
},
"outputs": [],
"source": [
"#estimate silhouette score for k-means\n",
"sk.metrics.silhouette_score(X=df_class.loc[:,['x','y']], labels=df_class['pred_khdbscan'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "kTu1QMGgZyq-"
},
"outputs": [],
"source": [
"#get only those points that are given a cluster\n",
"df_class_sub = df_class[df_class.pred_khdbscan > -1]\n",
"\n",
"#estimate silhouette score for k-means\n",
"sk.metrics.silhouette_score(X=df_class_sub.loc[:,['x','y']],labels=df_class_sub['pred_khdbscan'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZA70g5pBcX-R"
},
"source": [
"How does it do compared to k-means? Does that compare with what you see visually?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OqbAx2B7th6x"
},
"source": [
"### Make our own functions"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "F8mB8xT9twEm"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Tr3MnrrQtzVm"
},
"source": [
"Let's try to make another function just like the one we did for the k-means algorithm. \n",
"> feel free to copy and paste the steps from above\n",
"> remember to think carefully about where the input **minNumb** will go in the code."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "_kD3cTSZt776"
},
"outputs": [],
"source": [
"def run_HDBScan(minNumb):\n",
" \n",
" #1. initalize the ml algorithm\n",
" clus_hdbscan = hdbscan.HDBSCAN(min_cluster_size = ?) \n",
"\n",
" #2. fit the ml algorithm\n",
" clus_hdbscan.?\n",
"\n",
" #3. make some predictions\n",
" df_class['pred_khdbscan'] = clus_hdbscan.?\n",
"\n",
" #4. measure performance\n",
" avg_sil = sk.metrics.silhouette_score(?)\n",
"\n",
" return avg_sil"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "PqCGtqsPuz2C"
},
"source": [
"Try out you new function and see if it works!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "uZ9mfQt0u3AW"
},
"outputs": [],
"source": [
"#run your function with 500 as the minimum number of points\n",
"run_HDBScan(40)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "3-xSVjhsvCkb"
},
"source": [
"### Running loops"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OR50rWdevH4z"
},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NZeXpxw7vJlY"
},
"source": [
"Let's try and use a loop to try out a bunch of different values for the minimum number of points parameter.\n",
"\n",
"> the **range()** function follows range(from,to,by) format. e.g., range(10,70,10) = 10, 20, 30, 40, 50, 60"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "pAmtU97fviYa"
},
"outputs": [],
"source": [
"for ? in range(?,?,?):\n",
" print(run_HDBScan(?))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "x0h2nxpEdfJc"
},
"source": [
"### Bonus"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "zl4W4XfsxCvO"
},
"source": [
"If all is going well, you might want to try to plot the results! \n",
"> You can follow along with how we ploted the results from the k-means."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Sus7XOoDxOL9"
},
"outputs": [],
"source": [
"avg_sil_hd = []\n",
"for ? in range(?):\n",
" avg_sil_hd.?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Si853JxzxNix"
},
"outputs": [],
"source": [
"#create a dictionary\n",
"dict_sil = ?\n",
"\n",
"#convert the dictionary into a dataframe\n",
"df_sil_hd = ?\n",
"\n",
"#plot\n",
"sns.relplot(?)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IPgJHTMQxb6h"
},
"source": [
"Can you find an optimum value for min_cluster_size? If you go back and run the HDBScan algorithm with the optimal min_cluster_size do you see large differences in how it clusters the data?\n",
" \n",
"Finally, what happens if you run your function for large min_cluster sizes? How large can you go before your code fails, and why do you think it failed?"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "BJqFrYyyYzMx"
},
"source": [
"### Further reading"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QAdU1AJ0Y1CX"
},
"source": [
"A more detailed look at [HDBScan](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html).\n",
"\n",
"A comparison between a number of different [clustering algorithms](https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html).\n",
"\n",
"> If you would like the notebook without missing code check out the [full code](https://colab.research.google.com/github/tbonne/peds/blob/main/docs/fullNotebooks/full_IntroClustering1_kmeans_HDBScan.ipynb) version."
]
}
],
"metadata": {
"colab": {
"authorship_tag": "ABX9TyOo4Q7S1AJdM5P/BNnO3ICq",
"collapsed_sections": [],
"include_colab_link": true,
"name": "IntroClustering1_kmeans_HDBScan",
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}