Open In Colab


17. Class Data Challenge 1

Welcome to the first simulation challenge! We will use this competition to try out what we’ve learnt so far, and to learn some important lessons about clustering and visualizing data.

The challenge will follow five stages:

  1. Simulate some data where you know the answer

  2. Post your data to slack

  3. Find data from another team (mark with thumbs up)

  4. Reply with your predictions of the number of clusters

  5. Start back at step 1, this time increase by 1 dimension.

This challenge is meant to develop our collaboration and data sharing skills, as well as our visualization and clusting skills!

As we will be in break out groups of three, feel free to contact me on slack and i’ll come by to help out!

Load the libraries we will need to generate some points

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

17.1. Simulate some data

#Simulate a cluster
cluster1_x = np.random.normal(0,10,size=1000)
cluster1_y = np.random.normal(0,10,size=1000)

#put it into a dataframe
d = {'x':cluster1_x,'y':cluster1_y,'cluster':'1'}
df_clusters = pd.DataFrame(data=d)

#take a look
df_clusters.head(5)

Visualize your cluster.

#scatter plot
sns.scatterplot(data=df_clusters, x="x", y="y",hue='cluster')

Add another cluster!

#Simulate a cluster
cluster2_x = np.random.uniform(5,10,size=1000)
cluster2_y = np.random.uniform(5,10,size=1000)

#put it into a dataframe
d2 = {'x':cluster2_x,'y':cluster2_y,'cluster':'2'}
df_cluster2 = pd.DataFrame(data=d2)

#add it to the previous cluster
df_clusters = df_clusters.append(df_cluster2)

#take a look
df_clusters.head(5)
sns.scatterplot(data=df_clusters, x="x", y="y",hue='cluster')

Things to try

  • Change the mean of your clusters: np.random.normal(50,10,size=1000)

  • Change the spread of your clusters: np.random.normal(50,10,size=1000)

  • Change the number of points in your clusters: np.random.normal(50,10,size=1000)

  • Change the distribution of you cluster:

np.random.uniform(5,10,size=100)
np.random.normal(5,10,size=100)

  • Change the number of dimensions, e.g.,:

cluster_x = np.random.uniform(5,10,size=1000)
cluster_y = np.random.uniform(5,10,size=1000)
cluster_z = np.random.uniform(5,10,size=1000)

Things to ask yourself

  • Do you think it is easy to find out how many clusters there are?

  • Is it imposible to figure it out?

  • How can you visualize something that is in 3D well enough to see the clusters you made?

17.2. Post your data to Slack!

#randomly shuffle the rows around
df_clusters = df_clusters.sample(frac=1)

#remove the cluster column (i.e., no labels!)
df_clusters = df_clusters.drop('cluster', axis=1)

#save your dataframe to a csv file
df_clusters.to_csv("MyTeamName_clusters.csv", index=False)

The csv file should now be in your google drive. You can also just click on files on the left and you will be able to see and download your data. Once downloaded place into the simulation_competition1 channel on slack.

17.3. Predict the number of clusters in other datasets!

Import the data into your colab:

  1. Download a dataset from slack

  2. Drag and drop the file into the files tab (to the left)

  3. Load in the data using pandas

#use read_csv to load the data (Hint: right click on the file you want, then copy path)
df_other = pd.read_csv('/content/MyTeamName_clusters.csv') 

#take a look
df_other

Try visualizing the data

#scatter plots

Try out some of the clustering algorithims we learnt

# k-means clustering (elbow method)

# Hdbscan clustering (min number of points in a cluster)

Which method does better on mean silhouette distance, and how many clusters does it suggest?

Things to ask yourself:

  • Do the visualization and clustering algorithms aggree on the number of clusters?

When you think you have an answer post it to slack and see if you are right! Feel free to communicate and collaborate with each other!!