Class Data Challenge 1

17. Class Data Challenge 1#

Welcome to the first simulation challenge! We will use this competition to try out what we’ve learnt so far, and to learn some important lessons about clustering and visualizing data.

The challenge will follow five stages:

Simulate some data where you know the answer
Post your data to slack
Find data from another team (mark with thumbs up)
Reply with your predictions of the number of clusters
Start back at step 1, this time increase by 1 dimension.

This challenge is meant to develop our collaboration and data sharing skills, as well as our visualization and clusting skills!

As we will be in break out groups of three, feel free to contact me on slack and i’ll come by to help out!

Load the libraries we will need to generate some points

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

17.1. Simulate some data#

#Simulate a cluster
cluster1_x = np.random.normal(0,10,size=1000)
cluster1_y = np.random.normal(0,10,size=1000)

#put it into a dataframe
d = {'x':cluster1_x,'y':cluster1_y,'cluster':'1'}
df_clusters = pd.DataFrame(data=d)

#take a look
df_clusters.head(5)

Visualize your cluster.

#scatter plot
sns.scatterplot(data=df_clusters, x="x", y="y",hue='cluster')

Add another cluster!

#Simulate a cluster
cluster2_x = np.random.uniform(5,10,size=1000)
cluster2_y = np.random.uniform(5,10,size=1000)

#put it into a dataframe
d2 = {'x':cluster2_x,'y':cluster2_y,'cluster':'2'}
df_cluster2 = pd.DataFrame(data=d2)

#add it to the previous cluster
df_clusters = df_clusters.append(df_cluster2)

#take a look
df_clusters.head(5)

sns.scatterplot(data=df_clusters, x="x", y="y",hue='cluster')

Things to try

Change the mean of your clusters: np.random.normal(50,10,size=1000)
Change the spread of your clusters: np.random.normal(50,10,size=1000)
Change the number of points in your clusters: np.random.normal(50,10,size=1000)
Change the distribution of you cluster:

np.random.uniform(5,10,size=100)
np.random.normal(5,10,size=100)

Change the number of dimensions, e.g.,:

cluster_x = np.random.uniform(5,10,size=1000)
cluster_y = np.random.uniform(5,10,size=1000)
cluster_z = np.random.uniform(5,10,size=1000)

Things to ask yourself

Do you think it is easy to find out how many clusters there are?
Is it imposible to figure it out?
How can you visualize something that is in 3D well enough to see the clusters you made?

17.3. Predict the number of clusters in other datasets!#

Import the data from another group into your colab:

Download a dataset from slack
Drag and drop the file into the files tab (to the left)
Load in the data using pandas

#use read_csv to load the data (Hint: right click on the file you want, then copy path)
df_other = pd.read_csv('/content/MyTeamName_clusters.csv') 

#take a look
df_other

Try visualizing the data

#scatter plots

Try out some of the clustering algorithims we learnt

# k-means clustering (elbow method)

# Hdbscan clustering (min number of points in a cluster)

Which method does better on mean silhouette distance, and how many clusters does it suggest?

Things to ask yourself:

Do the visualization and clustering algorithms aggree on the number of clusters?

When you think you have an answer post it to slack and see if you are right! Feel free to communicate and collaborate with each other!!

Class Data Challenge 1

Contents

17. Class Data Challenge 1#

17.1. Simulate some data#

17.2. Share your data with another group!#

17.3. Predict the number of clusters in other datasets!#