20. Applied clustering#
In this exercise we will look at how to use our new found clustering skills with real data. We’ll look for clusters of credic card users and see if we can identify different classes of users.
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn import cluster
import sklearn as sk
import matplotlib.pyplot as plt
20.1. Load data#
Let’s simulate some data where we know how many clusters there are. This time we’ll add an extra dimension.
i.e., let’s create 1000 points and set them to class 1. Each point will get a random x, y, and z coordinate.
df_cc = pd.read_csv("/content/CC GENERAL.csv")
df_cc.head()
Check for and handle missing values. Also remove the CUST_ID column.
??
Descibe the data
??
20.2. Visualizing in 3D#
Plot the credit card data in a few different ways. What can you learn about the data?
#import plotly
import plotly.express as px
#build a figure
fig = px.scatter(?)
fig.show()
Try a heat map? Any strong correlations?
??
20.3. PCA Dimension reduction#
Sometimes we want to work with a reduced dimension dataset. In these cases we can use different methods, let’s lean how to use principal component analysis (PCA).
This approach will create orthogonal axis that capture the maximum variance in the data.
First when doing a PCA or dimensional reduction it’s generally a good idea to scale all numeric values so that distance is the same/similar across features.
from sklearn.preprocessing import StandardScaler
# build the scaler
scaler = StandardScaler()
# Fit and use the scaler to transform our data
X_scaled = scaler.fit_transform(df_cc)
#load in the PCA
from sklearn.decomposition import PCA
# Create the PCA algorithm
pca = PCA(n_components=2)
# Fit the PCA algorithm and use it to transform our data
X_pca = pca.fit_transform(X_scaled)
# Take a look
X_pca
Let’s see how much of the variance in the data can be explained by the first two principal components.
# explained variance of each component
explained_var = pca.explained_variance_ratio_
print("Explained variance ratio:", explained_var)
Let’s take a look at what the PCA did.
# Convert it to a dataframe
pca_df = pd.DataFrame(data=X_pca, columns=['PC1', 'PC2'])
# Plot it
sns.scatterplot( data=pca_df, x='PC1', y='PC2', alpha=0.5)
plt.title('Credit Card Data Projected onto First 2 Principal Components')
plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}% var)")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}% var)")
plt.show()
20.4. Cluster on reduced dimentions#
Next let’s cluster on this reduced space.
from sklearn.cluster import KMeans
# Create the algorithm
kmeans = KMeans(n_clusters=4, random_state=42)
# Fit the algorithm to the data
clusters = kmeans.fit_predict(X_pca)
# Plot the clusters
pca_df['Cluster'] = clusters
sns.scatterplot(x='PC1', y='PC2', hue='Cluster', palette='tab10', data=pca_df)
plt.title('K-Means Clusters in PCA Space')
plt.show()
What do these clusters look like on the original data? Let’s do some summary statistics.
# merge the cluster data back to the original data
df_cc_k = pd.merge(df_cc, pca_df, left_index=True, right_index=True)
# let's look at the means within each cluster
df_cc_k.groupby('Cluster').mean()
20.5. PaCMAP Dimension reduction#
Let’s do the steps above again, this time with PaCMAP.
n_neighbors (size of local neighborhood)
n_components (how many dimensions to keep)
# Install the library
!pip install pacmap
import pacmap
# Create the PaCMAP algorithm
embedding = pacmap.PaCMAP(n_components=2, n_neighbors=30, random_state=192)
# Fit the PaCMAP algorithm to the data and use it to transform the data
X_pacmap = embedding.fit_transform(X_scaled)
Let’s plot the data in it’s reduced dimensions. What do you find?
plt.scatter(X_pacmap[:,0], X_pacmap[:,1], s=10, alpha=0.6)
plt.title("PaCMAP Embedding")
plt.xlabel("PaCMAP1"); plt.ylabel("PaCMAP2")
plt.show()
Now that we’ve reduced the dimensions and we can visualize the data in 2d, let’s see how it clusters.
# Build the clustering algorithm
km = KMeans(n_clusters=5, n_init="auto", random_state=42)
# fit the clustering algorithm to the reduced data
labels_km = km.fit_predict(X_pacmap)
#create a scaterplot with points labeled by cluster
emb_km = pd.DataFrame(X_pacmap)
emb_km["Cluster"] = labels_km
plt.figure(figsize=(7,6))
sns.scatterplot(x=emb_km[0], y=emb_km[1], hue=emb_km["Cluster"], palette="tab10")
plt.title("k-means Clusters on UMAP")
plt.xlabel("UMAP1"); plt.ylabel("UMAP2")
plt.legend()
plt.show()
Now that we’ve identified clusters let’s summarize their properties back on the original scale.
df_cc_umap = pd.merge(df_cc, emb_km, left_index=True, right_index=True)
df_cc_umap.groupby('Cluster').mean()
Try to again cluster the above dataset, this time using the HDBScan algorithm.
20.6. Bonus#
Try to cluster again but this time don’t reduce the dimensions when clustering and see how our results differ!