K-Means 聚类

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(font_scale=1.75)
sns.set_style("white")

import random
np.random.seed(10)

在graspologic中的K-Means聚类是Sklearn的KMeans类的一个封装。我们的算法通过迭代一系列值并创建一个具有最低可能轮廓分数的模型来找到最佳的kmeans聚类模型,如Sklearn中定义的这里

让我们在合成数据上使用K-Means聚类,并将其与现有的Sklearn实现进行比较。

在合成数据上使用K均值

[2]:
# Synthetic data

# Dim 1
class_1 = np.random.randn(150, 1)
class_2 = 2 + np.random.randn(150, 1)
dim_1 = np.vstack((class_1, class_2))

# Dim 2
class_1 = np.random.randn(150, 1)
class_2 = 2 + np.random.randn(150, 1)
dim_2 = np.vstack((class_1, class_2))

X = np.hstack((dim_1, dim_2))

# Labels
label_1 = np.zeros((150, 1))
label_2 = 1 + label_1

c = np.vstack((label_1, label_2)).reshape(300,)

# Plotting Function for Clustering
def plot(title, c_hat, X):
    plt.figure(figsize=(10, 10))
    n_components = int(np.max(c_hat) + 1)
    palette = sns.color_palette("deep")[:n_components]
    fig = sns.scatterplot(x=X[:,0], y=X[:,1], hue=c_hat, legend=None, palette=palette)
    fig.set(xticks=[], yticks=[], title=title)
    plt.show()

plot('True Clustering', c, X)
../../_images/tutorials_clustering_kclust_4_0.png

在Sklearn中现有的KMeans聚类实现中,必须预先选择模型的参数,包括组件的数量。如果输入的参数与数据不匹配,聚类性能可能会受到影响。性能可以通过ARI来衡量,这是一个范围从0到1的指标。ARI得分为1表示估计的聚类与真实聚类完全相同。

[3]:
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix
from scipy.optimize import linear_sum_assignment
from sklearn.metrics import adjusted_rand_score

from graspologic.utils import remap_labels

# Say user provides inaccurate estimate of number of components
kmeans_ = KMeans(3)
c_hat_kmeans = kmeans_.fit_predict(X)

# Remap Predicted labels
c_hat_kmeans = remap_labels(c, c_hat_kmeans)

plot('Sklearn Clustering', c_hat_kmeans, X)

# ARI Score
print("ARI Score for Model: %.2f" % adjusted_rand_score(c, c_hat_kmeans))
/home/runner/.cache/pypoetry/virtualenvs/graspologic-pkHfzCJ8-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
../../_images/tutorials_clustering_kclust_6_1.png
ARI Score for Model: 0.46

我们的方法扩展了现有的Sklearn框架,允许用户自动估计k-means聚类模型的最佳超参数。找到理想的n_clusters_,该值小于用户提供的最大值。

[4]:
from graspologic.cluster.kclust import KMeansCluster

# Fit model
kclust_ = KMeansCluster(max_clusters=10)
c_hat_kclust = kclust_.fit_predict(X)

c_hat_kclust = remap_labels(c, c_hat_kclust)

plot('KClust Clustering', c_hat_kclust, X)

print("ARI Score for Model: %.2f" % adjusted_rand_score(c, c_hat_kclust))
../../_images/tutorials_clustering_kclust_8_0.png
ARI Score for Model: 0.66