k-means

本示例使用\(k\)-均值聚类进行时间序列分析。算法有三种变体可供选择:标准欧几里得\(k\)-均值、DBA-\(k\)-均值(用于DTW重心平均[1])和Soft-DTW \(k\)-均值[2]。

在下图中,每一行对应不同聚类的结果。在一行中,每个子图对应一个聚类。它表示从训练集中分配到所考虑聚类的时间序列集合(以黑色显示)以及聚类的重心(以红色显示)。

关于预处理的说明

在这个例子中,时间序列使用TimeSeriesScalerMeanVariance进行预处理。这个缩放器的特点是每个输出的时间序列都具有零均值和单位方差。 这里的假设是给定时间序列的范围是无信息的,人们只希望以幅度不变的方式比较形状(当时间序列是多变量时,这也会重新缩放所有模态,使得不会有单一模态负责大部分方差)。 这意味着人们不能将重心缩放回数据范围,因为每个时间序列都是独立缩放的,因此没有所谓的整体数据范围。

[1] F. Petitjean, A. Ketterlin & P. Gancarski. 一种用于动态时间规整的全局平均方法,及其在聚类中的应用。模式识别,Elsevier,2011年,第44卷,第3期,第678-693页 [2] M. Cuturi, M. Blondel “Soft-DTW: 一种用于时间序列的可微分损失函数,” ICML 2017。

Euclidean $k$-means, DBA $k$-means, Soft-DTW $k$-means
Euclidean k-means
15.795 --> 7.716 --> 7.716 -->
DBA k-means
Init 1
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
0.637 --> [Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
0.458 --> [Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
0.458 -->
Init 2
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
0.826 --> [Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
0.525 --> [Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
0.477 --> [Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
0.472 --> [Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
0.472 -->
[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
Soft-DTW k-means
0.472 --> 0.144 --> 0.142 --> 0.143 --> 0.142 --> 0.143 --> 0.142 --> 0.143 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 -->

# Author: Romain Tavenard
# License: BSD 3 clause

import numpy
import matplotlib.pyplot as plt

from tslearn.clustering import TimeSeriesKMeans
from tslearn.datasets import CachedDatasets
from tslearn.preprocessing import TimeSeriesScalerMeanVariance, \
    TimeSeriesResampler

seed = 0
numpy.random.seed(seed)
X_train, y_train, X_test, y_test = CachedDatasets().load_dataset("Trace")
X_train = X_train[y_train < 4]  # Keep first 3 classes
numpy.random.shuffle(X_train)
# Keep only 50 time series
X_train = TimeSeriesScalerMeanVariance().fit_transform(X_train[:50])
# Make time series shorter
X_train = TimeSeriesResampler(sz=40).fit_transform(X_train)
sz = X_train.shape[1]

# Euclidean k-means
print("Euclidean k-means")
km = TimeSeriesKMeans(n_clusters=3, verbose=True, random_state=seed)
y_pred = km.fit_predict(X_train)

plt.figure()
for yi in range(3):
    plt.subplot(3, 3, yi + 1)
    for xx in X_train[y_pred == yi]:
        plt.plot(xx.ravel(), "k-", alpha=.2)
    plt.plot(km.cluster_centers_[yi].ravel(), "r-")
    plt.xlim(0, sz)
    plt.ylim(-4, 4)
    plt.text(0.55, 0.85,'Cluster %d' % (yi + 1),
             transform=plt.gca().transAxes)
    if yi == 1:
        plt.title("Euclidean $k$-means")

# DBA-k-means
print("DBA k-means")
dba_km = TimeSeriesKMeans(n_clusters=3,
                          n_init=2,
                          metric="dtw",
                          verbose=True,
                          max_iter_barycenter=10,
                          random_state=seed)
y_pred = dba_km.fit_predict(X_train)

for yi in range(3):
    plt.subplot(3, 3, 4 + yi)
    for xx in X_train[y_pred == yi]:
        plt.plot(xx.ravel(), "k-", alpha=.2)
    plt.plot(dba_km.cluster_centers_[yi].ravel(), "r-")
    plt.xlim(0, sz)
    plt.ylim(-4, 4)
    plt.text(0.55, 0.85,'Cluster %d' % (yi + 1),
             transform=plt.gca().transAxes)
    if yi == 1:
        plt.title("DBA $k$-means")

# Soft-DTW-k-means
print("Soft-DTW k-means")
sdtw_km = TimeSeriesKMeans(n_clusters=3,
                           metric="softdtw",
                           metric_params={"gamma": .01},
                           verbose=True,
                           random_state=seed)
y_pred = sdtw_km.fit_predict(X_train)

for yi in range(3):
    plt.subplot(3, 3, 7 + yi)
    for xx in X_train[y_pred == yi]:
        plt.plot(xx.ravel(), "k-", alpha=.2)
    plt.plot(sdtw_km.cluster_centers_[yi].ravel(), "r-")
    plt.xlim(0, sz)
    plt.ylim(-4, 4)
    plt.text(0.55, 0.85,'Cluster %d' % (yi + 1),
             transform=plt.gca().transAxes)
    if yi == 1:
        plt.title("Soft-DTW $k$-means")

plt.tight_layout()
plt.show()

脚本总运行时间: (0 分钟 6.376 秒)

Gallery generated by Sphinx-Gallery