注意
转到末尾 以下载完整的示例代码
k-means¶
本示例使用\(k\)-均值聚类进行时间序列分析。算法有三种变体可供选择:标准欧几里得\(k\)-均值、DBA-\(k\)-均值(用于DTW重心平均[1])和Soft-DTW \(k\)-均值[2]。
在下图中,每一行对应不同聚类的结果。在一行中,每个子图对应一个聚类。它表示从训练集中分配到所考虑聚类的时间序列集合(以黑色显示)以及聚类的重心(以红色显示)。
关于预处理的说明¶
在这个例子中,时间序列使用TimeSeriesScalerMeanVariance进行预处理。这个缩放器的特点是每个输出的时间序列都具有零均值和单位方差。 这里的假设是给定时间序列的范围是无信息的,人们只希望以幅度不变的方式比较形状(当时间序列是多变量时,这也会重新缩放所有模态,使得不会有单一模态负责大部分方差)。 这意味着人们不能将重心缩放回数据范围,因为每个时间序列都是独立缩放的,因此没有所谓的整体数据范围。
[1] F. Petitjean, A. Ketterlin & P. Gancarski. 一种用于动态时间规整的全局平均方法,及其在聚类中的应用。模式识别,Elsevier,2011年,第44卷,第3期,第678-693页 [2] M. Cuturi, M. Blondel “Soft-DTW: 一种用于时间序列的可微分损失函数,” ICML 2017。
Euclidean k-means
15.795 --> 7.716 --> 7.716 -->
DBA k-means
Init 1
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
0.637 --> [Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
0.458 --> [Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
0.458 -->
Init 2
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
0.826 --> [Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
0.525 --> [Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
0.477 --> [Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
0.472 --> [Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
0.472 -->
[Parallel(n_jobs=1)]: Done 49 tasks | elapsed: 0.0s
Soft-DTW k-means
0.472 --> 0.144 --> 0.142 --> 0.143 --> 0.142 --> 0.143 --> 0.142 --> 0.143 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 --> 0.142 -->
# Author: Romain Tavenard
# License: BSD 3 clause
import numpy
import matplotlib.pyplot as plt
from tslearn.clustering import TimeSeriesKMeans
from tslearn.datasets import CachedDatasets
from tslearn.preprocessing import TimeSeriesScalerMeanVariance, \
TimeSeriesResampler
seed = 0
numpy.random.seed(seed)
X_train, y_train, X_test, y_test = CachedDatasets().load_dataset("Trace")
X_train = X_train[y_train < 4] # Keep first 3 classes
numpy.random.shuffle(X_train)
# Keep only 50 time series
X_train = TimeSeriesScalerMeanVariance().fit_transform(X_train[:50])
# Make time series shorter
X_train = TimeSeriesResampler(sz=40).fit_transform(X_train)
sz = X_train.shape[1]
# Euclidean k-means
print("Euclidean k-means")
km = TimeSeriesKMeans(n_clusters=3, verbose=True, random_state=seed)
y_pred = km.fit_predict(X_train)
plt.figure()
for yi in range(3):
plt.subplot(3, 3, yi + 1)
for xx in X_train[y_pred == yi]:
plt.plot(xx.ravel(), "k-", alpha=.2)
plt.plot(km.cluster_centers_[yi].ravel(), "r-")
plt.xlim(0, sz)
plt.ylim(-4, 4)
plt.text(0.55, 0.85,'Cluster %d' % (yi + 1),
transform=plt.gca().transAxes)
if yi == 1:
plt.title("Euclidean $k$-means")
# DBA-k-means
print("DBA k-means")
dba_km = TimeSeriesKMeans(n_clusters=3,
n_init=2,
metric="dtw",
verbose=True,
max_iter_barycenter=10,
random_state=seed)
y_pred = dba_km.fit_predict(X_train)
for yi in range(3):
plt.subplot(3, 3, 4 + yi)
for xx in X_train[y_pred == yi]:
plt.plot(xx.ravel(), "k-", alpha=.2)
plt.plot(dba_km.cluster_centers_[yi].ravel(), "r-")
plt.xlim(0, sz)
plt.ylim(-4, 4)
plt.text(0.55, 0.85,'Cluster %d' % (yi + 1),
transform=plt.gca().transAxes)
if yi == 1:
plt.title("DBA $k$-means")
# Soft-DTW-k-means
print("Soft-DTW k-means")
sdtw_km = TimeSeriesKMeans(n_clusters=3,
metric="softdtw",
metric_params={"gamma": .01},
verbose=True,
random_state=seed)
y_pred = sdtw_km.fit_predict(X_train)
for yi in range(3):
plt.subplot(3, 3, 7 + yi)
for xx in X_train[y_pred == yi]:
plt.plot(xx.ravel(), "k-", alpha=.2)
plt.plot(sdtw_km.cluster_centers_[yi].ravel(), "r-")
plt.xlim(0, sz)
plt.ylim(-4, 4)
plt.text(0.55, 0.85,'Cluster %d' % (yi + 1),
transform=plt.gca().transAxes)
if yi == 1:
plt.title("Soft-DTW $k$-means")
plt.tight_layout()
plt.show()
脚本总运行时间: (0 分钟 6.376 秒)