`pyts.classification`.BOSSVS¶

class pyts.classification.BOSSVS(word_size=4, n_bins=4, window_size=10, window_step=1, anova=False, drop_sum=False, norm_mean=False, norm_std=False, strategy='quantile', alphabet=None, numerosity_reduction=True, use_idf=True, smooth_idf=False, sublinear_tf=True)[来源]¶

向量空间中的SFA符号袋。

每个时间序列通过Bag-of-SFA Symbols (BOSS)算法转换为直方图。然后，针对每个类别，将这些直方图相加并计算tf-idf向量。对于新样本的预测类别，是其tf向量与每个类别的tf-idf向量之间余弦相似度最高的类别。

Parameters:

word_size : int (default = 4)

每个单词的大小。

n_bins : int (default = 4)

要生成的区间数量。该值必须在2到26之间。

window_size : int or float (default = 10)

滑动窗口的大小。如果是浮点数，表示每个时间序列大小的百分比，必须在0到1之间。窗口大小将计算为ceil(window_size * n_timestamps)。

window_step : int or float (default = 1)

滑动窗口的步长。如果是浮点数，表示每个时间序列大小的百分比，必须在0到1之间。窗口大小将计算为ceil(window_step * n_timestamps)。

anova : bool (default = False)

如果为True，则通过单向ANOVA检验选择傅里叶系数。如果为False，则选择前几个傅里叶系数。

drop_sum : bool (default = False)

如果为True，则丢弃第一个傅里叶系数（即子序列之和）。否则，保留该系数。

norm_mean : bool (default = False)

如果为True，在缩放前对每个子序列进行居中处理。

norm_std : bool (default = False)

如果为True，将每个子序列缩放为单位方差。

strategy : str (default = ‘quantile’)

用于定义分箱宽度的策略：

‘uniform’: 每个样本中的所有箱宽度相同
'quantile': 每个样本中的所有分箱具有相同数量的数据点
‘normal’: 箱边缘为标准正态分布的分位数
‘entropy’: 使用信息增益计算分箱边界

alphabet : None, ‘ordinal’ or array-like, shape = (n_bins,)

使用的字母表。如果为None，则使用拉丁字母表的前n_bins个字母。

numerosity_reduction : bool (default = True)

如果为True，则删除样本中除一个之外的所有连续出现的相同单词。

use_idf : bool (default = True)

启用逆文档频率重新加权。

smooth_idf : bool (default = False)

通过向文档频率加一来平滑idf权重，就像假设有一个额外的文档包含了集合中的每个术语恰好一次。这样可以防止除零错误。

sublinear_tf : bool (default = True)

应用次线性词频缩放，即用1 + log(tf)替换tf。

参考文献

[1]	P. Schäfer, "可扩展的时间序列分类". 数据挖掘与知识发现, 30(5), 1273-1298 (2016).

示例

>>> from pyts.classification import BOSSVS
>>> from pyts.datasets import load_gunpoint
>>> X_train, X_test, y_train, y_test = load_gunpoint(return_X_y=True)
>>> clf = BOSSVS(window_size=28)
>>> clf.fit(X_train, y_train)
BOSSVS(...)
>>> clf.score(X_test, y_test)
0.98

Attributes:	classes_ : array, shape = (n_classes,) 分类器已知的类别标签数组。 idf_ : array, shape = (n_features,) , or None 当`use_idf=True`时学习到的idf向量（全局词权重），否则为None。 tfidf_ : array, shape = (n_classes, n_words) 词项-文档矩阵。 vocabulary_ : dict 特征索引到术语的映射。

方法

`__init__`([word_size, n_bins, window_size, …])	Initialize self.
`decision_function`(X)	Evaluate the cosine similarity between document-term matrix and X.
`fit`(X, y)	Compute the document-term matrix.
`get_params`([deep])	Get parameters for this estimator.
`predict`(X)	Predict the class labels for the provided data.
`score`(X, y[, sample_weight])	Return the mean accuracy on the given test data and labels.
`set_params`(**params)	Set the parameters of this estimator.

__init__(word_size=4, n_bins=4, window_size=10, window_step=1, anova=False, drop_sum=False, norm_mean=False, norm_std=False, strategy='quantile', alphabet=None, numerosity_reduction=True, use_idf=True, smooth_idf=False, sublinear_tf=True)[来源]¶: 初始化自身。查看 help(type(self)) 获取准确的签名信息。

decision_function(X)[来源]¶

评估文档-词项矩阵与X之间的余弦相似度。

参数：	X ： array-like, shape (n_samples, n_timestamps) 测试样本。
返回值:	X : 数组, 形状 (n_samples, n_classes) 文档-词项矩阵与X之间的余弦相似度。

fit(X, y)[来源]¶

计算文档-词项矩阵。

参数:	X : array-like, shape = (n_samples, n_timestamps) 训练向量。 y : array-like, shape = (n_samples,) 每个数据样本的类别标签。
返回值:	self : object

get_params(deep=True)¶

获取此估计器的参数。

参数：	deep : bool, default=True 如果为True，将返回此估计器及其包含的子估计器的参数。
返回值:	params : dict 参数名称映射到对应的值。

predict(X)[来源]¶

预测所提供数据的类别标签。

参数:	X : array-like, shape = (n_samples, n_timestamps) 测试样本。
返回值:	y_pred : array, shape = (n_samples,) 每个数据样本的类别标签。

score(X, y, sample_weight=None)¶

返回给定测试数据和标签的平均准确率。

参数:	X : array-like, shape = (n_samples, n_timestamps) 单变量时间序列。 y : array-like, shape = (n_samples,) X的真实标签。 sample_weight : None or array-like, shape = (n_samples,) (default = None) 样本权重。
Returns:	score : float `self.predict(X)` 关于 y 的平均准确率。