tsfresh.transformers 包

子模块

tsfresh.transformers.feature_augmenter 模块

class tsfresh.transformers.feature_augmenter.FeatureAugmenter(default_fc_parameters=None, kind_to_fc_parameters=None, column_id=None, column_sort=None, column_kind=None, column_value=None, timeseries_container=None, chunksize=None, n_jobs=5, show_warnings=False, disable_progressbar=False, impute_function=None, profile=False, profiling_filename='profile.txt', profiling_sorting='cumulative')[源代码]

基类：BaseEstimator, TransformerMixin

Sklearn-兼容的估计器，用于计算并添加从给定时间序列中提取的多个特征到数据中。它基本上是 extract_features() 的一个包装。

这些功能包括基本的如最小值、最大值或中位数，以及高级功能如傅里叶变换或统计测试。有关所有可能功能的列表，请参阅模块 feature_calculators。每个添加的特征的列名包含用于计算的该模块的函数名称。

对于这个估计器，两个数据集起着至关重要的作用：

时间序列容器包含时间序列数据。这个容器（格式见数据格式）包含用于计算特征的数据。它必须按id分组，这些id用于标识哪个特征应附加到第二个数据框的哪一行。
输入数据 X，特征将被添加到其中。它的行由索引标识，并且 X 中的每个索引在时间序列容器中必须作为一个 id 存在。

想象以下情况：你想对10种不同的金融股票进行分类，并且你有它们过去一年的发展作为时间序列。然后，你可以从股票的元信息中创建特征，例如它们在市场上存在的时间等，并填充一个表格——每一行代表一只股票的特征。这就是输入数组X，每一行可以通过例如股票名称作为索引进行标识。

>>> df = pandas.DataFrame(index=["AAA", "BBB", ...])
>>> # Fill in the information of the stocks
>>> df["started_since_days"] = ... # add a feature

然后，您可以通过使用此估计器从股票的时间演变中提取所有特征。时间序列容器必须包含一列id，这些id与X的索引相同。

>>> time_series = read_in_timeseries() # get the development of the shares
>>> from tsfresh.transformers import FeatureAugmenter
>>> augmenter = FeatureAugmenter(column_id="id")
>>> augmenter.set_timeseries_container(time_series)
>>> df_with_time_series_features = augmenter.transform(df)

特征计算的设置可以通过设置对象来控制。如果你传递 None，则使用默认设置。更多信息请参考 ComprehensiveFCParameters。

这个估计器不会选择相关特征，而是计算并将所有特征添加到 DataFrame 中。请参阅 RelevantFeatureAugmenter 以计算和选择特征。

关于参数 column_id、column_sort、column_kind 和 column_value 的含义描述，请参见 extraction。

fit(X=None, y=None)[源代码]

fit 函数对于这个估计器是不需要的。它什么都不做，只是为了兼容性原因而存在。

参数:

X (Any) – 不需要。
y (Any) – 不需要。

返回:

估计器实例本身

返回类型:

FeatureAugmenter

set_timeseries_container(timeseries_container)[源代码]

设置时间序列，将基于此计算特征。关于时间序列容器的格式，请参考 extraction。时间序列必须包含与稍后将添加特征的 DataFrame 相同的索引（您将传递给 transform() 的那个）。您可以根据需要多次调用此函数，以稍后更改时间序列（例如，如果您想为不同的 ID 提取特征）。

参数:: timeseries_container (pandas.DataFrame or dict) – 时间序列作为 pandas.DataFrame 或字典。格式请参见 extraction。
返回:: 无
返回类型:: None

transform(X)[源代码]

使用 timeseries_container 计算的特征添加到输入的 pandas.DataFrame X 中的相应行。

为了节省一些计算时间，您应该只在容器中包含您需要的时间序列。您可以使用方法 set_timeseries_container() 来设置时间序列容器。

参数:: X (pandas.DataFrame) – 将添加计算出的时间序列特征的数据框。这不是包含时间序列本身的数据框。
返回:: 输入的DataFrame，但增加了特征。
返回类型:: pandas.DataFrame

tsfresh.transformers.feature_selector 模块

class tsfresh.transformers.feature_selector.FeatureSelector(test_for_binary_target_binary_feature='fisher', test_for_binary_target_real_feature='mann', test_for_real_target_binary_feature='mann', test_for_real_target_real_feature='kendall', fdr_level=0.05, hypotheses_independent=False, n_jobs=5, chunksize=None, ml_task='auto', multiclass=False, n_significant=1, multiclass_p_values='min')[源代码]

基类：BaseEstimator, TransformerMixin

Sklearn兼容的估计器，用于将数据集中的特征数量减少到仅与给定目标相关且显著的特征。它基本上是围绕 check_fs_sig_bh() 的包装器。

检查是通过测试假设来完成的

System Message: WARNING/2 (H_0)

latex exited with error [stdout] This is pdfTeX, Version 3.141592653-2.6-1.40.26 (TeX Live 2024) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2023-11-01> patch level 1 L3 programming layer <2024-05-08> (/usr/local/texlive/2024basic/texmf-dist/tex/latex/base/article.cls Document Class: article 2023/05/17 v1.4n Standard LaTeX document class (/usr/local/texlive/2024basic/texmf-dist/tex/latex/base/size12.clo)) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/base/inputenc.sty) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) ! LaTeX Error: File `anyfontsize.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.8 \usepackage {bm}^^M No pages of output. Transcript written on math.log.

= 该功能不相关且无法添加

反对

System Message: WARNING/2 (H_1)

latex exited with error [stdout] This is pdfTeX, Version 3.141592653-2.6-1.40.26 (TeX Live 2024) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2023-11-01> patch level 1 L3 programming layer <2024-05-08> (/usr/local/texlive/2024basic/texmf-dist/tex/latex/base/article.cls Document Class: article 2023/05/17 v1.4n Standard LaTeX document class (/usr/local/texlive/2024basic/texmf-dist/tex/latex/base/size12.clo)) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/base/inputenc.sty) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) ! LaTeX Error: File `anyfontsize.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.8 \usepackage {bm}^^M No pages of output. Transcript written on math.log.

= 该特征是相关的，应该保留

使用几种统计测试（取决于特征和/或目标是否为二进制）。使用 Benjamini Hochberg 程序，只有

System Message: WARNING/2 (H_0)

latex exited with error [stdout] This is pdfTeX, Version 3.141592653-2.6-1.40.26 (TeX Live 2024) (preloaded format=latex) restricted \write18 enabled. entering extended mode (./math.tex LaTeX2e <2023-11-01> patch level 1 L3 programming layer <2024-05-08> (/usr/local/texlive/2024basic/texmf-dist/tex/latex/base/article.cls Document Class: article 2023/05/17 v1.4n Standard LaTeX document class (/usr/local/texlive/2024basic/texmf-dist/tex/latex/base/size12.clo)) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/base/inputenc.sty) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amsmath.sty For additional information on amsmath, use the `?' option. (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amstext.sty (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amsgen.sty)) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amsbsy.sty) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsmath/amsopn.sty)) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amscls/amsthm.sty) (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsfonts/amssymb.sty (/usr/local/texlive/2024basic/texmf-dist/tex/latex/amsfonts/amsfonts.sty)) ! LaTeX Error: File `anyfontsize.sty' not found. Type X to quit or <RETURN> to proceed, or enter new name. (Default extension: sty) Enter file name: ! Emergency stop. <read *> l.8 \usepackage {bm}^^M No pages of output. Transcript written on math.log.

中的特征被拒绝。

这个估计器——像大多数 sklearn 估计器一样——工作在两个步骤的过程中。首先，它在训练数据上进行拟合，此时目标已知：

>>> import pandas as pd
>>> X_train, y_train = pd.DataFrame(), pd.Series() # fill in with your features and target
>>> from tsfresh.transformers import FeatureSelector
>>> selector = FeatureSelector()
>>> selector.fit(X_train, y_train)

在这个例子中，相关功能的列表是空的：

>>> selector.relevant_features
>>> []

特征重要性也是如此：

>>> selector.feature_importances_
>>> array([], dtype=float64)

估计器会跟踪那些在训练步骤中相关的特征。如果在训练后应用估计器，它将删除测试数据样本中的所有其他特征：

>>> X_test = pd.DataFrame()
>>> X_selected = selector.transform(X_test)

之后，X_selected 将仅包含训练期间相关联的特征。

如果你对更多关于特征的信息感兴趣，你可以在拟合之后查看成员 relevant_features。

fit(X, y)[源代码]

提取信息，使用给定的目标确定哪些功能是相关的。

更多信息，请参见 check_fs_sig_bh() 函数。输入数据样本中的所有列都被视为特征。X 中所有行的索引必须在 y 中存在。

参数:

X (pandas.DataFrame or numpy.array) – 带有特征的数据样本，将被分类为相关或不相关
y (pandas.Series or numpy.array) – 要使用的目标向量，用于分类特征

返回:

带有信息的拟合估计器，哪些特征是相关的

返回类型:

FeatureSelector

transform(X)[源代码]

删除所有在拟合阶段不相关的特征。

参数:: X (pandas.DataSeries or numpy.array) – 包含所有特征的数据样本，将被缩减为仅包含相关特征的数据样本
返回:: 与X相同的数据样本，但仅包含相关特征
返回类型:: pandas.DataFrame or numpy.array

tsfresh.transformers.per_column_imputer 模块

class tsfresh.transformers.per_column_imputer.PerColumnImputer(col_to_NINF_repl_preset=None, col_to_PINF_repl_preset=None, col_to_NAN_repl_preset=None)[源代码]

基类：BaseEstimator, TransformerMixin

Sklearn兼容的估计器，通过用同一列的平均值/极值替换所有 NaNs 和 infs 来逐列填充数据框。它基本上是围绕 impute() 的一个包装器。

DataFrame 中出现的每个 inf 或 NaN 都被替换为

-inf -> min
+inf -> max
NaN -> 中位数

这个估计器——像大多数 sklearn 估计器一样——工作在一个两步程序中。首先，调用 .fit 函数，其中为每一列计算最小值、最大值和中位数。其次，调用 .transform 函数，该函数使用按列计算的最小值、最大值和中位数值替换 NaNs 和 infs 的出现。

fit(X, y=None)[源代码]

计算DataFrame中所有列的最小值、最大值和中位数。更多信息，请参见 get_range_values_per_column() 函数。

参数:

X (pandas.DataFrame) – DataFrame 用于计算最小值、最大值和中位数值
y (Any) – 不需要。

返回:

带有计算出的最小值、最大值和中值的估计器

返回类型:

Imputer

transform(X)[源代码]

在 DataFrame X 中，按列替换所有 NaNs、-inf 和 +inf 为提供的字典中的平均值/极值。

参数:: X (pandas.DataFrame) – DataFrame 进行插补
返回:: 填充后的 DataFrame
返回类型:: pandas.DataFrame
抛出:: RuntimeError – 如果替换字典仍然是 None 类型。如果变换器未拟合，则可能会发生这种情况。

tsfresh.transformers.relevant_feature_augmenter 模块

class tsfresh.transformers.relevant_feature_augmenter.RelevantFeatureAugmenter(filter_only_tsfresh_features=True, default_fc_parameters=None, kind_to_fc_parameters=None, column_id=None, column_sort=None, column_kind=None, column_value=None, timeseries_container=None, chunksize=None, n_jobs=5, show_warnings=False, disable_progressbar=False, profile=False, profiling_filename='profile.txt', profiling_sorting='cumulative', test_for_binary_target_binary_feature='fisher', test_for_binary_target_real_feature='mann', test_for_real_target_binary_feature='mann', test_for_real_target_real_feature='kendall', fdr_level=0.05, hypotheses_independent=False, ml_task='auto', multiclass=False, n_significant=1, multiclass_p_values='min')[源代码]

基类：BaseEstimator, TransformerMixin

Sklearn-兼容的估计器，用于从时间序列中计算相关特征，并将它们添加到数据样本中。

与其他许多 sklearn 估计器一样，该估计器分两步工作：

在拟合阶段，使用 set_timeseries_container 函数设置的时间序列计算所有可能的时间序列特征（除非通过传递 feature_extraction_settings 对象手动更改特征）。然后，使用统计方法计算它们对目标的重要性，并使用 Benjamini Hochberg 程序仅选择相关的特征。这些特征被内部存储。

在转换步骤中，使用从拟合步骤中获取的哪些特征是相关的信息，并从时间序列中提取这些特征。然后将这些提取的特征添加到输入数据样本中。

这个估计器是 tsfresh 包中大部分功能的封装。有关子任务的更多信息，请参阅各个模块和函数，它们是：

特征提取的设置：ComprehensiveFCParameters
特征提取方法: extract_features()
提取的特征: feature_calculators
特征选择：check_fs_sig_bh()

这个估计器的工作原理类似于 FeatureAugmenter ，不同之处在于这个估计器只输出和计算相关特征，而另一个输出所有特征。

同样对于这个估计器，两个数据集起着至关重要的作用：

时间序列容器包含时间序列数据。这个容器（格式见 extraction）包含用于计算特征的数据。它必须按id分组，id用于标识哪个特征应附加到第二个数据框的哪一行。
输入数据，特征将被添加到其中。

设想以下情况：您想要对10种不同的金融股票进行分类，并且您有它们过去一年的发展作为时间序列。然后，您将从股票的元信息中创建特征，例如它们在市场上存在的时间等，并填充一个表格——每只股票的特征在一行中。

>>> # Fill in the information of the stocks and the target
>>> X_train, X_test, y_train = pd.DataFrame(), pd.DataFrame(), pd.Series()

然后，您可以通过使用此估计器从股票的时间演变中提取所有相关特征：

>>> train_time_series, test_time_series = read_in_timeseries() # get the development of the shares
>>> from tsfresh.transformers import RelevantFeatureAugmenter
>>> augmenter = RelevantFeatureAugmenter()
>>> augmenter.set_timeseries_container(train_time_series)
>>> augmenter.fit(X_train, y_train)
>>> augmenter.set_timeseries_container(test_time_series)
>>> X_test_with_features = augmenter.transform(X_test)

X_test_with_features 将包含与 X_test 相同的信息（可能包含您添加的所有元信息）以及根据您提供的时间序列计算出的某些相关时间序列特征。

请记住，在拟合或转换之前提交的时间序列必须包含X中存在的行的数据。

如果你的 set filter_only_tsfresh_features 设置为 True，你在使用此估计器之前在 X_train（或 X_test）中手动创建的特征将不会被触及。否则，这些特征也会被评估，并可能因为它们无关而被从数据样本中剔除。

关于参数 column_id、column_sort、column_kind 和 column_value 的含义描述，请参见 extraction。

你可以在拟合步骤中控制特征提取（转换步骤中的特征提取是自动完成的），以及通过传递设置在拟合步骤中进行特征选择。然而，如果你不传递任何标志，使用的默认设置通常是相当合理的。

fit(X, y)[源代码]

使用从 set_timeseries_container() 获取的时间序列，并从中计算特征，然后将这些特征添加到数据样本 X 中（X 可以包含其他手动设计的特征）。

然后确定 X 的哪些特征与给定的目标 y 相关。将这些相关特征内部存储，以便在转换步骤中仅提取它们。

如果 filter_only_tsfresh_features 为 True，则仅拒绝新添加的自动特征。如果为 False，则还会查看 DataFrame 中已有的特征。

参数:

X (pandas.DataFrame or numpy.array) – 不包含时间序列特征的数据框。索引行应同时存在于时间序列和目标向量中。
y (pandas.Series or numpy.array) – 定义目标向量，以确定哪些特征是相关的。

返回:

带有信息的拟合估计器，哪些特征是相关的。

返回类型:

RelevantFeatureAugmenter

fit_transform(X, y)[源代码]

等同于 fit() 后跟 transform()；然而，这比单独执行这些步骤更快，因为它避免了为训练数据重新提取相关特征。

参数:

X (pandas.DataFrame or numpy.array) – 不包含时间序列特征的数据框。索引行应同时存在于时间序列和目标向量中。
y (pandas.Series or numpy.array) – 定义目标向量，以确定哪些特征是相关的。

返回:

一个数据样本，包含与X相同的信息，但增加了相关的时间序列特征并删除了不相关的信息（仅当 filter_only_tsfresh_features 为 False 时）。

返回类型:

pandas.DataFrame

set_timeseries_container(timeseries_container)[源代码]

设置时间序列，将根据该时间序列计算特征。有关时间序列容器的格式，请参阅 extraction。时间序列必须包含与稍后将添加特征的 DataFrame 相同的索引（您将传递给 transform() 或 fit() 的那个）。您可以根据需要多次调用此函数，以稍后更改时间序列（例如，如果您想为不同的 ID 提取特征）。

参数:: timeseries_container (pandas.DataFrame or dict) – 时间序列作为 pandas.DataFrame 或字典。格式请参见 extraction。
返回:: 无
返回类型:: None

transform(X)[源代码]

在拟合步骤之后，可以确定哪些特征是相关的，只从通过函数 set_timeseries_container() 传入的时间序列中提取这些特征。

如果 filter_only_tsfresh_features 为 False，也会删除数据框中已存在的不相关的特征。

参数:: X (pandas.DataFrame or numpy.array) – 要添加相关（并删除不相关）特征的数据样本。
返回:: 一个数据样本，包含与X相同的信息，但增加了相关的时间序列特征并删除了不相关的信息（仅当 filter_only_tsfresh_features 为 False 时）。
返回类型:: pandas.DataFrame

模块内容

模块 transformers 包含几个可以在 sklearn 管道中使用的转换器。