预处理示例¶

本教程涵盖了来自 functime.preprocessing 的选定函数。我们可视化了一些常见的时间序列预处理技术，以及时间序列转换之前和之后的效果。这些转换使时间序列看起来更“规范”，通常使时间序列更容易进行预测。本章来自《预测：原则与实践》教科书的 https://otexts.com/fpp3/stationarity.html 提供了关于这个主题的优秀入门。

In [ ]:

Copied!





import polars as pl

from functime.plotting import plot_forecasts, plot_panel
from functime.preprocessing import (
    boxcox,
    deseasonalize_fourier,
    detrend,
    diff,
    fractional_diff,
    scale,
    yeojohnson,
)
import polars as pl

from functime.plotting import plot_forecasts, plot_panel
from functime.preprocessing import (
    boxcox,
    deseasonalize_fourier,
    detrend,
    diff,
    fractional_diff,
    scale,
    yeojohnson,
)

我们首先加载商品价格数据集。

In [ ]:

Copied!

data = pl.read_parquet("../../data/commodities.parquet")
entity_col, time_col, target_col = data.columns
data.head(1)
data = pl.read_parquet("../../data/commodities.parquet")
entity_col, time_col, target_col = data.columns
data.head(1)

总共有71种商品。

In [ ]:

Copied!

data.get_column("commodity_type").n_unique()
data.get_column("commodity_type").n_unique()

现在让我们通过变异系数可视化前四个波动性最大的时间序列。

In [ ]:

Copied!





most_volatile_commodities = (
    data.group_by(entity_col)
    .agg((pl.col(target_col).std() / pl.col(target_col).mean()).alias("cv"))
    .top_k(k=4, by="cv")
)
most_volatile_commodities
most_volatile_commodities = (
    data.group_by(entity_col)
    .agg((pl.col(target_col).std() / pl.col(target_col).mean()).alias("cv"))
    .top_k(k=4, by="cv")
)
most_volatile_commodities

In [ ]:

Copied!





selected = most_volatile_commodities.get_column(entity_col)
y = data.filter(pl.col(entity_col).is_in(selected))
figure = plot_panel(y=y, height=800, width=1000)
figure.show(renderer="svg")
selected = most_volatile_commodities.get_column(entity_col)
y = data.filter(pl.col(entity_col).is_in(selected))
figure = plot_panel(y=y, height=800, width=1000)
figure.show(renderer="svg")

这些时间序列看起来非常复杂：趋势行为、季节性影响、随时间变化的波动性等。让我们看看能否对这些时间序列进行预处理，以使其更易于预测!

去趋势化¶

我们可以使用 plot_forecasts 函数来比较变换前后的时间序列。

In [ ]:

Copied!





transformer = detrend(freq="1mo", method="linear")
y_detrended = y.pipe(transformer).collect()
figure = plot_forecasts(
    y_true=y, y_pred=y_detrended.group_by(entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")
transformer = detrend(freq="1mo", method="linear")
y_detrended = y.pipe(transformer).collect()
figure = plot_forecasts(
    y_true=y, y_pred=y_detrended.group_by(entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")

反转变换是超级简单的！

In [ ]:

Copied!





y_original = transformer.invert(y_detrended).group_by(entity_col).tail(64).collect()
subset = ["Natural gas, Europe", "Crude oil, Dubai"]
figure = plot_forecasts(
    y_true=y.filter(pl.col(entity_col).is_in(subset)),
    y_pred=y_original,
    height=400,
    width=1000,
)
figure.show(renderer="svg")
y_original = transformer.invert(y_detrended).group_by(entity_col).tail(64).collect()
subset = ["Natural gas, Europe", "Crude oil, Dubai"]
figure = plot_forecasts(
    y_true=y.filter(pl.col(entity_col).is_in(subset)),
    y_pred=y_original,
    height=400,
    width=1000,
)
figure.show(renderer="svg")

去季节性¶

我们通过在傅里叶项上进行残差回归来支持去季节化，以建模季节性。对于这个例子，让我们使用 M4 每小时数据集，该数据集具有明显的季节性模式。

In [ ]:

Copied!





m4_data = pl.read_parquet("../../data/m4_1w_train.parquet")
m4_entity_col, m4_time_col, m4_target_col = m4_data.columns
y_m4 = m4_data.filter(pl.col(m4_entity_col).is_in(["W174", "W175", "W176", "W178"]))
figure = plot_panel(y=y_m4, height=800, width=1000)
figure.show(renderer="svg")
m4_data = pl.read_parquet("../../data/m4_1w_train.parquet")
m4_entity_col, m4_time_col, m4_target_col = m4_data.columns
y_m4 = m4_data.filter(pl.col(m4_entity_col).is_in(["W174", "W175", "W176", "W178"]))
figure = plot_panel(y=y_m4, height=800, width=1000)
figure.show(renderer="svg")

让我们绘制该系列的季节性成分！

In [ ]:

Copied!





# Fourier Terms
transformer = deseasonalize_fourier(sp=12, K=3)
y_deseasonalized = y_m4.pipe(transformer).collect()
y_seasonal = transformer.state.artifacts["X_seasonal"].collect()
figure = plot_panel(
    y=y_seasonal.group_by(m4_entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")
# Fourier Terms
transformer = deseasonalize_fourier(sp=12, K=3)
y_deseasonalized = y_m4.pipe(transformer).collect()
y_seasonal = transformer.state.artifacts["X_seasonal"].collect()
figure = plot_panel(
    y=y_seasonal.group_by(m4_entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")

In [ ]:

Copied!





y_deseasonalized = y_m4.pipe(transformer).collect()
y_original = transformer.invert(y_deseasonalized).collect()
figure = plot_panel(
    y=y_original.group_by(m4_entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")
y_deseasonalized = y_m4.pipe(transformer).collect()
y_original = transformer.invert(y_deseasonalized).collect()
figure = plot_panel(
    y=y_original.group_by(m4_entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")

差分¶

一阶差分是一种用于时间序列分析的技术，通过计算连续观察值之间的差异，将非平稳时间序列转换为平稳序列。假设时间序列是单位根1的集成。

In [ ]:

Copied!





transformer = diff(order=1)
y_diff = y.pipe(transformer).collect()
figure = plot_forecasts(
    y_true=y, y_pred=y_diff.group_by(entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")
transformer = diff(order=1)
y_diff = y.pipe(transformer).collect()
figure = plot_forecasts(
    y_true=y, y_pred=y_diff.group_by(entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")

分数差分¶

有时候，您可能希望在不去除时间序列中所有记忆的情况下，使时间序列平稳。这在特定的预测任务中尤其有用，其中下一个值依赖于过去值的长期历史（想想预测股票价格）。在这种情况下，我们可以使用分数差分。请注意这些图与之前图的区别。使用如增广的迪基-富勒检验这样的评分函数进行多次测试，确定使时间序列平稳的最小d值是值得的。

In [ ]:

Copied!





transformer = fractional_diff(d=0.3, min_weight=1e-3)
y_diff = y.pipe(transformer).collect()
figure = plot_forecasts(
    y_true=y, y_pred=y_diff.group_by(entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")
transformer = fractional_diff(d=0.3, min_weight=1e-3)
y_diff = y.pipe(transformer).collect()
figure = plot_forecasts(
    y_true=y, y_pred=y_diff.group_by(entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")

季节性差分¶

In [ ]:

Copied!





transformer = diff(order=1, sp=12)
y_seas_diff = y.pipe(transformer).collect()
figure = plot_forecasts(
    y_true=y, y_pred=y_seas_diff.group_by(entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")
transformer = diff(order=1, sp=12)
y_seas_diff = y.pipe(transformer).collect()
figure = plot_forecasts(
    y_true=y, y_pred=y_seas_diff.group_by(entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")

本地缩放¶

跨多个时间序列的缩放变换的并行化版本（减去均值，除以标准差）。

In [ ]:

Copied!





transformer = scale(use_mean=True, use_std=True)
y_scaled = y_m4.pipe(transformer).collect()
figure = plot_panel(y=y_scaled.group_by(m4_entity_col).tail(64), height=800, width=1000)
figure.show(renderer="svg")
transformer = scale(use_mean=True, use_std=True)
y_scaled = y_m4.pipe(transformer).collect()
figure = plot_panel(y=y_scaled.group_by(m4_entity_col).tail(64), height=800, width=1000)
figure.show(renderer="svg")

Box-Cox¶

此转换用于稳定时间序列的方差。要求所有值为正。

In [ ]:

Copied!





transformer = boxcox(method="mle")
y_boxcox = y.pipe(transformer).collect()
figure = plot_panel(y=y_boxcox.group_by(entity_col).tail(64), height=800, width=1000)
figure.show(renderer="svg")
transformer = boxcox(method="mle")
y_boxcox = y.pipe(transformer).collect()
figure = plot_panel(y=y_boxcox.group_by(entity_col).tail(64), height=800, width=1000)
figure.show(renderer="svg")

Yeo-Johnson¶

这种变换类似于Box-Cox，但没有严格的正数要求。

In [ ]:

Copied!





transformer = yeojohnson()
y_yeojohnson = y.pipe(transformer).collect()
figure = plot_panel(
    y=y_yeojohnson.group_by(entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")
transformer = yeojohnson()
y_yeojohnson = y.pipe(transformer).collect()
figure = plot_panel(
    y=y_yeojohnson.group_by(entity_col).tail(64), height=800, width=1000
)
figure.show(renderer="svg")

让我们将所有内容整合在一起！¶

Box-Cox 转换以稳定方差
去季节化以去除季节性
首差分以稳定均值

目标是使时间序列“看起来”更平稳，这是许多时间序列预测模型的重要假设。这里有一个关于该主题的优秀入门资料： https://otexts.com/fpp3/stationarity.html

In [ ]:

Copied!





y_new = (
    y.pipe(boxcox())
    .pipe(deseasonalize_fourier(sp=12, K=3))
    .pipe(diff(order=1))
    .collect()
)
figure = plot_panel(y=y_new.group_by(entity_col).tail(64), height=800, width=1000)
figure.show(renderer="svg")
y_new = (
    y.pipe(boxcox())
    .pipe(deseasonalize_fourier(sp=12, K=3))
    .pipe(diff(order=1))
    .collect()
)
figure = plot_panel(y=y_new.group_by(entity_col).tail(64), height=800, width=1000)
figure.show(renderer="svg")