%load_ext autoreload
%autoreload 2延迟变换
基于滞后计算特征
mlforecast允许您定义对滞后项进行转换以用作特征。这些通过lag_transforms参数提供,该参数是一个字典,其中键是滞后项,值是要应用于该滞后项的转换列表。
数据设置
import numpy as np
from mlforecast import MLForecast
from mlforecast.utils import generate_daily_seriesdata = generate_daily_series(10)内置转换
内置的滞后变换位于 mlforecast.lag_transforms 模块中。
from mlforecast.lag_transforms import RollingMean, ExpandingStdfcst = MLForecast(
models=[],
freq='D',
lag_transforms={
1: [ExpandingStd()],
7: [RollingMean(window_size=7, min_samples=1), RollingMean(window_size=14)]
},
)一旦定义了你的变换,你可以使用 MLForecast.preprocess 来查看它们的样子。
fcst.preprocess(data).head(2)| unique_id | ds | y | expanding_std_lag1 | rolling_mean_lag7_window_size7_min_samples1 | rolling_mean_lag7_window_size14 | |
|---|---|---|---|---|---|---|
| 20 | id_0 | 2000-01-21 | 6.319961 | 1.956363 | 3.234486 | 3.283064 |
| 21 | id_0 | 2000-01-22 | 0.071677 | 2.028545 | 3.256055 | 3.291068 |
扩展内置转换
您可以使用Combine类组合内置变换,该类接受两个变换和一个操作符。
import operator
from mlforecast.lag_transforms import Combinefcst = MLForecast(
models=[],
freq='D',
lag_transforms={
1: [
RollingMean(window_size=7),
RollingMean(window_size=14),
Combine(
RollingMean(window_size=7),
RollingMean(window_size=14),
operator.truediv,
)
],
},
)
prep = fcst.preprocess(data)
prep.head(2)| unique_id | ds | y | rolling_mean_lag1_window_size7 | rolling_mean_lag1_window_size14 | rolling_mean_lag1_window_size7_truediv_rolling_mean_lag1_window_size14 | |
|---|---|---|---|---|---|---|
| 14 | id_0 | 2000-01-15 | 0.435006 | 3.234486 | 3.283064 | 0.985204 |
| 15 | id_0 | 2000-01-16 | 1.489309 | 3.256055 | 3.291068 | 0.989361 |
np.testing.assert_allclose(
prep['rolling_mean_lag1_window_size7'] / prep['rolling_mean_lag1_window_size14'],
prep['rolling_mean_lag1_window_size7_truediv_rolling_mean_lag1_window_size14']
)如果你希望在Combine中的某个变换应用于不同的滞后,可以使用Offset类,它会先应用偏移,然后再进行变换。
from mlforecast.lag_transforms import Offsetfcst = MLForecast(
models=[],
freq='D',
lag_transforms={
1: [
RollingMean(window_size=7),
Combine(
RollingMean(window_size=7),
Offset(RollingMean(window_size=7), n=1),
operator.truediv,
)
],
2: [RollingMean(window_size=7)]
},
)
prep = fcst.preprocess(data)
prep.head(2)| unique_id | ds | y | rolling_mean_lag1_window_size7 | rolling_mean_lag1_window_size7_truediv_rolling_mean_lag2_window_size7 | rolling_mean_lag2_window_size7 | |
|---|---|---|---|---|---|---|
| 8 | id_0 | 2000-01-09 | 1.462798 | 3.326081 | 0.998331 | 3.331641 |
| 9 | id_0 | 2000-01-10 | 2.035518 | 3.360938 | 1.010480 | 3.326081 |
np.testing.assert_allclose(
prep['rolling_mean_lag1_window_size7'] / prep['rolling_mean_lag2_window_size7'],
prep['rolling_mean_lag1_window_size7_truediv_rolling_mean_lag2_window_size7']
)from sklearn.linear_model import LinearRegressionfcst = MLForecast(
models=[LinearRegression()],
freq='D',
lag_transforms={
1: [
RollingMean(window_size=7),
RollingMean(window_size=14),
Combine(
RollingMean(window_size=7),
RollingMean(window_size=14),
operator.truediv,
)
],
},
)
fcst.fit(data)
fcst.predict(2);基于numba的变换
window-ops包 提供了作为 numba JIT 编译 函数定义的转换。我们使用 numba,因为它使得这些转换速度非常快,并且可以绕过 python 的 GIL,这允许我们在多线程环境中并发运行它们。
使用这些转换的主要好处是它们非常易于实现。然而,当我们需要在预测步骤中更新它们的值时,它们可能会非常慢,因为我们必须在完整历史上再次调用函数并仅保留最后一个值。因此,如果性能是一个关注点,您应该尝试使用内置的转换,或在 MLForecast.preprocess 或 MLForecast.fit 中将 keep_last_n 设置为您的转换所需的最小样本数。
from numba import njit
from window_ops.expanding import expanding_mean
from window_ops.shift import shift_array@njit
def ratio_over_previous(x, offset=1):
"""计算当前值与其`偏移`滞后值之间的比率"""
return x / shift_array(x, offset=offset)
@njit
def diff_over_previous(x, offset=1):
"""计算当前值与其`offset`滞后值之间的差异"""
return x - shift_array(x, offset=offset)如果您的函数接受的参数比输入数组更多,您可以提供一个元组,例如:(func, arg1, arg2, ...)
fcst = MLForecast(
models=[],
freq='D',
lags=[1, 2, 3],
lag_transforms={
1: [expanding_mean, ratio_over_previous, (ratio_over_previous, 2)], # 第二个比率设定偏移量为2
2: [diff_over_previous],
},
)
prep = fcst.preprocess(data)
prep.head(2)| unique_id | ds | y | lag1 | lag2 | lag3 | expanding_mean_lag1 | ratio_over_previous_lag1 | ratio_over_previous_lag1_offset2 | diff_over_previous_lag2 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3 | id_0 | 2000-01-04 | 3.481831 | 2.445887 | 1.218794 | 0.322947 | 1.329209 | 2.006809 | 7.573645 | 0.895847 |
| 4 | id_0 | 2000-01-05 | 4.191721 | 3.481831 | 2.445887 | 1.218794 | 1.867365 | 1.423546 | 2.856785 | 1.227093 |
正如您所看到的,函数的名称与转换名称结合使用,并加上 _lag 后缀。如果函数有其他参数并且它们没有设置为默认值,那么这些参数也会被包含在内,就像这里的 offset=2 一样。
np.testing.assert_allclose(prep['lag1'] / prep['lag2'], prep['ratio_over_previous_lag1'])
np.testing.assert_allclose(prep['lag1'] / prep['lag3'], prep['ratio_over_previous_lag1_offset2'])
np.testing.assert_allclose(prep['lag2'] - prep['lag3'], prep['diff_over_previous_lag2'])Give us a ⭐ on Github