预测样本内

关于如何生成样本内预测的教程。

本教程提供了如何使用 core 类的 predict_insample 函数生成训练和验证集预测的示例。在这个示例中，我们将对 AirPassengers 数据训练 NHITS 模型，并展示在模型拟合后如何恢复样本内预测。

样本内预测：生成训练和验证集预测的过程。

使用案例： * 调试：生成样本内预测对于调试目的非常有用。例如，检查模型是否能够拟合训练集。 * 训练收敛：检查模型是否已经收敛。 * 异常检测：样本内预测可用于检测训练集中异常行为（例如，离群值）。（注意：如果模型过于灵活，它可能能够完美地预测离群值）

您可以使用 Google Colab 通过 GPU 运行这些实验。

1. 安装 NeuralForecast

%%capture
!pip install neuralforecast

2. 加载AirPassengers数据

core.NeuralForecast类包含共享的fit、predict和其他方法，这些方法的输入为包含['unique_id', 'ds', 'y']列的pandas DataFrame，其中unique_id标识数据集中单个时间序列，ds是日期，y是目标变量。

在这个示例数据集中，由一组单个系列组成，但您可以很容易地将模型拟合到更大的长格式数据集。

%%capture
from neuralforecast.utils import AirPassengersDF

Y_df = AirPassengersDF # 定义于 neuralforecast.utils
Y_df.head()

	unique_id	ds	y
0	1.0	1949-01-31	112.0
1	1.0	1949-02-28	118.0
2	1.0	1949-03-31	132.0
3	1.0	1949-04-30	129.0
4	1.0	1949-05-31	121.0

3. 模型训练

首先，我们在AirPassengers数据上训练NHITS模型。我们将使用core类的fit方法来训练模型。

import pandas as pd

from neuralforecast import NeuralForecast
from neuralforecast.models import NHITS

%%capture
horizon = 12

# 尝试不同的超参数以提高准确性。
models = [NHITS(h=horizon,                      # 预测范围
                input_size=2 * horizon,         # 输入序列的长度
                max_steps=1000,                 # 训练步骤数
                n_freq_downsample=[2, 1, 1],    # 每个堆栈输出的下采样因子
                mlp_units = 3 * [[1024, 1024]]) # 每个区块中的单元数量。
          ]
nf = NeuralForecast(models=models, freq='M')
nf.fit(df=Y_df, val_size=horizon)

4. 预测样本内数据

使用NeuralForecast.predict_insample方法，您可以在模型训练后获得训练集和验证集的预测。该函数将始终使用在fit或cross_validation方法中最后一次用于训练的数据集。

通过step_size参数，您可以指定生成预测时连续窗口之间的步长。在此示例中，我们将step_size=horizon设置为生成不重叠的预测。

以下图示展示了基于step_size参数和模型的h（预测范围）如何生成预测。在图中，我们设置step_size=2和h=4。

Y_hat_insample = nf.predict_insample(step_size=horizon)

Predicting DataLoader 0: 100%|██████████| 1/1 [00:00<00:00, 37.76it/s]

predict_insample 函数返回一个 pandas DataFrame，包含以下列： * unique_id: 时间序列的唯一标识符。 * ds: 每行预测的日期戳。 * cutoff: 进行预测时的日期戳。 * y: 目标变量的实际值。 * model_name: 预测模型的预测值。在本例中为 NHITS。

Y_hat_insample.head()

	unique_id	ds	cutoff	NHITS	y
0	1.0	1949-01-31	1948-12-31	0.204289	112.0
1	1.0	1949-02-28	1948-12-31	0.302111	118.0
2	1.0	1949-03-31	1948-12-31	0.399522	132.0
3	1.0	1949-04-30	1948-12-31	0.429369	129.0
4	1.0	1949-05-31	1948-12-31	0.518200	121.0

Important

该函数将从时间序列的第一个时间戳生成预测。由于模型在生成预测时输入信息非常有限，因此对于这些初始时间戳，预测可能并不准确。

5. 绘制预测结果

最后，我们绘制训练集和验证集的预测结果。

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(Y_hat_insample['ds'], Y_hat_insample['y'], label='True')
plt.plot(Y_hat_insample['ds'], Y_hat_insample['NHITS'], label='Forecast')
plt.axvline(Y_hat_insample['ds'].iloc[-12], color='black', linestyle='--', label='Train-Test Split')
plt.xlabel('Timestamp [t]')
plt.ylabel('Monthly Passengers')
plt.grid()
plt.legend()

Important

请注意，训练集的预测非常准确，而验证集（最后12个时间戳）的预测则不够精准。这是因为模型是在训练集上训练的，而深度学习模型如NHITS很容易对训练集产生过拟合。

参考文献

Cristian Challu, Kin G. Olivares, Boris N. Oreshkin, Federico Garza, Max Mergenthaler-Canseco, Artur Dubrawski (2021). NHITS: 用于时间序列预测的神经层次插值。已被AAAI 2023接受。

Give us a ⭐ on Github