时间序列分类

逻辑回归分析二元目标变量与其预测变量之间的关系，以估计因变量取值为1的概率。在存在时间序列数据的情况下，当时间上的观察值不是独立时，模型的误差将会随着时间而相关，纳入自回归特征或滞后可以捕捉时间依赖性，并增强逻辑回归的预测能力。

NHITS的输入包括静态外生变量 \(\mathbf{x}^{(s)}\)、历史外生变量 \(\mathbf{x}^{(h)}_{[:t]}\)、在预测时可用的外生变量 \(\mathbf{x}^{(f)}_{[:t+H]}\) 和自回归特征 \(\mathbf{y}_{[:t]}\)，每个输入进一步分解为分类变量和连续变量。网络使用多分位数回归来建模以下条件概率：\[\mathbb{P}(\mathbf{y}_{[t+1:t+H]}|\;\mathbf{y}_{[:t]},\; \mathbf{x}^{(h)}_{[:t]},\; \mathbf{x}^{(f)}_{[:t+H]},\; \mathbf{x}^{(s)})\]

在本笔记本中，我们展示如何拟合NeuralForecast方法以进行二元序列回归。我们将： - 安装NeuralForecast。 - 加载二元序列数据。 - 拟合和预测时间分类器。 - 绘制和评估预测结果。

您可以使用Google Colab中的GPU运行这些实验。

1. 安装 NeuralForecast

#%%capture
#!pip 安装 neuralforecast

import numpy as np
import pandas as pd
from sklearn import datasets

import matplotlib.pyplot as plt
from neuralforecast import NeuralForecast
from neuralforecast.models import MLP, NHITS, LSTM
from neuralforecast.losses.pytorch import DistributionLoss, Accuracy

2. 加载二进制序列数据

core.NeuralForecast 类包含共享的 fit、predict 和其他方法，这些方法的输入为具有 ['unique_id', 'ds', 'y'] 列的 pandas DataFrame，其中 unique_id 用于标识数据集中各个时间序列，ds 是日期，y 是目标二元变量。

在这个激励示例中，我们将 8x8 的数字图像转换为长度为 64 的序列，并定义一个分类问题，以识别像素何时超过某个阈值。我们声明一个长格式的 pandas DataFrame，以匹配 NeuralForecast 的输入。

digits = datasets.load_digits()
images = digits.images[:100]

plt.imshow(images[0,:,:], cmap=plt.cm.gray, 
           vmax=16, interpolation="nearest")

pixels = np.reshape(images, (len(images), 64))
ytarget = (pixels > 10) * 1

fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(pixels[10])
ax2.plot(ytarget[10], color='purple')
ax1.set_xlabel('Pixel index')
ax1.set_ylabel('Pixel value')
ax2.set_ylabel('Pixel threshold', color='purple')
plt.grid()
plt.show()

# 我们将图像展平并创建一个输入数据框
# with 'unique_id' series identifier and 'ds' time stamp identifier.
Y_df = pd.DataFrame.from_dict({
            'unique_id': np.repeat(np.arange(100), 64),
            'ds': np.tile(np.arange(64)+1910, 100),
            'y': ytarget.flatten(), 'pixels': pixels.flatten()})
Y_df

	unique_id	ds	y	pixels
0	0	1910	0	0.0
1	0	1911	0	0.0
2	0	1912	0	5.0
3	0	1913	1	13.0
4	0	1914	0	9.0
...	...	...	...	...
6395	99	1969	1	14.0
6396	99	1970	1	16.0
6397	99	1971	0	3.0
6398	99	1972	0	0.0
6399	99	1973	0	0.0

6400 rows × 4 columns

3. 拟合和预测时间分类器

拟合模型

使用 NeuralForecast.fit 方法，您可以将一组模型训练到您的数据集上。您可以定义预测的 horizon（在此示例中为12），并修改模型的超参数。例如，对于 NHITS，我们修改了编码器和解码器的默认隐藏大小。

请参见 NHITS 和 MLP 模型文档。

Warning

目前，基于递归的模型系列无法与伯努利分布输出一起使用。这影响以下方法 LSTM、GRU、DilatedRNN 和 TCN。此功能正在持续开发中。

# %%capture
horizon = 12

# 尝试不同的超参数以提高准确性。
models = [MLP(h=horizon,                           # 预测范围
              input_size=2 * horizon,              # 输入序列的长度
              loss=DistributionLoss('Bernoulli'),  # 二元分类损失
              valid_loss=Accuracy(),               # 精度验证信号
              max_steps=500,                       # 训练步骤数
              scaler_type='standard',              # 用于数据标准化的缩放器类型
              hidden_size=64,                      # 定义了LSTM隐藏状态的大小
              # #early_stop_patience_steps=2,         早停正则化耐心步数
              val_check_steps=10,                  # 验证信号的频率（影响提前停止）
              ),
          NHITS(h=horizon,                          # 预测范围
                input_size=2 * horizon,             # 输入序列的长度
                loss=DistributionLoss('Bernoulli'), # 二元分类损失
                valid_loss=Accuracy(),              # 精度验证信号                
                max_steps=500,                      # 训练步骤数
                n_freq_downsample=[2, 1, 1],        # 每个堆栈输出的下采样因子
                # #early_stop_patience_steps=2,        早停正则化耐心步数
                val_check_steps=10,                 # 验证信号的频率（影响提前停止）
                )             
          ]
nf = NeuralForecast(models=models, freq='Y')
Y_hat_df = nf.cross_validation(df=Y_df, n_windows=1)

Global seed set to 1
Global seed set to 1

Epoch 124: 100%|██████████| 4/4 [00:00<00:00, 50.22it/s, v_num=35, train_loss_step=0.260, train_loss_epoch=0.331]
Predicting DataLoader 0: 100%|██████████| 4/4 [00:00<00:00, 37.07it/s]
Epoch 124: 100%|██████████| 4/4 [00:00<00:00,  5.34it/s, v_num=37, train_loss_step=0.179, train_loss_epoch=0.180]
Predicting DataLoader 0: 100%|██████████| 4/4 [00:00<00:00, 49.74it/s]

# 默认情况下，NeuralForecast 会生成预测区间。
# 在这种情况下，低x和高x水平代表 
# 预测累积x%概率的上下限
Y_hat_df = Y_hat_df.reset_index(drop=True)
Y_hat_df

	unique_id	ds	cutoff	MLP	MLP-median	MLP-lo-90	MLP-lo-80	MLP-hi-80	MLP-hi-90	NHITS	NHITS-median	NHITS-lo-90	NHITS-lo-80	NHITS-hi-80	NHITS-hi-90	y	pixels
0	0	1962	1961	0.190	0.0	0.0	0.0	1.0	1.0	0.422	0.0	0.0	0.0	1.0	1.0	0	10.0
1	0	1963	1961	0.754	1.0	0.0	0.0	1.0	1.0	0.955	1.0	1.0	1.0	1.0	1.0	1	12.0
2	0	1964	1961	0.035	0.0	0.0	0.0	0.0	0.0	0.000	0.0	0.0	0.0	0.0	0.0	0	0.0
3	0	1965	1961	0.049	0.0	0.0	0.0	0.0	0.0	0.015	0.0	0.0	0.0	0.0	0.0	0	0.0
4	0	1966	1961	0.042	0.0	0.0	0.0	0.0	0.0	0.000	0.0	0.0	0.0	0.0	0.0	0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1195	99	1969	1961	0.484	0.0	0.0	0.0	1.0	1.0	0.817	1.0	0.0	0.0	1.0	1.0	1	14.0
1196	99	1970	1961	0.587	1.0	0.0	0.0	1.0	1.0	0.495	0.0	0.0	0.0	1.0	1.0	1	16.0
1197	99	1971	1961	0.336	0.0	0.0	0.0	1.0	1.0	0.126	0.0	0.0	0.0	1.0	1.0	0	3.0
1198	99	1972	1961	0.046	0.0	0.0	0.0	0.0	0.0	0.000	0.0	0.0	0.0	0.0	0.0	0	0.0
1199	99	1973	1961	0.001	0.0	0.0	0.0	0.0	0.0	0.000	0.0	0.0	0.0	0.0	0.0	0	0.0

1200 rows × 17 columns

# 定义最终预测的分类阈值
# 如果（概率 > 阈值） -> 1
Y_hat_df['NHITS'] = (Y_hat_df['NHITS'] > 0.5) * 1
Y_hat_df['MLP'] = (Y_hat_df['MLP'] > 0.5) * 1
Y_hat_df

	unique_id	ds	cutoff	MLP	MLP-median	MLP-lo-90	MLP-lo-80	MLP-hi-80	MLP-hi-90	NHITS	NHITS-median	NHITS-lo-90	NHITS-lo-80	NHITS-hi-80	NHITS-hi-90	y	pixels
0	0	1962	1961	0	0.0	0.0	0.0	1.0	1.0	0	0.0	0.0	0.0	1.0	1.0	0	10.0
1	0	1963	1961	1	1.0	0.0	0.0	1.0	1.0	1	1.0	1.0	1.0	1.0	1.0	1	12.0
2	0	1964	1961	0	0.0	0.0	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0	0.0
3	0	1965	1961	0	0.0	0.0	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0	0.0
4	0	1966	1961	0	0.0	0.0	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1195	99	1969	1961	0	0.0	0.0	0.0	1.0	1.0	1	1.0	0.0	0.0	1.0	1.0	1	14.0
1196	99	1970	1961	1	1.0	0.0	0.0	1.0	1.0	0	0.0	0.0	0.0	1.0	1.0	1	16.0
1197	99	1971	1961	0	0.0	0.0	0.0	1.0	1.0	0	0.0	0.0	0.0	1.0	1.0	0	3.0
1198	99	1972	1961	0	0.0	0.0	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0	0.0
1199	99	1973	1961	0	0.0	0.0	0.0	0.0	0.0	0	0.0	0.0	0.0	0.0	0.0	0	0.0

1200 rows × 17 columns

4. 绘制和评估预测

最后，我们将两个模型的预测结果与真实值进行比较。并评估MLP和NHITS时间分类器的准确性。

plot_df = Y_hat_df[Y_hat_df.unique_id==10]

fig, ax = plt.subplots(1, 1, figsize = (20, 7))
plt.plot(plot_df.ds, plot_df.y, label='target signal')
plt.plot(plot_df.ds, plot_df['MLP'] * 1.1, label='MLP prediction')
plt.plot(plot_df.ds, plot_df['NHITS'] * .9, label='NHITS prediction')
ax.set_title('Binary Sequence Forecast', fontsize=22)
ax.set_ylabel('Pixel Threshold and Prediction', fontsize=20)
ax.set_xlabel('Timestamp [t]', fontsize=20)
ax.legend(prop={'size': 15})
ax.grid()

def accuracy(y, y_hat):
    return np.mean(y==y_hat)

mlp_acc = accuracy(y=Y_hat_df['y'], y_hat=Y_hat_df['MLP'])
nhits_acc = accuracy(y=Y_hat_df['y'], y_hat=Y_hat_df['NHITS'])

print(f'MLP Accuracy: {mlp_acc:.1%}')
print(f'NHITS Accuracy: {nhits_acc:.1%}')

MLP Accuracy: 77.7%
NHITS Accuracy: 78.1%

参考文献

Give us a ⭐ on Github