缺失值

::: {#cell-1 .cell 0=‘隐’ 1=‘藏’}

!pip install -Uqq nixtla

:::

::: {#cell-2 .cell 0=‘隐’ 1=‘藏’}

from nixtla.utils import in_colab

:::

::: {#cell-3 .cell 0=‘隐’ 1=‘藏’}

IN_COLAB = in_colab()

:::

::: {#cell-4 .cell 0=‘隐’ 1=‘藏’}

if not IN_COLAB:    from nixtla.utils import colab_badge    from dotenv import load_dotenv

:::

TimeGPT 需要没有缺失值的时间序列数据。可以有多个系列在不同的日期开始和结束，但每个系列在其给定的时间范围内必须包含连续的数据。在本教程中，我们将展示如何处理 TimeGPT 中的缺失值。目录1. 加载数据2. 开始使用 TimeGPT3. 可视化数据4. 填充缺失值5. 使用 TimeGPT 进行预测6. 重要注意事项7. 参考文献本工作基于 skforecast 的处理缺失值的时间序列预测教程。

if not IN_COLAB:    load_dotenv()    colab_badge('docs/tutorials/15_missing_values')

加载数据

我们将首先使用 pandas 加载数据。该数据集代表一个城市每日的自行车租赁数量。列名称为西班牙语，因此我们将其重命名为 ds 代表日期，y 代表自行车租赁数量。

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-learning-python/master/data/usuarios_diarios_bicimad.csv')df = df[['fecha', 'Usos bicis total día']] # 选择日期和目标变量 df.rename(columns={'fecha': 'ds', 'Usos bicis total día': 'y'}, inplace=True) df.head()

	ds	y
0	2014-06-23	99
1	2014-06-24	72
2	2014-06-25	119
3	2014-06-26	135
4	2014-06-27	149

为了方便起见，我们将把日期转换为时间戳，并为该系列分配一个唯一的 ID。尽管在这个例子中我们只有一个系列，但在处理多个系列时，有必要为每个系列分配一个唯一的 ID。

df['ds'] = pd.to_datetime(df['ds']) df['unique_id'] = 'id1'df = df[['unique_id', 'ds', 'y']]

现在我们将数据分为训练集和测试集。我们将使用最后93天的数据作为测试集。

train_df = df[:-93] test_df = df[-93:]

我们现在将在训练集中引入一些缺失值，以展示如何处理它们。这将按照skforecast教程中的方法进行。

mask = ~((train_df['ds'] >= '2020-09-01') & (train_df['ds'] <= '2020-10-10')) &  ~((train_df['ds'] >= '2020-11-08') & (train_df['ds'] <= '2020-12-15'))train_df_gaps = train_df[mask]

开始使用 TimeGPT

在继续之前，我们将实例化 NixtlaClient 类，该类提供对 TimeGPT 所有方法的访问。为此，您需要一个 Nixtla API 密钥。

from nixtla import NixtlaClient

nixtla_client = NixtlaClient(    # defaults to os.environ.get("NIXTLA_API_KEY")    api_key = 'my_api_key_provided_by_nixtla')

👍 使用 Azure AI 端点>> 要使用 Azure AI 端点，请设置 base_url 参数：>> nixtla_client = NixtlaClient(base_url="您的 Azure AI 端点", api_key="您的 api_key")

::: {#cell-23 .cell 0=‘隐’ 1=‘藏’}

if not IN_COLAB:    nixtla_client = NixtlaClient()

:::

要了解如何设置您的 API 密钥，请参考设置您的 API 密钥教程。

可视化数据

我们可以使用 NixtlaClient 类中的 plot 方法来可视化数据。此方法有一个 engine 参数，可以让您在不同的绘图库之间进行选择。默认使用 matplotlib，但您也可以使用 plotly 进行交互式绘图。

nixtla_client.plot(train_df_gaps)

注意，在数据中有两个间隙：从2020年9月1日到2020年10月10日，以及从2020年11月8日到2020年12月15日。为了更好地可视化这些间隙，您可以使用plot方法的max_insample_length参数，或者您可以简单地放大图表。

nixtla_client.plot(train_df_gaps, max_insample_length=800)

此外，请注意2020年3月16日至2020年4月21日期间的数据呈现为零租赁。这些不是缺失值，而是与城市的COVID-19封锁相对应的实际零值。

填充缺失值

在使用 TimeGPT 之前，我们需要确保：1. 从开始日期到结束日期的所有时间戳在数据中都是存在的。2. 目标列不包含缺失值。为了解决第一个问题，我们将使用 utilsforecast 中的 fill_gaps 函数，这是 Nixtla 提供的一个用于时间序列预测的重要工具包，包含数据预处理、绘图和评估等功能。fill_gaps 函数将填充数据中的缺失日期。要做到这一点，它需要以下参数：- df: 包含时间序列数据的 DataFrame。- freq（字符串或整数）：数据的频率。

from utilsforecast.preprocessing import fill_gaps

print('Number of rows before filling gaps:', len(train_df_gaps))train_df_complete = fill_gaps(train_df_gaps, freq='D')print('Number of rows after filling gaps:', len(train_df_complete))

Number of rows before filling gaps: 2851
Number of rows after filling gaps: 2929

现在我们需要决定如何填写目标列中的缺失值。在本教程中，我们将使用插值法，但在选择填充策略时，考虑您的数据的具体背景是很重要的。例如，如果您处理的是日零售数据，则缺失值很可能表示那天没有销售，您可以用零来填充。相反，如果您处理的是每小时温度数据，则缺失值可能意味着传感器发生了故障，您可能更倾向于使用插值法来填充缺失值。

train_df_complete['y'] = train_df_complete['y'].interpolate(method='linear', limit_direction='both')train_df_complete.isna().sum() # 检查是否存在任何缺失值

unique_id    0
ds           0
y            0
dtype: int64

使用 TimeGPT 进行预测

我们现在准备使用 NixtlaClient 类中的 forecast 方法。此方法需要以下参数：- df：包含时间序列数据的 DataFrame- h：（int）预测范围。在这种情况下，它是93天。- model （str）：要使用的模型。默认是 timegpt-1，但由于预测范围超过了数据的频率（每日数据），我们将使用 timegpt-1-long-horizon。要了解更多信息，请参考长时间范围预测教程。

fcst = nixtla_client.forecast(train_df_complete, h=len(test_df), model='timegpt-1-long-horizon')

INFO:nixtla.nixtla_client:Validating inputs...
INFO:nixtla.nixtla_client:Preprocessing dataframes...
INFO:nixtla.nixtla_client:Inferred freq: D
WARNING:nixtla.nixtla_client:The specified horizon "h" exceeds the model horizon. This may lead to less accurate forecasts. Please consider using a smaller horizon.
INFO:nixtla.nixtla_client:Restricting input...
INFO:nixtla.nixtla_client:Calling Forecast Endpoint...

📘 Azure AI 中的可用模型>> 如果您正在使用 Azure AI 端点，请确保设置 model="azureai"：>> nixtla_client.forecast(..., model="azureai")> > 对于公共 API，我们支持两个模型：timegpt-1 和 timegpt-1-long-horizon。> > 默认情况下，使用 timegpt-1。有关如何以及何时使用 timegpt-1-long-horizon 的信息，请参见本教程。

我们可以使用 plot 方法来可视化 TimeGPT 预测和测试集。

nixtla_client.plot(test_df, fcst)

接下来，我们将使用 utilsforecast 中的 evaluate 函数来计算 TimeGPT 预测的平均绝对误差 (MAE)。在继续之前，我们需要将预测中的日期转换为时间戳，以便能够将其与测试集合并。evaluate 函数需要以下参数：- df: 包含预测值和实际值的数据框（在 y 列中）。- metrics（列表）：需要计算的指标。

from utilsforecast.evaluation import evaluate from utilsforecast.losses import mae

fcst['ds'] = pd.to_datetime(fcst['ds'])result = test_df.merge(fcst, on=['ds', 'unique_id'], how='left')result.head()

	unique_id	ds	y	TimeGPT
0	id1	2022-06-30	13468	13357.357422
1	id1	2022-07-01	12932	12390.051758
2	id1	2022-07-02	9918	9778.649414
3	id1	2022-07-03	8967	8846.636719
4	id1	2022-07-04	12869	11589.071289

evaluate(result, metrics=[mae])

	unique_id	metric	TimeGPT
0	id1	mae	1824.693076

重要考虑事项

本教程的关键要点是 TimeGPT 需要没有缺失值的时间序列数据。这意味着：1. 考虑到数据的频率，时间戳必须是连续的，起始和结束日期之间不能有间隙。2. 数据中不得包含缺失值（NaNs）。我们还展示了 utilsforecast 提供了一个方便的函数来填补缺失日期，并且您需要决定如何处理缺失值。这个决定取决于数据的上下文，因此在选择填充策略时请务必谨慎，并选择您认为最能反映现实的策略。最后，我们还演示了 utilsforecast 提供了一个函数来使用常见的准确性指标评估 TimeGPT 预测。

参考文献* Joaquín Amat Rodrigo 和 Javier Escobar Ortiz (2022). “在时间序列预测中排除疫情影响”

Give us a ⭐ on Github