Jackknife+、Jackknife-minmax 和 CV+

在本笔记本中,我们比较了来自Barber等人2021年Jackknife+Jackknife-minmaxCV+

生成回归数据

我们生成一个具有标量目标变量的任意回归数据集。我们将其分为训练集和测试集。

[1]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=500, n_features=3, n_targets=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=250)

我们任意决定采用梯度提升方法进行回归。

[2]:
from sklearn.ensemble import GradientBoostingRegressor

我们决定任意期望的覆盖率为95%。

[3]:
from fortuna.metric.regression import prediction_interval_coverage_probability

error = 0.05

CV+

首先,我们使用K折交叉验证程序训练模型。

[4]:
from sklearn.model_selection import KFold

cross_val_outputs, cross_val_targets, cross_test_outputs = [], [], []
n_splits = 5
for i, idx in enumerate(KFold(n_splits=n_splits).split(X_train)):
    print(f"Split #{i + 1} out of {n_splits}.", end="\r")
    model = GradientBoostingRegressor()
    model.fit(X_train[idx[0]], y_train[idx[0]])
    cross_val_outputs.append(model.predict(X_train[idx[1]]))
    cross_val_targets.append(y_train[idx[1]])
    cross_test_outputs.append(model.predict(X_test))
Split #5 out of 5.

给定模型输出,我们计算使用CV+获得的保形区间。

[5]:
from fortuna.conformal import CVPlusConformalRegressor

cvplus_interval = CVPlusConformalRegressor().conformal_interval(
    cross_val_outputs=cross_val_outputs,
    cross_val_targets=cross_val_targets,
    cross_test_outputs=cross_test_outputs,
    error=error,
)
cvplus_coverage = prediction_interval_coverage_probability(
    cvplus_interval[:, 0], cvplus_interval[:, 1], y_test
)

刀切法和刀切最小最大值

我们现在使用留一法程序来训练模型。

[6]:
from sklearn.model_selection import LeaveOneOut
import jax.numpy as jnp

loo_val_outputs, loo_val_targets, loo_test_outputs = [], [], []
c = 0
for i, idx in enumerate(LeaveOneOut().split(X_train)):
    if c >= 30:
        break
    print(f"Split #{i + 1} out of {X_train.shape[0]}.", end="\r")
    model = GradientBoostingRegressor()
    model.fit(X_train[idx[0]], y_train[idx[0]])
    loo_val_outputs.append(model.predict(X_train[idx[1]]))
    loo_val_targets.append(y_train[idx[1]])
    loo_test_outputs.append(model.predict(X_test))
    c += 1

loo_val_outputs = jnp.array(loo_val_outputs)
loo_val_targets = jnp.array(loo_val_targets)
loo_test_outputs = jnp.array(loo_test_outputs)
Split #30 out of 250.

给定模型输出,我们计算使用jackknife+和jackknife-minmax获得的保形区间。

[7]:
from fortuna.conformal import (
    JackknifePlusConformalRegressor,
    JackknifeMinmaxConformalRegressor,
)

jkplus_interval = JackknifePlusConformalRegressor().conformal_interval(
    loo_val_outputs=loo_val_outputs,
    loo_val_targets=loo_val_targets,
    loo_test_outputs=loo_test_outputs,
    error=error,
)
jkplus_coverage = prediction_interval_coverage_probability(
    jkplus_interval[:, 0], jkplus_interval[:, 1], y_test
)

jkmm_interval = JackknifeMinmaxConformalRegressor().conformal_interval(
    loo_val_outputs=loo_val_outputs,
    loo_val_targets=loo_val_targets,
    loo_test_outputs=loo_test_outputs,
    error=error,
)
jkmm_coverage = prediction_interval_coverage_probability(
    jkmm_interval[:, 0], jkmm_interval[:, 1], y_test
)

覆盖率结果

[8]:
print(f"Desired coverage: {1 - error}.")
print(f"CV+ empirical coverage: {cvplus_coverage}.")
print(f"jackknife+ empirical coverage: {jkplus_coverage}.")
print(f"jackknife-minmax empirical coverage: {jkmm_coverage}.")
Desired coverage: 0.95.
CV+ empirical coverage: 0.9440000653266907.
jackknife+ empirical coverage: 0.984000027179718.
jackknife-minmax empirical coverage: 0.9920000433921814.

与CV+相比,我们在CV+中使用5折交叉验证训练模型,而jackknife+和jackknife-minmax由于在整个训练数据集上执行留一法程序,需要显著更高的计算时间。通过在训练数据的子集上执行留一法,可以显著降低这一成本。

[ ]: