Jackknife+、Jackknife-minmax 和 CV+¶
在本笔记本中,我们比较了来自Barber等人2021年的Jackknife+、Jackknife-minmax和CV+。
生成回归数据¶
我们生成一个具有标量目标变量的任意回归数据集。我们将其分为训练集和测试集。
[1]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=500, n_features=3, n_targets=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=250)
我们任意决定采用梯度提升方法进行回归。
[2]:
from sklearn.ensemble import GradientBoostingRegressor
我们决定任意期望的覆盖率为95%。
[3]:
from fortuna.metric.regression import prediction_interval_coverage_probability
error = 0.05
CV+¶
首先,我们使用K折交叉验证程序训练模型。
[4]:
from sklearn.model_selection import KFold
cross_val_outputs, cross_val_targets, cross_test_outputs = [], [], []
n_splits = 5
for i, idx in enumerate(KFold(n_splits=n_splits).split(X_train)):
print(f"Split #{i + 1} out of {n_splits}.", end="\r")
model = GradientBoostingRegressor()
model.fit(X_train[idx[0]], y_train[idx[0]])
cross_val_outputs.append(model.predict(X_train[idx[1]]))
cross_val_targets.append(y_train[idx[1]])
cross_test_outputs.append(model.predict(X_test))
Split #5 out of 5.
给定模型输出,我们计算使用CV+获得的保形区间。
[5]:
from fortuna.conformal import CVPlusConformalRegressor
cvplus_interval = CVPlusConformalRegressor().conformal_interval(
cross_val_outputs=cross_val_outputs,
cross_val_targets=cross_val_targets,
cross_test_outputs=cross_test_outputs,
error=error,
)
cvplus_coverage = prediction_interval_coverage_probability(
cvplus_interval[:, 0], cvplus_interval[:, 1], y_test
)
刀切法和刀切最小最大值¶
我们现在使用留一法程序来训练模型。
[6]:
from sklearn.model_selection import LeaveOneOut
import jax.numpy as jnp
loo_val_outputs, loo_val_targets, loo_test_outputs = [], [], []
c = 0
for i, idx in enumerate(LeaveOneOut().split(X_train)):
if c >= 30:
break
print(f"Split #{i + 1} out of {X_train.shape[0]}.", end="\r")
model = GradientBoostingRegressor()
model.fit(X_train[idx[0]], y_train[idx[0]])
loo_val_outputs.append(model.predict(X_train[idx[1]]))
loo_val_targets.append(y_train[idx[1]])
loo_test_outputs.append(model.predict(X_test))
c += 1
loo_val_outputs = jnp.array(loo_val_outputs)
loo_val_targets = jnp.array(loo_val_targets)
loo_test_outputs = jnp.array(loo_test_outputs)
Split #30 out of 250.
给定模型输出,我们计算使用jackknife+和jackknife-minmax获得的保形区间。
[7]:
from fortuna.conformal import (
JackknifePlusConformalRegressor,
JackknifeMinmaxConformalRegressor,
)
jkplus_interval = JackknifePlusConformalRegressor().conformal_interval(
loo_val_outputs=loo_val_outputs,
loo_val_targets=loo_val_targets,
loo_test_outputs=loo_test_outputs,
error=error,
)
jkplus_coverage = prediction_interval_coverage_probability(
jkplus_interval[:, 0], jkplus_interval[:, 1], y_test
)
jkmm_interval = JackknifeMinmaxConformalRegressor().conformal_interval(
loo_val_outputs=loo_val_outputs,
loo_val_targets=loo_val_targets,
loo_test_outputs=loo_test_outputs,
error=error,
)
jkmm_coverage = prediction_interval_coverage_probability(
jkmm_interval[:, 0], jkmm_interval[:, 1], y_test
)
覆盖率结果¶
[8]:
print(f"Desired coverage: {1 - error}.")
print(f"CV+ empirical coverage: {cvplus_coverage}.")
print(f"jackknife+ empirical coverage: {jkplus_coverage}.")
print(f"jackknife-minmax empirical coverage: {jkmm_coverage}.")
Desired coverage: 0.95.
CV+ empirical coverage: 0.9440000653266907.
jackknife+ empirical coverage: 0.984000027179718.
jackknife-minmax empirical coverage: 0.9920000433921814.
与CV+相比,我们在CV+中使用5折交叉验证训练模型,而jackknife+和jackknife-minmax由于在整个训练数据集上执行留一法程序,需要显著更高的计算时间。通过在训练数据的子集上执行留一法,可以显著降低这一成本。
[ ]: