平滑EBMs

平滑EBMs#

在这个演示笔记本中，我们将使用一个特别设计的合成数据集来创建一个可解释的提升机（EBM）。我们对数据生成过程的控制使我们能够直观地评估EBM恢复用于创建数据的原始函数的能力。要了解合成数据集的生成方式，您可以查看GitHub上的完整代码。这将提供我们试图恢复的底层函数的见解。完整的数据集生成代码可以在以下位置找到：synthetic generation code

这个笔记本可以在我们的examples folder在GitHub上找到。

# install interpret if not already installed
try:
    import interpret
except ModuleNotFoundError:
    !pip install --quiet interpret scikit-learn

# boilerplate - generate the synthetic dataset and split into test/train

import numpy as np
from sklearn.model_selection import train_test_split
from interpret.utils import make_synthetic
from interpret import show

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())

seed = 42

X, y, names, types = make_synthetic(classes=None, n_samples=50000, missing=False, seed=seed)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=seed)

训练可解释的提升机（EBM）

合成数据集包含大量平滑函数。为了有效处理这些平滑变化的关系，我们在EBM拟合过程中引入了一个名为‘smoothing_rounds’的参数。‘smoothing_rounds’通过在选择随机分割点时以非贪婪的方式启动提升过程，从而在构建内部决策树时避免初始过拟合并建立基线平滑部分响应，然后再切换到更适合拟合部分响应中任何剩余尖锐过渡的贪婪方法。我们还使用reg_alpha正则化参数进一步平滑结果。EBM还支持reg_lambda和max_delta_step，这些参数在某些情况下可能有用。

对于一些具有较大异常值的数据集，增加验证集大小和/或从外部袋中取中位数模型可能会有所帮助，如这里所述： interpretml/interpret#548

from interpret.glassbox import ExplainableBoostingRegressor

ebm = ExplainableBoostingRegressor(names, types, interactions=3, smoothing_rounds=5000, reg_alpha=10.0)
ebm.fit(X_train, y_train)

ExplainableBoostingRegressor(feature_names=['feature_0', 'feature_1',
                                            'feature_2', 'feature_3_integers',
                                            'feature_4', 'feature_5',
                                            'feature_6', 'feature_7_unused',
                                            'feature_8_low_cardinality',
                                            'feature_9_high_cardinality'],
                             feature_types=['continuous', 'continuous',
                                            'continuous', 'continuous',
                                            'continuous', 'continuous',
                                            'continuous', 'continuous',
                                            'nominal', 'nominal'],
                             interactions=3, reg_alpha=10.0,
                             smoothing_rounds=5000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

全局解释

下面的可视化图表证实，EBM能够成功恢复此特定问题的原始数据生成函数。

# Feature 0 - Cosine partial response generated on uniformly distributed data.

show(ebm.explain_global(), 0)

# Feature 1 - Sine partial response generated on normally distributed data.

show(ebm.explain_global(), 1)

# Feature 2 - Squared partial response generated on exponentially distributed data.

show(ebm.explain_global(), 2)

# Feature 3 - Linear partial response generated on poisson distributed integers.

show(ebm.explain_global(), 3)

# Feature 4 - Square wave partial response generated on a feature with correlations
#             to features 0 and 1 with added normally distributed noise.

show(ebm.explain_global(), 4)

# Feature 5 - Sawtooth wave partial response generated on a feature with a conditional 
#             correlation to feature 2 with added normally distributed noise.

show(ebm.explain_global(), 5)

# Feature 6 - exp(x) partial response generated on a feature with interaction effects 
#             between features 2 and 3 with added normally distributed noise.

show(ebm.explain_global(), 6)

# Feature 7 - Unused in the generation function. Should have minimal importance.

show(ebm.explain_global(), 7)

# Feature 8 - Linear partial response generated on a low cardinality categorical feature.
#             The category strings end in integers that indicate the increasing order.

show(ebm.explain_global(), 8)

# Feature 9 - Linear partial response generated on a high cardinality categorical feature.
#             The category strings end in integers that indicate the increasing order.

show(ebm.explain_global(), 9)

# Interaction 0 - Pairwise interaction generated by XORing the sign of feature 0 
#                 with the least significant bit of the integers from feature 3.

show(ebm.explain_global(), 10)

# Interaction 1 - Pairwise interaction generated by multiplying feature 1 and 2.

show(ebm.explain_global(), 11)

# Interaction 2 - Pairwise interaction generated by multiplying feature 3 and 8.

show(ebm.explain_global(), 12)

对于RMSE回归，EBM的截距应与平均值相同

print(np.average(y_train))
print(ebm.intercept_)

0.8689547089058739
0.8689547089058739

特征和成对项的重要性

show(ebm.explain_global())

评估EBM性能

from interpret.perf import RegressionPerf

ebm_perf = RegressionPerf(ebm).explain_perf(X_test, y_test, name='EBM')
print("RMSE: " + str(ebm_perf._internal_obj["overall"]["rmse"]))
show(ebm_perf)

RMSE: 0.26813750938453734