EBM 内部机制 - 二元分类

EBM 内部机制 - 二元分类#

这是描述EBM内部结构及如何进行预测的三部分系列的第二部分。第一部分请点击这里。第三部分请点击这里。

在第二部分中，我们将涵盖二元分类、交互作用、缺失值、序数以及交互作用的简化离散化分辨率。在阅读这部分之前，您应该熟悉第一部分

# boilerplate
from interpret import show
from interpret.glassbox import ExplainableBoostingClassifier
import numpy as np

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())

# make a dataset composed of an ordinal categorical, and a continuous feature
X = [["low", 8.0], ["medium", 7.0], ["high", 9.0], [None, None]]
y = ["apples", "apples", "oranges", "oranges"]

# Fit a classification EBM with 1 interaction
# Define an ordinal feature with specified ordering
# Limit the number of interaction bins to force a lower resolution
# Eliminate the validation set to handle the small dataset
ebm = ExplainableBoostingClassifier(
    interactions=1,
    feature_types=[["low", "medium", "high"], 'continuous'], 
    max_interaction_bins=4,
    validation_size=0, outer_bags=1, max_rounds=900, min_samples_leaf=1, min_hessian=1e-9)
ebm.fit(X, y)
show(ebm.explain_global())

/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/interpret/glassbox/_ebm/_ebm.py:813: UserWarning: Missing values detected. Our visualizations do not currently display missing values. To retain the glassbox nature of the model you need to either set the missing values to an extreme value like -1000 that will be visible on the graphs, or manually examine the missing value score in ebm.term_scores_[term_index][0]
  warn(

print(ebm.classes_)

['apples' 'oranges']

与所有scikit-learn分类器一样，我们将类别的列表存储在ebm.classes_属性中，作为一个排序数组。在这个例子中，我们的类别是字符串，但我们也可以接受整数，正如我们将在第3部分中看到的那样。

print(ebm.feature_types)

[['low', 'medium', 'high'], 'continuous']

在这个例子中，我们将feature_types传递给了ExplainableBoostingClassifier的__init__函数。按照scikit-learn的惯例，这被未经修改地记录在ebm对象中。

print(ebm.feature_types_in_)

['ordinal', 'continuous']

传递给__init__的feature_types被实现为基本特征类型['ordinal', 'continuous']。遵循scikit-learn的SLEP007约定精神，我们在ebm.feature_types_in_中记录了这一点。

print(ebm.feature_names_in_)

['feature_0000', 'feature_0001']

由于我们没有指定特征名称，模型创建了一些默认名称。如果我们向ExplainableBoostingClassifier的__init__函数传递了feature_names，或者如果我们使用了带有列名的Pandas数据框，那么ebm.feature_names_in_将包含这些名称。

print(ebm.term_features_)

[(0,), (1,), (0, 1)]

我们的模型包含3个加性项。前两项是主效应特征，第三项是各个特征之间的成对交互。EBMs不仅限于主效应和成对效应。我们还支持三向交互、四向交互以及更高阶的交互。如果模型中存在任何高阶交互，它们将在ebm.term_features_中作为进一步的索引元组列出。

print(ebm.term_names_)

['feature_0000', 'feature_0001', 'feature_0000 & feature_0001']

ebm.term_names_ 是一个便利属性，它将 ebm.term_features_ 和 ebm.feature_names_in_ 结合起来，为每个加法项创建名称。

ebm.term_names_ 是以下结果：

term_names = [” & “.join(ebm.feature_names_in_[i] for i in grp) for grp in ebm.term_features_]

print(ebm.bins_)

[[{'low': 1, 'medium': 2, 'high': 3}], [array([7.5, 8.5]), array([8.5])]]

ebm.bins_ 是一个每个特征的属性。如第1部分所述，ebm.bins_ 定义了如何对分类（‘名义’和‘序数’）和‘连续’特征进行分箱。

对于分类特征，我们使用一个将类别字符串映射到分箱索引的字典。

如第一部分所述，连续特征分箱是通过一系列切点定义的，这些切点将连续范围划分为多个区域。在这个例子中，我们的数据集对于连续特征有3个唯一值：7.0、8.0和9.0。与第一部分类似，本例中的主要效应有2个分箱切点，将这些值分为3个区域。在这个例子中，主要效应的分箱切点再次是7.5和8.5。

EBMs 支持在交互特征分箱时降低分箱分辨率的能力。在调用 ExplainableBoostingClassifier 的 __init__ 时，我们指定了 max_interaction_bins=4，这限制了 EBM 在交互分箱时只能创建 4 个分箱。其中两个分箱保留给“缺失”和“未见”值，这使得模型在剩余的连续特征值中只有 2 个分箱。然而，我们的数据集中有 3 个唯一值，因此 EBM 被迫决定将这些值中的哪些值组合在一起，并选择一个分割点将它们分成 2 个区域。在这个例子中，EBM 可以选择 7.0 和 9.0 之间的任何分割点。它选择了 8.5，这将 7.0 和 8.0 的值放在较低的分箱中，而 9.0 放在较高的分箱中。

主效应和交互作用的分箱定义存储在ebm.bins_属性中每个特征的列表中。在这个例子中，ebm.bins_[1]包含一个数组列表：[array([7.5, 8.5]), array([8.5])]。ebm.bins_[1][0]处的第一个数组[7.5, 8.5]是主效应的分箱分辨率。ebm.bins_[1][1]处的第二个数组[8.5]是用于交互作用分箱的分箱分辨率。

分箱分辨率不仅限于成对。如果希望对三元组使用更低的分辨率，则列表中会包含第三个分箱切割数组。列表中的最后一项是用于所有高于该位置的交互顺序的分箱分辨率。如果EBM列表中仅包含[7.5, 8.5]的分箱分辨率，则该分辨率将用于主效应、成对、三元组及更高阶的交互。

print(ebm.term_scores_[0])

[ 13.00397068 -11.41817436 -11.44100247   9.85520615   0.        ]

ebm.term_scores_[0] 是这个例子中第一个特征的查找表。由于第一个特征是有序分类的，我们使用字典 {'low': 1, 'medium': 2, 'high': 3} 来查找每个分类字符串应使用的分箱。如果特征值是 NaN，则我们使用索引 0 处的分数。如果特征值是 "low"，我们使用索引 1 处的分数。如果特征值是 "medium"，我们使用索引 2 处的分数。如果特征值是 "high"，我们使用索引 3 处的分数。如果特征值是其他任何值，我们使用索引 4 处的分数。

在这个例子中，第0个bin的分数不为零，因为我们在数据集中为该特征包含了一个缺失值。

print(ebm.term_scores_[1])

[ 12.70602723 -11.17326023 -11.21006963   9.67730263   0.        ]

ebm.term_scores_[1] 是本例中第二个特征的查找表。由于第二个特征是连续特征，我们使用切点进行分箱。第0个箱索引再次保留给缺失值，最后一个箱索引再次保留给未见过的值。在本例中，第0个箱的分数不为零，因为我们在数据集中为该特征包含了一个缺失值。

ebm.bins_[1] 属性包含一个具有2个切点数组的列表。在这种情况下，我们对一个主效应特征进行分箱，因此我们使用索引0处的分箱，即 ebm.bins_[1][0]。

print(ebm.term_scores_[2])

[[ 1.1250000e+00 -4.6500000e-01 -1.2951066e-14  0.0000000e+00]
 [ 4.6500000e-01 -1.1250000e+00 -1.2951066e-14  0.0000000e+00]
 [-1.2951066e-14 -1.1250000e+00  2.7000000e-01  0.0000000e+00]
 [-1.2951066e-14 -2.7000000e-01  1.1250000e+00  0.0000000e+00]
 [ 0.0000000e+00  0.0000000e+00  0.0000000e+00  0.0000000e+00]]

ebm.term_scores_[2] 是本例中由两个特征组成的对的查找表。参与该对的特征可以在 ebm.term_features_[2] 中找到。该对的查找表是二维的，因此索引它需要两个索引。第一个索引将是第一个特征的 bin 索引，第二个索引将是第二个特征的 bin 索引。示例：

pair_scores = ebm.term_scores_[2]

local_score = pair_scores[(feature_0_index, feature_1_index)]

示例代码

最后，这里有一些代码将上述考虑整合到一个函数中，该函数可以为简化场景进行预测。此代码不处理回归、多类、未见值或超出成对的交互等问题。

如果你需要一个可以在所有EBM场景中使用的完整函数，请参见第3部分中的多类示例，该示例除了处理多类问题外，还处理回归和二元分类以及其他所有细节。

sample_scores = []
for sample in X:
    # start from the intercept for each sample
    score = float(ebm.intercept_)
    
    # We have 3 terms: two main effects, and 1 pair interaction
    for term_idx, features in enumerate(ebm.term_features_):
        # indexing into a tensor requires a multi-dimensional index
        tensor_index = []

        # main effects will have 1 feature, and pairs will have 2 features
        for feature_idx in features:
            feature_val = sample[feature_idx]
            bin_idx = 0  # if missing value, use bin index 0

            if feature_val is not None and feature_val is not np.nan:
                # we bin differently for main effects and pairs,
                # so determine which resolution is needed
                if len(features) == 1 or len(ebm.bins_[feature_idx]) == 1:
                    # this is a main effect or only one bin level
                    # is available, so use the highest resolution bins
                    bins = ebm.bins_[feature_idx][0]
                elif len(features) == 2 or len(ebm.bins_[feature_idx]) == 2:
                    # use the lower resolution bins
                    bins = ebm.bins_[feature_idx][1]
                else:
                    raise Exception("Unsupported bin resolution")

                if isinstance(bins, dict):
                    # categorical feature
                    bin_idx = bins[feature_val]
                else:
                    # continuous feature
                    # add 1 because the 0th bin is reserved for 'missing'
                    bin_idx = np.digitize(feature_val, bins) + 1

            tensor_index.append(bin_idx)
        # local_score is also the local feature importance
        local_score = ebm.term_scores_[term_idx][tuple(tensor_index)]
        score += local_score
    sample_scores.append(score)

logits = np.array(sample_scores)

# use the sigmoid function to convert the logits into probabilities
probabilities = 1 / (1 + np.exp(-logits))

print("probability of " + ebm.classes_[1])
print(ebm.predict_proba(X)[:, 1])
print(probabilities)

probability of oranges
[1.60228407e-10 1.62484339e-10 1.00000000e+00 1.00000000e+00]
[1.60228407e-10 1.62484339e-10 1.00000000e+00 1.00000000e+00]

/tmp/ipykernel_18919/1383054448.py:4: DeprecationWarning:

Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)

对于回归问题，我们默认的链接函数是恒等链接函数，因此分数就是实际的预测值。

对于分类问题，分数是logits，我们需要应用逆链接函数来计算概率。对于二分类问题，逆链接函数是sigmoid函数。

与第1部分中的回归相同，‘local_score’变量包含本地解释中显示的值。

show(ebm.explain_local(X, y), 0)