EBM 内部机制 - 多类别#

这是描述EBM内部结构及如何进行预测的三部分系列的第3部分。要查看第1部分,请点击这里。要查看第2部分,请点击这里

在第三部分中,我们将涵盖多分类、指定的分箱切割、术语排除和未见过的值。在阅读本部分之前,您应该熟悉第一部分第二部分中的信息。

# boilerplate
from interpret import show
from interpret.glassbox import ExplainableBoostingClassifier
import numpy as np

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
# make a dataset composed of a nominal, an unused feature, and a continuous 
X = [["Peru", "", 7], ["Fiji", "", 8], ["Peru", "", 9], [None, "", None]]
y = [6000, 5000, 4000, 6000] # integer classes

# Fit a classification EBM without interactions
# Specify exact bin cuts for the continuous feature
# Exclude the middle feature during fitting
# Eliminate the validation set to handle the small dataset
ebm = ExplainableBoostingClassifier(
    interactions=0, 
    feature_types=['nominal', 'nominal', [7.25, 9.0]], 
    exclude=[(1,)],
    validation_size=0, outer_bags=1, min_samples_leaf=1, min_hessian=1e-9)
ebm.fit(X, y)
show(ebm.explain_global())
/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/interpret/glassbox/_ebm/_ebm.py:813: UserWarning: Missing values detected. Our visualizations do not currently display missing values. To retain the glassbox nature of the model you need to either set the missing values to an extreme value like -1000 that will be visible on the graphs, or manually examine the missing value score in ebm.term_scores_[term_index][0]
  warn(





print(ebm.classes_)
[4000 5000 6000]

与所有scikit-learn分类器一样,我们将类别的列表存储在ebm.classes_属性中,作为一个排序数组。在这个例子中,我们的类别是整数,但我们也可以接受字符串,如第2部分所示。

print(ebm.feature_types)
['nominal', 'nominal', [7.25, 9.0]]

在这个例子中,我们将feature_types传递给了ExplainableBoostingClassifier的__init__函数。按照scikit-learn的惯例,这被未经修改地记录在ebm对象中。

print(ebm.feature_types_in_)
['nominal', 'nominal', 'continuous']

传递给__init__的feature_types被实现为基本特征类型['nominal', 'nominal', 'continuous']。

print(ebm.feature_names)
None

在调用ExplainableBoostingClassifier的__init__函数时未指定feature_names,因此根据scikit-learn的惯例,将其设置为None,以记录未修改的__init__参数。

print(ebm.feature_names_in_)
['feature_0000', 'feature_0001', 'feature_0002']

由于我们没有指定特征名称,模型创建了一些默认名称。如果我们向ExplainableBoostingClassifier的__init__函数传递了feature_names,或者如果我们使用了带有列名的Pandas数据框,那么ebm.feature_names_in_将包含这些名称。遵循scikit-learn的SLEP007约定,我们在ebm.feature_names_in_中记录了这一点。

print(ebm.term_features_)
[(0,), (2,)]

在调用ExplainableBoostingClassifier的__init__函数时,我们指定了exclude=[(1,)],这意味着我们排除了模型项列表中的中间特征。因此,中间特征在ebm.term_features_的项列表中缺失。

print(ebm.term_names_)
['feature_0000', 'feature_0002']

由于ebm.term_features_缺少该特征,ebm.term_names_也缺少中间特征

print(ebm.bins_)
[[{'Fiji': 1, 'Peru': 2}], [], [array([7.25, 9.  ])]]

ebm.bins_ 是一个每个特征的属性,所以这里列出了中间特征。然而,我们看到中间特征没有分箱定义,因为在使用模型进行预测时没有考虑它。

这些分箱的结构如第1部分第2部分所述。需要注意的是,连续特征的分箱切点与ExplainableBoostingClassifier的__init__函数的feature_types参数中指定的分箱切点[7.25, 9.0]相同。

值得注意的是,指定的最后一个分箱切割点正好等于最大特征值9.0。在这种情况下,当特征值与切割值相同时,特征将被放入上部分箱。

print(ebm.intercept_)
[-28.31612845 -28.47045242   0.        ]

对于多分类问题,ebm.intercept_ 是一个数组,包含 ebm.classes_ 中每个预测类别的 logit。这种行为与其他 scikit-learn 多分类分类器为每个类别生成一个 logit 的方式相同。

print(ebm.term_scores_[0])
[[-0.62123484 -0.49799169  0.9073881 ]
 [-0.65546339  1.5720376  -0.99546262]
 [ 0.63834911 -0.53702295  0.04403726]
 [ 0.          0.          0.        ]]

ebm.term_scores_[0] 是包含国家名称的名义分类特征的查找表。对于多类问题,每个区间由一个包含每个预测类别的1个logit的数组组成。在这个例子中,每行对应一个区间。外部索引中有4个区间,内部索引中有3个类别的logit。

缺失值再次被放置在0号分箱索引中,如上所示,为3个logits的第一行。未见分箱是最后一行的零。

由于此特征是一个名义分类,我们使用字典 {'Fiji': 1, 'Peru': 2} 来查找每个分类字符串应使用哪一行的 logits。

print(ebm.term_scores_[1])
[[-60.96934772 -60.92160852  22.33281795]
 [ 13.33630299  13.97033549   5.65040018]
 [ 14.1539431   32.90285101 -13.67516121]
 [ 33.47910163  14.04842202 -14.30805692]
 [  0.           0.           0.        ]]

ebm.term_scores_[1] 是连续特征的查找表。再次强调,第0个和最后一个索引分别用于缺失值和未见过的值。这个特定的例子有5个分箱,包括第0个缺失分箱索引、由2个切分点划分的三个分区以及未见过的分箱索引。每一行是一个包含3个类别logits的单个分箱。

示例代码

此示例代码整合了所有三个部分中讨论的内容。它可以用作ExplainableBoostingRegressor现有EBM预测函数的替代品,或者作为ExplainableBoostingClassifier的predict_proba函数。

from sklearn.utils.extmath import softmax

sample_scores = []
for sample in X:
    # start from the intercept for each sample
    score = ebm.intercept_.copy()
    if isinstance(score, float) or len(score) == 1:
        # regression or binary classification
        score = float(score)

    # we have 2 terms, so add their score contributions
    for term_idx, features in enumerate(ebm.term_features_):
        # indexing into a tensor requires a multi-dimensional index
        tensor_index = []

        # main effects will have 1 feature, and pairs will have 2 features
        for feature_idx in features:
            feature_val = sample[feature_idx]
            bin_idx = 0  # if missing value, use bin index 0

            if feature_val is not None and feature_val is not np.nan:
                # we bin differently for main effects and pairs, so first 
                # get the list containing the bins for different resolutions
                bin_levels = ebm.bins_[feature_idx]

                # what resolution do we need for this term (main resolution, pair
                # resolution, etc.), but limit to the last resolution available
                bins = bin_levels[min(len(bin_levels), len(features)) - 1]

                if isinstance(bins, dict):
                    # categorical feature
                    # 'unseen' category strings are in the last bin (-1)
                    bin_idx = bins.get(feature_val, -1)
                else:
                    # continuous feature
                    try:
                        # try converting to a float, if that fails it's 'unseen'
                        feature_val = float(feature_val)
                        # add 1 because the 0th bin is reserved for 'missing'
                        bin_idx = np.digitize(feature_val, bins) + 1
                    except ValueError:
                        # non-floats are 'unseen', which is in the last bin (-1)
                        bin_idx = -1
        
            tensor_index.append(bin_idx)
        # local_score is also the local feature importance
        local_score = ebm.term_scores_[term_idx][tuple(tensor_index)]
        score += local_score
    sample_scores.append(score)

predictions = np.array(sample_scores)

if hasattr(ebm, 'classes_'):
    # classification
    if len(ebm.classes_) == 2:
        # binary classification

        # softmax expects two logits for binary classification
        # the first logit is always equivalent to 0 for binary classification
        predictions = [[0, x] for x in predictions]
    predictions = softmax(predictions)

if hasattr(ebm, 'classes_'):
    print("probabilities for classes " + str(ebm.classes_))
    print("")
    print(ebm.predict_proba(X))
else:
    print(ebm.predict(X))
print("")
print(predictions)
probabilities for classes [4000 5000 6000]

[[1.98844326e-09 9.91722964e-10 9.99999997e-01]
 [9.05906581e-10 9.99999998e-01 1.04938662e-09]
 [9.99999997e-01 9.63570318e-10 1.93077962e-09]
 [7.25969327e-50 7.38164097e-50 1.00000000e+00]]

[[1.98844326e-09 9.91722964e-10 9.99999997e-01]
 [9.05906581e-10 9.99999998e-01 1.04938662e-09]
 [9.99999997e-01 9.63570318e-10 1.93077962e-09]
 [7.25969327e-50 7.38164097e-50 1.00000000e+00]]