Catboost 教程

在本教程中,我们使用 catboost 进行基于树的梯度提升。

你可以通过 pip 安装 catboost:

pip install catboost

或者使用 conda:

conda install -c conda-forge catboost
[1]:
import catboost
from catboost import CatBoostClassifier, CatBoostRegressor

import shap

shap.initjs()

首先,让我们探讨具有数值特征的数据集的shap值。

[2]:
X, y = shap.datasets.california(n_points=500)
[3]:
model = CatBoostRegressor(iterations=300, learning_rate=0.1, random_seed=123)
model.fit(X, y, verbose=False, plot=False)
[3]:
<catboost.core.CatBoostRegressor at 0x13fe19db0>
[4]:
explainer = shap.TreeExplainer(model)
shap_values = explainer(X)

# visualize the first prediction's explanation
shap.plots.force(shap_values[0, ...])
[4]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

上述解释展示了每个特征如何促使模型输出从基准值(我们在训练数据集上传递的平均模型输出)向模型输出推进。推动预测值上升的特征以红色显示,推动预测值下降的特征以蓝色显示。

如果我们采用许多如上所示的解释,将它们旋转90度,然后水平堆叠它们,我们可以看到整个数据集的解释(在笔记本中,这个图是交互式的):

[5]:
# visualize the training set predictions
shap.plots.force(shap_values)
[5]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

要理解单个特征如何影响模型的输出,我们可以绘制该特征的SHAP值与数据集中所有示例的特征值的关系图。由于SHAP值表示特征对模型输出变化的贡献,下图表示预测房价随``MedInc``(街区组中的收入中位数)变化的情况。在单一``MedInc``值处的垂直分散表示与其他特征的交互效应。

在这个依赖图中,我们选择了 ``HouseAge``(街区组中的房屋中位年龄)特征进行着色,从中我们可以观察到:

  • 在较低的 MedInc``(比如说 <=4),``HouseAge 对房价的影响不大(或者至少,这种依赖关系并不明显)。

  • 在较高的 MedInc 下,较老的房子往往有更高的价格(我们通常可以看到,对于相同的 MedInc ,红点比蓝点更高)。

[6]:
# create a SHAP dependence plot to show the effect of a single feature across the whole dataset
shap.dependence_plot("MedInc", shap_values.values, X, interaction_index="HouseAge")
../../../_images/example_notebooks_tabular_examples_tree_based_models_Catboost_tutorial_9_0.png

为了了解哪些特征对模型最为重要,我们可以绘制每个样本的每个特征的SHAP值。下图按所有样本的SHAP值大小的总和对特征进行排序,并使用SHAP值显示每个特征对模型输出的影响分布。颜色表示特征值(红色高,蓝色低)。例如,这揭示了高MedInc(区块组中的中位收入)会增加预测的房价。

[7]:
# summarize the effects of all the features
shap.plots.beeswarm(shap_values)
../../../_images/example_notebooks_tabular_examples_tree_based_models_Catboost_tutorial_11_0.png

你也可以使用 SHAP 值来分析分类特征的重要性。

[8]:
import catboost.datasets

train_df, test_df = catboost.datasets.amazon()
y = train_df.ACTION
X = train_df.drop("ACTION", axis=1)
cat_features = list(range(0, X.shape[1]))
[9]:
model = CatBoostClassifier(iterations=300, learning_rate=0.1, random_seed=12)
model.fit(X, y, cat_features=cat_features, verbose=False, plot=False)
[9]:
<catboost.core.CatBoostClassifier at 0x13ff33220>
[10]:
explainer = shap.TreeExplainer(model)
shap_values = explainer(X, y)

以下是对一个正例和一个负例的特征重要性的可视化。请注意,二分类输出的值不在范围 [0,1] 内。你需要计算一个sigmoid函数值,以计算最终的概率。

[11]:
test_objects = [X.iloc[0:1], X.iloc[91:92]]

for obj in test_objects:
    print(f"Probability of class 1 = {model.predict_proba(obj)[0][1]:.4f}")
    print(
        "Formula raw prediction = {:.4f}".format(
            model.predict(obj, prediction_type="RawFormulaVal")[0]
        )
    )
    print("\n")
Probability of class 1 = 0.9970
Formula raw prediction = 5.8130


Probability of class 1 = 0.0229
Formula raw prediction = -3.7539


[12]:
shap.plots.force(shap_values[0, ...])
[12]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
[13]:
shap.plots.force(shap_values[91, ...])
[13]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

在下面的蜂群图中,你可以看到特征的绝对值并不重要,因为它们是哈希值。但从图中可以清楚地看出不同特征的影响。

[14]:
shap.plots.beeswarm(shap_values)
../../../_images/example_notebooks_tabular_examples_tree_based_models_Catboost_tutorial_21_0.png

您可以为多类计算 SHAP 值。

[15]:
model = CatBoostClassifier(
    loss_function="MultiClass", iterations=300, learning_rate=0.1, random_seed=123
)
model.fit(X, y, cat_features=cat_features, verbose=False, plot=False)
[15]:
<catboost.core.CatBoostClassifier at 0x14037a080>
[16]:
explainer = shap.TreeExplainer(model)
shap_values = explainer(X, y)

类 0 的公式原始预测的 SHAP 值汇总图。

[17]:
shap.plots.beeswarm(shap_values[..., 0])
../../../_images/example_notebooks_tabular_examples_tree_based_models_Catboost_tutorial_26_0.png

对于类别1的强制图。

[18]:
shap.plots.beeswarm(shap_values[..., 1])
../../../_images/example_notebooks_tabular_examples_tree_based_models_Catboost_tutorial_28_0.png