Catboost 教程

在本教程中，我们使用 catboost 进行基于树的梯度提升。

你可以通过 pip 安装 catboost：

pip install catboost

或者使用 conda：

conda install -c conda-forge catboost

[1]:

import catboost
from catboost import CatBoostClassifier, CatBoostRegressor

import shap

shap.initjs()

首先，让我们探讨具有数值特征的数据集的shap值。

[2]:

X, y = shap.datasets.california(n_points=500)

[3]:

model = CatBoostRegressor(iterations=300, learning_rate=0.1, random_seed=123)
model.fit(X, y, verbose=False, plot=False)

[3]:

<catboost.core.CatBoostRegressor at 0x13fe19db0>

[4]:

explainer = shap.TreeExplainer(model)
shap_values = explainer(X)

# visualize the first prediction's explanation
shap.plots.force(shap_values[0, ...])

[4]:

Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

上述解释展示了每个特征如何促使模型输出从基准值（我们在训练数据集上传递的平均模型输出）向模型输出推进。推动预测值上升的特征以红色显示，推动预测值下降的特征以蓝色显示。

如果我们采用许多如上所示的解释，将它们旋转90度，然后水平堆叠它们，我们可以看到整个数据集的解释（在笔记本中，这个图是交互式的）：

[5]:

# visualize the training set predictions
shap.plots.force(shap_values)

[5]:

Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

要理解单个特征如何影响模型的输出，我们可以绘制该特征的SHAP值与数据集中所有示例的特征值的关系图。由于SHAP值表示特征对模型输出变化的贡献，下图表示预测房价随``MedInc``（街区组中的收入中位数）变化的情况。在单一``MedInc``值处的垂直分散表示与其他特征的交互效应。

在这个依赖图中，我们选择了 ``HouseAge``（街区组中的房屋中位年龄）特征进行着色，从中我们可以观察到：

在较低的 MedInc``（比如说 <=4），``HouseAge 对房价的影响不大（或者至少，这种依赖关系并不明显）。
在较高的 MedInc 下，较老的房子往往有更高的价格（我们通常可以看到，对于相同的 MedInc ，红点比蓝点更高）。

[6]:

# create a SHAP dependence plot to show the effect of a single feature across the whole dataset
shap.dependence_plot("MedInc", shap_values.values, X, interaction_index="HouseAge")

../../../_images/example_notebooks_tabular_examples_tree_based_models_Catboost_tutorial_9_0.png

为了了解哪些特征对模型最为重要，我们可以绘制每个样本的每个特征的SHAP值。下图按所有样本的SHAP值大小的总和对特征进行排序，并使用SHAP值显示每个特征对模型输出的影响分布。颜色表示特征值（红色高，蓝色低）。例如，这揭示了高MedInc（区块组中的中位收入）会增加预测的房价。

[7]:

# summarize the effects of all the features
shap.plots.beeswarm(shap_values)

../../../_images/example_notebooks_tabular_examples_tree_based_models_Catboost_tutorial_11_0.png

你也可以使用 SHAP 值来分析分类特征的重要性。

[8]:

import catboost.datasets

train_df, test_df = catboost.datasets.amazon()
y = train_df.ACTION
X = train_df.drop("ACTION", axis=1)
cat_features = list(range(0, X.shape[1]))

[9]:

model = CatBoostClassifier(iterations=300, learning_rate=0.1, random_seed=12)
model.fit(X, y, cat_features=cat_features, verbose=False, plot=False)

[9]:

<catboost.core.CatBoostClassifier at 0x13ff33220>

[10]:

explainer = shap.TreeExplainer(model)
shap_values = explainer(X, y)

以下是对一个正例和一个负例的特征重要性的可视化。请注意，二分类输出的值不在范围 [0,1] 内。你需要计算一个sigmoid函数值，以计算最终的概率。

[11]:

test_objects = [X.iloc[0:1], X.iloc[91:92]]

for obj in test_objects:
    print(f"Probability of class 1 = {model.predict_proba(obj)[0][1]:.4f}")
    print(
        "Formula raw prediction = {:.4f}".format(
            model.predict(obj, prediction_type="RawFormulaVal")[0]
        )
    )
    print("\n")

Probability of class 1 = 0.9970
Formula raw prediction = 5.8130


Probability of class 1 = 0.0229
Formula raw prediction = -3.7539

[12]:

shap.plots.force(shap_values[0, ...])

[12]:

Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

[13]:

shap.plots.force(shap_values[91, ...])

[13]:

Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

在下面的蜂群图中，你可以看到特征的绝对值并不重要，因为它们是哈希值。但从图中可以清楚地看出不同特征的影响。

[14]:

shap.plots.beeswarm(shap_values)

../../../_images/example_notebooks_tabular_examples_tree_based_models_Catboost_tutorial_21_0.png

您可以为多类计算 SHAP 值。

[15]:

model = CatBoostClassifier(
    loss_function="MultiClass", iterations=300, learning_rate=0.1, random_seed=123
)
model.fit(X, y, cat_features=cat_features, verbose=False, plot=False)

[15]:

<catboost.core.CatBoostClassifier at 0x14037a080>

[16]:

explainer = shap.TreeExplainer(model)
shap_values = explainer(X, y)

类 0 的公式原始预测的 SHAP 值汇总图。

[17]:

shap.plots.beeswarm(shap_values[..., 0])

../../../_images/example_notebooks_tabular_examples_tree_based_models_Catboost_tutorial_26_0.png

对于类别1的强制图。

[18]:

shap.plots.beeswarm(shap_values[..., 1])

../../../_images/example_notebooks_tabular_examples_tree_based_models_Catboost_tutorial_28_0.png