在不平衡数据集上拟合模型以及如何对抗偏差#

这个例子说明了在具有不平衡类别的数据集上学习所引发的问题。随后,我们比较了减轻这些负面影响的不同方法。

# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
print(__doc__)

问题定义#

我们将删除以下功能:

  • “fnlwgt”:这个特征是在研究“adult”数据集时创建的。 因此,我们将不使用这个在调查期间未获得的特征。

  • “education-num”:它与“education”编码了相同的信息。 因此,我们将删除这两个特征中的一个。

from sklearn.datasets import fetch_openml

df, y = fetch_openml("adult", version=2, as_frame=True, return_X_y=True)
df = df.drop(columns=["fnlwgt", "education-num"])

“adult”数据集中的类别比例约为3:1

class
<=50K    37155
>50K     11687
Name: count, dtype: int64

这个数据集只是稍微不平衡。为了更好地突出从不平衡数据集中学习的效果,我们将将其比例增加到30:1

from imblearn.datasets import make_imbalance

ratio = 30
df_res, y_res = make_imbalance(
    df,
    y,
    sampling_strategy={classes_count.idxmin(): classes_count.max() // ratio},
)
y_res.value_counts()
class
<=50K    37155
>50K      1238
Name: count, dtype: int64

我们将执行交叉验证评估以获得测试分数的估计。

作为基线,我们可以使用一个分类器,它将始终预测多数类,而不考虑提供的特征。

from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate

dummy_clf = DummyClassifier(strategy="most_frequent")
scoring = ["accuracy", "balanced_accuracy"]
cv_result = cross_validate(dummy_clf, df_res, y_res, scoring=scoring)
print(f"Accuracy score of a dummy classifier: {cv_result['test_accuracy'].mean():.3f}")
Accuracy score of a dummy classifier: 0.968

我们可以使用平衡准确率来代替准确率,这将考虑到平衡问题。

print(
    "Balanced accuracy score of a dummy classifier: "
    f"{cv_result['test_balanced_accuracy'].mean():.3f}"
)
Balanced accuracy score of a dummy classifier: 0.500

从不平衡数据集中学习的策略#

我们将使用字典和列表来持续存储我们的实验结果,并将它们显示为pandas数据框。

index = []
scores = {"Accuracy": [], "Balanced accuracy": []}

虚拟基线#

在训练一个真实的机器学习模型之前,我们可以存储使用我们的DummyClassifier获得的结果。

import pandas as pd

index += ["Dummy classifier"]
cv_result = cross_validate(dummy_clf, df_res, y_res, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.5


线性分类器基线#

我们将使用LogisticRegression分类器创建一个机器学习管道。在这方面,我们需要对分类列进行独热编码,并在将数据注入LogisticRegression分类器之前对数值列进行标准化。

首先,我们定义我们的数值和分类管道。

from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

num_pipe = make_pipeline(
    StandardScaler(), SimpleImputer(strategy="mean", add_indicator=True)
)
cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore"),
)

然后,我们可以创建一个预处理器,它将把分类列分派到分类管道,将数值列分派到数值管道

from sklearn.compose import make_column_selector as selector
from sklearn.compose import make_column_transformer

preprocessor_linear = make_column_transformer(
    (num_pipe, selector(dtype_include="number")),
    (cat_pipe, selector(dtype_include="category")),
    n_jobs=2,
)

最后,我们将预处理器与我们的LogisticRegression连接起来。然后我们可以评估我们的模型。

from sklearn.linear_model import LogisticRegression

lr_clf = make_pipeline(preprocessor_linear, LogisticRegression(max_iter=1000))
index += ["Logistic regression"]
cv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.50000
Logistic regression 0.971453 0.58389


我们可以看到,我们的线性模型比我们的虚拟基线学习得稍微好一些。然而,它受到了类别不平衡的影响。

我们可以验证,类似的情况也发生在基于树的模型中,例如RandomForestClassifier。对于这种类型的分类器,我们不需要对数值数据进行缩放,只需要对分类数据进行序数编码。

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder

num_pipe = SimpleImputer(strategy="mean", add_indicator=True)
cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
)

preprocessor_tree = make_column_transformer(
    (num_pipe, selector(dtype_include="number")),
    (cat_pipe, selector(dtype_include="category")),
    n_jobs=2,
)

rf_clf = make_pipeline(
    preprocessor_tree, RandomForestClassifier(random_state=42, n_jobs=2)
)
index += ["Random forest"]
cv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.971453 0.583890
Random forest 0.972182 0.644381


RandomForestClassifier 同样受到类别不平衡的影响,但比线性模型的影响稍小。现在,我们将介绍不同的方法来提高这两种模型的性能。

使用 class_weight#

scikit-learn 中的大多数模型都有一个参数 class_weight。这个参数会影响线性模型中的损失计算或基于树的模型中的标准,以便对少数类和多数类的错误分类进行不同的惩罚。我们可以设置 class_weight="balanced",使得应用的权重与类频率成反比。我们在线性模型和基于树的模型中测试了这个参数化。

lr_clf.set_params(logisticregression__class_weight="balanced")

index += ["Logistic regression with balanced class weights"]
cv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.971453 0.583890
Random forest 0.972182 0.644381
Logistic regression with balanced class weights 0.810877 0.827702


rf_clf.set_params(randomforestclassifier__class_weight="balanced")

index += ["Random forest with balanced class weights"]
cv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.971453 0.583890
Random forest 0.972182 0.644381
Logistic regression with balanced class weights 0.810877 0.827702
Random forest with balanced class weights 0.965697 0.640643


我们可以看到,使用class_weight对于线性模型非常有效,缓解了从不平衡类别中学习的问题。然而,RandomForestClassifier仍然偏向于多数类别,主要是由于标准不足以应对类别不平衡的问题。

在学习过程中重新采样训练集#

另一种方法是通过欠采样或过采样一些样本来重新采样训练集。imbalanced-learn 提供了一些采样器来进行此类处理。

from imblearn.pipeline import make_pipeline as make_pipeline_with_sampler
from imblearn.under_sampling import RandomUnderSampler

lr_clf = make_pipeline_with_sampler(
    preprocessor_linear,
    RandomUnderSampler(random_state=42),
    LogisticRegression(max_iter=1000),
)
index += ["Under-sampling + Logistic regression"]
cv_result = cross_validate(lr_clf, df_res, y_res, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.971453 0.583890
Random forest 0.972182 0.644381
Logistic regression with balanced class weights 0.810877 0.827702
Random forest with balanced class weights 0.965697 0.640643
Under-sampling + Logistic regression 0.803897 0.823715


rf_clf = make_pipeline_with_sampler(
    preprocessor_tree,
    RandomUnderSampler(random_state=42),
    RandomForestClassifier(random_state=42, n_jobs=2),
)
index += ["Under-sampling + Random forest"]
cv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.971453 0.583890
Random forest 0.972182 0.644381
Logistic regression with balanced class weights 0.810877 0.827702
Random forest with balanced class weights 0.965697 0.640643
Under-sampling + Logistic regression 0.803897 0.823715
Under-sampling + Random forest 0.798245 0.809470


在线性模型或随机森林训练之前应用随机欠采样器,允许不以多数类样本为代价(即降低准确性)来关注多数类。

我们可以应用任何类型的采样器,并找出哪种采样器在当前数据集上效果最好。

相反,我们将介绍另一种方法,通过使用分类器,这些分类器将在内部应用采样。

使用imbalanced-learn中的特定平衡算法#

我们已经展示了随机欠采样在决策树上可以很有效。然而,与其在数据集上进行一次欠采样,不如在获取引导样本之前对原始数据集进行欠采样。这是imblearn.ensemble.BalancedRandomForestClassifierBalancedBaggingClassifier的基础。

from imblearn.ensemble import BalancedRandomForestClassifier

rf_clf = make_pipeline(
    preprocessor_tree,
    BalancedRandomForestClassifier(
        sampling_strategy="all",
        replacement=True,
        bootstrap=False,
        random_state=42,
        n_jobs=2,
    ),
)
index += ["Balanced random forest"]
cv_result = cross_validate(rf_clf, df_res, y_res, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.971453 0.583890
Random forest 0.972182 0.644381
Logistic regression with balanced class weights 0.810877 0.827702
Random forest with balanced class weights 0.965697 0.640643
Under-sampling + Logistic regression 0.803897 0.823715
Under-sampling + Random forest 0.798245 0.809470
Balanced random forest 0.850858 0.811267


使用 BalancedRandomForestClassifier 的性能优于 应用单一随机欠采样。我们将在 BalancedBaggingClassifier 中使用梯度提升分类器。

from sklearn.ensemble import HistGradientBoostingClassifier

from imblearn.ensemble import BalancedBaggingClassifier

bag_clf = make_pipeline(
    preprocessor_tree,
    BalancedBaggingClassifier(
        estimator=HistGradientBoostingClassifier(random_state=42),
        n_estimators=10,
        random_state=42,
        n_jobs=2,
    ),
)

index += ["Balanced bag of histogram gradient boosting"]
cv_result = cross_validate(bag_clf, df_res, y_res, scoring=scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores
Accuracy Balanced accuracy
Dummy classifier 0.967755 0.500000
Logistic regression 0.971453 0.583890
Random forest 0.972182 0.644381
Logistic regression with balanced class weights 0.810877 0.827702
Random forest with balanced class weights 0.965697 0.640643
Under-sampling + Logistic regression 0.803897 0.823715
Under-sampling + Random forest 0.798245 0.809470
Balanced random forest 0.850858 0.811267
Balanced bag of histogram gradient boosting 0.833459 0.821810


最后一种方法是最有效的。不同的欠采样方法可以为不同的GBDT带来一些多样性,使其学习而不只关注多数类的一部分。

脚本的总运行时间: (0 分钟 40.557 秒)

预计内存使用量: 275 MB

图库由Sphinx-Gallery生成