使用采样器的Bagging分类器#

在这个例子中,我们展示了如何通过提供不同的采样器,使用BalancedBaggingClassifier来创建多种分类器。

我们将给出过去一年中已发布的几个示例。

# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
print(__doc__)

生成一个不平衡的数据集#

对于这个例子,我们将使用函数make_classification创建一个合成数据集。这个问题将是一个玩具分类问题,两个类别之间的比例为1:9。

from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=10_000,
    n_features=10,
    weights=[0.1, 0.9],
    class_sep=0.5,
    random_state=0,
)
import pandas as pd

pd.Series(y).value_counts(normalize=True)
1    0.8977
0    0.1023
Name: proportion, dtype: float64

在接下来的部分中,我们将展示一些多年来提出的算法。我们旨在说明如何通过传递不同的采样器来重用BalancedBaggingClassifier

from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_validate

ebb = BaggingClassifier()
cv_results = cross_validate(ebb, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.712 +/- 0.012

精确平衡的Bagging和过Bagging#

BalancedBaggingClassifier 可以与 RandomUnderSamplerRandomOverSampler 结合使用。这些方法分别被称为精确平衡装袋和过装袋,并首次在 [1] 中提出。

from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.under_sampling import RandomUnderSampler

# Exactly Balanced Bagging
ebb = BalancedBaggingClassifier(sampler=RandomUnderSampler())
cv_results = cross_validate(ebb, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.762 +/- 0.025
from imblearn.over_sampling import RandomOverSampler

# Over-bagging
over_bagging = BalancedBaggingClassifier(sampler=RandomOverSampler())
cv_results = cross_validate(over_bagging, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.705 +/- 0.010

SMOTE-Bagging#

与其使用RandomOverSampler进行自举采样,另一种选择是使用SMOTE作为过采样器。这种方法被称为SMOTE-Bagging [2]

from imblearn.over_sampling import SMOTE

# SMOTE-Bagging
smote_bagging = BalancedBaggingClassifier(sampler=SMOTE())
cv_results = cross_validate(smote_bagging, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.740 +/- 0.010

大致平衡的装袋#

在使用RandomUnderSamplerRandomOverSampler时,虽然可以创建确切数量的样本,但它并不符合bagging框架中所需的统计精神。[3]的作者提出使用负二项分布来计算要选择的大多数类样本的数量,然后进行随机欠采样。

在这里,我们通过实现一个负责重采样的函数来说明这种方法,并使用FunctionSampler将其集成到Pipelinecross_validate中。

from collections import Counter

import numpy as np

from imblearn import FunctionSampler


def roughly_balanced_bagging(X, y, replace=False):
    """Implementation of Roughly Balanced Bagging for binary problem."""
    # find the minority and majority classes
    class_counts = Counter(y)
    majority_class = max(class_counts, key=class_counts.get)
    minority_class = min(class_counts, key=class_counts.get)

    # compute the number of sample to draw from the majority class using
    # a negative binomial distribution
    n_minority_class = class_counts[minority_class]
    n_majority_resampled = np.random.negative_binomial(n=n_minority_class, p=0.5)

    # draw randomly with or without replacement
    majority_indices = np.random.choice(
        np.flatnonzero(y == majority_class),
        size=n_majority_resampled,
        replace=replace,
    )
    minority_indices = np.random.choice(
        np.flatnonzero(y == minority_class),
        size=n_minority_class,
        replace=replace,
    )
    indices = np.hstack([majority_indices, minority_indices])

    return X[indices], y[indices]


# Roughly Balanced Bagging
rbb = BalancedBaggingClassifier(
    sampler=FunctionSampler(func=roughly_balanced_bagging, kw_args={"replace": True})
)
cv_results = cross_validate(rbb, X, y, scoring="balanced_accuracy")

print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.758 +/- 0.016

脚本的总运行时间: (0 分钟 26.070 秒)

预计内存使用量: 199 MB

图库由Sphinx-Gallery生成