注意
Go to the end 下载完整示例代码。
使用采样器的Bagging分类器#
在这个例子中,我们展示了如何通过提供不同的采样器,使用BalancedBaggingClassifier
来创建多种分类器。
我们将给出过去一年中已发布的几个示例。
# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
print(__doc__)
生成一个不平衡的数据集#
对于这个例子,我们将使用函数make_classification
创建一个合成数据集。这个问题将是一个玩具分类问题,两个类别之间的比例为1:9。
1 0.8977
0 0.1023
Name: proportion, dtype: float64
在接下来的部分中,我们将展示一些多年来提出的算法。我们旨在说明如何通过传递不同的采样器来重用BalancedBaggingClassifier
。
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import cross_validate
ebb = BaggingClassifier()
cv_results = cross_validate(ebb, X, y, scoring="balanced_accuracy")
print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.712 +/- 0.012
精确平衡的Bagging和过Bagging#
BalancedBaggingClassifier
可以与 RandomUnderSampler
或 RandomOverSampler
结合使用。这些方法分别被称为精确平衡装袋和过装袋,并首次在 [1] 中提出。
from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.under_sampling import RandomUnderSampler
# Exactly Balanced Bagging
ebb = BalancedBaggingClassifier(sampler=RandomUnderSampler())
cv_results = cross_validate(ebb, X, y, scoring="balanced_accuracy")
print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.762 +/- 0.025
from imblearn.over_sampling import RandomOverSampler
# Over-bagging
over_bagging = BalancedBaggingClassifier(sampler=RandomOverSampler())
cv_results = cross_validate(over_bagging, X, y, scoring="balanced_accuracy")
print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.705 +/- 0.010
SMOTE-Bagging#
与其使用RandomOverSampler
进行自举采样,另一种选择是使用SMOTE
作为过采样器。这种方法被称为SMOTE-Bagging [2]。
from imblearn.over_sampling import SMOTE
# SMOTE-Bagging
smote_bagging = BalancedBaggingClassifier(sampler=SMOTE())
cv_results = cross_validate(smote_bagging, X, y, scoring="balanced_accuracy")
print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.740 +/- 0.010
大致平衡的装袋#
在使用RandomUnderSampler
或
RandomOverSampler
时,虽然可以创建确切数量的样本,但它并不符合bagging框架中所需的统计精神。[3]的作者提出使用负二项分布来计算要选择的大多数类样本的数量,然后进行随机欠采样。
在这里,我们通过实现一个负责重采样的函数来说明这种方法,并使用FunctionSampler
将其集成到Pipeline
和cross_validate
中。
from collections import Counter
import numpy as np
from imblearn import FunctionSampler
def roughly_balanced_bagging(X, y, replace=False):
"""Implementation of Roughly Balanced Bagging for binary problem."""
# find the minority and majority classes
class_counts = Counter(y)
majority_class = max(class_counts, key=class_counts.get)
minority_class = min(class_counts, key=class_counts.get)
# compute the number of sample to draw from the majority class using
# a negative binomial distribution
n_minority_class = class_counts[minority_class]
n_majority_resampled = np.random.negative_binomial(n=n_minority_class, p=0.5)
# draw randomly with or without replacement
majority_indices = np.random.choice(
np.flatnonzero(y == majority_class),
size=n_majority_resampled,
replace=replace,
)
minority_indices = np.random.choice(
np.flatnonzero(y == minority_class),
size=n_minority_class,
replace=replace,
)
indices = np.hstack([majority_indices, minority_indices])
return X[indices], y[indices]
# Roughly Balanced Bagging
rbb = BalancedBaggingClassifier(
sampler=FunctionSampler(func=roughly_balanced_bagging, kw_args={"replace": True})
)
cv_results = cross_validate(rbb, X, y, scoring="balanced_accuracy")
print(f"{cv_results['test_score'].mean():.3f} +/- {cv_results['test_score'].std():.3f}")
0.758 +/- 0.016
脚本的总运行时间: (0 分钟 26.070 秒)
预计内存使用量: 199 MB