5. 采样器集成#

5.1. 包含内部平衡采样器的分类器#

5.1.1. Bagging分类器#

在集成分类器中,bagging方法在不同的随机选择的数据子集上构建多个估计器。在scikit-learn中,这个分类器被命名为BaggingClassifier。然而,这个分类器不允许平衡每个数据子集。因此,当在不平衡的数据集上训练时,这个分类器会偏向多数类:

>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=10000, n_features=2, n_informative=2,
...                            n_redundant=0, n_repeated=0, n_classes=3,
...                            n_clusters_per_class=1,
...                            weights=[0.01, 0.05, 0.94], class_sep=0.8,
...                            random_state=0)
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import balanced_accuracy_score
>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.tree import DecisionTreeClassifier
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> bc = BaggingClassifier(DecisionTreeClassifier(), random_state=0)
>>> bc.fit(X_train, y_train) #doctest:
BaggingClassifier(...)
>>> y_pred = bc.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.77...

BalancedBaggingClassifier中,每个引导样本将进一步重新采样以达到sampling_strategy所需的效果。因此,BalancedBaggingClassifier采用了与scikit-learn的BaggingClassifier相同的参数。此外,采样由参数sampler或两个参数sampling_strategyreplacement控制,如果用户希望使用RandomUnderSampler

>>> from imblearn.ensemble import BalancedBaggingClassifier
>>> bbc = BalancedBaggingClassifier(DecisionTreeClassifier(),
...                                 sampling_strategy='auto',
...                                 replacement=False,
...                                 random_state=0)
>>> bbc.fit(X_train, y_train)
BalancedBaggingClassifier(...)
>>> y_pred = bbc.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.8...

更改sampler将产生不同的已知实现 [MO97], [HKT09], [WY09]。您可以参考以下示例,这些示例展示了实践中这些不同的方法: Bagging classifiers using sampler

5.1.2. 随机森林#

BalancedRandomForestClassifier 是另一种集成方法,其中森林中的每棵树都将提供一个平衡的引导样本 [CLB+04]。该类提供了 RandomForestClassifier 的所有功能:

>>> from imblearn.ensemble import BalancedRandomForestClassifier
>>> brf = BalancedRandomForestClassifier(
...     n_estimators=100, random_state=0, sampling_strategy="all", replacement=True,
...     bootstrap=False,
... )
>>> brf.fit(X_train, y_train)
BalancedRandomForestClassifier(...)
>>> y_pred = brf.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.8...

5.1.3. Boosting#

已经设计了几种利用提升的方法。

RUSBoostClassifier 在执行提升迭代之前随机对数据集进行欠采样 [SKVHN09]:

>>> from imblearn.ensemble import RUSBoostClassifier
>>> rusboost = RUSBoostClassifier(n_estimators=200, algorithm='SAMME.R',
...                               random_state=0)
>>> rusboost.fit(X_train, y_train)
RUSBoostClassifier(...)
>>> y_pred = rusboost.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0...

一种特定的方法,使用AdaBoostClassifier作为 袋装分类器中的学习器,称为“EasyEnsemble”。EasyEnsembleClassifier允许 袋装AdaBoost学习器,这些学习器在平衡的bootstrap样本上训练[LWZ08]。类似于 BalancedBaggingClassifier API,可以如下构建集成:

>>> from imblearn.ensemble import EasyEnsembleClassifier
>>> eec = EasyEnsembleClassifier(random_state=0)
>>> eec.fit(X_train, y_train)
EasyEnsembleClassifier(...)
>>> y_pred = eec.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.6...