不平衡学习特定指标#

已经开发了特定的指标来评估使用不平衡数据训练的分类器。imblearn主要提供了两个在sklearn中未实现的额外指标:(i) 几何平均数和 (ii) 平衡准确率指数。

# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
print(__doc__)

RANDOM_STATE = 42

首先,我们将生成一些不平衡的数据集。

from sklearn.datasets import make_classification

X, y = make_classification(
    n_classes=3,
    class_sep=2,
    weights=[0.1, 0.9],
    n_informative=10,
    n_redundant=1,
    flip_y=0,
    n_features=20,
    n_clusters_per_class=4,
    n_samples=5000,
    random_state=RANDOM_STATE,
)

我们将数据分为训练集和测试集。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=RANDOM_STATE
)

我们将创建一个由SMOTE过采样器和一个LogisticRegression分类器组成的管道。

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline

model = make_pipeline(
    StandardScaler(),
    SMOTE(random_state=RANDOM_STATE),
    LogisticRegression(max_iter=10_000, random_state=RANDOM_STATE),
)

现在,我们将在训练集上训练模型,并获取与测试集相关的预测。请注意,只有在调用 fit 时才会发生重采样:y_pred 中的样本数量与 y_test 中的相同。

几何平均数对应于敏感性和特异性的乘积的平方根。结合这两个指标应该考虑到数据集的平衡。

from imblearn.metrics import geometric_mean_score

print(f"The geometric mean is {geometric_mean_score(y_test, y_pred):.3f}")
The geometric mean is 0.940

索引平衡准确率可以将任何度量标准转换为用于不平衡学习问题的度量标准。

from imblearn.metrics import make_index_balanced_accuracy

alpha = 0.1
geo_mean = make_index_balanced_accuracy(alpha=alpha, squared=True)(geometric_mean_score)

print(
    f"The IBA using alpha={alpha} and the geometric mean: "
    f"{geo_mean(y_test, y_pred):.3f}"
)
The IBA using alpha=0.1 and the geometric mean: 0.884
alpha = 0.5
geo_mean = make_index_balanced_accuracy(alpha=alpha, squared=True)(geometric_mean_score)

print(
    f"The IBA using alpha={alpha} and the geometric mean: "
    f"{geo_mean(y_test, y_pred):.3f}"
)
The IBA using alpha=0.5 and the geometric mean: 0.884

脚本的总运行时间: (0 分钟 1.843 秒)

预计内存使用量: 199 MB

图库由Sphinx-Gallery生成