6. 其他采样器#

6.1. 自定义采样器#

一个完全定制的采样器,FunctionSampler,在imbalanced-learn中可用,这样你可以通过定义一个单一函数来快速原型化你自己的采样器。可以使用接受字典的属性kw_args来添加额外的参数。以下示例说明了如何保留数组Xy的前10个元素:

>>> import numpy as np
>>> from imblearn import FunctionSampler
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
...                            n_redundant=0, n_repeated=0, n_classes=3,
...                            n_clusters_per_class=1,
...                            weights=[0.01, 0.05, 0.94],
...                            class_sep=0.8, random_state=0)
>>> def func(X, y):
...   return X[:10], y[:10]
>>> sampler = FunctionSampler(func=func)
>>> X_res, y_res = sampler.fit_resample(X, y)
>>> np.all(X_res == X[:10])
True
>>> np.all(y_res == y[:10])
True

此外,参数 validate 控制输入检查。例如, 将 validate=False 设置为允许传递任何类型的目标 y 并对回归目标进行一些 采样:

>>> from sklearn.datasets import make_regression
>>> X_reg, y_reg = make_regression(n_samples=100, random_state=42)
>>> rng = np.random.RandomState(42)
>>> def dummy_sampler(X, y):
...     indices = rng.choice(np.arange(X.shape[0]), size=10)
...     return X[indices], y[indices]
>>> sampler = FunctionSampler(func=dummy_sampler, validate=False)
>>> X_res, y_res = sampler.fit_resample(X_reg, y_reg)
>>> y_res
array([  41.49112498, -142.78526195,   85.55095317,  141.43321419,
         75.46571114,  -67.49177372,  159.72700509, -169.80498923,
        211.95889757,  211.95889757])

我们展示了如何使用这种采样器来实现一个异常值拒绝估计器,该估计器可以轻松地在Pipeline中使用: 自定义采样器以实现异常值拒绝估计器

6.2. 自定义生成器#

Imbalanced-learn 提供了针对 TensorFlow 和 Keras 的特定生成器,这些生成器将生成平衡的小批量数据。

6.2.1. TensorFlow 生成器#

balanced_batch_generator 允许使用不平衡学习采样器生成平衡的小批量数据,该采样器返回索引。

首先,我们生成一些数据:

>>> n_features, n_classes = 10, 2
>>> X, y = make_classification(
...     n_samples=10_000, n_features=n_features, n_informative=2,
...     n_redundant=0, n_repeated=0, n_classes=n_classes,
...     n_clusters_per_class=1, weights=[0.1, 0.9],
...     class_sep=0.8, random_state=0
... )
>>> X = X.astype(np.float32)

然后,我们可以创建生成器,它将产生平衡的小批量数据:

>>> from imblearn.under_sampling import RandomUnderSampler
>>> from imblearn.tensorflow import balanced_batch_generator
>>> training_generator, steps_per_epoch = balanced_batch_generator(
...     X,
...     y,
...     sample_weight=None,
...     sampler=RandomUnderSampler(),
...     batch_size=32,
...     random_state=42,
... )

generatorsteps_per_epoch 在 Tensorflow 模型的训练过程中使用。我们将展示如何使用这个生成器。首先,我们可以定义一个逻辑回归模型,该模型将通过梯度下降进行优化:

>>> import tensorflow as tf
>>> # initialize the weights and intercept
>>> normal_initializer = tf.random_normal_initializer(mean=0, stddev=0.01)
>>> coef = tf.Variable(normal_initializer(
...     shape=[n_features, n_classes]), dtype="float32"
... )
>>> intercept = tf.Variable(
...     normal_initializer(shape=[n_classes]), dtype="float32"
... )
>>> # define the model
>>> def logistic_regression(X):
...     return tf.nn.softmax(tf.matmul(X, coef) + intercept)
>>> # define the loss function
>>> def cross_entropy(y_true, y_pred):
...     y_true = tf.one_hot(y_true, depth=n_classes)
...     y_pred = tf.clip_by_value(y_pred, 1e-9, 1.)
...     return tf.reduce_mean(-tf.reduce_sum(y_true * tf.math.log(y_pred)))
>>> # define our metric
>>> def balanced_accuracy(y_true, y_pred):
...     cm = tf.math.confusion_matrix(tf.cast(y_true, tf.int64), tf.argmax(y_pred, 1))
...     per_class = np.diag(cm) / tf.math.reduce_sum(cm, axis=1)
...     return np.mean(per_class)
>>> # define the optimizer
>>> optimizer = tf.optimizers.SGD(learning_rate=0.01)
>>> # define the optimization step
>>> def run_optimization(X, y):
...     with tf.GradientTape() as g:
...         y_pred = logistic_regression(X)
...         loss = cross_entropy(y, y_pred)
...     gradients = g.gradient(loss, [coef, intercept])
...     optimizer.apply_gradients(zip(gradients, [coef, intercept]))

一旦初始化,模型通过迭代平衡的小批量数据进行训练,并最小化之前定义的损失:

>>> epochs = 10
>>> for e in range(epochs):
...     y_pred = logistic_regression(X)
...     loss = cross_entropy(y, y_pred)
...     bal_acc = balanced_accuracy(y, y_pred)
...     print(f"epoch: {e}, loss: {loss:.3f}, accuracy: {bal_acc}")
...     for i in range(steps_per_epoch):
...         X_batch, y_batch = next(training_generator)
...         run_optimization(X_batch, y_batch)
epoch: 0, ...

6.2.2. Keras生成器#

Keras 提供了一个更高级的 API,可以通过调用 fit_generator 方法来定义和训练模型。为了说明这一点,我们将定义一个逻辑回归模型:

>>> from tensorflow import keras
>>> y = keras.utils.to_categorical(y, 3)
>>> model = keras.Sequential()
>>> model.add(
...     keras.layers.Dense(
...         y.shape[1], input_dim=X.shape[1], activation='softmax'
...     )
... )
>>> model.compile(
...     optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy']
... )

balanced_batch_generator 创建一个平衡的小批量生成器,并生成相应数量的小批量:

>>> from imblearn.keras import balanced_batch_generator
>>> training_generator, steps_per_epoch = balanced_batch_generator(
...     X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42
... )

然后,可以调用fit,传入生成器和步骤:

>>> callback_history = model.fit(
...     training_generator,
...     steps_per_epoch=steps_per_epoch,
...     epochs=10,
...     verbose=1,
... )
Epoch 1/10 ...

第二种可能性是使用 BalancedBatchGenerator。只有这个类的实例 会被传递给 fit

>>> from imblearn.keras import BalancedBatchGenerator
>>> training_generator = BalancedBatchGenerator(
...     X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42
... )
>>> callback_history = model.fit(
...     training_generator,
...     steps_per_epoch=steps_per_epoch,
...     epochs=10,
...     verbose=1,
... )
Epoch 1/10 ...