6. 其他采样器#
6.1. 自定义采样器#
一个完全定制的采样器,FunctionSampler
,在imbalanced-learn中可用,这样你可以通过定义一个单一函数来快速原型化你自己的采样器。可以使用接受字典的属性kw_args
来添加额外的参数。以下示例说明了如何保留数组X
和y
的前10个元素:
>>> import numpy as np
>>> from imblearn import FunctionSampler
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
... n_redundant=0, n_repeated=0, n_classes=3,
... n_clusters_per_class=1,
... weights=[0.01, 0.05, 0.94],
... class_sep=0.8, random_state=0)
>>> def func(X, y):
... return X[:10], y[:10]
>>> sampler = FunctionSampler(func=func)
>>> X_res, y_res = sampler.fit_resample(X, y)
>>> np.all(X_res == X[:10])
True
>>> np.all(y_res == y[:10])
True
此外,参数 validate
控制输入检查。例如,
将 validate=False
设置为允许传递任何类型的目标 y
并对回归目标进行一些
采样:
>>> from sklearn.datasets import make_regression
>>> X_reg, y_reg = make_regression(n_samples=100, random_state=42)
>>> rng = np.random.RandomState(42)
>>> def dummy_sampler(X, y):
... indices = rng.choice(np.arange(X.shape[0]), size=10)
... return X[indices], y[indices]
>>> sampler = FunctionSampler(func=dummy_sampler, validate=False)
>>> X_res, y_res = sampler.fit_resample(X_reg, y_reg)
>>> y_res
array([ 41.49112498, -142.78526195, 85.55095317, 141.43321419,
75.46571114, -67.49177372, 159.72700509, -169.80498923,
211.95889757, 211.95889757])
我们展示了如何使用这种采样器来实现一个异常值拒绝估计器,该估计器可以轻松地在Pipeline
中使用:
自定义采样器以实现异常值拒绝估计器
6.2. 自定义生成器#
Imbalanced-learn 提供了针对 TensorFlow 和 Keras 的特定生成器,这些生成器将生成平衡的小批量数据。
6.2.1. TensorFlow 生成器#
balanced_batch_generator
允许使用不平衡学习采样器生成平衡的小批量数据,该采样器返回索引。
首先,我们生成一些数据:
>>> n_features, n_classes = 10, 2
>>> X, y = make_classification(
... n_samples=10_000, n_features=n_features, n_informative=2,
... n_redundant=0, n_repeated=0, n_classes=n_classes,
... n_clusters_per_class=1, weights=[0.1, 0.9],
... class_sep=0.8, random_state=0
... )
>>> X = X.astype(np.float32)
然后,我们可以创建生成器,它将产生平衡的小批量数据:
>>> from imblearn.under_sampling import RandomUnderSampler
>>> from imblearn.tensorflow import balanced_batch_generator
>>> training_generator, steps_per_epoch = balanced_batch_generator(
... X,
... y,
... sample_weight=None,
... sampler=RandomUnderSampler(),
... batch_size=32,
... random_state=42,
... )
generator
和 steps_per_epoch
在 Tensorflow 模型的训练过程中使用。我们将展示如何使用这个生成器。首先,我们可以定义一个逻辑回归模型,该模型将通过梯度下降进行优化:
>>> import tensorflow as tf
>>> # initialize the weights and intercept
>>> normal_initializer = tf.random_normal_initializer(mean=0, stddev=0.01)
>>> coef = tf.Variable(normal_initializer(
... shape=[n_features, n_classes]), dtype="float32"
... )
>>> intercept = tf.Variable(
... normal_initializer(shape=[n_classes]), dtype="float32"
... )
>>> # define the model
>>> def logistic_regression(X):
... return tf.nn.softmax(tf.matmul(X, coef) + intercept)
>>> # define the loss function
>>> def cross_entropy(y_true, y_pred):
... y_true = tf.one_hot(y_true, depth=n_classes)
... y_pred = tf.clip_by_value(y_pred, 1e-9, 1.)
... return tf.reduce_mean(-tf.reduce_sum(y_true * tf.math.log(y_pred)))
>>> # define our metric
>>> def balanced_accuracy(y_true, y_pred):
... cm = tf.math.confusion_matrix(tf.cast(y_true, tf.int64), tf.argmax(y_pred, 1))
... per_class = np.diag(cm) / tf.math.reduce_sum(cm, axis=1)
... return np.mean(per_class)
>>> # define the optimizer
>>> optimizer = tf.optimizers.SGD(learning_rate=0.01)
>>> # define the optimization step
>>> def run_optimization(X, y):
... with tf.GradientTape() as g:
... y_pred = logistic_regression(X)
... loss = cross_entropy(y, y_pred)
... gradients = g.gradient(loss, [coef, intercept])
... optimizer.apply_gradients(zip(gradients, [coef, intercept]))
一旦初始化,模型通过迭代平衡的小批量数据进行训练,并最小化之前定义的损失:
>>> epochs = 10
>>> for e in range(epochs):
... y_pred = logistic_regression(X)
... loss = cross_entropy(y, y_pred)
... bal_acc = balanced_accuracy(y, y_pred)
... print(f"epoch: {e}, loss: {loss:.3f}, accuracy: {bal_acc}")
... for i in range(steps_per_epoch):
... X_batch, y_batch = next(training_generator)
... run_optimization(X_batch, y_batch)
epoch: 0, ...
6.2.2. Keras生成器#
Keras 提供了一个更高级的 API,可以通过调用 fit_generator
方法来定义和训练模型。为了说明这一点,我们将定义一个逻辑回归模型:
>>> from tensorflow import keras
>>> y = keras.utils.to_categorical(y, 3)
>>> model = keras.Sequential()
>>> model.add(
... keras.layers.Dense(
... y.shape[1], input_dim=X.shape[1], activation='softmax'
... )
... )
>>> model.compile(
... optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy']
... )
balanced_batch_generator
创建一个平衡的小批量生成器,并生成相应数量的小批量:
>>> from imblearn.keras import balanced_batch_generator
>>> training_generator, steps_per_epoch = balanced_batch_generator(
... X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42
... )
然后,可以调用fit
,传入生成器和步骤:
>>> callback_history = model.fit(
... training_generator,
... steps_per_epoch=steps_per_epoch,
... epochs=10,
... verbose=1,
... )
Epoch 1/10 ...
第二种可能性是使用
BalancedBatchGenerator
。只有这个类的实例
会被传递给 fit
:
>>> from imblearn.keras import BalancedBatchGenerator
>>> training_generator = BalancedBatchGenerator(
... X, y, sampler=RandomUnderSampler(), batch_size=10, random_state=42
... )
>>> callback_history = model.fit(
... training_generator,
... steps_per_epoch=steps_per_epoch,
... epochs=10,
... verbose=1,
... )
Epoch 1/10 ...