随机欠采样器#

class imblearn.under_sampling.RandomUnderSampler(*, sampling_strategy='auto', random_state=None, replacement=False)[source]#

用于执行随机欠采样的类。

通过随机选择样本来对多数类进行欠采样，可以选择有放回或无放回的方式。

更多内容请参阅用户指南。

Parameters:

sampling_strategyfloat, str, dict, callable, default=’auto’

采样信息以对数据集进行采样。

当 float 时，它对应于重采样后少数类样本数量与多数类样本数量的期望比率。因此，该比率表示为 \(\alpha_{us} = N_{m} / N_{rM}\)，其中 \(N_{m}\) 是少数类样本的数量，\(N_{rM}\) 是重采样后多数类样本的数量。

警告

float 仅适用于二分类。对于多类分类会引发错误。
当 str 时，指定重采样所针对的类别。不同类别中的样本数量将被均衡化。可能的选择有：

'majority': 仅对多数类进行重采样;

'not minority': 对除少数类之外的所有类进行重采样;

'not majority': 重新采样除多数类之外的所有类；

'all': 对所有类别进行重采样;

'auto': 等同于 'not minority'.
当dict时，键对应于目标类别。值对应于每个目标类别所需的样本数量。
当可调用时，函数接受 y 并返回一个 dict。键对应于目标类别。值对应于每个类别所需的样本数量。

random_stateint, RandomState instance, default=None

控制算法的随机化。

如果是整数，random_state 是随机数生成器使用的种子；
如果 RandomState 实例，random_state 是随机数生成器；
如果 None，随机数生成器是 np.random 使用的 RandomState 实例。

replacementbool, default=False

样本是否是有放回或无放回的。

Attributes:

sampling_strategy_dict: 包含用于采样数据集信息的字典。键对应于从中采样的类标签，值是要采样的样本数量。
sample_indices_ndarray of shape (n_new_samples,): 所选样本的索引。

在版本0.4中添加。
n_features_in_int: 输入数据集中的特征数量。

在版本0.9中添加。
feature_names_in_ndarray of shape (n_features_in_,): 在fit期间看到的特征名称。仅在X具有全部为字符串的特征名称时定义。

在版本0.10中添加。

另请参阅

NearMiss: 使用近邻样本进行欠采样。

注释

支持通过独立采样每个类别来进行多类别重采样。支持包含字符串和数值数据的对象数组作为异构数据。

示例

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from imblearn.under_sampling import RandomUnderSampler
>>> X, y = make_classification(n_classes=2, class_sep=2,
...  weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
>>> print('Original dataset shape %s' % Counter(y))
Original dataset shape Counter({1: 900, 0: 100})
>>> rus = RandomUnderSampler(random_state=42)
>>> X_res, y_res = rus.fit_resample(X, y)
>>> print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({0: 100, 1: 100})

方法

`fit`(X, y, **params)	检查采样器的输入和统计信息。
`fit_resample`(X, y, **params)	重新采样数据集。
`get_feature_names_out`([input_features])	获取转换的输出特征名称。
`get_metadata_routing`()	获取此对象的元数据路由。
`get_params`([deep])	获取此估计器的参数。
`set_params`(**params)	设置此估计器的参数。

fit(X, y, **params)[source]#

检查采样器的输入和统计信息。

在所有情况下，您都应该使用 fit_resample。

Parameters:

X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features): 数据数组。
yarray-like of shape (n_samples,): 目标数组。

Returns:

selfobject: 返回实例本身。

fit_resample(X, y, **params)[source]#

重新采样数据集。

Parameters:

X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features): 包含需要采样的数据的矩阵。
yarray-like of shape (n_samples,): X中每个样本对应的标签。

Returns:

X_resampled{array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features): 包含重采样数据的数组。
y_resampledarray-like of shape (n_samples_new,): X_resampled 对应的标签。

get_feature_names_out(input_features=None)[source]#

获取转换的输出特征名称。

Parameters:

input_featuresarray-like of str or None, default=None

输入特征。

如果 input_features 是 None，那么 feature_names_in_ 将用作特征名称。如果 feature_names_in_ 未定义，则生成以下输入特征名称： ["x0", "x1", ..., "x(n_features_in_ - 1)"]。
如果 input_features 是类似数组的，那么 input_features 必须与 feature_names_in_ 匹配，如果 feature_names_in_ 已定义。

Returns: