CondensedNearestNeighbour#

class imblearn.under_sampling.CondensedNearestNeighbour(*, sampling_strategy='auto', random_state=None, n_neighbors=None, n_seeds_S=1, n_jobs=None)[source]#

基于压缩最近邻方法进行欠采样。

更多内容请参阅用户指南。

Parameters:

sampling_strategystr, list or callable

采样信息以对数据集进行采样。

当 str 时，指定重采样所针对的类别。请注意，每个类别的样本数量不会相等。可能的选择有：

'majority': 仅对多数类进行重采样;

'not minority': 对除少数类之外的所有类进行重采样;

'not majority': 对除多数类之外的所有类进行重采样；

'all': 重新采样所有类别;

'auto': 等同于 'not minority'.
当list时，列表包含重采样所针对的类。
当可调用时，函数接受 y 并返回一个 dict。键对应于目标类别。值对应于每个类别所需的样本数量。

random_stateint, RandomState instance, default=None

控制算法的随机化。

如果是整数，random_state 是随机数生成器使用的种子；
如果 RandomState 实例，random_state 是随机数生成器；
如果 None，随机数生成器是 np.random 使用的 RandomState 实例。

n_neighborsint or estimator object, default=None

如果 int，则考虑用于计算最近邻的邻域大小。如果是对象，则使用继承自 KNeighborsMixin 的估计器来找到最近邻。如果 None，则将使用带有 1-NN 规则的 KNeighborsClassifier。

n_seeds_Sint, default=1

为了构建集合S而提取的样本数量。

n_jobsint, default=None

在交叉验证循环中使用的CPU核心数量。 None 表示1，除非在 joblib.parallel_backend 上下文中。 -1 表示使用所有处理器。更多详情请参见术语表。

Attributes:

sampling_strategy_dict: 包含用于采样数据集信息的字典。键对应于从中采样的类标签，值是要采样的样本数量。
estimator_estimator object: 最后拟合的k-NN估计器。
estimators_list of estimator objects of shape (n_resampled_classes - 1,): 包含用于每类分类的K近邻估计器。

在版本0.12中添加。
sample_indices_ndarray of shape (n_new_samples,): 所选样本的索引。

在版本0.4中添加。
n_features_in_int: 输入数据集中的特征数量。

在版本0.9中添加。
feature_names_in_ndarray of shape (n_features_in_,): 在fit期间看到的特征名称。仅在X具有全部为字符串的特征名称时定义。

在版本0.10中添加。

另请参阅

EditedNearestNeighbours: 通过编辑样本进行欠采样。
RepeatedEditedNearestNeighbours: 通过重复ENN算法进行欠采样。
AllKNN: 使用ENN和各种邻居数量进行欠采样。

注释

该方法基于[1]。

支持多类重采样：应用了一种策略（少数类）与其他每个类进行对比。

参考文献

[1]

P. Hart, “浓缩最近邻规则,” 在信息理论中，IEEE Transactions on, 卷 14(3), 页 515-516, 1968.

示例

>>> from collections import Counter  
>>> from sklearn.datasets import fetch_openml  
>>> from sklearn.preprocessing import scale  
>>> from imblearn.under_sampling import CondensedNearestNeighbour  
>>> X, y = fetch_openml('diabetes', version=1, return_X_y=True)  
>>> X = scale(X)  
>>> print('Original dataset shape %s' % Counter(y))  
Original dataset shape Counter({'tested_negative': 500,         'tested_positive': 268})  
>>> cnn = CondensedNearestNeighbour(random_state=42)  
>>> X_res, y_res = cnn.fit_resample(X, y)  
>>> print('Resampled dataset shape %s' % Counter(y_res))  
Resampled dataset shape Counter({'tested_positive': 268,         'tested_negative': 181})  

方法

`fit`(X, y, **params)	检查采样器的输入和统计信息。
`fit_resample`(X, y, **params)	重新采样数据集。
`get_feature_names_out`([input_features])	获取转换的输出特征名称。
`get_metadata_routing`()	获取此对象的元数据路由。
`get_params`([deep])	获取此估计器的参数。
`set_params`(**params)	设置此估计器的参数。

property estimator_#: 最后拟合的k-NN估计器。

fit(X, y, **params)[source]#

检查采样器的输入和统计信息。

在所有情况下，您都应该使用 fit_resample。

Parameters:

X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features): 数据数组。
yarray-like of shape (n_samples,): 目标数组。

Returns:

selfobject: 返回实例本身。

fit_resample(X, y, **params)[source]#

重新采样数据集。

Parameters:

X{array-like, dataframe, sparse matrix} of shape (n_samples, n_features): 包含需要采样的数据的矩阵。
yarray-like of shape (n_samples,): X中每个样本对应的标签。

Returns:

X_resampled{array-like, dataframe, sparse matrix} of shape (n_samples_new, n_features): 包含重采样数据的数组。
y_resampledarray-like of shape (n_samples_new,): X_resampled对应的标签。

get_feature_names_out(input_features=None)[source]#

获取转换的输出特征名称。

Parameters:

input_featuresarray-like of str or None, default=None

输入特征。

如果 input_features 是 None，则使用 feature_names_in_ 作为特征名称。如果 feature_names_in_ 未定义，则生成以下输入特征名称： ["x0", "x1", ..., "x(n_features_in_ - 1)"]。
如果 input_features 是类似数组的，那么 input_features 必须与 feature_names_in_ 匹配，如果 feature_names_in_ 已定义。

Returns:

feature_names_outndarray of str objects: 与输入特征相同。

get_metadata_routing()[source]#

获取此对象的元数据路由。

请查看用户指南了解路由机制的工作原理。

Returns:

routingMetadataRequest: 一个封装路由信息的MetadataRequest。

get_params(deep=True)[source]#

获取此估计器的参数。

Parameters:

deepbool, default=True: 如果为True，将返回此估计器及其包含的子对象的参数。

Returns:

paramsdict: 参数名称映射到它们的值。

set_params(**params)[source]#

设置此估计器的参数。

该方法适用于简单的估计器以及嵌套对象（如Pipeline）。后者具有__形式的参数，以便可以更新嵌套对象的每个组件。

Parameters:

**paramsdict: 估计器参数。

Returns:

selfestimator instance: 估计器实例。

使用`imblearn.under_sampling.CondensedNearestNeighbour`的示例#

比较欠采样采样器

Compare under-sampling samplers

CondensedNearestNeighbour#

使用imblearn.under_sampling.CondensedNearestNeighbour的示例#

本页面

使用`imblearn.under_sampling.CondensedNearestNeighbour`的示例#