ChiSqSelector ¶

class pyspark.ml.feature. ChiSqSelector ( * , numTopFeatures : int = 50 , featuresCol : str = 'features' , outputCol : Optional [ str ] = None , labelCol : str = 'label' , selectorType : str = 'numTopFeatures' , percentile : float = 0.1 , fpr : float = 0.05 , fdr : float = 0.05 , fwe : float = 0.05 ) [source] ¶

卡方特征选择，用于选择分类特征以预测分类标签。选择器支持不同的选择方法： numTopFeatures 、 percentile 、 fpr 、 fdr 、 fwe 。

numTopFeatures 根据卡方检验选择固定数量的顶级特征。

percentile 与此类似，但选择所有特征的一部分，而不是固定数量。

fpr 选择所有p值低于阈值的特征，从而控制选择的假阳性率。

fdr 使用 Benjamini-Hochberg程序选择所有假发现率低于阈值的特征。

fwe 选择所有p值低于阈值的特征。该阈值按1/numFeatures进行缩放，从而控制选择的族错误率。

默认情况下，选择方法是 numTopFeatures ，默认的顶级特征数量设置为50。

自版本 3.1.0 起已弃用: 使用 UnivariateFeatureSelector

新增于版本 2.0.0。

示例

           >>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame(
...    [(Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0),
...     (Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0),
...     (Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0)],
...    ["features", "label"])
>>> selector = ChiSqSelector(numTopFeatures=1, outputCol="selectedFeatures")
>>> model = selector.fit(df)
>>> model.getFeaturesCol()
'features'
>>> model.setFeaturesCol("features")
ChiSqSelectorModel...
>>> model.transform(df).head().selectedFeatures
DenseVector([18.0])
>>> model.selectedFeatures
[2]
>>> chiSqSelectorPath = temp_path + "/chi-sq-selector"
>>> selector.save(chiSqSelectorPath)
>>> loadedSelector = ChiSqSelector.load(chiSqSelectorPath)
>>> loadedSelector.getNumTopFeatures() == selector.getNumTopFeatures()
True
>>> modelPath = temp_path + "/chi-sq-selector-model"
>>> model.save(modelPath)
>>> loadedModel = ChiSqSelectorModel.load(modelPath)
>>> loadedModel.selectedFeatures == model.selectedFeatures
True
>>> loadedModel.transform(df).take(1) == model.transform(df).take(1)
True

          

方法

`clear` (参数)	如果参数已明确设置，则从参数映射中清除该参数。
`copy` ([extra])	创建此实例的副本，具有相同的uid和一些额外的参数。
`explainParam` (参数)	解释单个参数并返回其名称、文档以及可选的默认值和用户提供的值的字符串。
`explainParams` ()	返回所有参数的文档，包括它们可选的默认值和用户提供的值。
`extractParamMap` ([extra])	提取嵌入的默认参数值和用户提供的值，然后将它们与输入中的额外值合并到一个扁平的参数映射中，如果存在冲突，则使用后者的值，即顺序为：默认参数值 < 用户提供的值 < 额外值。
`fit` (数据集[, 参数])	使用可选参数将模型拟合到输入数据集。
`fitMultiple` (数据集, 参数映射)	为输入数据集中的每个参数映射拟合一个模型。
`getFdr` ()	获取fdr的值或其默认值。
`getFeaturesCol` ()	获取featuresCol的值或其默认值。
`getFpr` ()	获取fpr的值或其默认值。
`getFwe` ()	获取fwe的值或其默认值。
`getLabelCol` ()	获取 labelCol 的值或其默认值。
`getNumTopFeatures` ()	获取numTopFeatures的值或其默认值。
`getOrDefault` (参数)	获取用户提供的参数映射中的参数值或其默认值。
`getOutputCol` ()	获取outputCol的值或其默认值。
`getParam` (paramName)	根据名称获取参数。
`getPercentile` ()	获取百分位数的值或其默认值。
`getSelectorType` ()	获取selectorType的值或其默认值。
`hasDefault` (参数)	检查参数是否具有默认值。
`hasParam` (paramName)	测试此实例是否包含具有给定（字符串）名称的参数。
`isDefined` (参数)	检查参数是否由用户显式设置或具有默认值。
`isSet` (参数)	检查参数是否被用户显式设置。
`load` (路径)	从输入路径读取一个ML实例，是 read().load(path) 的快捷方式。
`read` ()	返回此类的一个 MLReader 实例。
`save` (路径)	将此 ML 实例保存到给定路径，是 ‘write().save(path)’ 的快捷方式。
`set` (参数, 值)	在嵌入的参数映射中设置一个参数。
`setFdr` (值)	设置 `fdr` 的值。
`setFeaturesCol` (值)	设置 `featuresCol` 的值。
`setFpr` (值)	设置 `fpr` 的值。
`setFwe` (值)	设置 `fwe` 的值。
`setLabelCol` (值)	设置 `labelCol` 的值。
`setNumTopFeatures` (值)	设置 `numTopFeatures` 的值。
`setOutputCol` (值)	设置 `outputCol` 的值。
`setParams` (self, \*[, numTopFeatures, …])	设置此ChiSqSelector的参数。
`setPercentile` (值)	设置 `percentile` 的值。
`setSelectorType` (值)	设置 `selectorType` 的值。
`write` ()	返回此ML实例的MLWriter实例。

属性

`fdr`
`featuresCol`
`fpr`
`fwe`
`labelCol`
`numTopFeatures`
`输出列`
`参数`	返回按名称排序的所有参数。
`百分位数`
`selectorType`

方法文档

clear ( param : pyspark.ml.param.Param ) → None ¶: 如果参数已明确设置，则从参数映射中清除该参数。

copy ( extra : Optional [ ParamMap ] = None ) → JP ¶

创建此实例的副本，具有相同的uid和一些额外的参数。此实现首先调用Params.copy，然后使用额外参数复制伴随的Java管道组件。因此，Python包装器和Java管道组件都会被复制。

Parameters

extra dict, optional: 复制到新实例的额外参数

Returns

JavaParams: 此实例的副本

explainParam ( param : Union [ str , pyspark.ml.param.Param ] ) → str ¶: 解释单个参数并返回其名称、文档以及可选的默认值和用户提供的值的字符串。

explainParams ( ) → str ¶: 返回所有参数的文档，包括它们可选的默认值和用户提供的值。

extractParamMap ( extra : Optional [ ParamMap ] = None ) → ParamMap ¶

提取嵌入的默认参数值和用户提供的值，然后将它们与输入中的额外值合并到一个扁平的参数映射中，如果存在冲突，则使用后者的值，即顺序为：默认参数值 < 用户提供的值 < 额外值。

Parameters

extra dict, optional: 额外参数值

Returns

dict: 合并的参数映射

fit ( dataset : pyspark.sql.dataframe.DataFrame , params : Union[ParamMap, List[ParamMap], Tuple[ParamMap], None] = None ) → Union [ M , List [ M ] ] ¶

使用可选参数将模型拟合到输入数据集。

新增于版本 1.3.0。

Parameters

dataset pyspark.sql.DataFrame: 输入数据集。
params dict or list or tuple, optional: 一个可选的参数映射，用于覆盖嵌入的参数。如果给定了一个参数映射的列表/元组，这将调用每个参数映射上的fit方法，并返回一个模型列表。

Returns

Transformer or a list of Transformer: 拟合模型

fitMultiple ( dataset : pyspark.sql.dataframe.DataFrame , paramMaps : Sequence [ ParamMap ] ) → Iterator [ Tuple [ int , M ] ] ¶

为输入数据集中的每个参数映射拟合一个模型。

新增于版本 2.3.0。

Parameters

dataset pyspark.sql.DataFrame: 输入数据集。
paramMaps collections.abc.Sequence: 一系列参数映射。

Returns

_FitMultipleIterator: 一个线程安全的可迭代对象，其中包含每个参数映射的一个模型。每次调用 next(modelIterator) 将返回 (index, model) ，其中模型是使用 paramMaps[index] 拟合的。 index 值可能不是连续的。

getFdr ( ) → float ¶: 获取fdr的值或其默认值。

新增于版本 2.2.0。

getFeaturesCol ( ) → str ¶: 获取featuresCol的值或其默认值。

getFpr ( ) → float ¶: 获取fpr的值或其默认值。

新增于版本 2.1.0。

getFwe ( ) → float ¶: 获取fwe的值或其默认值。

新增于版本 2.2.0。

getLabelCol ( ) → str ¶: 获取 labelCol 的值或其默认值。

getNumTopFeatures ( ) → int ¶: 获取numTopFeatures的值或其默认值。

新增于版本 2.0.0。

getOrDefault ( param : Union [ str , pyspark.ml.param.Param [ T ] ] ) → Union [ Any , T ] ¶: 获取用户提供的参数映射中的参数值或其默认值。如果两者都未设置，则引发错误。

getOutputCol ( ) → str ¶: 获取outputCol的值或其默认值。

getParam ( paramName : str ) → pyspark.ml.param.Param ¶: 根据名称获取参数。

getPercentile ( ) → float ¶: 获取百分位数的值或其默认值。

新增于版本 2.1.0。

getSelectorType ( ) → str ¶: 获取selectorType的值或其默认值。

新增于版本 2.1.0。

hasDefault ( param : Union [ str , pyspark.ml.param.Param [ Any ] ] ) → bool ¶: 检查参数是否具有默认值。

hasParam ( paramName : str ) → bool ¶: 测试此实例是否包含具有给定（字符串）名称的参数。

isDefined ( param : Union [ str , pyspark.ml.param.Param [ Any ] ] ) → bool ¶: 检查参数是否由用户显式设置或具有默认值。

isSet ( param : Union [ str , pyspark.ml.param.Param [ Any ] ] ) → bool ¶: 检查参数是否被用户显式设置。

classmethod load ( path : str ) → RL ¶: 从输入路径读取一个ML实例，是 read().load(path) 的快捷方式。

classmethod read ( ) → pyspark.ml.util.JavaMLReader [ RL ] ¶: 返回此类的一个 MLReader 实例。

save ( path : str ) → None ¶: 将此 ML 实例保存到给定路径，是 ‘write().save(path)’ 的快捷方式。

set ( param : pyspark.ml.param.Param , value : Any ) → None ¶: 在嵌入的参数映射中设置一个参数。

setFdr ( value : float ) → P ¶: 设置 fdr 的值。仅在 selectorType = “fdr” 时适用。

新增于版本 2.2.0。

setFeaturesCol ( value : str ) → P ¶: 设置 featuresCol 的值。

setFpr ( value : float ) → P ¶: 设置 fpr 的值。仅在 selectorType = “fpr” 时适用。

新增于版本 2.1.0。

setFwe ( value : float ) → P ¶: 设置 fwe 的值。仅在 selectorType = “fwe” 时适用。

新增于版本 2.2.0。

setLabelCol ( value : str ) → P ¶: 设置 labelCol 的值。

setNumTopFeatures ( value : int ) → P ¶: 设置 numTopFeatures 的值。仅在 selectorType = “numTopFeatures” 时适用。

新增于版本 2.0.0。

setOutputCol ( value : str ) → P ¶: 设置 outputCol 的值。

setParams ( self , \* , numTopFeatures=50 , featuresCol="features" , outputCol=None , labelCol="label" , selectorType="numTopFeatures" , percentile=0.1 , fpr=0.05 , fdr=0.05 , fwe=0.05 ) [source] ¶: 设置此ChiSqSelector的参数。

新增于版本 2.0.0。

setPercentile ( value : float ) → P ¶: 设置 percentile 的值。仅在 selectorType = “percentile” 时适用。

新增于版本 2.1.0。

setSelectorType ( value : str ) → P ¶: 设置 selectorType 的值。

新增于版本 2.1.0。

write ( ) → pyspark.ml.util.JavaMLWriter ¶: 返回此ML实例的MLWriter实例。

属性文档

fdr = Param(parent='undefined', name='fdr', doc='The upper bound of the expected false discovery rate.') ¶

featuresCol = Param(parent='undefined', name='featuresCol', doc='features column name.') ¶

fpr = Param(parent='undefined', name='fpr', doc='The highest p-value for features to be kept.') ¶

fwe = Param(parent='undefined', name='fwe', doc='The upper bound of the expected family-wise error rate.') ¶

labelCol = Param(parent='undefined', name='labelCol', doc='label column name.') ¶

numTopFeatures = Param(parent='undefined', name='numTopFeatures', doc='Number of features that selector will select, ordered by ascending p-value. If the number of features is < numTopFeatures, then this will select all features.') ¶

outputCol = Param(parent='undefined', name='outputCol', doc='output column name.') ¶

params ¶: 返回按名称排序的所有参数。默认实现使用 dir() 获取所有类型为 Param 的属性。

percentile = Param(parent='undefined', name='percentile', doc='Percentile of features that selector will select, ordered by ascending p-value.') ¶

selectorType = Param(parent='undefined', name='selectorType', doc='The selector type. Supported options: numTopFeatures (default), percentile, fpr, fdr, fwe.') ¶

分桶器

ChiSqSelectorModel