pyspark.sql.functions.count_min_sketch

pyspark.sql.functions. count_min_sketch ( col : ColumnOrName , eps : ColumnOrName , confidence : ColumnOrName , seed : ColumnOrName ) → pyspark.sql.column.Column [source]

Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.

版本 3.5.0 中的新内容。

Parameters
col or str

要计算的目标列。

eps or str

相对误差,必须为正

confidence or str

置信度,必须是正数并且小于1.0

seed or str

随机种子

Returns

列的计数最小草图

示例

>>> df = spark.createDataFrame([[1], [2], [1]], ['data'])
>>> df = df.agg(count_min_sketch(df.data, lit(0.5), lit(0.5), lit(1)).alias('sketch'))
>>> df.select(hex(df.sketch).alias('r')).collect()
[Row(r='0000000100000000000000030000000100000004000000005D8D6AB90000000000000000000000000000000200000000000000010000000000000000')]