pyspark.sql.functions.count_min_sketch ¶
-
pyspark.sql.functions.
count_min_sketch
( col : ColumnOrName , eps : ColumnOrName , confidence : ColumnOrName , seed : ColumnOrName ) → pyspark.sql.column.Column [source] ¶ -
Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space.
版本 3.5.0 中的新内容。
- Parameters
- Returns
-
-
列
-
列的计数最小草图
-
示例
>>> df = spark.createDataFrame([[1], [2], [1]], ['data']) >>> df = df.agg(count_min_sketch(df.data, lit(0.5), lit(0.5), lit(1)).alias('sketch')) >>> df.select(hex(df.sketch).alias('r')).collect() [Row(r='0000000100000000000000030000000100000004000000005D8D6AB90000000000000000000000000000000200000000000000010000000000000000')]