pyspark.RDD.subtract ¶

RDD. subtract ( other : pyspark.rdd.RDD [ T ] , numPartitions : Optional [ int ] = None ) → pyspark.rdd.RDD [ T ] [source] ¶

返回 self 中未包含在 other 中的每个值。

新增于版本 0.9.1。

Parameters

other RDD: 另一个 RDD
numPartitions int, optional: 新 RDD 中的分区数量

Returns

RDD: 一个包含此集合中不在 other 中的元素的 RDD

另请参阅

RDD.subtractByKey()

示例

           >>> rdd1 = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
>>> rdd2 = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(rdd1.subtract(rdd2).collect())
[('a', 1), ('b', 4), ('b', 5)]

          

pyspark.RDD.stdev

pyspark.RDD.subtractByKey