重新排序分区#

group reorder_partition

枚举

enum class hash_id#

标识用于哈希分区的哈希函数。

值：

enumerator HASH_IDENTITY#: 身份哈希函数，它简单地返回要哈希的键。

enumerator HASH_MURMUR3#: Murmur3 哈希函数。

函数

std::pair<std::unique_ptr<table>, std::vector<size_type>> partition(table_view const &t, column_view const &partition_map, size_type num_partitions, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

根据partition_map指定的映射对t的行进行分区。

对于t中的每一行i，partition_map[i]表示行i属于哪个分区。partition通过重新排列t的行来创建一个新表，使得同一分区中的行是连续的。返回的表按分区顺序从[0, num_partitions)升序排列。每个分区内的顺序未定义。

返回一个vector，包含num_partitions + 1个值，这些值指示返回表中每个分区的起始位置，即分区i从offsets[i]（包含）开始，到offset[i+1]（不包含）结束。因此，如果[0, num_partitions)中的值j没有出现在partition_map中，分区j将为空，即offsets[j+1] - offsets[j] == 0。

partition_map 中的值必须在 [0, num_partitions) 范围内，否则行为是未定义的。

Throws:

cudf::logic_error – 当 partition_map 是非整数类型时
cudf::logic_error – 当 partition_map.has_nulls() == true
cudf::logic_error – 当 partition_map.size() != t.num_rows()

Parameters:

t – 要分区的表
partition_map – 非空的整数值列，将t中的每一行映射到其分区。
num_partitions – 分区总数
stream – 用于设备内存操作和内核启动的CUDA流
mr – 用于分配返回表的设备内存的设备内存资源

Returns:

包含重新排序的表和num_partitions + 1个偏移量的向量的对，每个分区的偏移量使得分区i的大小由offset[i+1] - offset[i]确定。

std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(table_view const &input, std::vector<size_type> const &columns_to_hash, int num_partitions, hash_id hash_function = hash_id::HASH_MURMUR3, uint32_t seed = DEFAULT_HASH_SEED, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

将输入表中的行分区到多个输出表中。

将input的行根据columns_to_hash指定的列的哈希值分区到num_partitions个分区中。分区到同一分区的行在输出表中连续分组。返回一个指向输出表中每个分区开始处的行偏移向量。

Throws:

std::out_of_range – 如果索引是 columns_to_hash 无效

Parameters:

input – 要分区的表
columns_to_hash – 输入列的索引以进行哈希处理
num_partitions – 使用的分区数量
hash_function – 可选的哈希ID，用于选择要使用的哈希函数
seed – 可选的哈希函数种子值
stream – 用于设备内存操作和内核启动的CUDA流
mr – 用于分配返回表的设备内存的设备内存资源

Returns:

一个输出表和每个分区的行偏移量向量

std::pair<std::unique_ptr<cudf::table>, std::vector<cudf::size_type>> round_robin_partition(table_view const &input, cudf::size_type num_partitions, cudf::size_type start_partition = 0, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

轮询分区。

返回一个新表，其中行被重新排列为分区组，并返回一个指向输出表中每个分区开始处的行偏移向量。行的分区分配基于它们在表中的行索引，以轮询方式进行。

这个算法的一个很好的类比是发牌：

牌组表示为表中的行。
分区的数量是发牌的玩家数量。
start_partition 表示哪个玩家首先开始拿牌。

该算法有两个结果：

另一副牌通过将每个玩家的牌重新堆叠回一副牌中形成，保留每个玩家发牌的顺序，从玩家0开始。
一个指向输出牌组的向量，指示玩家牌开始的位置。

玩家的牌堆（分区）是从相应的偏移量开始，到下一个玩家的起始偏移量结束的牌范围，如果是最后一个玩家，则到牌堆的最后一张牌结束。

当 num_partitions > nrows 时，玩家数量多于卡片数量。我们从指定的第一个玩家开始发牌，依次轮流发牌，直到卡片发完而玩家还未发完。没有获得任何卡片的玩家由 offset[i] == offset[i+1] or offset[i] == table.num_rows() if i == num_partitions-1 表示，他们的牌堆（分区）中没有卡片（行）。

Example 1:
input:
table => col 1 {0, ..., 12}
num_partitions = 3
start_partition = 0

output: pair<table, partition_offsets>
table => col 1 {0,3,6,9,12,1,4,7,10,2,5,8,11}
partition_offsets => {0,5,9}

Example 2:
input:
table => col 1 {0, ..., 12}
num_partitions = 3
start_partition = 1

output: pair<table, partition_offsets>
table => col 1 {2,5,8,11,0,3,6,9,12,1,4,7,10}
partition_offsets => {0,4,9}

Example 3:
input:
table => col 1 {0, ..., 10}
num_partitions = 3
start_partition = 0

output: pair<table, partition_offsets>
table => col 1 {0,3,6,9,1,4,7,10,2,5,8}
partition_offsets => {0,4,8}

Example 4:
input:
table => col 1 {0, ..., 10}
num_partitions = 3
start_partition = 1

output: pair<table, partition_offsets>
table => col 1 {2,5,8,0,3,6,9,1,4,7,10}
partition_offsets => {0,3,7}

Example 5:
input:
table => col 1 {0, ..., 10}
num_partitions = 3
start_partition = 2

output: pair<table, partition_offsets>
table => col 1 {1,4,7,10,2,5,8,0,3,6,9}
partition_offsets => {0,4,7}

Example 6:
input:
table => col 1 {0, ..., 10}
num_partitions = 15 > num_rows = 11
start_partition = 2

output: pair<table, partition_offsets>
table => col 1 {0,1,2,3,4,5,6,7,8,9,10}
partition_offsets => {0,0,0,1,2,3,4,5,6,7,8,9,10,11,11}

Example 7:
input:
table => col 1 {0, ..., 10}
num_partitions = 15 > num_rows = 11
start_partition = 10

output: pair<table, partition_offsets>
table => col 1 {5,6,7,8,9,10,0,1,2,3,4}
partition_offsets => {0,1,2,3,4,5,6,6,6,6,6,7,8,9,10}

Example 8:
input:
table => col 1 {0, ..., 10}
num_partitions = 15 > num_rows = 11
start_partition = 14

output: pair<table, partition_offsets>
table => col 1 {1,2,3,4,5,6,7,8,9,10,0}
partition_offsets => {0,1,2,3,4,5,6,7,8,9,10,10,10,10,10}

Example 9:
input:
table => col 1 {0, ..., 10}
num_partitions = 11 == num_rows = 11
start_partition = 2

output: pair<table, partition_offsets>
table => col 1 {9,10,0,1,2,3,4,5,6,7,8}
partition_offsets => {0,1,2,3,4,5,6,7,8,9,10}

Throws:

cudf::logic_error – 如果 num_partitions <= 1
cudf::logic_error – 如果 start_partition >= num_partitions

Parameters:

input – [in] 要轮询分区的输入表
num_partitions – [in] 表的分区数量
start_partition – [in] 第一个分区的索引
stream – [in] 用于设备内存操作和内核启动的CUDA流
mr – [in] 用于分配返回表的设备内存的设备内存资源

Returns:

一个由指向分区表的unique_ptr和表中每个分区的分区偏移量组成的std::pair。