聚合滚动#

group aggregation_rolling

函数

std::unique_ptr<column> rolling_window(column_view const &input, size_type preceding_window, size_type following_window, size_type min_periods, rolling_aggregation const &agg, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

对列中的值应用固定大小的滚动窗口函数。

此函数在输入列的每个元素i周围的窗口中聚合值，如果没有足够的观察值，则使元素i的位掩码无效。窗口大小是静态的（每个元素相同）。这与Pandas的DataFrame.rolling API匹配，但有一些显著差异：

它使用一个两部分窗口而不是中心标志，以允许更灵活的窗口。总窗口大小 = preceding_window + following_window。元素 i 使用元素 [i-preceding_window+1, i+following_window] 进行窗口计算。
对于不满足最小观察次数的输出行，此函数不会存储NA/NaN，而是更新列的有效位掩码以指示哪些元素是有效的。

返回列类型的注意事项：

count聚合返回的列始终具有INT32类型。
VARIANCE/STD 聚合返回的列始终具有 FLOAT64 类型。
所有其他运算符返回与输入相同类型的列。因此，建议在进行滚动MEAN之前，将整数列类型（尤其是低精度整数）转换为FLOAT32或FLOAT64。

Parameters:

input – [in] 输入列
preceding_window – [in] 向后方向的静态滚动窗口大小
following_window – [in] 向前方向的静态滚动窗口大小
min_periods – [in] 窗口中需要的最小观测值数量，否则元素 i 为 null。
agg – [in] 滚动窗口聚合类型（SUM, MAX, MIN 等）
stream – [in] 用于设备内存操作和内核启动的CUDA流
mr – [in] 用于分配返回列的设备内存的设备内存资源

Returns:

一个可为空的输出列，包含滚动窗口的结果

std::unique_ptr<column> rolling_window(column_view const &input, column_view const &default_outputs, size_type preceding_window, size_type following_window, size_type min_periods, rolling_aggregation const &agg, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

对列中的值应用固定大小的滚动窗口函数。

它使用一个两部分窗口而不是中心标志，以允许更灵活的窗口。总窗口大小 = preceding_window + following_window。元素 i 使用元素 [i-preceding_window+1, i+following_window] 进行窗口计算。
对于不满足最小观察次数的输出行，此函数不会存储NA/NaN，而是更新列的有效位掩码以指示哪些元素是有效的。

返回列类型的注意事项：

count聚合返回的列始终具有INT32类型。
VARIANCE/STD 聚合返回的列始终具有 FLOAT64 类型。
所有其他运算符返回与输入相同类型的列。因此，建议在进行滚动MEAN之前，将整数列类型（尤其是低精度整数）转换为FLOAT32或FLOAT64。

Parameters:

input – [in] 输入列
preceding_window – [in] 向后方向的静态滚动窗口大小
following_window – [in] 向前方向的静态滚动窗口大小
min_periods – [in] 窗口中需要的最小观测值数量，否则元素 i 为 null。
agg – [in] 滚动窗口聚合类型（SUM, MAX, MIN 等）
stream – [in] 用于设备内存操作和内核启动的CUDA流
mr – [in] 用于分配返回列的设备内存的设备内存资源
default_outputs – 每行的默认值列，用于返回而不是空值。用于LEAD()/LAG()，如果行偏移超出列的边界。

Returns:

一个可空的输出列，包含滚动窗口的结果

std::unique_ptr<column> grouped_rolling_window(table_view const &group_keys, column_view const &input, size_type preceding_window, size_type following_window, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

对列中的值应用一个分组感知的、固定大小的滚动窗口函数。

与 rolling_window() 类似，此函数在指定的 input 列的每个元素周围的窗口中聚合值。它与 rolling_window() 的不同之处在于，input 列的元素被分组到不同的组中（例如，groupby 的结果）。窗口聚合不能跨越组边界。对于 input 的第 i 行，组由 group_keys 下的列的对应（即第 i 个）值确定。

注意：此方法要求行已按group_key值预先排序。

Example: Consider a user-sales dataset, where the rows look as follows:
{ "user_id", sales_amt, day }

The `grouped_rolling_window()` method enables windowing queries such as grouping a dataset by
`user_id`, and summing up the `sales_amt` column over a window of 3 rows (2 preceding (including
current row), 1 row following).

In this example,
   1. `group_keys == [ user_id ]`
   2. `input == sales_amt`
The data are grouped by `user_id`, and ordered by `day`-string. The aggregation
(SUM) is then calculated for a window of 3 values around (and including) each row.

For the following input:

 [ // user,  sales_amt
   { "user1",   10      },
   { "user2",   20      },
   { "user1",   20      },
   { "user1",   10      },
   { "user2",   30      },
   { "user2",   80      },
   { "user1",   50      },
   { "user1",   60      },
   { "user2",   40      }
 ]

Partitioning (grouping) by `user_id` yields the following `sales_amt` vector
(with 2 groups, one for each distinct `user_id`):

   [ 10,  20,  10,  50,  60,  20,  30,  80,  40 ]
     <-------user1-------->|<------user2------->

The SUM aggregation is applied with 1 preceding and 1 following
row, with a minimum of 1 period. The aggregation window is thus 3 rows wide,
yielding the following column:

   [ 30, 40,  80, 120, 110,  50, 130, 150, 120 ]

Note: The SUMs calculated at the group boundaries (i.e. indices 0, 4, 5, and 8)
consider only 2 values each, in spite of the window-size being 3.
Each aggregation operation cannot cross group boundaries.

返回的列对于 op == COUNT 始终具有 INT32 类型。所有其他操作符返回的列与输入的类型相同。因此，建议在进行滚动 MEAN 之前将整数列类型（尤其是低精度整数）转换为 FLOAT32 或 FLOAT64。

注意：preceding_window 和 following_window 可能具有负值。这会产生当前行可能根本不包含在窗口中的情况。例如，考虑一个定义为 (preceding=3, following=-1) 的窗口。这将产生一个从当前行前 2 行（即 3-1）到当前行前 1 行的窗口。对于上面的例子，第 3 行的窗口是：

[ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <—窗口–> ^ | 当前行

同样地，preceding 可能有一个负值，表示窗口从当前行之后的位置开始。它与 following 的语义略有不同，因为 preceding 包括当前行。因此：

preceding=1 => 窗口从当前行开始。
preceding=0 => 窗口从当前行的下一行开始。
preceding=-1 => 窗口从当前行的前2行开始。等等。

Parameters:

group_keys – [in] 分组列（已预排序）
input – [in] 输入列（待聚合）
preceding_window – [in] 静态滚动窗口的大小，向后方向（对于正值）或向前方向（对于负值）
following_window – [in] 正向（对于正值）或反向（对于负值）的静态滚动窗口大小
min_periods – [in] 窗口中需要的最小观测值数量，否则元素 i 为 null。
aggr – [in] 滚动窗口聚合类型（SUM, MAX, MIN 等）
stream – [in] 用于设备内存操作和内核启动的CUDA流
mr – [in] 用于分配返回列的设备内存的设备内存资源

Returns:

一个可为空的输出列，包含滚动窗口的结果

std::unique_ptr<column> grouped_rolling_window(table_view const &group_keys, column_view const &input, window_bounds preceding_window, window_bounds following_window, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

对列中的值应用一个分组感知的、固定大小的滚动窗口函数。

注意：此方法要求行已按group_key值预先排序。

Example: Consider a user-sales dataset, where the rows look as follows:
{ "user_id", sales_amt, day }

The `grouped_rolling_window()` method enables windowing queries such as grouping a dataset by
`user_id`, and summing up the `sales_amt` column over a window of 3 rows (2 preceding (including
current row), 1 row following).

In this example,
   1. `group_keys == [ user_id ]`
   2. `input == sales_amt`
The data are grouped by `user_id`, and ordered by `day`-string. The aggregation
(SUM) is then calculated for a window of 3 values around (and including) each row.

For the following input:

 [ // user,  sales_amt
   { "user1",   10      },
   { "user2",   20      },
   { "user1",   20      },
   { "user1",   10      },
   { "user2",   30      },
   { "user2",   80      },
   { "user1",   50      },
   { "user1",   60      },
   { "user2",   40      }
 ]

Partitioning (grouping) by `user_id` yields the following `sales_amt` vector
(with 2 groups, one for each distinct `user_id`):

   [ 10,  20,  10,  50,  60,  20,  30,  80,  40 ]
     <-------user1-------->|<------user2------->

The SUM aggregation is applied with 1 preceding and 1 following
row, with a minimum of 1 period. The aggregation window is thus 3 rows wide,
yielding the following column:

   [ 30, 40,  80, 120, 110,  50, 130, 150, 120 ]

Note: The SUMs calculated at the group boundaries (i.e. indices 0, 4, 5, and 8)
consider only 2 values each, in spite of the window-size being 3.
Each aggregation operation cannot cross group boundaries.

[ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <—窗口–> ^ | 当前行

同样地，preceding 可能有一个负值，表示窗口从当前行之后的位置开始。它与 following 的语义略有不同，因为 preceding 包括当前行。因此：

preceding=1 => 窗口从当前行开始。
preceding=0 => 窗口从当前行的下一行开始。
preceding=-1 => 窗口从当前行的前2行开始。等等。

Parameters:

group_keys – [in] 分组列（已预排序）
input – [in] 输入列（待聚合）
preceding_window – [in] 静态滚动窗口的大小，向后方向（对于正值）或向前方向（对于负值）
following_window – [in] 静态滚动窗口在正向（对于正值）或反向（对于负值）的大小
min_periods – [in] 窗口中需要的最小观测值数量，否则元素 i 为 null。
aggr – [in] 滚动窗口聚合类型（SUM, MAX, MIN 等）
stream – [in] 用于设备内存操作和内核启动的CUDA流
mr – [in] 用于分配返回列的设备内存的设备内存资源

Returns:

一个可为空的输出列，包含滚动窗口的结果

std::unique_ptr<column> grouped_rolling_window(table_view const &group_keys, column_view const &input, column_view const &default_outputs, size_type preceding_window, size_type following_window, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

对列中的值应用一个分组感知的、固定大小的滚动窗口函数。

注意：此方法要求行已按group_key值预先排序。

Example: Consider a user-sales dataset, where the rows look as follows:
{ "user_id", sales_amt, day }

The `grouped_rolling_window()` method enables windowing queries such as grouping a dataset by
`user_id`, and summing up the `sales_amt` column over a window of 3 rows (2 preceding (including
current row), 1 row following).

In this example,
   1. `group_keys == [ user_id ]`
   2. `input == sales_amt`
The data are grouped by `user_id`, and ordered by `day`-string. The aggregation
(SUM) is then calculated for a window of 3 values around (and including) each row.

For the following input:

 [ // user,  sales_amt
   { "user1",   10      },
   { "user2",   20      },
   { "user1",   20      },
   { "user1",   10      },
   { "user2",   30      },
   { "user2",   80      },
   { "user1",   50      },
   { "user1",   60      },
   { "user2",   40      }
 ]

Partitioning (grouping) by `user_id` yields the following `sales_amt` vector
(with 2 groups, one for each distinct `user_id`):

   [ 10,  20,  10,  50,  60,  20,  30,  80,  40 ]
     <-------user1-------->|<------user2------->

The SUM aggregation is applied with 1 preceding and 1 following
row, with a minimum of 1 period. The aggregation window is thus 3 rows wide,
yielding the following column:

   [ 30, 40,  80, 120, 110,  50, 130, 150, 120 ]

Note: The SUMs calculated at the group boundaries (i.e. indices 0, 4, 5, and 8)
consider only 2 values each, in spite of the window-size being 3.
Each aggregation operation cannot cross group boundaries.

[ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <—窗口–> ^ | 当前行

同样地，preceding 可能有一个负值，表示窗口从当前行之后的位置开始。它与 following 的语义略有不同，因为 preceding 包括当前行。因此：

preceding=1 => 窗口从当前行开始。
preceding=0 => 窗口从当前行的下一行开始。
preceding=-1 => 窗口从当前行的前2行开始。等等。

Parameters:

group_keys – [in] 分组列（已预排序）
input – [in] 输入列（待聚合）
preceding_window – [in] 静态滚动窗口的大小，向后方向（对于正值）或向前方向（对于负值）
following_window – [in] 正向（对于正值）或反向（对于负值）的静态滚动窗口大小
min_periods – [in] 窗口中需要的最小观测值数量，否则元素 i 为 null。
aggr – [in] 滚动窗口聚合类型（SUM, MAX, MIN 等）
stream – [in] 用于设备内存操作和内核启动的CUDA流
mr – [in] 用于分配返回列的设备内存的设备内存资源
default_outputs – 每行的默认值列，用于返回而不是空值。用于LEAD()/LAG()，如果行偏移超出列或组的边界。

Returns:

一个可空的输出列，包含滚动窗口的结果

std::unique_ptr<column> grouped_rolling_window(table_view const &group_keys, column_view const &input, column_view const &default_outputs, window_bounds preceding_window, window_bounds following_window, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

对列中的值应用一个分组感知的、固定大小的滚动窗口函数。

注意：此方法要求行已按group_key值预先排序。

Example: Consider a user-sales dataset, where the rows look as follows:
{ "user_id", sales_amt, day }

The `grouped_rolling_window()` method enables windowing queries such as grouping a dataset by
`user_id`, and summing up the `sales_amt` column over a window of 3 rows (2 preceding (including
current row), 1 row following).

In this example,
   1. `group_keys == [ user_id ]`
   2. `input == sales_amt`
The data are grouped by `user_id`, and ordered by `day`-string. The aggregation
(SUM) is then calculated for a window of 3 values around (and including) each row.

For the following input:

 [ // user,  sales_amt
   { "user1",   10      },
   { "user2",   20      },
   { "user1",   20      },
   { "user1",   10      },
   { "user2",   30      },
   { "user2",   80      },
   { "user1",   50      },
   { "user1",   60      },
   { "user2",   40      }
 ]

Partitioning (grouping) by `user_id` yields the following `sales_amt` vector
(with 2 groups, one for each distinct `user_id`):

   [ 10,  20,  10,  50,  60,  20,  30,  80,  40 ]
     <-------user1-------->|<------user2------->

The SUM aggregation is applied with 1 preceding and 1 following
row, with a minimum of 1 period. The aggregation window is thus 3 rows wide,
yielding the following column:

   [ 30, 40,  80, 120, 110,  50, 130, 150, 120 ]

Note: The SUMs calculated at the group boundaries (i.e. indices 0, 4, 5, and 8)
consider only 2 values each, in spite of the window-size being 3.
Each aggregation operation cannot cross group boundaries.

[ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <—窗口–> ^ | 当前行

同样地，preceding 可能有一个负值，表示窗口从当前行之后的位置开始。它与 following 的语义略有不同，因为 preceding 包括当前行。因此：

preceding=1 => 窗口从当前行开始。
preceding=0 => 窗口从当前行的下一行开始。
preceding=-1 => 窗口从当前行的前2行开始。等等。

Parameters:

group_keys – [in] 分组列（已预排序）
input – [in] 输入列（待聚合）
preceding_window – [in] 静态滚动窗口的大小，向后方向（对于正值）或向前方向（对于负值）
following_window – [in] 正向（对于正值）或反向（对于负值）的静态滚动窗口大小
min_periods – [in] 窗口中需要的最小观测值数量，否则元素 i 为 null。
aggr – [in] 滚动窗口聚合类型（SUM, MAX, MIN 等）
stream – [in] 用于设备内存操作和内核启动的CUDA流
mr – [in] 用于分配返回列的设备内存的设备内存资源
default_outputs – 每行的默认值列，用于返回而不是空值。用于 LEAD()/LAG()，如果行偏移超出列或组的边界。

Returns:

一个可空的输出列，包含滚动窗口的结果

std::unique_ptr<column> grouped_time_range_rolling_window(table_view const &group_keys, column_view const &timestamp_column, cudf::order const &timestamp_order, column_view const &input, size_type preceding_window_in_days, size_type following_window_in_days, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

对列中的值应用一个基于时间戳的滚动窗口函数，该函数考虑了分组。

与 rolling_window() 类似，此函数在指定的 input 列的每个元素周围的窗口中聚合值。它与 rolling_window() 在两个方面有所不同：

input 列的元素被分组到不同的组中（例如 groupby 的结果），由 group_keys 下的列对应的值决定。窗口聚合不能跨越组边界。
在一个组内，聚合窗口是基于时间间隔计算的（例如，当前行之前/之后的天数）。输入数据的时间戳由timestamp_column参数指定。

注意：此方法要求行按组键和时间戳值预先排序。

  Example: Consider a user-sales dataset, where the rows look as follows:
   { "user_id", sales_amt, date }
 
  This method enables windowing queries such as grouping a dataset by `user_id`, sorting by
  increasing `date`, and summing up the `sales_amt` column over a window of 3 days (1 preceding
*day, the current day, and 1 following day).
 
  In this example,
     1. `group_keys == [ user_id ]`
     2. `timestamp_column == date`
     3. `input == sales_amt`
  The data are grouped by `user_id`, and ordered by `date`. The aggregation
  (SUM) is then calculated for a window of 3 days around (and including) each row.
 
  For the following input:
 
   [ // user,  sales_amt,  YYYYMMDD (date)
     { "user1",   10,      20200101    },
     { "user2",   20,      20200101    },
     { "user1",   20,      20200102    },
     { "user1",   10,      20200103    },
     { "user2",   30,      20200101    },
     { "user2",   80,      20200102    },
     { "user1",   50,      20200107    },
     { "user1",   60,      20200107    },
     { "user2",   40,      20200104    }
   ]
 
  Partitioning (grouping) by `user_id`, and ordering by `date` yields the following `sales_amt`
  vector (with 2 groups, one for each distinct `user_id`):
 
  Date :(202001-)  [ 01,  02,  03,  07,  07,    01,   01,   02,  04 ]
  Input:           [ 10,  20,  10,  50,  60,    20,   30,   80,  40 ]
                     <-------user1-------->|<---------user2--------->
 
  The SUM aggregation is applied, with 1 day preceding, and 1 day following, with a minimum of 1
  period. The aggregation window is thus 3 *days* wide, yielding the following output column:
 
   Results:        [ 30,  40,  30,  110, 110,  130,  130,  130,  40 ]

注意：参与每个窗口的行数可能会有所不同，基于组内的索引、日期戳和min_periods。相应地：

results[0] 考虑了2个值，因为它位于其组的开头，并且没有前面的值。
results[5] 考虑了3个值，尽管它位于其组的开头。根据其日期戳，它必须包括接下来的2个值。

每个聚合操作不能跨越组边界。

Parameters:

group_keys – [in] 分组列（已预排序）
timestamp_column – [in] 每行的（预排序）时间戳
timestamp_order – [in] 时间戳排序的顺序（升序/降序）
input – [in] 输入列（待聚合）
preceding_window_in_days – [in] 向后方向的滚动窗口时间间隔
following_window_in_days – [in] 向前滚动的窗口时间间隔
min_periods – [in] 窗口中需要的最小观测值数量，否则元素 i 为 null。
aggr – [in] 滚动窗口聚合类型（SUM, MAX, MIN 等）
stream – [in] 用于设备内存操作和内核启动的CUDA流
mr – [in] 用于分配返回列的设备内存的设备内存资源

Returns:

一个可为空的输出列，包含滚动窗口的结果

std::unique_ptr<column> grouped_time_range_rolling_window(table_view const &group_keys, column_view const &timestamp_column, cudf::order const &timestamp_order, column_view const &input, window_bounds preceding_window_in_days, window_bounds following_window_in_days, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

对列中的值应用一个基于时间戳的滚动窗口函数，该函数具有分组感知能力。

与 rolling_window() 类似，此函数在指定的 input 列的每个元素周围的窗口中聚合值。它与 rolling_window() 在两个方面有所不同：

input 列的元素被分组到不同的组中（例如 groupby 的结果），由 group_keys 下的列对应的值决定。窗口聚合不能跨越组边界。
在一个组内，聚合窗口是基于时间间隔计算的（例如，当前行之前/之后的天数）。输入数据的时间戳由timestamp_column参数指定。

注意：此方法要求行按组键和时间戳值预先排序。

  Example: Consider a user-sales dataset, where the rows look as follows:
   { "user_id", sales_amt, date }
 
  This method enables windowing queries such as grouping a dataset by `user_id`, sorting by
  increasing `date`, and summing up the `sales_amt` column over a window of 3 days (1 preceding
*day, the current day, and 1 following day).
 
  In this example,
     1. `group_keys == [ user_id ]`
     2. `timestamp_column == date`
     3. `input == sales_amt`
  The data are grouped by `user_id`, and ordered by `date`. The aggregation
  (SUM) is then calculated for a window of 3 days around (and including) each row.
 
  For the following input:
 
   [ // user,  sales_amt,  YYYYMMDD (date)
     { "user1",   10,      20200101    },
     { "user2",   20,      20200101    },
     { "user1",   20,      20200102    },
     { "user1",   10,      20200103    },
     { "user2",   30,      20200101    },
     { "user2",   80,      20200102    },
     { "user1",   50,      20200107    },
     { "user1",   60,      20200107    },
     { "user2",   40,      20200104    }
   ]
 
  Partitioning (grouping) by `user_id`, and ordering by `date` yields the following `sales_amt`
  vector (with 2 groups, one for each distinct `user_id`):
 
  Date :(202001-)  [ 01,  02,  03,  07,  07,    01,   01,   02,  04 ]
  Input:           [ 10,  20,  10,  50,  60,    20,   30,   80,  40 ]
                     <-------user1-------->|<---------user2--------->
 
  The SUM aggregation is applied, with 1 day preceding, and 1 day following, with a minimum of 1
  period. The aggregation window is thus 3 *days* wide, yielding the following output column:
 
   Results:        [ 30,  40,  30,  110, 110,  130,  130,  130,  40 ]

注意：参与每个窗口的行数可能会有所不同，基于组内的索引、日期戳和min_periods。相应地：

results[0] 考虑了2个值，因为它位于其组的开头，并且没有前面的值。
results[5] 考虑了3个值，尽管它位于其组的开头。根据其日期戳，它必须包括接下来的2个值。

每个聚合操作不能跨越组边界。

preceding_window_in_days 和 following_window_in_days 被指定为 window_bounds，并且支持“无界”窗口，如果设置为 window_bounds::unbounded()。

Parameters:

group_keys – [in] 分组列（已预排序）
timestamp_column – [in] 每行的（预排序）时间戳
timestamp_order – [in] 时间戳排序的顺序（升序/降序）
input – [in] 输入列（待聚合）
preceding_window_in_days – [in] 向后方向的滚动窗口时间间隔
following_window_in_days – [in] 向前方向的滚动窗口时间间隔
min_periods – [in] 窗口中需要的最小观测值数量，否则元素 i 为 null。
aggr – [in] 滚动窗口聚合类型（SUM, MAX, MIN 等）
stream – [in] 用于设备内存操作和内核启动的CUDA流
mr – [in] 用于分配返回列的设备内存的设备内存资源

Returns:

一个可空的输出列，包含滚动窗口的结果

std::unique_ptr<column> grouped_range_rolling_window(table_view const &group_keys, column_view const &orderby_column, cudf::order const &order, column_view const &input, range_window_bounds const &preceding, range_window_bounds const &following, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

将基于分组感知和值范围的滚动窗口函数应用于列中的值。

此函数在指定input列的每个元素周围的窗口中聚合行。窗口是根据有序orderby列的值以及表示orderby列值包含范围的preceding和following标量的值来确定的。

input 列的元素被分组到不同的组中（例如 groupby 的结果），由 group_keys 下的列对应的值决定。窗口聚合不能跨越组边界。
在一个组内，所有行按orderby列排序后，索引为i的行的聚合窗口确定如下：a) 如果orderby为升序，则行i的聚合窗口包括所有索引为j的input行，满足：
```
(orderby[i] - preceding) <= orderby[j] <= orderby[i] + following
```
b) 如果orderby为降序，则行i的聚合窗口包括所有索引为j的input行，满足：
```
(orderby[i] + preceding) >= orderby[j] >= orderby[i] - following
```

注意：此方法要求行按组键和排序列值预先排序。

窗口间隔被指定为适合orderby列的标量值。目前，仅支持以下orderby列类型和范围类型的组合：

如果 orderby 列是 TIMESTAMP 类型，则 preceding/following 窗口将以相同分辨率的 DURATION 标量指定。例如，对于类型为 TIMESTAMP_SECONDS 的 orderby 列，间隔只能是 DURATION_SECONDS。更高分辨率（例如 DURATION_NANOSECONDS）或更低分辨率（例如 DURATION_DAYS）的持续时间不能使用。
如果orderby列是整数类型（例如INT32），则preceding/following应该是完全相同的类型（INT32）。

Example: Consider a motor-racing statistics dataset, containing the following columns:
  1. driver_name:   (STRING) Name of the car driver
  2. num_overtakes: (INT32)  Number of times the driver overtook another car in a lap
  3. lap_number:    (INT32)  The number of the lap

The `group_range_rolling_window()` function allows one to calculate the total number of overtakes
each driver made within any 3 lap window of each entry:
  1. Group/partition the dataset by `driver_id` (This is the group_keys argument.)
  2. Sort each group by the `lap_number` (i.e. This is the orderby_column.)
  3. Calculate the SUM(num_overtakes) over a window (preceding=1, following=1)

For the following input:

 [ // driver_name,  num_overtakes,  lap_number
   {   "bottas",        1,            1        },
   {   "hamilton",      2,            1        },
   {   "bottas",        2,            2        },
   {   "bottas",        1,            3        },
   {   "hamilton",      3,            1        },
   {   "hamilton",      8,            2        },
   {   "bottas",        5,            7        },
   {   "bottas",        6,            8        },
   {   "hamilton",      4,            4        }
 ]

Partitioning (grouping) by `driver_name`, and ordering by `lap_number` yields the following
`num_overtakes` vector (with 2 groups, one for each distinct `driver_name`):

lap_number:      [ 1,  2,  3,  7,  8,   1,  1,   2,  4 ]
num_overtakes:   [ 1,  2,  1,  5,  6,   2,  3,   8,  4 ]
                   <-----bottas------>|<----hamilton--->

The SUM aggregation is applied, with 1 preceding, and 1 following, with a minimum of 1
period. The aggregation window is thus 3 (laps) wide, yielding the following output column:

 Results:        [ 3,  4,  3,  11, 11,  13, 13,  13,  4 ]

注意：参与每个窗口的行数可能会有所不同，基于组内的索引、日期戳和min_periods。相应地：

results[0] 考虑了2个值，因为它位于其组的开头，并且没有前面的值。
results[5] 考虑了3个值，尽管它位于其组的开头。根据其 orderby_column 值，它必须包括接下来的2个值。

每个聚合操作不能跨越组边界。

返回列的类型取决于输入列类型 T 和聚合方式：

COUNT 返回 INT32 列
MIN/MAX 返回 T 列
SUM 返回 T 的提升类型。在 INT32 上的求和会产生 INT64。
MEAN 返回 FLOAT64 列
COLLECT 返回类型为 LIST 的列。

LEAD/LAG/ROW_NUMBER 对于范围查询是未定义的。

Parameters:

group_keys – [in] 分组列（已预排序）
orderby_column – [in] 用于范围比较的（预排序的）排序列
order – [in] 排序的顺序（升序/降序），用于指定排序列的排序方式
input – [in] 输入列（待聚合）
preceding – [in] 向后方向的间隔值
following – [in] 向前方向的间隔值
min_periods – [in] 窗口中需要的最小观测值数量，否则元素 i 为 null。
aggr – [in] 滚动窗口聚合类型（SUM, MAX, MIN 等）
stream – [in] 用于设备内存操作和内核启动的CUDA流
mr – [in] 用于分配返回列的设备内存的设备内存资源

Returns:

一个可为空的输出列，包含滚动窗口的结果

std::unique_ptr<column> rolling_window(column_view const &input, column_view const &preceding_window, column_view const &following_window, size_type min_periods, rolling_aggregation const &agg, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

对列中的值应用可变大小的滚动窗口函数。

此函数在输入列的每个元素i周围的窗口中聚合值，如果没有足够的观察值，则使元素i的位掩码无效。窗口大小是动态的（每个元素不同）。这与Pandas的DataFrame.rolling API相匹配，但有一些显著的区别：

它使用一个两部分窗口而不是中心标志，以允许更灵活的窗口。总窗口大小 = preceding_window + following_window。元素 i 使用元素 [i-preceding_window+1, i+following_window] 进行窗口计算。
对于不满足最小观察次数的输出行，此函数不会存储NA/NaN，而是更新列的有效位掩码以指示哪些元素是有效的。
支持动态滚动窗口，即可以使用额外的数组为每个元素指定窗口大小。

计数聚合返回的列始终为INT32类型。所有其他操作符返回的列与输入的类型相同。因此，建议在进行滚动MEAN之前，将整数列类型（尤其是低精度整数）转换为FLOAT32或FLOAT64。

Throws:

cudf::logic_error – 如果窗口列类型不是INT32

Parameters:

input – [in] 输入列
preceding_window – [in] 一个非空的INT32类型列，表示向前方向的窗口大小。preceding_window[i] 指定了元素 i 的前置窗口大小。
following_window – [in] 一个不可为空的 INT32 窗口大小列，表示向后方向。following_window[i] 指定元素 i 的后续窗口大小。
min_periods – [in] 窗口中需要的最小观测值数量，否则元素 i 为 null。
agg – [in] 滚动窗口聚合类型（求和、最大值、最小值等）
stream – [in] 用于设备内存操作和内核启动的CUDA流
mr – [in] 用于分配返回列的设备内存的设备内存资源

Returns:

一个可为空的输出列，包含滚动窗口的结果

struct range_window_bounds#

#include <range_window_bounds.hpp>

窗口边界大小的抽象，用于与 grouped_range_rolling_window() 一起使用。

类似于window_bounds在grouped_rolling_window()中，range_window_bounds表示用于grouped_range_rolling_window()的窗口边界。窗口可以指定为以下之一：

一个固定宽度的数值标量值。例如：a) 一个DURATION_DAYS标量，用于TIMESTAMP_DAYS排序列 b) 一个INT32标量，用于INT32排序列
“unbounded”，表示边界延伸到组中的第一行/最后一行。
“当前行”，表示边界结束于组中与当前行值匹配的第一行/最后一行。

公共类型

enum class extent_type : int32_t#

range_window_bounds 的类型。

值：

enumerator CURRENT_ROW#

enumerator BOUNDED#: 边界定义为与当前行匹配的第一行/最后一行。

enumerator UNBOUNDED#

边界延伸到整个组中的第一行/最后一行。

边界定义为从当前行开始，落在指定范围内的第一行/最后一行。

公共函数

inline bool is_current_row() const#

窗口是否绑定到当前行。

Returns:: true 如果窗口绑定到当前行
Returns:: false 如果窗口未绑定到当前行

inline bool is_unbounded() const#

窗口是否无界。

Returns:: 如果窗口是无界的，则为 true
Returns:: 如果窗口有有限边界，则为 false

inline scalar const &range_scalar() const#

返回边界的底层标量值。

Returns:: 边界的底层标量值

range_window_bounds(range_window_bounds const&) = default#: 复制构造函数。

公共静态函数

static range_window_bounds get(scalar const &boundary, rmm::cuda_stream_view stream = cudf::get_default_stream())#

工厂方法用于构造一个有界的窗口边界。

Parameters:

boundary – 有限窗口边界
stream – 用于设备内存操作和内核启动的CUDA流

Returns:

一个有界的窗口边界对象

static range_window_bounds current_row(data_type type, rmm::cuda_stream_view stream = cudf::get_default_stream())#

工厂方法用于构造一个仅限于当前行值的窗口边界。

Parameters:

type – 窗口边界的数据类型
stream – 用于设备内存操作和内核启动的CUDA流

Returns:

一个“当前行”窗口边界对象

static range_window_bounds unbounded(data_type type, rmm::cuda_stream_view stream = cudf::get_default_stream())#

工厂方法用于构造一个无界的窗口边界。

Parameters:

type – 窗口边界的数据类型
stream – 用于设备内存操作和内核启动的CUDA流

Returns:

一个无界的窗口边界对象

struct window_bounds#

#include <rolling.hpp>

窗口边界大小的抽象。

公共函数

inline bool is_unbounded() const#

是否window_bounds是无界的。

Returns:: 如果窗口边界是无界的，则为true。
Returns:: 如果窗口边界具有有限的行边界，则为false。

inline size_type value() const#

获取此window_bounds的行边界。

Returns:: 行边界值（以天或行计）

公共静态函数

static inline window_bounds get(size_type value)#

构建有界窗口边界。

Parameters:: value – 有限的窗口边界（以天或行计）
Returns:: 一个窗口边界

static inline window_bounds unbounded()#

构造无界窗口边界。

Returns:: window_bounds