分组依据子句
描述
GROUP BY
子句用于根据一组指定的分组表达式对行进行分组,并基于一个或多个指定的聚合函数对该分组的行计算聚合。Spark 还支持高级聚合,通过
GROUPING SETS
、
CUBE
、
ROLLUP
子句对同一输入记录集进行多个聚合。分组表达式和高级聚合可以在
GROUP BY
子句中混合使用,并可以嵌套在
GROUPING SETS
子句中。有关更多细节,请参见
Mixed/Nested Grouping Analytics
部分。当一个 FILTER 子句附加到聚合函数时,仅匹配的行会传递给该函数。
语法
GROUP BY group_expression [ , group_expression [ , ... ] ] [ WITH { ROLLUP | CUBE } ]
GROUP BY { group_expression | { ROLLUP | CUBE | GROUPING SETS } (grouping_set [ , ...]) } [ , ... ]
聚合函数被定义为
aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_expression ) ]
参数
-
group_expression
指定用于将行分组的标准。行的分组是根据分组表达式的结果值执行的。分组表达式可以是列名,例如
GROUP BY a
,列的位置,如GROUP BY 0
,或者是像GROUP BY a + b
的表达式。 -
grouping_set
分组集由零个或多个用逗号分隔的表达式括在括号中指定。当分组集只有一个元素时,可以省略括号。例如,
GROUPING SETS ((a), (b))
与GROUPING SETS (a, b)
是相同的。语法:
{ ( [ expression [ , ... ] ] ) | expression }
-
GROUPING SETS
为每个在 GROUPING SETS 后指定的分组集分组行。例如,
GROUP BY GROUPING SETS ((warehouse), (product))
在语义上等同于GROUP BY warehouse
和GROUP BY product
结果的并集。该子句是UNION ALL
的简写,其中每个UNION ALL
操作符执行对在GROUPING SETS
子句中指定的每个分组集的聚合。类似地,GROUP BY GROUPING SETS ((warehouse, product), (product), ())
在语义上等同于GROUP BY warehouse, product
、GROUP BY product
和全局聚合的结果。注意: 为了与 Hive 兼容,Spark 允许
GROUP BY ... GROUPING SETS (...)
。GROUP BY 表达式通常被忽略,但如果它包含比 GROUPING SETS 表达式更多的额外表达式,则这些额外表达式将包含在分组表达式中,值始终为 null。例如,SELECT a, b, c FROM ... GROUP BY a, b, c GROUPING SETS (a, b)
,列c
的输出总是 null。 -
ROLLUP
在单个语句中指定多个聚合级别。该子句用于根据多个分组集计算聚合。
ROLLUP
是GROUPING SETS
的简写。例如,GROUP BY warehouse, product WITH ROLLUP
或GROUP BY ROLLUP(warehouse, product)
等同于GROUP BY GROUPING SETS((warehouse, product), (warehouse), ())
。GROUP BY ROLLUP(warehouse, product, (warehouse, location))
等同于GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse), ())
。ROLLUP
规范的 N 个元素将导致 N+1 个GROUPING SETS
。 -
CUBE
CUBE
子句用于根据在GROUP BY
子句中指定的分组列的组合执行聚合。CUBE
是GROUPING SETS
的简写。例如,GROUP BY warehouse, product WITH CUBE
或GROUP BY CUBE(warehouse, product)
等同于GROUP BY GROUPING SETS((warehouse, product), (warehouse), (product), ())
。GROUP BY CUBE(warehouse, product, (warehouse, location))
等同于GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (product, warehouse, location), (warehouse), (product), (warehouse, product), ())
。CUBE
规范的 N 个元素将导致 2^N 个GROUPING SETS
。 -
Mixed/Nested Grouping Analytics
A GROUP BY 子句可以包含多个
group_expression
和多个CUBE|ROLLUP|GROUPING SETS
。GROUPING SETS
还可以具有嵌套的CUBE|ROLLUP|GROUPING SETS
子句,例如GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))
,GROUPING SETS(warehouse, GROUPING SETS(location, GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))))
。CUBE|ROLLUP
只是GROUPING SETS
的语法糖,请参阅上面的部分以了解如何将CUBE|ROLLUP
转换为GROUPING SETS
。在这种情况下,group_expression
可以视为一个单一的分组GROUPING SETS
。对于GROUP BY
子句中的多个GROUPING SETS
,我们通过对原始GROUPING SETS
进行交叉乘积生成一个单独的GROUPING SETS
。对于GROUPING SETS
子句中的嵌套GROUPING SETS
,我们只需获取其分组集并去除它。例如,GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ())
和GROUP BY warehouse, ROLLUP(product), CUBE(location, size)
等同于GROUP BY GROUPING SETS( (warehouse, product, location, size), (warehouse, product, location), (warehouse, product, size), (warehouse, product), (warehouse, location, size), (warehouse, location), (warehouse, size), (warehouse))
。GROUP BY GROUPING SETS(GROUPING SETS(warehouse), GROUPING SETS((warehouse, product)))
等同于GROUP BY GROUPING SETS((warehouse), (warehouse, product))
。 -
aggregate_name
指定聚合函数名称(MIN、MAX、COUNT、SUM、AVG等)。
-
DISTINCT
在输入行传递给聚合函数之前去除重复值。
-
FILTER
过滤输入行,其中
WHERE
子句中的boolean_expression
计算为 true 的行会传递到聚合函数;其他行会被丢弃。
示例
CREATE TABLE dealer (id INT, city STRING, car_model STRING, quantity INT);
INSERT INTO dealer VALUES
(100, 'Fremont', 'Honda Civic', 10),
(100, 'Fremont', 'Honda Accord', 15),
(100, , 'Honda CRV', 7),
(200, , , 20),
(200, , , 10),
(200, , , 3),
(300, , , 5),
(300, , , 8);
-- 每个经销商的数量总和。按 `id` 分组。
SELECT id, sum(quantity) FROM dealer GROUP BY id ORDER BY id;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100| 32|
|200| 33|
|300| 13|
+---+-------------+
-- 在 GROUP by 子句中使用列位置。
SELECT id, sum(quantity) FROM dealer GROUP BY 1 ORDER BY 1;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100| 32|
|200| 33|
|300| 13|
+---+-------------+
-- 多重聚合。
-- 1. 每个经销商的数量总和。
-- 2. 每个经销商的最大数量。
SELECT id, sum(quantity) AS sum, max(quantity) AS max FROM dealer GROUP BY id ORDER BY id;
+---+---+---+
| id|sum|max|
+---+---+---+
|100| 32| 15|
|200| 33| 20|
|300| 13| 8|
+---+---+---+
-- 计算每个汽车型号的不同经销商城市数量。
SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY car_model;
+------------+-----+
| car_model|count|
+------------+-----+
| Honda Civic| 3|
| Honda CRV| 2|
|Honda Accord| 3|
+------------+-----+
-- 仅对 'Honda Civic' 和 'Honda CRV' 的每个经销商的数量求和。
SELECT id, sum(quantity) FILTER (
WHERE car_model IN ('Honda Civic', )
) AS `sum(quantity)` FROM dealer
GROUP BY id ORDER BY id;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100| 17|
|200| 23|
|300| 5|
+---+-------------+
-- 在单个语句中使用多个分组列进行聚合。
-- 以下根据四组分组列执行聚合。
-- 1. city, car_model
-- 2. city
-- 3. car_model
-- 4. 空分组集。返回所有城市和汽车型号的数量。
SELECT city, car_model, sum(quantity) AS sum FROM dealer
GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
ORDER BY city;
+---------+------------+---+
| city| car_model|sum|
+---------+------------+---+
| null| null| 78|
| null| HondaAccord| 33|
| null| HondaCRV| 10|
| null| HondaCivic| 35|
| Dublin| null| 33|
| Dublin| HondaAccord| 10|
| Dublin| HondaCRV| 3|
| Dublin| HondaCivic| 20|
| Fremont| null| 32|
| Fremont| HondaAccord| 15|
| Fremont| HondaCRV| 7|
| Fremont| HondaCivic| 10|
| San Jose| null| 13|
| San Jose| HondaAccord| 8|
| San Jose| HondaCivic| 5|
+---------+------------+---+
-- 使用 `ROLLUP` 子句进行分组处理。
-- 等价于 GROUP BY GROUPING SETS ((city, car_model), (city), ())
SELECT city, car_model, sum(quantity) AS sum FROM dealer
GROUP BY city, car_model WITH ROLLUP
ORDER BY city, car_model;
+---------+------------+---+
| city| car_model|sum|
+---------+------------+---+
| null| null| 78|
| Dublin| null| 33|
| Dublin| HondaAccord| 10|
| Dublin| HondaCRV| 3|
| Dublin| HondaCivic| 20|
| Fremont| null| 32|
| Fremont| HondaAccord| 15|
| Fremont| HondaCRV| 7|
| Fremont| HondaCivic| 10|
| San Jose| null| 13|
| San Jose| HondaAccord| 8|
| San Jose| HondaCivic| 5|
+---------+------------+---+
-- 使用 `CUBE` 子句进行分组处理。
-- 等价于 GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
SELECT city, car_model, sum(quantity) AS sum FROM dealer
GROUP BY city, car_model WITH CUBE
ORDER BY city, car_model;
+---------+------------+---+
| city| car_model|sum|
+---------+------------+---+
| null| null| 78|
| null| HondaAccord| 33|
| null| HondaCRV| 10|
| null| HondaCivic| 35|
| Dublin| null| 33|
| Dublin| HondaAccord| 10|
| Dublin| HondaCRV| 3|
| Dublin| HondaCivic| 20|
| Fremont| null| 32|
| Fremont| HondaAccord| 15|
| Fremont| HondaCRV| 7|
| Fremont| HondaCivic| 10|
| San Jose| null| 13|
| San Jose| HondaAccord| 8|
| San Jose| HondaCivic| 5|
+---------+------------+---+
-- 准备数据以进行忽略空值的示例
CREATE TABLE person (id INT, name STRING, age INT);
INSERT INTO person VALUES
(100, , NULL),
(200, , 30),
(300, , 80),
(400, , 50);
-- 选择列 age 中的第一行
SELECT FIRST(age) FROM person;
+--------------------+
| first(age, false) |
+--------------------+
| NULL |
+--------------------+
-- 获取列 `age` 中的第一行,忽略空值,列 `id` 中的最后一行,以及列 `id` 的总和。
SELECT FIRST(age IGNORE NULLS), LAST(id), SUM(id) FROM person;
+-------------------+------------------+----------+
| first(age, true) | last(id, false) | sum(id) |
+-------------------+------------------+----------+
| 30 | 400 | 1000 |
+-------------------+------------------+----------+