GROUP BY Modifiers
Starting from v7.4.0, the GROUP BY
clause of TiDB supports the WITH ROLLUP
modifier.
In the GROUP BY
clause, you can specify one or more columns as a group list and append the WITH ROLLUP
modifier after the list. Then, TiDB will conduct multidimensional descending grouping based on the columns in the group list and provide you with summary results for each group in the output.
Grouping method:
- The first grouping dimension includes all columns in the group list.
- Subsequent grouping dimensions start from the right end of the grouping list and exclude one more column at a time to form new groups.
Aggregation summaries: for each dimension, the query performs aggregation operations, and then aggregates the results of this dimension with the results of all previous dimensions. This means that you can get aggregated data at different dimensions, from detailed to overall.
With this grouping method, if there are N
columns in the group list, TiDB aggregates the query results on N+1
groups.
For example:
SELECT count(1) FROM t GROUP BY a,b,c WITH ROLLUP;
In this example, TiDB will aggregate the calculation results of count(1)
on 4 groups (that is, {a, b, c}
, {a, b}
, {a}
, and {}
) and output the summary results for each group.
Use cases
Aggregating and summarizing data from multiple columns is commonly used in OLAP (Online Analytical Processing) scenarios. By using the WITH ROLLUP
modifier, you can get additional rows that display super summary information from other high-level dimensions in your aggregated results. Then, you can use the super summary information for advanced data analysis and report generation.
Prerequisites
Currently, TiDB supports generating valid execution plans for the WITH ROLLUP
syntax only in TiFlash MPP mode. Therefore, make sure that your TiDB cluster has been deployed with TiFlash nodes and that target fact tables are configured with TiFlash replicas properly.
For more information, see Scale out a TiFlash cluster.
Examples
Suppose you have a profit table named bank
with the year
, month
, day
, and profit
columns.
CREATE TABLE bank
(
year INT,
month VARCHAR(32),
day INT,
profit DECIMAL(13, 7)
);
ALTER TABLE bank SET TIFLASH REPLICA 1; -- Add a TiFlash replica for the table
INSERT INTO bank VALUES(2000, "Jan", 1, 10.3),(2001, "Feb", 2, 22.4),(2000,"Mar", 3, 31.6)
To get the profit for the bank per year, you can use a simple GROUP BY
clause as follows:
SELECT year, SUM(profit) AS profit FROM bank GROUP BY year;
+------+--------------------+
| year | profit |
+------+--------------------+
| 2001 | 22.399999618530273 |
| 2000 | 41.90000057220459 |
+------+--------------------+
2 rows in set (0.15 sec)
In addition to yearly profits, bank reports usually also need to include the overall profit for all years or monthly divided profits for detailed profit analysis. Before v7.4.0, you have to use different GROUP BY
clauses in multiple queries and join the results using UNION to obtain aggregated summaries. Starting from v7.4.0, you can simply achieve the desired results in a single query by appending the WITH ROLLUP
modifier to the GROUP BY
clause.
SELECT year, month, SUM(profit) AS profit from bank GROUP BY year, month WITH ROLLUP ORDER BY year desc, month desc;
+------+-------+--------------------+
| year | month | profit |
+------+-------+--------------------+
| 2001 | Feb | 22.399999618530273 |
| 2001 | NULL | 22.399999618530273 |
| 2000 | Mar | 31.600000381469727 |
| 2000 | Jan | 10.300000190734863 |
| 2000 | NULL | 41.90000057220459 |
| NULL | NULL | 64.30000019073486 |
+------+-------+--------------------+
6 rows in set (0.025 sec)
The preceding results include aggregated data at different dimensions: by both year and month, by year, and overall. In the results, a row without NULL
values indicates that the profit
in that row is calculated by grouping both year and month. A row with a NULL
value in the month
column indicates that profit
in that row is calculated by aggregating all months in a year, while a row with a NULL
value in the year
column indicates that profit
in that row is calculated by aggregating all years.
Specifically:
- The
profit
value in the first row comes from the 2-dimensional group{year, month}
, representing the aggregation result for the fine-grained{2000, "Jan"}
group. - The
profit
value in the second row comes from the 1-dimensional group{year}
, representing the aggregation result for the mid-level{2001}
group. - The
profit
value in the last row comes from the 0-dimensional grouping{}
, representing the overall aggregation result.
NULL
values in the WITH ROLLUP
results are generated just before the Aggregate operator is applied. Therefore, you can use NULL
values in SELECT
, HAVING
, and ORDER BY
clauses to further filter the aggregated results.
For example, you can use NULL
in the HAVING
clause to filter and view the aggregated results of 2-dimensional groups only:
SELECT year, month, SUM(profit) AS profit FROM bank GROUP BY year, month WITH ROLLUP HAVING year IS NOT null AND month IS NOT null;
+------+-------+--------------------+
| year | month | profit |
+------+-------+--------------------+
| 2000 | Mar | 31.600000381469727 |
| 2000 | Jan | 10.300000190734863 |
| 2001 | Feb | 22.399999618530273 |
+------+-------+--------------------+
3 rows in set (0.02 sec)
Note that if a column in the GROUP BY
list contains native NULL
values, the aggregation results of WITH ROLLUP
might mislead the query results. To address this issue, you can use the GROUPING()
function to distinguish native NULL
values from NULL
values generated by WITH ROLLUP
. This function takes a grouping expression as a parameter and returns 0
or 1
to indicate whether the grouping expression is aggregated in the current result. 1
represents aggregated, and 0
represents not aggregated.
The following example shows how to use the GROUPING()
function:
SELECT year, month, SUM(profit) AS profit, grouping(year) as grp_year, grouping(month) as grp_month FROM bank GROUP BY year, month WITH ROLLUP ORDER BY year DESC, month DESC;
+------+-------+--------------------+----------+-----------+
| year | month | profit | grp_year | grp_month |
+------+-------+--------------------+----------+-----------+
| 2001 | Feb | 22.399999618530273 | 0 | 0 |
| 2001 | NULL | 22.399999618530273 | 0 | 1 |
| 2000 | Mar | 31.600000381469727 | 0 | 0 |
| 2000 | Jan | 10.300000190734863 | 0 | 0 |
| 2000 | NULL | 41.90000057220459 | 0 | 1 |
| NULL | NULL | 64.30000019073486 | 1 | 1 |
+------+-------+--------------------+----------+-----------+
6 rows in set (0.028 sec)
From this output, you can get an understanding of the aggregation dimension of a row directly from the results of grp_year
and grp_month
, which prevents interference from native NULL
values in the year
and month
grouping expressions.
The GROUPING()
function can accept up to 64 grouping expressions as parameters. In the output of multiple parameters, each parameter generates a result of 0
or 1
, and these parameters collectively form a 64-bit UNSIGNED LONGLONG
with each bit as 0
or 1
. You can use the following formula to get the bit position of each parameter as follows:
GROUPING(day, month, year):
result for GROUPING(year)
+ result for GROUPING(month) << 1
+ result for GROUPING(day) << 2
By using multiple parameters in the GROUPING()
function, you can efficiently filter aggregate results at any high dimension. For example, you can quickly filter the aggregate results for each year and all years by using GROUPING(year, month)
.
SELECT year, month, SUM(profit) AS profit, grouping(year) as grp_year, grouping(month) as grp_month FROM bank GROUP BY year, month WITH ROLLUP HAVING GROUPING(year, month) <> 0 ORDER BY year DESC, month DESC;
+------+-------+--------------------+----------+-----------+
| year | month | profit | grp_year | grp_month |
+------+-------+--------------------+----------+-----------+
| 2001 | NULL | 22.399999618530273 | 0 | 1 |
| 2000 | NULL | 41.90000057220459 | 0 | 1 |
| NULL | NULL | 64.30000019073486 | 1 | 1 |
+------+-------+--------------------+----------+-----------+
3 rows in set (0.023 sec)
How to interpret the ROLLUP execution plan
To meet the requirements of multidimensional grouping, multidimensional data aggregation uses the Expand
operator to replicate data. Each replica corresponds to a group at a specific dimension. With the data shuffling capability of MPP, the Expand
operator can rapidly reorganize and calculate a large volume of data between multiple TiFlash nodes, fully utilizing the computational power of each node.
The implementation of the Expand
operator is similar to that of the Projection
operator. The difference is that Expand
is a multi-level Projection
, which contains multiple levels of projection operation expressions. For each row of the raw data, the Projection
operator generates only one row in results, whereas the Expand
operator generates multiple rows in results (the number of rows is equal to the number of levels in projection operation expressions).
The following is an example of an execution plan:
explain SELECT year, month, grouping(year), grouping(month), SUM(profit) AS profit FROM bank GROUP BY year, month WITH ROLLUP;
+----------------------------------------+---------+--------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+----------------------------------------+---------+--------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| TableReader_44 | 2.40 | root | | MppVersion: 2, data:ExchangeSender_43 |
| └─ExchangeSender_43 | 2.40 | mpp[tiflash] | | ExchangeType: PassThrough |
| └─Projection_8 | 2.40 | mpp[tiflash] | | Column#6->Column#12, Column#7->Column#13, grouping(gid)->Column#14, grouping(gid)->Column#15, Column#9->Column#16 |
| └─Projection_38 | 2.40 | mpp[tiflash] | | Column#9, Column#6, Column#7, gid |
| └─HashAgg_36 | 2.40 | mpp[tiflash] | | group by:Column#6, Column#7, gid, funcs:sum(test.bank.profit)->Column#9, funcs:firstrow(Column#6)->Column#6, funcs:firstrow(Column#7)->Column#7, funcs:firstrow(gid)->gid, stream_count: 8 |
| └─ExchangeReceiver_22 | 3.00 | mpp[tiflash] | | stream_count: 8 |
| └─ExchangeSender_21 | 3.00 | mpp[tiflash] | | ExchangeType: HashPartition, Compression: FAST, Hash Cols: [name: Column#6, collate: binary], [name: Column#7, collate: utf8mb4_bin], [name: gid, collate: binary], stream_count: 8 |
| └─Expand_20 | 3.00 | mpp[tiflash] | | level-projection:[test.bank.profit, <nil>->Column#6, <nil>->Column#7, 0->gid],[test.bank.profit, Column#6, <nil>->Column#7, 1->gid],[test.bank.profit, Column#6, Column#7, 3->gid]; schema: [test.bank.profit,Column#6,Column#7,gid] |
| └─Projection_16 | 3.00 | mpp[tiflash] | | test.bank.profit, test.bank.year->Column#6, test.bank.month->Column#7 |
| └─TableFullScan_17 | 3.00 | mpp[tiflash] | table:bank | keep order:false, stats:pseudo |
+----------------------------------------+---------+--------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
10 rows in set (0.05 sec)
In this example execution plan, you can view the multiple-level expression of the Expand
operator in the operator info
column of the Expand_20
row. It consists of 2-dimensional expressions, and you can view the schema information of the Expand
operator at the end of the row, which is schema: [test.bank.profit, Column#6, Column#7, gid]
.
In the schema information of the Expand
operator, GID
is generated as an additional column. Its value is calculated by the Expand
operator based on the grouping logic of different dimensions, and the value reflects the relationship between the current data replica and the grouping set
. In most cases, the Expand
operator uses a Bit-And operation, which can represent 63 combinations of grouping items for ROLLUP, corresponding to 64 dimensions of grouping. In this mode, TiDB generates the GID
value depending on whether the grouping set
of the required dimension contains grouping expressions when the current data replica is replicated, and it fills a 64-bit UINT64 value in the order of columns to be grouped.
In the preceding example, the order of columns in the grouping list is [year, month]
, and the dimension groups generated by the ROLLUP syntax are {year, month}
, {year}
, and {}
. For the dimension group {year, month}
, both year
and month
are required columns, so TiDB fills the bit positions for them with 1 and 1 correspondingly. This forms a UINT64 of 11...0
, which is 3 in decimal. Therefore, the projection expression is [test.bank.profit, Column#6, Column#7, 3->gid]
(where column#6
corresponds to year
, and column#7
corresponds to month
).
The following is an example row of the raw data:
+------+-------+------+------------+
| year | month | day | profit |
+------+-------+------+------------+
| 2000 | Jan | 1 | 10.3000000 |
+------+-------+------+------------+
After the Expand
operator is applied, you can get the following three rows of results:
+------------+------+-------+-----+
| profit | year | month | gid |
+------------+------+-------+-----+
| 10.3000000 | 2000 | Jan | 3 |
+------------+------+-------+-----+
| 10.3000000 | 2000 | NULL | 1 |
+------------+------+-------+-----+
| 10.3000000 | NULL | NULL | 0 |
+------------+------+-------+-----+
Note that the SELECT
clause in the query uses the GROUPING
function. When the GROUPING
function is used in the SELECT
, HAVING
, or ORDER BY
clauses, TiDB rewrites it during the logical optimization phase, transforms the relationship between the GROUPING
function and the GROUP BY
items into a GID
related to the logic of dimension group (also known as grouping set
), and fills this GID
as metadata into the new GROUPING
function.