Decorrelation of Correlated Subquery
Subquery related optimizations 描述了当没有相关列时,TiDB 如何处理子查询。由于相关子查询的去相关化较为复杂,本文介绍一些简单场景以及优化规则适用的范围。
引言
以 select * from t1 where t1.a < (select sum(t2.a) from t2 where t2.b = t1.b)
为例。这里的子查询 t1.a < (select sum(t2.a) from t2 where t2.b = t1.b)
涉及到查询条件中的相关列 t2.b=t1.b
,这个条件恰好是等价条件,因此可以将查询重写为 select t1.* from t1, (select b, sum(a) sum_a from t2 group by b) t2 where t1.b = t2.b and t1.a < t2.sum_a;
。通过这种方式,相关子查询被重写为 JOIN
。
TiDB 需要进行此类重写的原因在于,相关子查询每次执行时都必须绑定到外部查询的结果。在上述示例中,如果 t1.a
有 1000 万个值,这个子查询会重复执行 1000 万次,因为条件 t2.b=t1.b
会随着 t1.a
的值变化。当某种程度上解除相关性后,这个子查询只需执行一次。
限制条件
这种重写的缺点在于,当没有解除相关性时,优化器可以利用相关列上的索引。也就是说,虽然这个子查询可能会重复多次,但每次都可以利用索引过滤数据。使用重写规则后,相关列的位置通常会发生变化。虽然子查询只执行一次,但单次执行的时间可能会比未去相关化时更长。
因此,当外部值较少时,不建议进行去相关化,这样可能会带来更好的执行性能。在这种情况下,你可以通过使用 NO_DECORRELATE
优化器提示,或者在 优化规则和表达式下推的黑名单 中禁用“子查询去相关化”优化规则。大多数情况下,建议结合 SQL Plan Management 使用优化器提示,以禁用去相关化。
示例
create table t1(a int, b int);
create table t2(a int, b int, index idx(b));
explain select * from t1 where t1.a < (select sum(t2.a) from t2 where t2.b = t1.b);
+----------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+----------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------+
| HashJoin_11 | 9990.00 | root | | inner join, equal:[eq(test.t1.b, test.t2.b)], other cond:lt(cast(test.t1.a), Column#7) |
| ├─HashAgg_23(Build) | 7992.00 | root | | group by:test.t2.b, funcs:sum(Column#8)->Column#7, funcs:firstrow(test.t2.b)->test.t2.b |
| │ └─TableReader_24 | 7992.00 | root | | data:HashAgg_16 |
| │ └─HashAgg_16 | 7992.00 | cop[tikv] | | group by:test.t2.b, funcs:sum(test.t2.a)->Column#8 |
| │ └─Selection_22 | 9990.00 | cop[tikv] | | not(isnull(test.t2.b)) |
| │ └─TableFullScan_21 | 10000.00 | cop[tikv] | table:t2 | keep order:false, stats:pseudo |
| └─TableReader_15(Probe) | 9990.00 | root | | data:Selection_14 |
| └─Selection_14 | 9990.00 | cop[tikv] | | not(isnull(test.t1.b)) |
| └─TableFullScan_13 | 10000.00 | cop[tikv] | table:t1 | keep order:false, stats:pseudo |
+----------------------------------+----------+-----------+---------------+-----------------------------------------------------------------------------------------+
以上是优化生效的示例。HashJoin_11
是一个普通的 inner join
。
然后,你可以使用 NO_DECORRELATE
优化器提示,告诉优化器不要对子查询进行去相关化:
explain select * from t1 where t1.a < (select /*+ NO_DECORRELATE() */ sum(t2.a) from t2 where t2.b = t1.b);
+------------------------------------------+-----------+-----------+------------------------+--------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+------------------------------------------+-----------+-----------+------------------------+--------------------------------------------------------------------------------------+
| Projection_10 | 10000.00 | root | | test.t1.a, test.t1.b |
| └─Apply_12 | 10000.00 | root | | CARTESIAN inner join, other cond:lt(cast(test.t1.a, decimal(10,0) BINARY), Column#7) |
| ├─TableReader_14(Build) | 10000.00 | root | | data:TableFullScan_13 |
| │ └─TableFullScan_13 | 10000.00 | cop[tikv] | table:t1 | keep order:false, stats:pseudo |
| └─MaxOneRow_15(Probe) | 10000.00 | root | | |
| └─StreamAgg_20 | 10000.00 | root | | funcs:sum(Column#14)->Column#7 |
| └─Projection_45 | 100000.00 | root | | cast(test.t2.a, decimal(10,0) BINARY)->Column#14 |
| └─IndexLookUp_44 | 100000.00 | root | | |
| ├─IndexRangeScan_42(Build) | 100000.00 | cop[tikv] | table:t2, index:idx(b) | range: decided by [eq(test.t2.b, test.t1.b)], keep order:false, stats:pseudo |
| └─TableRowIDScan_43(Probe) | 100000.00 | cop[tikv] | table:t2 | keep order:false, stats:pseudo |
+------------------------------------------+-----------+-----------+------------------------+--------------------------------------------------------------------------------------+
禁用子查询去相关化规则后,也可以达到相同效果:
insert into mysql.opt_rule_blacklist values("decorrelate");
admin reload opt_rule_blacklist;
explain select * from t1 where t1.a < (select sum(t2.a) from t2 where t2.b = t1.b);
+------------------------------------------+-----------+-----------+------------------------+--------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+------------------------------------------+-----------+-----------+------------------------+--------------------------------------------------------------------------------------+
| Projection_10 | 10000.00 | root | | test.t1.a, test.t1.b |
| └─Apply_12 | 10000.00 | root | | CARTESIAN inner join, other cond:lt(cast(test.t1.a, decimal(10,0) BINARY), Column#7) |
| ├─TableReader_14(Build) | 10000.00 | root | | data:TableFullScan_13 |
| │ └─TableFullScan_13 | 10000.00 | cop[tikv] | table:t1 | keep order:false, stats:pseudo |
| └─MaxOneRow_15(Probe) | 10000.00 | root | | |
| └─StreamAgg_20 | 10000.00 | root | | funcs:sum(Column#14)->Column#7 |
| └─Projection_45 | 100000.00 | root | | cast(test.t2.a, decimal(10,0) BINARY)->Column#14 |
| └─IndexLookUp_44 | 100000.00 | root | | |
| ├─IndexRangeScan_42(Build) | 100000.00 | cop[tikv] | table:t2, index:idx(b) | range: decided by [eq(test.t2.b, test.t1.b)], keep order:false, stats:pseudo |
| └─TableRowIDScan_43(Probe) | 100000.00 | cop[tikv] | table:t2 | keep order:false, stats:pseudo |
+------------------------------------------+-----------+-----------+------------------------+--------------------------------------------------------------------------------------+
禁用子查询去相关化规则后,你可以在 operator info
中看到 range: decided by [eq(test.t2.b, test.t1.b)]
,这意味着没有进行相关子查询的去相关化,TiDB 使用了索引范围查询。