Semi-join Materialization is a special kind of subquery materialization used for Semi-join subqueries. It actually includes two strategies:
半连接物化策略是半连接子查询物化的一种特殊情况,实际它包换两种策略:
- Materialization/lookup
- Materialization/scan
Consider a query that finds countries in Europe which have big cities:
思考查询欧洲拥有指定人口的城市所属国家:
select * from Country
where Country.code IN (select City.Country
from City
where City.Population > 7*1000*1000)
and Country.continent='Europe'
The subquery is uncorrelated, that is, we can run it independently of the upper query. The idea of semi-join materialization is to do just that, and fill a temporary table with possible values of the City.country field of big cities, and then do a join with countries in Europe:
其中子查询与主查询不相关,我们可以独立于主查询运行它。半连接物化思路大致如此,并会生成一张拥有 City.country
字段大城市的表,然后再与国家表中的欧洲国家记录进行连接。
The join can be done in two directions:
连接可能出现两个方向
- From the materialized table to countries in Europe
从物化表连接欧洲国家表
- From countries in Europe to the materialized table
从欧洲国家表连接物化表
The first way involves doing a full scan on the materialized table, so we call it "Materialization-scan".
第一种方法是对物化表进行完全扫描,所以我们称之为“物化扫描”。
If you run a join from Countries to the materialized table, the cheapest way to find a match in the materialized table is to make a lookup on its primary key (it has one: we used it to remove duplicates). Because of that, we call the strategy "Materialization-lookup".
如果您运行从国家表到物化表的连接,在物化表中找到匹配的最有效的方法是在其主键上进行查找(它只有一行记录:我们用它来删除重复项)。 因此,我们将策略称为 Materialization-lookup
(物化查找)。
半连接物化执行
物化扫描
If we chose to look for cities with a population greater than 7 million, the optimizer will use Materialization-Scan and EXPLAIN will show this:
如果我们查找人口超过700万的城市,优化器将使用 Materialization-Scan
,EXPLAIN
将显示:
MariaDB [world]> explain select * from Country where Country.code IN (select City.Country from City where City.Population > 7*1000*1000);
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
| 1 | PRIMARY | <subquery2> | ALL | distinct_key | NULL | NULL | NULL | 15 | |
| 1 | PRIMARY | Country | eq_ref | PRIMARY | PRIMARY | 3 | world.City.Country | 1 | |
| 2 | MATERIALIZED | City | range | Population,Country | Population | 4 | NULL | 15 | Using index condition |
+----+--------------+-------------+--------+--------------------+------------+---------+--------------------+------+-----------------------+
3 rows in set (0.01 sec)
Here, you can see:
这里,我们可以看到:
- There are still two SELECTs (look for columns with id=1 and id=2)
- 这里会有两个
SELECT
(查看id=1 and id=2
的列) - The second select (with id=2) has select_type=MATERIALIZED. This means it will be executed and its results will be stored in a temporary table with a unique key over all columns. The unique key is there to prevent the table from containing any duplicate records.
- 第二个
SELECT (id=2)
的select_type
为MATERIALIZED
。这意味着它将被执行,其结果将被存储在一个临时表中,在所有列上都有一个唯一键。唯一键用于防止表包含任何重复记录。 - The first select received the table name . This is the table that we got as a result of the materialization of the select with id=2.
- 第一次
SELECT
接收到表名<subquery2>
。这是id=2
的SELECT
物化得到的结果表。
The optimizer chose to do a full scan over the materialized table, so this is an example of a use of the Materialization-Scan strategy.
优化器选择对物化表进行完全扫描,因此这是使用物化扫描策略的一个示例。
As for execution costs, we're going to read 15 rows from table City, write 15 rows to materialized table, read them back (the optimizer assumes there won't be any duplicates), and then do 15 eq_ref accesses to table Country. In total, we'll do 45 reads and 15 writes.
至于执行成本,我们将从 City
表中读取 15 行,将 15 行写入物化表,然后将它们读回(优化器保证不会有任何重复),然后执行 15 个 eq-ref
访问 Country
表。我们总共要读 45 次,写 15 次。
译者注 45 次为读 City
表获取 15 行 + 写入后再读取 15 行 + 15 行 eq-ref
访问 Country
表
By comparison, if you run the EXPLAIN in MySQL, you'll get this:
相比之下,如果您在 MySQL
中运行 EXPLAIN
,您将得到:
MySQL [world]> explain select * from Country where Country.code IN (select City.Country from City where City.Population > 7*1000*1000);
+----+--------------------+---------+-------+--------------------+------------+---------+------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------+-------+--------------------+------------+---------+------+------+------------------------------------+
| 1 | PRIMARY | Country | ALL | NULL | NULL | NULL | NULL | 239 | Using where |
| 2 | DEPENDENT SUBQUERY | City | range | Population,Country | Population | 4 | NULL | 15 | Using index condition; Using where |
+----+--------------------+---------+-------+--------------------+------------+---------+------+------+------------------------------------+
...which is a plan to do (239 + 239*15) = 3824 table reads.
… 这是一个读取 (239+239\*15) = 3824
表行数的计划。
译者注 239 行表数据 + 239 行全扫子查询表 15 行 (239*15)
物化查找
Let's modify the query slightly and look for countries which have cities with a population over one millon (instead of seven):
让我们稍微修改一下查询,并查找人口超过 1 百万(而不是 7 百万)的城市所在的国家:
MariaDB [world]> explain select * from Country where Country.code IN (select City.Country from City where City.Population > 1*1000*1000) ;
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
| 1 | PRIMARY | Country | ALL | PRIMARY | NULL | NULL | NULL | 239 | |
| 1 | PRIMARY | <subquery2> | eq_ref | distinct_key | distinct_key | 3 | func | 1 | |
| 2 | MATERIALIZED | City | range | Population,Country | Population | 4 | NULL | 238 | Using index condition |
+----+--------------+-------------+--------+--------------------+--------------+---------+------+------+-----------------------+
3 rows in set (0.00 sec)
The EXPLAIN output is similar to the one which used Materialization-scan, except that:
EXPLAIN
输出类似于使用 Materialization-scan
的输出,但以下情况除外:
- the table is accessed with the eq_ref access method
<subquery2>
表使用eq_ref
进行访问- the access uses an index named distinct_key
- 使用访问名为
distinct_key
的索引
This means that the optimizer is planning to do index lookups into the materialized table. In other words, we're going to use the Materialization-lookup strategy.
这意味着优化器计划对物化表进行索引查找。 换句话说,我们将使用 Materialization-lookup
策略。
In MySQL (or with optimizer_switch='semijoin=off,materialization=off'), one will get this EXPLAIN:
在 MySQL
或者开启 optimizer_switch='semijoin=off,materialization=off'
,EXPLAIN
将显示:
MySQL [world]> explain select * from Country where Country.code IN (select City.Country from City where City.Population > 1*1000*1000) ;
+----+--------------------+---------+----------------+--------------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------+----------------+--------------------+---------+---------+------+------+-------------+
| 1 | PRIMARY | Country | ALL | NULL | NULL | NULL | NULL | 239 | Using where |
| 2 | DEPENDENT SUBQUERY | City | index_subquery | Population,Country | Country | 3 | func | 18 | Using where |
+----+--------------------+---------+----------------+--------------------+---------+---------+------+------+-------------+
One can see that both plans will do a full scan on the Country table. For the second step, MariaDB will fill the materialized table (238 rows read from table City and written to the temporary table) and then do a unique key lookup for each record in table Country, which works out to 238 unique key lookups. In total, the second step will cost (239+238) = 477 reads and 238 temp.table writes.
第一步,这两个执行计划都将对 Country
表进行全面扫描。第二步,MariaDB
将填充物化表(从 City
表读取238行并写入临时表),然后对 Country
表的每个记录进行唯一键查找,得出 238 个唯一键查找。总的来说,第二步将花费(239 + 238) = 477
次读取和 238 次临时表写入。
MySQL's plan for the second step is to read 18 rows using an index on City.Country for each record it receives for table Country. This works out to a cost of (18*239) = 4302 reads. Had there been fewer subquery invocations, this plan would have been better than the one with Materialization. By the way, MariaDB has an option to use such a query plan, too (see FirstMatch Strategy), but it did not choose it.
MySQL
的第二步计划是使用 City.Country
的索引读取 18 行,以及每个记录对应的 Country
表记录。这就需要花费 (18 \* 239)= 4302
次读取。如果子查询调用更少,这个计划会比物化的更好。顺便说一下,MariaDB 也可以选择使用这样的查询计划(参见 FirstMatch
策略),但它没有选择使用。
带 group 的子查询
MariaDB is able to use Semi-join materialization strategy when the subquery has grouping (other semi-join strategies are not applicable in this case).
当子查询具有分组时,MariaDB
能够使用半连接物化策略(在这种情况下其他半连接策略不适用)。
This allows for efficient execution of queries that search for the best/last element in a certain group.
这将有效执行搜索特定组中的首尾元素的查询。
For example, let's find cities that have the biggest population on their continent:
例如,让我们找到每个洲上人口最多的城市:
explain
select * from City
where City.Population in (select max(City.Population) from City, Country
where City.Country=Country.Code
group by Continent)
+------+--------------+-------------+------+---------------+------------+---------+----------------------------------+------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------+-------------+------+---------------+------------+---------+----------------------------------+------+-----------------+
| 1 | PRIMARY | <subquery2> | ALL | distinct_key | NULL | NULL | NULL | 239 | |
| 1 | PRIMARY | City | ref | Population | Population | 4 | <subquery2>.max(City.Population) | 1 | |
| 2 | MATERIALIZED | Country | ALL | PRIMARY | NULL | NULL | NULL | 239 | Using temporary |
| 2 | MATERIALIZED | City | ref | Country | Country | 3 | world.Country.Code | 18 | |
+------+--------------+-------------+------+---------------+------------+---------+----------------------------------+------+-----------------+
4 rows in set (0.00 sec)
the cities are:
城市为:
+------+-------------------+---------+------------+
| ID | Name | Country | Population |
+------+-------------------+---------+------------+
| 1024 | Mumbai (Bombay) | IND | 10500000 |
| 3580 | Moscow | RUS | 8389200 |
| 2454 | Macao | MAC | 437500 |
| 608 | Cairo | EGY | 6789479 |
| 2515 | Ciudad de México | MEX | 8591309 |
| 206 | São Paulo | BRA | 9968485 |
| 130 | Sydney | AUS | 3276207 |
+------+-------------------+---------+------------+
Semi-join materialization
半连接物化
- Can be used for uncorrelated IN-subqueries. The subselect may use grouping and/or aggregate functions.
- 可用于不相关的
IN
子查询。子选择可以使用分组或聚合函数。 - Is shown in EXPLAIN as type=MATERIALIZED for the subquery, and a line with table= in the parent subquery.
- 在
EXPLAIN
中子查询类型显示为type = MATERIALIZED
,以及父的子查询中具有table = <subqueryN>
的行。 - Is enabled when one has both materialization=on and semijoin=on in the optimizer_switch variable.
- 当一个在
optimizer_switch
变量中同时开启materialization = on
和semijoin = on
时启用。 - The materialization=on|off flag is shared with Non-semijoin materialization.
materialization = on|off
标志与是否开启物化连接一同使用。
Semi-join Materialization Strategy
THE END [ 2019-03-18 ]