Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up optimization docs & add syntactic join ordering section #13608

Merged
merged 3 commits into from
Oct 17, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 86 additions & 43 deletions docs/src/main/sphinx/optimizer/cost-based-optimizations.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
========================
Cost based optimizations
Cost-based optimizations
colebow marked this conversation as resolved.
Show resolved Hide resolved
========================

Trino supports several cost based optimizations, described below.
Expand All @@ -10,60 +10,63 @@ Join enumeration
The order in which joins are executed in a query can have a significant impact
on the query's performance. The aspect of join ordering that has the largest
impact on performance is the size of the data being processed and transferred
over the network. If a join, that produces a lot of data, is performed early in
the execution, then subsequent stages need to process large amounts of
over the network. If a join which produces a lot of data is performed early in
the query's execution, then subsequent stages need to process large amounts of
data for longer than necessary, increasing the time and resources needed for
the query.
processing the query.

With cost based join enumeration, Trino uses
:doc:`/optimizer/statistics` provided by connectors to estimate
the costs for different join orders and automatically picks the
join order with the lowest computed costs.
With cost-based join enumeration, Trino uses :doc:`/optimizer/statistics`
provided by connectors to estimate the costs for different join orders and
colebow marked this conversation as resolved.
Show resolved Hide resolved
automatically picks the join order with the lowest computed costs.

The join enumeration strategy is governed by the ``join_reordering_strategy``
session property, with the ``optimizer.join-reordering-strategy``
configuration property providing the default value.
:ref:`session property <session-properties-definition>`, with the
``optimizer.join-reordering-strategy`` configuration property providing the
default value.

The valid values are:
* ``AUTOMATIC`` (default) - full automatic join enumeration enabled
The possible values are:

* ``AUTOMATIC`` (default) - enable full automatic join enumeration
* ``ELIMINATE_CROSS_JOINS`` - eliminate unnecessary cross joins
* ``NONE`` - purely syntactic join order

If using ``AUTOMATIC`` and statistics are not available, or if for any other
reason a cost could not be computed, the ``ELIMINATE_CROSS_JOINS`` strategy is
used instead.
If you are using ``AUTOMATIC`` join enumeration and statistics are not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the properties reference docs for this property link to here ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like there is no link to this from properties-optimizer.rst .. we should add that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @colebow

available or a cost can not be computed for any other reason, the
``ELIMINATE_CROSS_JOINS`` strategy is used instead.

Join distribution selection
---------------------------

Trino uses a hash based join algorithm. That implies that for each join
operator a hash table must be created from one join input, called build side.
The other input, called probe side, is then iterated, and for each row the hash table is
queried to find matching rows.
Trino uses a hash-based join algorithm. For each join operator, a hash table
must be created from one join input, referred to as the build side. The other
input, called the probe side, is then iterated on. For each row, the hash table
is queried to find matching rows.

There are two types of join distributions:
* Partitioned: each node participating in the query builds a hash table
from only fraction of the data
* Broadcast: each node participating in the query builds a hash table
from all of the data (data is replicated to each node)

Each type have their trade offs. Partitioned joins require redistributing both
tables using a hash of the join key. This can be slower, sometimes
substantially slower, than broadcast joins, but allows much larger joins. In
particular, broadcast joins are faster if the build side is much smaller
than the probe side. However, broadcast joins require that the tables on the
build side of the join after filtering fit in memory on each node, whereas
distributed joins only need to fit in distributed memory across all nodes.

With cost based join distribution selection, Trino automatically chooses to
use a partitioned or broadcast join. With cost based join enumeration, Trino
automatically chooses which side is the probe and which is the build.

* Partitioned: each node participating in the query builds a hash table from
colebow marked this conversation as resolved.
Show resolved Hide resolved
only a fraction of the data
* Broadcast: each node participating in the query builds a hash table from all
of the data. The data is replicated to each node.

Each type has advantages and disadvantages. Partitioned joins require
redistributing both tables using a hash of the join key. These joins can be much
slower than broadcast joins, but they allow much larger joins overall. Broadcast
joins are faster if the build side is much smaller than the probe side. However,
broadcast joins require that the tables on the build side of the join after
filtering fit in memory on each node, whereas distributed joins only need to fit
in distributed memory across all nodes.

With cost-based join distribution selection, Trino automatically chooses whether
to use a partitioned or broadcast join. With cost-based join enumeration, Trino
automatically chooses which sides are probe and build.

The join distribution strategy is governed by the ``join_distribution_type``
session property, with the ``join-distribution-type`` configuration property
providing the default value.

The valid values are:

* ``AUTOMATIC`` (default) - join distribution type is determined automatically
for each join
* ``BROADCAST`` - broadcast join distribution is used for all joins
Expand All @@ -73,14 +76,54 @@ The valid values are:
Capping replicated table size
-----------------------------

Join distribution type will be chosen automatically when join reordering strategy
is set to ``AUTOMATIC`` or when join distribution type is set to ``AUTOMATIC``.
In such case it is possible to cap the maximum size of replicated table via
``join-max-broadcast-table-size`` config property (e.g. ``join-max-broadcast-table-size=100MB``)
or via ``join_max_broadcast_table_size`` session property (e.g. ``set session join_max_broadcast_table_size='100MB';``)
This allows to improve cluster concurrency and to prevent bad plans when CBO misestimates size of joined tables.

By default replicated table size is capped to 100MB.
The join distribution type is automatically chosen when the join reordering
strategy is set to ``AUTOMATIC`` or when the join distribution type is set to
``AUTOMATIC``. In both cases, it is possible to cap the maximum size of the
replicated table with the ``join-max-broadcast-table-size`` configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add this properties to properties-optimizer.rst .. even if just a small section with link to here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @colebow

property or with the ``join_max_broadcast_table_size`` session property. This
allows you to improve cluster concurrency and prevent bad plans when the
cost-based optimizer misestimates the size of the joined tables.

By default, the replicated table size is capped to 100MB.

Syntactic join order
colebow marked this conversation as resolved.
Show resolved Hide resolved
--------------------

If not using cost-based optimization, Trino defaults to syntactic join ordering.
While there is no formal way to optimize queries for this case, it is possible
to take advantage of how Trino implements joins to make them more performant.

Trino uses in-memory hash joins. When processing a join statement, Trino loads
the right-most table of the join into memory as the build side, then streams the
colebow marked this conversation as resolved.
Show resolved Hide resolved
next right-most table as the probe side to execute the join. If a query has
multiple joins, the result of this first join stays in memory as the build side,
and the third right-most table is then used as the probe side, and so on for
additional joins. In the case where join order is made more complex, such as
when using parentheses to specify specific parents for joins, Trino may execute
multiple lower-level joins at once, but each step of that process follows the
same logic, and the same applies when the results are ultimately joined
together.

Because of this behavior, it is optimal to syntactically order joins in your SQL
queries from the largest tables to the smallest, as this minimizes memory usage.

As an example, if you have a small, medium, and large table and are using left
joins:

.. code-block:: sql

SELECT
*
FROM
large_table l
LEFT JOIN medium_table m ON l.user_id = m.user_id
LEFT JOIN small_table s ON s.user_id = l.user_id

.. warning::

This means of optimization is not a feature of Trino. It is an artifact of
how joins are implemented, and therefore this behavior may change without
notice.

Connector implementations
-------------------------
Expand Down