trinodb · losipiuk · Oct 17, 2022 · Aug 10, 2022 · Aug 10, 2022 · Aug 10, 2022
diff --git a/docs/src/main/sphinx/optimizer/cost-based-optimizations.rst b/docs/src/main/sphinx/optimizer/cost-based-optimizations.rst
@@ -1,5 +1,5 @@
 ========================
-Cost based optimizations
+Cost-based optimizations
 ========================
 
 Trino supports several cost based optimizations, described below.
@@ -10,60 +10,63 @@ Join enumeration
 The order in which joins are executed in a query can have a significant impact
 on the query's performance. The aspect of join ordering that has the largest
 impact on performance is the size of the data being processed and transferred
-over the network. If a join, that produces a lot of data, is performed early in
-the execution, then subsequent stages need to process large amounts of
+over the network. If a join which produces a lot of data is performed early in
+the query's execution, then subsequent stages need to process large amounts of
 data for longer than necessary, increasing the time and resources needed for
-the query.
+processing the query.
 
-With cost based join enumeration, Trino uses
-:doc:`/optimizer/statistics` provided by connectors to estimate
-the costs for different join orders and automatically picks the
-join order with the lowest computed costs.
+With cost-based join enumeration, Trino uses :doc:`/optimizer/statistics`
+provided by connectors to estimate the costs for different join orders and
+automatically picks the join order with the lowest computed costs.
 
 The join enumeration strategy is governed by the ``join_reordering_strategy``
-session property, with the ``optimizer.join-reordering-strategy``
-configuration property providing the default value.
+:ref:`session property <session-properties-definition>`, with the
+``optimizer.join-reordering-strategy`` configuration property providing the
+default value.
 
-The valid values are:
- * ``AUTOMATIC`` (default) - full automatic join enumeration enabled
+The possible values are:
+
+ * ``AUTOMATIC`` (default) - enable full automatic join enumeration
  * ``ELIMINATE_CROSS_JOINS`` - eliminate unnecessary cross joins
  * ``NONE`` - purely syntactic join order
 
-If using ``AUTOMATIC`` and statistics are not available, or if for any other
-reason a cost could not be computed, the ``ELIMINATE_CROSS_JOINS`` strategy is
-used instead.
+If you are using ``AUTOMATIC`` join enumeration and statistics are not
+available or a cost can not be computed for any other reason, the
+``ELIMINATE_CROSS_JOINS`` strategy is used instead.
 
 Join distribution selection
 ---------------------------
 
-Trino uses a hash based join algorithm. That implies that for each join
-operator a hash table must be created from one join input, called build side.
-The other input, called probe side, is then iterated, and for each row the hash table is
-queried to find matching rows.
+Trino uses a hash-based join algorithm. For each join operator, a hash table
+must be created from one join input, referred to as the build side. The other
+input, called the probe side, is then iterated on. For each row, the hash table
+is queried to find matching rows.
 
 There are two types of join distributions:
- * Partitioned: each node participating in the query builds a hash table
-   from only fraction of the data
- * Broadcast: each node participating in the query builds a hash table
-   from all of the data (data is replicated to each node)
-
-Each type have their trade offs. Partitioned joins require redistributing both
-tables using a hash of the join key. This can be slower, sometimes
-substantially slower, than broadcast joins, but allows much larger joins. In
-particular, broadcast joins are faster if the build side is much smaller
-than the probe side. However, broadcast joins require that the tables on the
-build side of the join after filtering fit in memory on each node, whereas
-distributed joins only need to fit in distributed memory across all nodes.
-
-With cost based join distribution selection, Trino automatically chooses to
-use a partitioned or broadcast join. With cost based join enumeration, Trino
-automatically chooses which side is the probe and which is the build.
+
+ * Partitioned: each node participating in the query builds a hash table from
+   only a fraction of the data
+ * Broadcast: each node participating in the query builds a hash table from all
+   of the data. The data is replicated to each node.
+
+Each type has advantages and disadvantages. Partitioned joins require
+redistributing both tables using a hash of the join key. These joins can be much
+slower than broadcast joins, but they allow much larger joins overall. Broadcast
+joins are faster if the build side is much smaller than the probe side. However,
+broadcast joins require that the tables on the build side of the join after
+filtering fit in memory on each node, whereas distributed joins only need to fit
+in distributed memory across all nodes.
+
+With cost-based join distribution selection, Trino automatically chooses whether
+to use a partitioned or broadcast join. With cost-based join enumeration, Trino
+automatically chooses which sides are probe and build.
 
 The join distribution strategy is governed by the ``join_distribution_type``
 session property, with the ``join-distribution-type`` configuration property
 providing the default value.
 
 The valid values are:
+
  * ``AUTOMATIC`` (default) - join distribution type is determined automatically
    for each join
  * ``BROADCAST`` - broadcast join distribution is used for all joins
@@ -73,14 +76,54 @@ The valid values are:
 Capping replicated table size
 -----------------------------
 
-Join distribution type will be chosen automatically when join reordering strategy
-is set to ``AUTOMATIC`` or when join distribution type is set to ``AUTOMATIC``.
-In such case it is possible to cap the maximum size of replicated table via
-``join-max-broadcast-table-size`` config property (e.g. ``join-max-broadcast-table-size=100MB``)
-or via ``join_max_broadcast_table_size`` session property (e.g. ``set session join_max_broadcast_table_size='100MB';``)
-This allows to improve cluster concurrency and to prevent bad plans when CBO misestimates size of joined tables.
-
-By default replicated table size is capped to 100MB.
+The join distribution type is automatically chosen when the join reordering
+strategy is set to ``AUTOMATIC`` or when the join distribution type is set to
+``AUTOMATIC``. In both cases, it is possible to cap the maximum size of the
+replicated table with the ``join-max-broadcast-table-size`` configuration
+property or with the ``join_max_broadcast_table_size`` session property. This
+allows you to improve cluster concurrency and prevent bad plans when the
+cost-based optimizer misestimates the size of the joined tables.
+
+By default, the replicated table size is capped to 100MB.
+
+Syntactic join order
+--------------------
+
+If not using cost-based optimization, Trino defaults to syntactic join ordering.
+While there is no formal way to optimize queries for this case, it is possible
+to take advantage of how Trino implements joins to make them more performant.
+
+Trino uses in-memory hash joins. When processing a join statement, Trino loads
+the right-most table of the join into memory as the build side, then streams the
+next right-most table as the probe side to execute the join. If a query has
+multiple joins, the result of this first join stays in memory as the build side,
+and the third right-most table is then used as the probe side, and so on for
+additional joins. In the case where join order is made more complex, such as
+when using parentheses to specify specific parents for joins, Trino may execute
+multiple lower-level joins at once, but each step of that process follows the
+same logic, and the same applies when the results are ultimately joined
+together.
+
+Because of this behavior, it is optimal to syntactically order joins in your SQL
+queries from the largest tables to the smallest, as this minimizes memory usage.
+
+As an example, if you have a small, medium, and large table and are using left
+joins:
+
+.. code-block:: sql
+
+    SELECT
+      *
+    FROM
+      large_table l
+      LEFT JOIN medium_table m ON l.user_id = m.user_id
+      LEFT JOIN small_table s ON s.user_id = l.user_id
+
+.. warning::
+
+    This means of optimization is not a feature of Trino. It is an artifact of
+    how joins are implemented, and therefore this behavior may change without
+    notice.
 
 Connector implementations
 -------------------------