Join intermediate tasks for small partitions #11023

losipiuk · 2022-02-11T16:56:50Z

Description

If partitions produced by upstream tasks are small it is sub-optimal to create a separate task for each partition.
With this commit, a single task can read data from multiple input partitions; target input size is configured via fault-tolerant-execution-target-task-input-size.
If the task is also reading source data (could be the case e.g if there is a join vs bucketed table and join key matches bucketing), the task sizing takes input split
weights into account (configured via fault-tolerant-execution-target-task-split-count).

~~Based on #10837 (review just last commit)~~

General information

Is this change a fix, improvement, new feature, refactoring, or other?

improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

core query engine

How would you describe this change to a non-technical end user or system administrator?

N/A

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

(x) No release notes entries required.
( ) Release notes entries required with the following suggested text:

linzebing · 2022-02-15T01:59:58Z

core/trino-main/src/main/java/io/trino/execution/scheduler/StageTaskSourceFactory.java

+                    if ((splitsWeight > 0 || exchangeSourcesSize > 0)
+                            && ((splitsWeight + taskSplitWeight) > targetPartitionSplitWeight || (exchangeSourcesSize + taskExchangeSourcesSize + replicatedExchangeSourcesSize) > targetPartitionSourceSizeInBytes)) {
+                        exchangeSources.putAll(replicatedExchangeSourceHandles); // add replicated exchanges
+                        joinedTasks.add(new TaskDescriptor(taskPartitionId++, splits.build(), exchangeSources.build(), groupNodeRequirements));


QQ: does task execution automatically takes care of processing the splits and partitions corresponding to the same hash?

Execution is dumb. But each of TaskDescriptors that is passed to postprocessTasks handles a single partition (splits read data from table buckets that match exchange partition).
In postprocessTasks I merge some TaskDescriptors if they are too small.

losipiuk · 2022-02-17T10:26:01Z

rebased

core/trino-main/src/main/java/io/trino/execution/scheduler/StageTaskSourceFactory.java

arhimondr · 2022-02-17T20:47:26Z

core/trino-main/src/main/java/io/trino/execution/scheduler/StageTaskSourceFactory.java

+        private long sourceHandleSize(ExchangeSourceHandle handle)
+        {
+            Exchange exchange = exchangeForHandle.get(handle);
+            ExchangeSourceStatistics exchangeSourceStatistics = exchange.getExchangeSourceStatistics(handle);


I'm thinking if it isn't a design mistake to have the getExchangeSourceStatistics method under the Exchange interface instead of the ExchangeManager. Maintaining these mappings could be costly if the number of partitions is high and it seems to be completely unnecessary. Do you think it is worth fixing it? (could be done as a follow up PR)

It could be move there.
But then so should ExchangeSourceSplitter split(ExchangeSourceHandle handle, long targetSizeInBytes);, right?

leaving as a followup

losipiuk · 2022-03-02T19:02:38Z

(rebased)

sopel39 · 2022-03-07T10:17:50Z

core/trino-main/src/main/java/io/trino/execution/scheduler/StageTaskSourceFactory.java

@@ -22,6 +22,7 @@
 import com.google.common.collect.ImmutableSet;


Could you extend commit message to describe what this actually means?

Changed. Hope it is better now.

If partitions produced by upstream tasks are small it is sub-optimal to create a separate task for each partition. With this commit, a single task can read data from multiple input partitions; target input size is configured via fault-tolerant-execution-target-task-input-size. If the task is also reading source data (could be the case e.g if there is a join vs bucketed table and join key matches bucketing), the task sizing takes input split weights into account (configured via fault-tolerant-execution-target-task-split-count).

losipiuk · 2022-03-09T11:04:27Z

CI: #11388

losipiuk force-pushed the lo/adaptive-task-sizeing branch from 9f0c8a6 to cd428c0 Compare February 14, 2022 19:56

cla-bot bot added the cla-signed label Feb 14, 2022

losipiuk marked this pull request as ready for review February 14, 2022 20:03

losipiuk requested review from arhimondr and linzebing February 14, 2022 20:03

losipiuk force-pushed the lo/adaptive-task-sizeing branch from cd428c0 to f8e1b1b Compare February 14, 2022 20:04

linzebing approved these changes Feb 15, 2022

View reviewed changes

linzebing reviewed Feb 15, 2022

View reviewed changes

arhimondr mentioned this pull request Feb 15, 2022

Support Failure Recovery #9101

Closed

31 tasks

arhimondr assigned losipiuk Feb 15, 2022

losipiuk force-pushed the lo/adaptive-task-sizeing branch from f8e1b1b to f8e1f5b Compare February 17, 2022 10:25

arhimondr approved these changes Feb 17, 2022

View reviewed changes

losipiuk requested a review from findepi February 18, 2022 12:04

losipiuk force-pushed the lo/adaptive-task-sizeing branch from f8e1f5b to 5c1dc6a Compare February 18, 2022 12:32

losipiuk requested a review from sopel39 February 23, 2022 13:42

losipiuk force-pushed the lo/adaptive-task-sizeing branch from 5c1dc6a to d9c63f1 Compare March 2, 2022 19:02

sopel39 reviewed Mar 7, 2022

View reviewed changes

losipiuk force-pushed the lo/adaptive-task-sizeing branch from d9c63f1 to 4456068 Compare March 8, 2022 11:01

losipiuk force-pushed the lo/adaptive-task-sizeing branch from 4456068 to 85ad8ca Compare March 8, 2022 16:37

losipiuk merged commit c9f6e77 into trinodb:master Mar 9, 2022

github-actions bot added this to the 373 milestone Mar 9, 2022

mosabua mentioned this pull request Mar 9, 2022

Add Trino 373 release notes #11290

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Join intermediate tasks for small partitions #11023

Join intermediate tasks for small partitions #11023

losipiuk commented Feb 11, 2022 •

edited

Loading

linzebing Feb 15, 2022

losipiuk Feb 15, 2022

losipiuk commented Feb 17, 2022

arhimondr Feb 17, 2022

losipiuk Feb 18, 2022

arhimondr Feb 18, 2022

losipiuk Mar 2, 2022

losipiuk commented Mar 2, 2022

sopel39 Mar 7, 2022

losipiuk Mar 8, 2022

losipiuk commented Mar 9, 2022

		@@ -22,6 +22,7 @@
		import com.google.common.collect.ImmutableSet;

Join intermediate tasks for small partitions #11023

Join intermediate tasks for small partitions #11023

Conversation

losipiuk commented Feb 11, 2022 • edited Loading

Description

General information

Documentation

Release notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk commented Feb 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk commented Mar 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk commented Mar 9, 2022

losipiuk commented Feb 11, 2022 •

edited

Loading