Implement split enumeration back pressure for FTE #15514

arhimondr · 2022-12-23T00:09:10Z

Description

Prevent the scheduler from creating more task descriptors than necessary

Additional context and related issues

TODO

Release notes

(X) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

losipiuk · 2022-12-23T08:40:57Z

...-main/src/main/java/io/trino/execution/scheduler/EventDrivenFaultTolerantQueryScheduler.java

@@ -974,6 +935,28 @@ private StateChangeListener<TaskStatus> createExchangeSinkInstanceHandleUpdateRe
            };
        }

+        private void loadMoreTaskDescriptorsIfNecessary()
+        {
+            if (schedulingQueue.size() - schedulingQueue.getSpeculativeCount() < 100) {


nit: make 100 configurable

Correct me if I am wrong here. If we have arbitrary distribution stage with a replicated source then we do not have really much control over how many "speculative" task descriptors we would produce before we enumarte all the splits for the replicated source.
Only after the split enumaration for replicated source completes the priority for tasks will change from "speculative" to "non-speculative".
Probably a corner case which is not something to bother but I would like to be sure I understand the logic here correctly.

Yeah. You are right. I think there's even a more generic problem. Currently we do not account memory for task descriptors that are not complete. It is hard to say how big is this problem in practice. The scheduler should not schedule downstream stages before upstream stages are done (at least done with scheduling tasks). Hence in theory the time it takes to enumerate splits for broadcast tasks shouldn't be long. However yeah, today it is inherently racy and it is possible that broadcast split enumeration could be slow. We may need to further improve this algorithm to have more control over it at some point.

core/trino-main/src/main/java/io/trino/execution/scheduler/EventDrivenTaskSource.java

losipiuk · 2022-12-23T11:31:39Z

core/trino-main/src/main/java/io/trino/execution/scheduler/EventDrivenTaskSource.java

    }

-    public synchronized void start()
+    public synchronized ListenableFuture<AssignmentResult> process()


If this method is called twice, and second call is done before future returned by the first call is done then we would register same splits twice in the assigner. I think we should guard against it.

Maybe you just need to verify that getSplitBatchAndAdvance is never called twice.

Added a check to make sure process is not called before the previous one is finished

losipiuk · 2022-12-23T11:46:05Z

core/trino-main/src/main/java/io/trino/execution/scheduler/EventDrivenTaskSource.java

+            return immediateFuture(assigner.finish());
+        }
+
+        ListenableFuture<IdempotentSplitSource.SplitBatchReference> firstCompleted = whenAnyComplete(futures);


This attaches a callback to each future on each call to process. If some sources are slow the list of callback can grow large.

You can avoid that by keeping SettabeFuture<AssignmentResult to be completed when new splits are discovered on the field. But I think it overall gets more complicated (you cannot easily exploit IdempotentSplitSource I think).
Maybe this is not a big deal.

Yeah, that's a good point. Added a proxy future as an attempt to address the accumulating callbacks problem. Not sure if that's the cleanest solution though. Please take a look and let me know what you think.

Tricky but should work. I really hope will I will not need to touch this code in the future :)

losipiuk · 2022-12-23T11:52:02Z

core/trino-main/src/test/java/io/trino/execution/scheduler/TestEventDrivenTaskSource.java

@@ -272,116 +269,19 @@ public void stressTest()
        testStageTaskSourceSuccess(sourceHandles, remoteSources, splits);
    }

-    @Test(invocationCount = INVOCATION_COUNT)
-    public void testFailures()


how is that tested now?

The failure handling logic is now trivial (there's no custom error handling code, the assumption is that Future#transform handles failures as expected). Thought that might not be worth it maintaining this extra complexity.

losipiuk

As often I had hard time reading that (complex nature). It looks good. Some comments.

linzebing · 2022-12-28T19:41:19Z

It's definitely very hard to digest here given the complexity of the scheduler. Can you elaborate more on how we achieve back pressure here?

arhimondr · 2023-01-03T20:10:31Z

@linzebing

Can you elaborate more on how we achieve back pressure here?

Before the process of enumerating splits was launched in background and a set of callbacks called when a new task was added. With such implementation there was no way for the scheduler to tell the task factory to "suspend". In this PR the model is changed to more of a "pull" model. Whenever scheduler decides it needs more tasks it asks the task factory to create more.

arhimondr · 2023-01-04T17:58:20Z

Updated and ready for review

losipiuk · 2023-01-10T13:18:25Z

core/trino-main/src/main/java/io/trino/execution/scheduler/EventDrivenTaskSource.java

-            }
-            splitSource.close();
+            boolean result = delegate().cancel(mayInterruptIfRunning);
+            propagateIfNecessary();


why do you need this if this when propagateIfNecessary is already attached to delegate?

Hmm, good point. Let me remove the override

arhimondr requested review from losipiuk and linzebing December 23, 2022 00:09

cla-bot bot added the cla-signed label Dec 23, 2022

losipiuk reviewed Dec 23, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/scheduler/EventDrivenTaskSource.java Outdated Show resolved Hide resolved

losipiuk reviewed Dec 23, 2022

View reviewed changes

core/trino-main/src/main/java/io/trino/execution/scheduler/EventDrivenTaskSource.java Outdated Show resolved Hide resolved

losipiuk reviewed Dec 23, 2022

View reviewed changes

arhimondr force-pushed the limit-task-descriptor-buffer branch 3 times, most recently from b8b8f15 to 05ab2f1 Compare January 4, 2023 17:57

arhimondr changed the title ~~[WIP] Implement split enumeration back pressure for FTE~~ Implement split enumeration back pressure for FTE Jan 4, 2023

losipiuk reviewed Jan 10, 2023

View reviewed changes

losipiuk approved these changes Jan 10, 2023

View reviewed changes

Implement split enumeration backpressure for FTE

2675557

arhimondr force-pushed the limit-task-descriptor-buffer branch from 05ab2f1 to 2675557 Compare January 10, 2023 15:54

arhimondr merged commit 3fa94ea into trinodb:master Jan 10, 2023

arhimondr deleted the limit-task-descriptor-buffer branch January 10, 2023 18:44

github-actions bot added this to the 406 milestone Jan 10, 2023

colebow mentioned this pull request Jan 10, 2023

Add Trino 406 release notes #15625

Merged

arhimondr mentioned this pull request Feb 8, 2023

FTE can dead lock in certain cases #16029

Closed

losipiuk mentioned this pull request Apr 17, 2023

Implement task descriptor storage offloading for fault tolerant execution #12493

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement split enumeration back pressure for FTE #15514

Implement split enumeration back pressure for FTE #15514

arhimondr commented Dec 23, 2022

losipiuk Dec 23, 2022

losipiuk Dec 23, 2022

arhimondr Jan 3, 2023

losipiuk Dec 23, 2022

losipiuk Dec 23, 2022

arhimondr Jan 3, 2023

losipiuk Dec 23, 2022

arhimondr Jan 3, 2023

losipiuk Jan 10, 2023

losipiuk Dec 23, 2022

arhimondr Jan 3, 2023

losipiuk left a comment

linzebing commented Dec 28, 2022

arhimondr commented Jan 3, 2023

arhimondr commented Jan 4, 2023

losipiuk Jan 10, 2023

arhimondr Jan 10, 2023

Implement split enumeration back pressure for FTE #15514

Implement split enumeration back pressure for FTE #15514

Conversation

arhimondr commented Dec 23, 2022

Description

Additional context and related issues

Release notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

losipiuk left a comment

Choose a reason for hiding this comment

linzebing commented Dec 28, 2022

arhimondr commented Jan 3, 2023

arhimondr commented Jan 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment