-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add guided walkthrough on configuring fault-tolerant execution #14861
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. I think we need to emphasize a bit more that this allows processing queries that otherwise would fail with out of memory issue, and it also allows using spot instances since a worker dying is no longer an issues and a query can still finish..
Both of these are crucial for batch workloads .. which are often long running queries that take up lots of memory
20ef4d5
to
0c75911
Compare
3. Add the following configuration for fault-tolerant execution in the | ||
``additionalConfigProperties:`` section of the Helm chart: | ||
|
||
.. code-block:: yaml | ||
|
||
additionalConfigProperties: | ||
retry-policy=TASK | ||
exchange.compression-enabled=true | ||
query.low-memory-killer.delay=0s | ||
|
||
In non-Kubernetes installations, the same properties must be defined in the | ||
``config.properties`` file on the coordinator and all worker nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should you merge 3. with 2. to have single section about k8s deployemnt and single section about non-k8s deployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to keep in separate steps, because they're broken up by which config file needs to be edited in the non-k8s deployment - exchange-manager.properties
and config.properties
, respectively. That may be confusing if condensed into a single step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - minor commens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
additionalConfigProperties: | ||
retry-policy=TASK | ||
exchange.compression-enabled=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also strongly encouraged to change the default task concurrency, e.g.: "task.concurrency=8"
.
Ideally it would be great if the default recommended settings were being applied automatically when retry-policy=TASK
is enabled. Let me think about it how it can be done (#14955). For now let's keep the settings here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. We'll also need to modify the FTE reference doc (https://trino.io/docs/current/admin/fault-tolerant-execution.html) to describe what settings change when enabling task retries, either in your PR or I can update it afterwards once yours is merged and we know exactly what the setting changes will be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added task.concurrency=8
to the YAML anyway, even if it's set as the default in the future it's helpful to show what properties are relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added task.concurrency=8 to the YAML anyway, even if it's set as the default in the future it's helpful to show what properties are relevant.
I would prefer to avoid it. It may tend to evolve over time and ideally we would like to avoid users setting anything explicitly if not necessary, as then it might not be possible to easily change the recommended settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I'll remove it then. This page is intended as a "quick setup" guide anyway, so if we want to discuss more nuances for this property then the reference docs are more appropriate for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove exchange.compression-enabled=true
and query.low-memory-killer.delay=0s
from here as well, as now those are set automatically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arhimondr I thought they aren't set automatically if the user has them explicitly configured prior to setting retry-policy=TASK
? My thought is that it can't hurt to leave them in, in case the reader has the property configured and would then need to change its value in order to meet best practices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe it's worth mentioning it explicitly, something like
"If you have exchange.compression-enabled, query.low-memory-killer.delay configured explicitly - remove them", or something like that?
Basically what I'm afraid of is that somebody may configure query.low-memory-killer.delay=0s
explicitly following our recommendation, but the recommended value might need to be changed in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. I removed those properties from this section, then to address the new auto-default change behavior I rewrote the paragraph on line 125 of the /admin/ guide in this PR.
1181030
to
7ff63ad
Compare
@arhimondr or @losipiuk .. can we get this merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % comment
|
||
additionalConfigProperties: | ||
retry-policy=TASK | ||
exchange.compression-enabled=true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove exchange.compression-enabled=true
and query.low-memory-killer.delay=0s
from here as well, as now those are set automatically
7ff63ad
to
cbbb8fc
Compare
cbbb8fc
to
d33c9c5
Compare
Everything is good to go. Can we please just get this merged @arhimondr @losipiuk ? |
Description
Create a simplified guide on how to configure Trino with fault-tolerant execution, emphasizing the most common use case with an AWS S3 bucket for exchange storage. This is a companion to the existing
/admin/fault-tolerant-execution
reference material.Non-technical explanation
Add a guide explaining how to configure fault-tolerant execution on a Trino cluster, simplifying the feature for the average reader.
Release notes
(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: