Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guided walkthrough on configuring fault-tolerant execution #14861

Merged
merged 1 commit into from
Nov 21, 2022

Conversation

jhlodin
Copy link
Contributor

@jhlodin jhlodin commented Nov 1, 2022

Description

Create a simplified guide on how to configure Trino with fault-tolerant execution, emphasizing the most common use case with an AWS S3 bucket for exchange storage. This is a companion to the existing /admin/fault-tolerant-execution reference material.

Non-technical explanation

Add a guide explaining how to configure fault-tolerant execution on a Trino cluster, simplifying the feature for the average reader.

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Nov 1, 2022
@jhlodin jhlodin requested a review from mosabua November 1, 2022 21:11
@github-actions github-actions bot added the docs label Nov 1, 2022
Copy link
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. I think we need to emphasize a bit more that this allows processing queries that otherwise would fail with out of memory issue, and it also allows using spot instances since a worker dying is no longer an issues and a query can still finish..

Both of these are crucial for batch workloads .. which are often long running queries that take up lots of memory

docs/src/main/sphinx/installation/query-resiliency.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/installation/query-resiliency.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/installation/query-resiliency.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/installation/query-resiliency.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/installation/query-resiliency.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/installation/query-resiliency.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/installation/query-resiliency.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/installation/query-resiliency.rst Outdated Show resolved Hide resolved
@jhlodin jhlodin force-pushed the jl/fte-howto-guide branch from 20ef4d5 to 0c75911 Compare November 3, 2022 21:46
@jhlodin jhlodin requested review from arhimondr and colebow November 3, 2022 21:46
@jhlodin jhlodin marked this pull request as ready for review November 3, 2022 21:47
@jhlodin jhlodin requested review from brianzhan and hashhar November 4, 2022 17:20
@hashhar hashhar requested review from losipiuk and removed request for hashhar November 7, 2022 18:10
Comment on lines 93 to 102
3. Add the following configuration for fault-tolerant execution in the
``additionalConfigProperties:`` section of the Helm chart:

.. code-block:: yaml

additionalConfigProperties:
retry-policy=TASK
exchange.compression-enabled=true
query.low-memory-killer.delay=0s

In non-Kubernetes installations, the same properties must be defined in the
``config.properties`` file on the coordinator and all worker nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you merge 3. with 2. to have single section about k8s deployemnt and single section about non-k8s deployment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to keep in separate steps, because they're broken up by which config file needs to be edited in the non-k8s deployment - exchange-manager.properties and config.properties, respectively. That may be confusing if condensed into a single step.

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - minor commens

Copy link
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


additionalConfigProperties:
retry-policy=TASK
exchange.compression-enabled=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also strongly encouraged to change the default task concurrency, e.g.: "task.concurrency=8".

Ideally it would be great if the default recommended settings were being applied automatically when retry-policy=TASK is enabled. Let me think about it how it can be done (#14955). For now let's keep the settings here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhlodin #14957

Let's have this landed and make proper adjustments to this guide before landing.

Big thanks for helping us with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. We'll also need to modify the FTE reference doc (https://trino.io/docs/current/admin/fault-tolerant-execution.html) to describe what settings change when enabling task retries, either in your PR or I can update it afterwards once yours is merged and we know exactly what the setting changes will be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added task.concurrency=8 to the YAML anyway, even if it's set as the default in the future it's helpful to show what properties are relevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added task.concurrency=8 to the YAML anyway, even if it's set as the default in the future it's helpful to show what properties are relevant.

I would prefer to avoid it. It may tend to evolve over time and ideally we would like to avoid users setting anything explicitly if not necessary, as then it might not be possible to easily change the recommended settings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I'll remove it then. This page is intended as a "quick setup" guide anyway, so if we want to discuss more nuances for this property then the reference docs are more appropriate for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove exchange.compression-enabled=true and query.low-memory-killer.delay=0s from here as well, as now those are set automatically

Copy link
Contributor Author

@jhlodin jhlodin Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arhimondr I thought they aren't set automatically if the user has them explicitly configured prior to setting retry-policy=TASK? My thought is that it can't hurt to leave them in, in case the reader has the property configured and would then need to change its value in order to meet best practices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe it's worth mentioning it explicitly, something like
"If you have exchange.compression-enabled, query.low-memory-killer.delay configured explicitly - remove them", or something like that?

Basically what I'm afraid of is that somebody may configure query.low-memory-killer.delay=0s explicitly following our recommendation, but the recommended value might need to be changed in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I removed those properties from this section, then to address the new auto-default change behavior I rewrote the paragraph on line 125 of the /admin/ guide in this PR.

@jhlodin jhlodin force-pushed the jl/fte-howto-guide branch 2 times, most recently from 1181030 to 7ff63ad Compare November 10, 2022 22:14
@mosabua
Copy link
Member

mosabua commented Nov 18, 2022

@arhimondr or @losipiuk .. can we get this merged?

Copy link
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it!

Copy link
Contributor

@arhimondr arhimondr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comment


additionalConfigProperties:
retry-policy=TASK
exchange.compression-enabled=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove exchange.compression-enabled=true and query.low-memory-killer.delay=0s from here as well, as now those are set automatically

@mosabua
Copy link
Member

mosabua commented Nov 21, 2022

Everything is good to go. Can we please just get this merged @arhimondr @losipiuk ?

@arhimondr arhimondr merged commit 4206d1a into trinodb:master Nov 21, 2022
@jhlodin jhlodin deleted the jl/fte-howto-guide branch November 21, 2022 19:02
@github-actions github-actions bot added this to the 404 milestone Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants