Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Per-Tenant GetPage@LSN Throttling #5648

Merged
merged 16 commits into from
Dec 19, 2023
Merged

Conversation

problame
Copy link
Contributor

@problame problame commented Oct 24, 2023

Rendered Merged

Implementation epic: #5899

@github-actions
Copy link

github-actions bot commented Oct 24, 2023

2190 tests run: 2105 passed, 0 failed, 85 skipped (full report)


Flaky tests (2)

Postgres 15

Code coverage (full report)

  • functions: 55.1% (9695 of 17596 functions)
  • lines: 82.3% (55697 of 67703 lines)

The comment gets automatically updated with the latest test results
75be184 at 2023-12-18T20:17:44.759Z :recycle:

Copy link
Contributor

@MMeent MMeent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. Overall I think this contains good ideas, with some rough edges that can use polish.

docs/rfcs/029-getpage-rate-limiting.md Outdated Show resolved Hide resolved
docs/rfcs/029-getpage-rate-limiting.md Outdated Show resolved Hide resolved
docs/rfcs/029-getpage-rate-limiting.md Outdated Show resolved Hide resolved
docs/rfcs/029-getpage-rate-limiting.md Outdated Show resolved Hide resolved
docs/rfcs/029-getpage-rate-limiting.md Outdated Show resolved Hide resolved
docs/rfcs/029-getpage-rate-limiting.md Outdated Show resolved Hide resolved
docs/rfcs/029-getpage-rate-limiting.md Outdated Show resolved Hide resolved
@lubennikovaav
Copy link
Contributor

I would like to understand the problem better, before we enable throttling.
Are we sure that GetPage@LSN is caused by client's load? Or at least that it correlates with it?

My concern is that if it is caused by some bug / misconfiguration of our mechanisms (postgres config, prefetch, LFC), we can start throttling normal user loads. And there won't be any symptoms on user-side, apart from "my postgres is slow", and it will be as hard to investigate as it is now

@problame
Copy link
Contributor Author

Are we sure that GetPage@LSN is caused by client's load?

Yes, 100% sure. The metric that I'm quoting in the RFC is only counted in response to requests from compute. We have separate metrics for the lower-level Timeline::get call.

My concern is that if it is caused by some bug / misconfiguration of our mechanisms (postgres config, prefetch, LFC), we can start throttling normal user loads. And there won't be any symptoms on user-side, apart from "my postgres is slow", and it will be as hard to investigate as it is now

The limit will be a lot higher than our average users' workload.
I.e., we're defining "the highest GetPage/second that Pageserver can support for a single tenant".

We will define the value based on experimentation in staging and production.

Further, we will have sufficient observability on the pageserver side to tell whether a tenant is running into the rate limit.
And we will have a safety switch to disable the rate limiting on a per-tenant basis.

I reflected these concern in my recent commit to this PR.

I think team compute should start working on the "Future Work" item "Compute-side metrics for GetPage latency." in parallel to Pageserver team implementing this RFC.
But, given our careful rollout plan, I don't think we need to wait for the compute-side metrics to arrive.

Agree?

@MMeent MMeent added the t/tech_design_rfc Issue type: tech design RFC label Oct 25, 2023
@hlinnaka
Copy link
Contributor

hlinnaka commented Nov 1, 2023

Replying to @knizhnik's comment on slack:

There are normal situations when there are splashes of requests to PS, for example compute restart. Should we somehow limit rate of get_page requests at compute startup even if it has negative impact on other tenants? The answer is not obvious for me... We should minimize negative users feedbacks. If time of some query execution is increased from 100 to 200 msec, most likely it is not critical. But if it is increased to several seconds (which can easily happen at startup), then for many customers it is not acceptable.

One approach is to allow for a higher burst capacity, but if you keep doing IOs at a high rate, then you get throttled. EBS behaves like that. E.g. use a leaky bucket algorithm for the throttling, with a fairly large bucket. One problem with that is that a query can appear to be fast when you run it on its own, but as the system as whole is more loaded and you exceed the "burst budget", performance tanks badly. People running PostgreSQL on EBS run into that, and it's a nasty surprise.

Or we could give a larger allowance specifically after compute restart.

@problame
Copy link
Contributor Author

problame commented Nov 3, 2023

The implementation is most likely going to be leaky bucket based, maybe hierarchical leaky bucket.

Or we could give a larger allowance specifically after compute restart.

Not going to happen as part of this PR because knowledge about "is the compute currently restarting" requires integration with control plane, which is already our bottleneck for other important features.

@problame
Copy link
Contributor Author

Last pushes adjust the wording in this RFC to use "throttling" instead of "rate limiting", aligning with the terminology proposed by @hlinnaka in https://github.com/neondatabase/cloud/pull/7746

@problame problame changed the title RFC: GetPage@LSN Rate Limit RFC: Per-Tenant GetPage@LSN Throttling Nov 30, 2023
@problame
Copy link
Contributor Author

I have made this RFC part of #5899 and would like to freeze & merge it.

There are still some open comments, and there are no approvals.

Those in favor, please hit approve.

If you have open conversations, please resolve them or make clear what you'd like to see changed in this RFC before it's merged.

@hlinnaka
Copy link
Contributor

+1 for this overall.

In the long term, I think we should throttle per-endpoint rather than per-tenant. If you create 100 read-only endpoints on the same tenant, each one should get X amount of I/O capacity. One reason to create read-only endpoints is to scale horizontally, and if the I/O capacity doesn't also scale, it's doesn't really achieve it.

I know that's harder to implement in the storage currently, as we don't distinguish between different endpoints there. So we can do per tenant for now, and switch to per endpoint later as a separate project.

docs/rfcs/029-getpage-throttling.md Outdated Show resolved Hide resolved
docs/rfcs/029-getpage-throttling.md Outdated Show resolved Hide resolved
docs/rfcs/029-getpage-throttling.md Outdated Show resolved Hide resolved
docs/rfcs/029-getpage-throttling.md Show resolved Hide resolved
docs/rfcs/029-getpage-throttling.md Outdated Show resolved Hide resolved
@problame problame merged commit c272c68 into main Dec 19, 2023
41 checks passed
@problame problame deleted the problame/rfc-29-rate-limiting branch December 19, 2023 10:20
@problame
Copy link
Contributor Author

Implementation epic: #5899

@problame problame mentioned this pull request Feb 9, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t/tech_design_rfc Issue type: tech design RFC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants