-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Per-Tenant GetPage@LSN Throttling #5648
Conversation
2190 tests run: 2105 passed, 0 failed, 85 skipped (full report)Code coverage (full report)
The comment gets automatically updated with the latest test results
75be184 at 2023-12-18T20:17:44.759Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments. Overall I think this contains good ideas, with some rough edges that can use polish.
I would like to understand the problem better, before we enable throttling. My concern is that if it is caused by some bug / misconfiguration of our mechanisms (postgres config, prefetch, LFC), we can start throttling normal user loads. And there won't be any symptoms on user-side, apart from "my postgres is slow", and it will be as hard to investigate as it is now |
Yes, 100% sure. The metric that I'm quoting in the RFC is only counted in response to requests from compute. We have separate metrics for the lower-level Timeline::get call.
The limit will be a lot higher than our average users' workload. We will define the value based on experimentation in staging and production. Further, we will have sufficient observability on the pageserver side to tell whether a tenant is running into the rate limit. I reflected these concern in my recent commit to this PR. I think team compute should start working on the "Future Work" item "Compute-side metrics for GetPage latency." in parallel to Pageserver team implementing this RFC. Agree? |
Replying to @knizhnik's comment on slack:
One approach is to allow for a higher burst capacity, but if you keep doing IOs at a high rate, then you get throttled. EBS behaves like that. E.g. use a leaky bucket algorithm for the throttling, with a fairly large bucket. One problem with that is that a query can appear to be fast when you run it on its own, but as the system as whole is more loaded and you exceed the "burst budget", performance tanks badly. People running PostgreSQL on EBS run into that, and it's a nasty surprise. Or we could give a larger allowance specifically after compute restart. |
The implementation is most likely going to be leaky bucket based, maybe hierarchical leaky bucket.
Not going to happen as part of this PR because knowledge about "is the compute currently restarting" requires integration with control plane, which is already our bottleneck for other important features. |
Last pushes adjust the wording in this RFC to use "throttling" instead of "rate limiting", aligning with the terminology proposed by @hlinnaka in https://github.com/neondatabase/cloud/pull/7746 |
I have made this RFC part of #5899 and would like to freeze & merge it. There are still some open comments, and there are no approvals. Those in favor, please hit approve. If you have open conversations, please resolve them or make clear what you'd like to see changed in this RFC before it's merged. |
+1 for this overall. In the long term, I think we should throttle per-endpoint rather than per-tenant. If you create 100 read-only endpoints on the same tenant, each one should get X amount of I/O capacity. One reason to create read-only endpoints is to scale horizontally, and if the I/O capacity doesn't also scale, it's doesn't really achieve it. I know that's harder to implement in the storage currently, as we don't distinguish between different endpoints there. So we can do per tenant for now, and switch to per endpoint later as a separate project. |
Implementation epic: #5899 |
RenderedMergedImplementation epic: #5899