RFC: Per-Tenant GetPage@LSN Throttling #5648

problame · 2023-10-24T16:58:53Z

~~Rendered~~ Merged

Implementation epic: #5899

github-actions · 2023-10-24T17:25:31Z

2190 tests run: 2105 passed, 0 failed, 85 skipped (full report)

Flaky tests (2)

Postgres 15

test_gc_cutoff: debug
test_timetravel: debug

Code coverage (full report)

functions: 55.1% (9695 of 17596 functions)
lines: 82.3% (55697 of 67703 lines)

_{The comment gets automatically updated with the latest test results
75be184 at 2023-12-18T20:17:44.759Z :recycle:}

MMeent

Some initial comments. Overall I think this contains good ideas, with some rough edges that can use polish.

docs/rfcs/029-getpage-rate-limiting.md

lubennikovaav · 2023-10-25T13:47:16Z

I would like to understand the problem better, before we enable throttling.
Are we sure that GetPage@LSN is caused by client's load? Or at least that it correlates with it?

My concern is that if it is caused by some bug / misconfiguration of our mechanisms (postgres config, prefetch, LFC), we can start throttling normal user loads. And there won't be any symptoms on user-side, apart from "my postgres is slow", and it will be as hard to investigate as it is now

problame · 2023-10-25T14:41:38Z

Are we sure that GetPage@LSN is caused by client's load?

Yes, 100% sure. The metric that I'm quoting in the RFC is only counted in response to requests from compute. We have separate metrics for the lower-level Timeline::get call.

My concern is that if it is caused by some bug / misconfiguration of our mechanisms (postgres config, prefetch, LFC), we can start throttling normal user loads. And there won't be any symptoms on user-side, apart from "my postgres is slow", and it will be as hard to investigate as it is now

The limit will be a lot higher than our average users' workload.
I.e., we're defining "the highest GetPage/second that Pageserver can support for a single tenant".

We will define the value based on experimentation in staging and production.

Further, we will have sufficient observability on the pageserver side to tell whether a tenant is running into the rate limit.
And we will have a safety switch to disable the rate limiting on a per-tenant basis.

I reflected these concern in my recent commit to this PR.

I think team compute should start working on the "Future Work" item "Compute-side metrics for GetPage latency." in parallel to Pageserver team implementing this RFC.
But, given our careful rollout plan, I don't think we need to wait for the compute-side metrics to arrive.

Agree?

docs/rfcs/029-getpage-rate-limiting.md

hlinnaka · 2023-11-01T09:11:04Z

Replying to @knizhnik's comment on slack:

There are normal situations when there are splashes of requests to PS, for example compute restart. Should we somehow limit rate of get_page requests at compute startup even if it has negative impact on other tenants? The answer is not obvious for me... We should minimize negative users feedbacks. If time of some query execution is increased from 100 to 200 msec, most likely it is not critical. But if it is increased to several seconds (which can easily happen at startup), then for many customers it is not acceptable.

One approach is to allow for a higher burst capacity, but if you keep doing IOs at a high rate, then you get throttled. EBS behaves like that. E.g. use a leaky bucket algorithm for the throttling, with a fairly large bucket. One problem with that is that a query can appear to be fast when you run it on its own, but as the system as whole is more loaded and you exceed the "burst budget", performance tanks badly. People running PostgreSQL on EBS run into that, and it's a nasty surprise.

Or we could give a larger allowance specifically after compute restart.

problame · 2023-11-03T10:01:39Z

The implementation is most likely going to be leaky bucket based, maybe hierarchical leaky bucket.

Or we could give a larger allowance specifically after compute restart.

Not going to happen as part of this PR because knowledge about "is the compute currently restarting" requires integration with control plane, which is already our bottleneck for other important features.

problame · 2023-11-30T11:07:33Z

Last pushes adjust the wording in this RFC to use "throttling" instead of "rate limiting", aligning with the terminology proposed by @hlinnaka in https://github.com/neondatabase/cloud/pull/7746

problame · 2023-11-30T11:09:04Z

I have made this RFC part of #5899 and would like to freeze & merge it.

There are still some open comments, and there are no approvals.

Those in favor, please hit approve.

If you have open conversations, please resolve them or make clear what you'd like to see changed in this RFC before it's merged.

hlinnaka · 2023-11-30T15:14:14Z

+1 for this overall.

In the long term, I think we should throttle per-endpoint rather than per-tenant. If you create 100 read-only endpoints on the same tenant, each one should get X amount of I/O capacity. One reason to create read-only endpoints is to scale horizontally, and if the I/O capacity doesn't also scale, it's doesn't really achieve it.

I know that's harder to implement in the storage currently, as we don't distinguish between different endpoints there. So we can do per tenant for now, and switch to per endpoint later as a separate project.

docs/rfcs/029-getpage-throttling.md

…limiting

problame · 2023-12-19T10:21:05Z

Implementation epic: #5899

problame added 3 commits October 24, 2023 18:55

initial

df64ec1

grammarly, some minor rephrasing

dcbe3d1

title

56e314d

MMeent reviewed Oct 24, 2023

View reviewed changes

knizhnik reviewed Oct 24, 2023

View reviewed changes

docs/rfcs/029-getpage-rate-limiting.md Outdated Show resolved Hide resolved

lubennikovaav reviewed Oct 25, 2023

View reviewed changes

docs/rfcs/029-getpage-rate-limiting.md Outdated Show resolved Hide resolved

problame added 2 commits October 25, 2023 16:15

reflect today's meeting decisions

2f8d992

minor clarifications

2aeb3ad

MMeent added the t/tech_design_rfc Issue type: tech design RFC label Oct 25, 2023

sharnoff reviewed Oct 29, 2023

View reviewed changes

docs/rfcs/029-getpage-rate-limiting.md Outdated Show resolved Hide resolved

problame self-assigned this Nov 3, 2023

koivunej mentioned this pull request Nov 22, 2023

pageserver: WAL ingestion backpressure #5897

Open

Update docs/rfcs/029-getpage-rate-limiting.md

7d400fd

problame mentioned this pull request Nov 22, 2023

Epic: per-tenant read path throttling #5899

Closed

problame added 2 commits November 30, 2023 10:56

rename file from rate-limiting to throttling

9d02771

adjust wording to use throttling instead of rate limit

ef7307b

problame changed the title ~~RFC: GetPage@LSN Rate Limit~~ RFC: Per-Tenant GetPage@LSN Throttling Nov 30, 2023

align title with PR title

554ae22

jcsp reviewed Dec 1, 2023

View reviewed changes

problame added 3 commits December 4, 2023 16:22

Merge remote-tracking branch 'origin/main' into problame/rfc-29-rate-…

a11ed1b

…limiting

explain why per-endpoint isn't easy

fd89c0e

address #5648 (comment)

2f89589

problame added 2 commits December 4, 2023 16:43

address #5648 (comment)

36cf7f4

address #5648 (comment)

bfaa77c

jcsp approved these changes Dec 5, 2023

View reviewed changes

problame added 2 commits December 18, 2023 19:02

clarify bucket implementation; #5648 (comment)

02ce2d3

https://github.com/neondatabase/neon/pull/5648#discussion_r1370698397

75be184

problame merged commit c272c68 into main Dec 19, 2023
41 checks passed

problame deleted the problame/rfc-29-rate-limiting branch December 19, 2023 10:20

problame mentioned this pull request Feb 9, 2024

per-TenantShard read throttling #6706

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Per-Tenant GetPage@LSN Throttling #5648

RFC: Per-Tenant GetPage@LSN Throttling #5648

problame commented Oct 24, 2023 •

edited

Loading

github-actions bot commented Oct 24, 2023 •

edited

Loading

Postgres 15

MMeent left a comment

lubennikovaav commented Oct 25, 2023

problame commented Oct 25, 2023

hlinnaka commented Nov 1, 2023 •

edited

Loading

problame commented Nov 3, 2023 •

edited

Loading

problame commented Nov 30, 2023

problame commented Nov 30, 2023

hlinnaka commented Nov 30, 2023

problame commented Dec 19, 2023

RFC: Per-Tenant GetPage@LSN Throttling #5648

RFC: Per-Tenant GetPage@LSN Throttling #5648

Conversation

problame commented Oct 24, 2023 • edited Loading

github-actions bot commented Oct 24, 2023 • edited Loading

2190 tests run: 2105 passed, 0 failed, 85 skipped (full report)

Postgres 15

Code coverage (full report)

MMeent left a comment

Choose a reason for hiding this comment

lubennikovaav commented Oct 25, 2023

problame commented Oct 25, 2023

hlinnaka commented Nov 1, 2023 • edited Loading

problame commented Nov 3, 2023 • edited Loading

problame commented Nov 30, 2023

problame commented Nov 30, 2023

hlinnaka commented Nov 30, 2023

problame commented Dec 19, 2023

problame commented Oct 24, 2023 •

edited

Loading

github-actions bot commented Oct 24, 2023 •

edited

Loading

hlinnaka commented Nov 1, 2023 •

edited

Loading

problame commented Nov 3, 2023 •

edited

Loading