Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: per-tenant read path throttling #5899

Closed
17 of 18 tasks
problame opened this issue Nov 22, 2023 · 11 comments
Closed
17 of 18 tasks

Epic: per-tenant read path throttling #5899

problame opened this issue Nov 22, 2023 · 11 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver c/storage Component: storage t/Epic Issue type: Epic

Comments

@problame
Copy link
Contributor

problame commented Nov 22, 2023

Motivation

See #5648

tl;dr: we currently serve reads (and write path => #7564) as fast as possible.
This sets the wrong incentives and poses operational and economic problems:

  • noisy neighbor issues: Pageservers are multi-tenant. But currently, a single tenant can max out the resources of a PS on both read and write path.
  • reads specifically: computes that should be increasing their DRAM can instead issue more GetPage requests (we currently don't charge for IOPS)

DoD

Pageserver artificially caps the per-tenant throughput on the read path.
I.e., to all upstream Neon components, this cap will appear to be the maximum read performance that you can get per tenant per pageserver.

The limits will be chosen such that a TBD (small single-digit) number of tenants can run at the limit. Discovery of the limit values is done through gradual rollout, conservative experimentation, and informed by benchmarks.

The upstream (compute) responds to the limit-induced backpressure efficiently, gracefully, and without risk of starvation.

There is enough observability to clearly disambiguate slowness induced by limiting from slowness caused by otherwise slow pageserver. This disambiguation must be on per-tenant (better: per-timeline) granularity.

The throttle are on-by-default and cannot be permanently overridden on a per-tenant basis.
I.e., the implementation need not be suited for productization as "performance tier" or "QoS" feature.

Interactions

Sharding: with sharding, above limits will be per shard instead of per tenant.
However, we may need to (re-)introduce per-tenant limits within a single pageserver process to incentivize placement of shards across different nodes for increased performance & load spreading.
However, that's subject to future work.

High-Level Plan

High Level

Preview Give feedback

Impl

Preview Give feedback
@problame problame added c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic c/storage Component: storage labels Nov 22, 2023
@problame problame assigned problame and unassigned problame Nov 22, 2023
@problame problame changed the title Epic: per-tenant rate limiting Epic: per-tenant throttling Nov 30, 2023
problame added a commit that referenced this issue Dec 4, 2023
…6000)

Problem
-------

Before this PR, there was no concurrency limit on initial logical size
computations.

While logical size computations are lazy in theory, in practice
(production), they happen in a short timeframe after restart.

This means that on a PS with 20k tenants, we'd have up to 20k concurrent
initial logical size calculation requests.

This is self-inflicted needless overload.

This hasn't been a problem so far because the `.await` points on the
logical size calculation path never return `Pending`, hence we have a
natural concurrency limit of the number of executor threads.
But, as soon as we return `Pending` somewhere in the logical size
calculation path, other concurrent tasks get scheduled by tokio.
If these other tasks are also logical size calculations, they eventually
pound on the same bottleneck.

For example, in #5479, we want to switch the VirtualFile descriptor
cache to a `tokio::sync::RwLock`, which makes us return `Pending`, and
without measures like this patch, after PS restart, VirtualFile
descriptor cache thrashes heavily for 2 hours until all the logical size
calculations have been computed and the degree of concurrency /
concurrent VirtualFile operations is down to regular levels.
See the *Experiment* section below for details.

<!-- Experiments (see below) show that plain #5479 causes heavy
thrashing of the VirtualFile descriptor cache.
The high degree of concurrency is too much for 
In the case of #5479 the VirtualFile descriptor cache size starts
thrashing heavily.


-->

Background
----------

Before this PR, initial logical size calculation was spawned lazily on
first call to `Timeline::get_current_logical_size()`.

In practice (prod), the lazy calculation is triggered by
`WalReceiverConnectionHandler` if the timeline is active according to
storage broker, or by the first iteration of consumption metrics worker
after restart (`MetricsCollection`).

The spawns by walreceiver are high-priority because logical size is
needed by Safekeepers (via walreceiver `PageserverFeedback`) to enforce
the project logical size limit.
The spawns by metrics collection are not on the user-critical path and
hence low-priority. [^consumption_metrics_slo]

[^consumption_metrics_slo]: We can't delay metrics collection
indefintely because there are TBD internal SLOs tied to metrics
collection happening in a timeline manner
(neondatabase/cloud#7408). But let's ignore
that in this issue.

The ratio of walreceiver-initiated spawns vs
consumption-metrics-initiated spawns can be reconstructed from logs
(`spawning logical size computation from context of task kind {:?}"`).
PR #5995 and #6018 adds metrics for this.

First investigation of the ratio lead to the discovery that walreceiver
spawns 75% of init logical size computations.
That's because of two bugs:
- In Safekeepers: #5993
- In interaction between Pageservers and Safekeepers:
#5962

The safekeeper bug is likely primarily responsible but we don't have the
data yet. The metrics will hopefully provide some insights.

When assessing production-readiness of this PR, please assume that
neither of these bugs are fixed yet.


Changes In This PR
------------------

With this PR, initial logical size calculation is reworked as follows:

First, all initial logical size calculation task_mgr tasks are started
early, as part of timeline activation, and run a retry loop with long
back-off until success. This removes the lazy computation; it was
needless complexity because in practice, we compute all logical sizes
anyways, because consumption metrics collects it.

Second, within the initial logical size calculation task, each attempt
queues behind the background loop concurrency limiter semaphore. This
fixes the performance issue that we pointed out in the "Problem" section
earlier.

Third, there is a twist to queuing behind the background loop
concurrency limiter semaphore. Logical size is needed by Safekeepers
(via walreceiver `PageserverFeedback`) to enforce the project logical
size limit. However, we currently do open walreceiver connections even
before we have an exact logical size. That's bad, and I'll build on top
of this PR to fix that
(#5963). But, for the
purposes of this PR, we don't want to introduce a regression, i.e., we
don't want to provide an exact value later than before this PR. The
solution is to introduce a priority-boosting mechanism
(`GetLogicalSizePriority`), allowing callers of
`Timeline::get_current_logical_size` to specify how urgently they need
an exact value. The effect of specifying high urgency is that the
initial logical size calculation task for the timeline will skip the
concurrency limiting semaphore. This should yield effectively the same
behavior as we had before this PR with lazy spawning.

Last, the priority-boosting mechanism obsoletes the `init_order`'s grace
period for initial logical size calculations. It's a separate commit to
reduce the churn during review. We can drop that commit if people think
it's too much churn, and commit it later once we know this PR here
worked as intended.

Experiment With #5479 
---------------------

I validated this PR combined with #5479 to assess whether we're making
forward progress towards asyncification.

The setup is an `i3en.3xlarge` instance with 20k tenants, each with one
timeline that has 9 layers.
All tenants are inactive, i.e., not known to SKs nor storage broker.
This means all initial logical size calculations are spawned by
consumption metrics `MetricsCollection` task kind.
The consumption metrics worker starts requesting logical sizes at low
priority immediately after restart. This is achieved by deleting the
consumption metrics cache file on disk before starting
PS.[^consumption_metrics_cache_file]

[^consumption_metrics_cache_file] Consumption metrics worker persists
its interval across restarts to achieve persistent reporting intervals
across PS restarts; delete the state file on disk to get predictable
(and I believe worst-case in terms of concurrency during PS restart)
behavior.

Before this patch, all of these timelines would all do their initial
logical size calculation in parallel, leading to extreme thrashing in
page cache and virtual file cache.

With this patch, the virtual file cache thrashing is reduced
significantly (from 80k `open`-system-calls/second to ~500
`open`-system-calls/second during loading).


### Critique

The obvious critique with above experiment is that there's no skipping
of the semaphore, i.e., the priority-boosting aspect of this PR is not
exercised.

If even just 1% of our 20k tenants in the setup were active in
SK/storage_broker, then 200 logical size calculations would skip the
limiting semaphore immediately after restart and run concurrently.

Further critique: given the two bugs wrt timeline inactive vs active
state that were mentioned in the Background section, we could have 75%
of our 20k tenants being (falsely) active on restart.

So... (next section)

This Doesn't Make Us Ready For Async VirtualFile
------------------------------------------------

This PR is a step towards asynchronous `VirtualFile`, aka, #5479 or even
#4744.

But it doesn't yet enable us to ship #5479.

The reason is that this PR doesn't limit the amount of high-priority
logical size computations.
If there are many high-priority logical size calculations requested,
we'll fall over like we did if #5479 is applied without this PR.
And currently, at very least due to the bugs mentioned in the Background
section, we run thousands of high-priority logical size calculations on
PS startup in prod.

So, at a minimum, we need to fix these bugs.

Then we can ship #5479 and #4744, and things will likely be fine under
normal operation.

But in high-traffic situations, overload problems will still be more
likely to happen, e.g., VirtualFile cache descriptor thrashing.
The solution candidates for that are orthogonal to this PR though:
* global concurrency limiting
* per-tenant rate limiting => #5899
* load shedding
* scaling bottleneck resources (fd cache size (neondatabase/cloud#8351),
page cache size(neondatabase/cloud#8351), spread load across more PSes,
etc)

Conclusion
----------

Even with the remarks from in the previous section, we should merge this
PR because:
1. it's an improvement over the status quo (esp. if the aforementioned
bugs wrt timeline active / inactive are fixed)
2. it prepares the way for
#6010
3. it gets us close to shipping #5479 and #4744
problame added a commit that referenced this issue Dec 19, 2023
problame added a commit that referenced this issue Jan 11, 2024
)

This reverts commit ab1f37e.
Thereby
fixes #5479

Updated Analysis
================

The problem with the original patch was that it, for the first time,
exposed the `VirtualFile` code to tokio task concurrency instead of just
thread-based concurrency. That caused the VirtualFile file descriptor
cache to start thrashing, effectively grinding the system to a halt.

Details
-------

At the time of the original patch, we had a _lot_ of runnable tasks in
the pageserver.
The symptom that prompted the revert (now being reverted in this PR) is
that our production systems fell into a valley of zero goodput, high
CPU, and zero disk IOPS shortly after PS restart.
We lay out the root cause for that behavior in this subsection.

At the time, there was no concurrency limit on the number of concurrent
initial logical size calculations.
Initial size calculation was initiated for all timelines within the
first 10 minutes as part of consumption metrics collection.
On a PS with 20k timelines, we'd thus have 20k runnable tasks.

Before the original patch, the `VirtualFile` code never returned
`Poll::Pending`.
That meant that once we entered it, the calling tokio task would not
yield to the tokio executor until we were done performing the
VirtualFile operation, i.e., doing a blocking IO system call.

The original patch switched the VirtualFile file descriptor cache's
synchronization primitives to those from `tokio::sync`.
It did not change that we were doing synchronous IO system calls.
And the cache had more slots than we have tokio executor threads.
So, these primitives never actually needed to return `Poll::Pending`.
But, the tokio scheduler makes tokio sync primitives return `Pending`
*artificially*, as a mechanism for the scheduler to get back into
control more often
([example](https://docs.rs/tokio/1.35.1/src/tokio/sync/batch_semaphore.rs.html#570)).

So, the new reality was that VirtualFile calls could now yield to the
tokio executor.
Tokio would pick one of the other 19999 runnable tasks to run.
These tasks were also using VirtualFile.
So, we now had a lot more concurrency in that area of the code.

The problem with more concurrency was that caches started thrashing,
most notably the VirtualFile file descriptor cache: each time a task
would be rescheduled, it would want to do its next VirtualFile
operation. For that, it would first need to evict another (task's)
VirtualFile fd from the cache to make room for its own fd. It would then
do one VirtualFile operation before hitting an await point and yielding
to the executor again. The executor would run the other 19999 tasks for
fairness before circling back to the first task, which would find its fd
evicted.

The other cache that would theoretically be impacted in a similar way is
the pageserver's `PageCache`.
However, for initial logical size calculation, it seems much less
relevant in experiments, likely because of the random access nature of
initial logical size calculation.

Fixes
=====

We fixed the above problems by
- raising VirtualFile cache sizes
  - neondatabase/cloud#8351
- changing code to ensure forward-progress once cache slots have been
acquired
  - #5480
  - #5482
  - tbd: #6065
- reducing the amount of runnable tokio tasks
  - #5578
  - #6000
- fix bugs that caused unnecessary concurrency induced by connection
handlers
  - #5993

I manually verified that this PR doesn't negatively affect startup
performance as follows:
create a pageserver in production configuration, with 20k
tenants/timelines, 9 tiny L0 layer files each; Start it, and observe

```
INFO Startup complete (368.009s since start) elapsed_ms=368009
```

I further verified in that same setup that, when using `pagebench`'s
getpage benchmark at as-fast-as-possible request rate against 5k of the
20k tenants, the achieved throughput is identical. The VirtualFile cache
isn't thrashing in that case.

Future Work
===========

We will still exposed to the cache thrashing risk from outside factors,
e.g., request concurrency is unbounded, and initial size calculation
skips the concurrency limiter when we establish a walreceiver
connection.

Once we start thrashing, we will degrade non-gracefully, i.e., encounter
a valley as was seen with the original patch.

However, we have sufficient means to deal with that unlikely situation:
1. we have dashboards & metrics to monitor & alert on cache thrashing
2. we can react by scaling the bottleneck resources (cache size) or by
manually shedding load through tenant relocation

Potential systematic solutions are future work:
* global concurrency limiting
* per-tenant rate limiting => #5899
* pageserver-initiated load shedding

Related Issues
==============

This PR unblocks the introduction of tokio-epoll-uring for asynchronous
disk IO ([Epic](#4744)).
@problame problame self-assigned this Jan 22, 2024
@jcsp
Copy link
Collaborator

jcsp commented Feb 5, 2024

Status:

  • bumped last week by walredo latency
  • will start after io_uring stuff is wrapped.

problame added a commit that referenced this issue Feb 6, 2024
Will need this to validate per-tenant throttling in
#5899
@jcsp
Copy link
Collaborator

jcsp commented Feb 12, 2024

Initial draft PR is up for review -- could land this week.

Testing:

  • In staging we can enable throttling for everyone? Benchmarks will be unhappy. So we can set the limit to something very close to what the benchmarks currently do at peak.
  • Need to monitor impact on CPU usage from the overhead of throttling, while realizing that throttling will also decrease CPU usage due to lower ops/s

@jcsp
Copy link
Collaborator

jcsp commented Feb 19, 2024

Status:

  • Initial code landed on Friday
  • Next: enable for all tenants in staging, watch for ~2 days and overlap with Peter's bench.
  • If staging is OK, identify troublesome tenants in prod (k8s pod small ones) + work with compute/autoscaling team to remediate them. Target free tier users who are generating outsized I/O.

problame added a commit that referenced this issue Feb 22, 2024
…age (#6869)

The `refill_interval` switched from a milliseconds usize to a Duration
during a review follow-up, hence this slipped through manual testing.

Part of #5899
@jcsp
Copy link
Collaborator

jcsp commented Feb 26, 2024

This week:

  • Figure out our plan for making our latency metrics exclude the throttling time, and other metrics to measure impact/activity of throttling (currently just have logs)
  • Go with 20k in prod, with 40k burst limit: recognizing that this is higher than we want it to be ultimately, but we don't want to clamp things like full postgres restarts to a lower 5k limit.
  • notion page explaining how to adjust the throttle for a tenant in the field.

problame added a commit that referenced this issue Feb 29, 2024
…questContextAdaptor` uses it (#6961)

Extracted from #6953

Part of #5899
problame added a commit that referenced this issue Mar 1, 2024
…in callers (#6960)

Extracted from #6953

Part of #5899

Core Change
-----------

In #6953, we need the ability to scan the log _after_ a specific line
and ignore anything before that line.

This PR changes `log_contains` to returns a tuple of `(matching line,
cursor)`.
Hand that cursor to a subsequent `log_contains` call to search the log
for the next occurrence of the pattern.

Other Changes
-------------

- Inspect all the callsites of `log_contains` to handle the new tuple
return type.
- Above inspection unveiled many callers aren't using `assert
log_contains(...) is not None` but some weaker version of the code that
breaks if `log_contains` ever returns a not-None but falsy value. Fix
that.
- Above changes unveiled that `test_remote_storage_upload_queue_retries`
was using `wait_until` incorrectly; after fixing the usage, I had to
raise the `wait_until` timeout. So, maybe this will fix its flakiness.
@problame
Copy link
Contributor Author

problame commented Mar 4, 2024

problame added a commit that referenced this issue Mar 5, 2024
… metrics + regression test (#6953)

part of #5899

Problem
-------

Before this PR, the time spent waiting on the throttle was charged
towards the higher-level page_service metrics, i.e.,
`pageserver_smgr_query_seconds`.
The metrics are the foundation of internal SLIs / SLOs.
A throttled tenant would cause the SLI to degrade / SLO alerts to fire.

Changes
-------


- don't charge time spent in throttle towards the page_service metrics
- record time spent in throttle in RequestContext and subtract it from
the elapsed time
- this works because the page_service path doesn't create child context,
so, all the throttle time is recorded in the parent
- it's quite brittle and will break if we ever decide to spawn child
tasks that need child RequestContexts, which would have separate
instances of the `micros_spent_throttled` counter.
- however, let's punt that to a more general refactoring of
RequestContext
- add a test case that ensures that
- throttling happens for getpage requests; this aspect of the test
passed before this PR
- throttling delays aren't charged towards the page_service metrics;
this aspect of the test only passes with this PR
- drive-by: make the throttle log message `info!`, it's an expected
condition

Performance
-----------

I took the same measurements as in #6706 , no meaningful change in CPU
overhead.

Future Work
-----------

This PR enables us to experiment with the throttle for select tenants
without affecting the SLI metrics / triggering SLO alerts.

Before declaring this feature done, we need more work to happen,
specifically:

- decide on whether we want to retain the flexibility of throttling any
`Timeline::get` call, filtered by TaskKind
- versus: separate throttles for each page_service endpoint, potentially
with separate config options
- the trouble here is that this decision implies changes to the
TenantConfig, so, if we start using the current config style now, then
decide to switch to a different config, it'll be a breaking change

Nice-to-haves but probably not worth the time right now:

- Equivalent tests to ensure the throttle applies to all other
page_service handlers.
@jcsp
Copy link
Collaborator

jcsp commented Mar 11, 2024

Status:

  • We alert if a tenant is throttled on >2% of requests
    • Actions on the alerts:
      • if a client is a legacy pod that should be migrated to an autoscaled VM, migrate it
      • if a client is legitimately high throughput and large: shard it.
  • Reconciling with vecfored get changes to ensure we aren't double throttling.

@problame
Copy link
Contributor Author

Apart from

Reconciling with vecfored get changes to ensure we aren't double throttling.

nothing happened last week.

This week:

  • investigate Sentry panic from prod
  • notion page explaining how to adjust the throttle for a tenant in the field.
  • examine amount of throttling in prod => take action (see previous comment)

@problame
Copy link
Contributor Author

problame commented Mar 25, 2024

Status update:

This week:

  • examination of throttling in prod

@problame
Copy link
Contributor Author

Status update:

  • no progress this week

@problame problame changed the title Epic: per-tenant throttling Epic: per-tenant read path throttling Apr 30, 2024
@problame
Copy link
Contributor Author

I split off the write throttling aspect of this epic into a separate draft epic: #7564

(We do not expect to work on write throttling this quarter)

@problame
Copy link
Contributor Author

Closing this epic, the development work has finished long ago.

The last item

enable gradually, starting with high default, then going downwards to a value where we want to be

was and still is dependent on sharding + sharded ingest rollout, so that users who hit the throttle have an option to acquire more IOPS through sharding as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver c/storage Component: storage t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

2 participants