-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_runner/performance: add sharded ingest benchmark #9591
Conversation
@VladLazar @jcsp Initial local results without sharded ingest:
Ingestion is IO bound with 1 shard (i.e. Safekeeper is bottleneck), and pageserver ingestion is single-threaded so can use at most 1 CPU core. As long as pageservers have CPU headroom, the additional work due to sharding does not affect throughput and remains IO bound (i.e. Safekeeper-bound). Only when we saturate the CPU at ~10 shards does throughput drop, and then drops ~linearly up to 32 shards as expected. This implies that we won't see ingestion throughput improvements unless pageservers see CPU exhaustion. We will see a reduction in CPU usage though (and thus potential cost savings). However, we may see throughput increases if we saturate the bandwidth capacity of the Safekeepers too (bandwidth use currently scales linearly with shards). Just to set expectations for throughput improvements @jcsp. |
Btw, writing this as a Rust in-memory benchmark wasn't practical due to the tight code coupling, so I opted for a Python benchmark. A Pageserver WAL receiver pulls in most Pageserver infrastructure via the timeline/tenant, and all of this had to be made public for construction from benchmark binaries. We should add appropriate trait abstraction boundaries to allow constructing components in isolation for tests/benchmarks, but that's more work that it's worth right now. |
5328 tests run: 5106 passed, 0 failed, 222 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
7820750 at 2024-11-02T16:35:33.227Z :recycle: |
## Problem `tenant_get_shards()` does not work with a sharded tenant on 1 pageserver, as it assumes an unsharded tenant in this case. This special case appears to have been added to handle e.g. `test_emergency_mode`, where the storage controller is stopped. This breaks e.g. the sharded ingest benchmark in #9591 when run with a single shard. ## Summary of changes Correctly look up shards even with a single pageserver, but add a special case that assumes an unsharded tenant if the storage controller is stopped and the caller provides an explicit pageserver, in order to accomodate `test_emergency_mode`.
This isn't entirely accurate. With 1 shard, the Safekeeper runs at 70% CPU while the Pageserver runs at 170% CPU -- no significant IO-wait. I think this indicates that we may get a 2x speedup with 2 shards, but it'll flatten out there (unless CPU is saturated). We'll find out once sharded ingest is implemented. |
Problem
We need a benchmark to quantify the improvement of sharded ingestion.
Resolves #9569.
Summary of changes
Adds a Python benchmark for sharded ingestion. This ingests 7 GB of WAL (100M rows) into a Safekeeper and fans out to 10 shards running on 10 different pageservers. The ingest volume and duration is recorded.
Checklist before requesting a review
Checklist before merging