test_runner/performance: add sharded ingest benchmark #9591

erikgrinaker · 2024-10-31T12:39:30Z

Problem

We need a benchmark to quantify the improvement of sharded ingestion.

Resolves #9569.

Summary of changes

Adds a Python benchmark for sharded ingestion. This ingests 7 GB of WAL (100M rows) into a Safekeeper and fans out to 10 shards running on 10 different pageservers. The ingest volume and duration is recorded.

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

erikgrinaker · 2024-10-31T12:44:41Z

@VladLazar @jcsp Initial local results without sharded ingest:

MacBook Pro M3, 11 cores, 7 GB WAL (100M rows)

1 shard: 41.382s
2 shards: 40.805s
4 shards: 43.169s
10 shards: 61.436s
32 shards: 148.067s

Ingestion is IO bound with 1 shard (i.e. Safekeeper is bottleneck), and pageserver ingestion is single-threaded so can use at most 1 CPU core. As long as pageservers have CPU headroom, the additional work due to sharding does not affect throughput and remains IO bound (i.e. Safekeeper-bound). Only when we saturate the CPU at ~10 shards does throughput drop, and then drops ~linearly up to 32 shards as expected.

This implies that we won't see ingestion throughput improvements unless pageservers see CPU exhaustion. We will see a reduction in CPU usage though (and thus potential cost savings). However, we may see throughput increases if we saturate the bandwidth capacity of the Safekeepers too (bandwidth use currently scales linearly with shards). Just to set expectations for throughput improvements @jcsp.

erikgrinaker · 2024-10-31T12:47:34Z

Btw, writing this as a Rust in-memory benchmark wasn't practical due to the tight code coupling, so I opted for a Python benchmark. A Pageserver WAL receiver pulls in most Pageserver infrastructure via the timeline/tenant, and all of this had to be made public for construction from benchmark binaries. We should add appropriate trait abstraction boundaries to allow constructing components in isolation for tests/benchmarks, but that's more work that it's worth right now.

test_runner/fixtures/neon_fixtures.py

github-actions · 2024-10-31T13:27:37Z

5328 tests run: 5106 passed, 0 failed, 222 skipped (full report)

Flaky tests (1)

Postgres 17

test_obsolete_slot_drop: debug-x86-64

Code coverage* (full report)

functions: 31.5% (7771 of 24690 functions)
lines: 48.9% (61030 of 124696 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
7820750 at 2024-11-02T16:35:33.227Z :recycle:}

## Problem `tenant_get_shards()` does not work with a sharded tenant on 1 pageserver, as it assumes an unsharded tenant in this case. This special case appears to have been added to handle e.g. `test_emergency_mode`, where the storage controller is stopped. This breaks e.g. the sharded ingest benchmark in #9591 when run with a single shard. ## Summary of changes Correctly look up shards even with a single pageserver, but add a special case that assumes an unsharded tenant if the storage controller is stopped and the caller provides an explicit pageserver, in order to accomodate `test_emergency_mode`.

test_runner/performance/test_sharded_ingest.py

erikgrinaker · 2024-11-04T13:05:45Z

Ingestion is IO bound with 1 shard (i.e. Safekeeper is bottleneck), and pageserver ingestion is single-threaded so can use at most 1 CPU core. As long as pageservers have CPU headroom, the additional work due to sharding does not affect throughput and remains IO bound (i.e. Safekeeper-bound).

This isn't entirely accurate. With 1 shard, the Safekeeper runs at 70% CPU while the Pageserver runs at 170% CPU -- no significant IO-wait. I think this indicates that we may get a 2x speedup with 2 shards, but it'll flatten out there (unless CPU is saturated). We'll find out once sharded ingest is implemented.

test_runner/performance: add sharded ingest benchmark

19ef96e

erikgrinaker requested a review from VladLazar October 31, 2024 12:39

erikgrinaker commented Oct 31, 2024

View reviewed changes

test_runner/fixtures/neon_fixtures.py Outdated Show resolved Hide resolved

Revert fix

a754690

erikgrinaker mentioned this pull request Nov 1, 2024

test_runner: fix tenant_get_shards with one pageserver #9603

Merged

5 tasks

VladLazar approved these changes Nov 1, 2024

View reviewed changes

test_runner/performance/test_sharded_ingest.py Outdated Show resolved Hide resolved

test_runner/performance/test_sharded_ingest.py Outdated Show resolved Hide resolved

erikgrinaker added 4 commits November 1, 2024 14:01

Merge branch 'main' into erik/sharded-ingest-benchmark

8973681

Parameterize test with shard counts 1,8,32

5f9c844

Ensure storage controller doesn't move shards

b177eac

Merge branch 'main' into erik/sharded-ingest-benchmark

7820750

erikgrinaker enabled auto-merge (squash) November 2, 2024 15:43

erikgrinaker merged commit 0058eb0 into main Nov 2, 2024
80 checks passed

erikgrinaker deleted the erik/sharded-ingest-benchmark branch November 2, 2024 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_runner/performance: add sharded ingest benchmark #9591

test_runner/performance: add sharded ingest benchmark #9591

erikgrinaker commented Oct 31, 2024

erikgrinaker commented Oct 31, 2024 •

edited

Loading

erikgrinaker commented Oct 31, 2024 •

edited

Loading

github-actions bot commented Oct 31, 2024 •

edited

Loading

Postgres 17

erikgrinaker commented Nov 4, 2024

test_runner/performance: add sharded ingest benchmark #9591

test_runner/performance: add sharded ingest benchmark #9591

Conversation

erikgrinaker commented Oct 31, 2024

Problem

Summary of changes

Checklist before requesting a review

Checklist before merging

erikgrinaker commented Oct 31, 2024 • edited Loading

erikgrinaker commented Oct 31, 2024 • edited Loading

github-actions bot commented Oct 31, 2024 • edited Loading

5328 tests run: 5106 passed, 0 failed, 222 skipped (full report)

Postgres 17

Code coverage* (full report)

erikgrinaker commented Nov 4, 2024

erikgrinaker commented Oct 31, 2024 •

edited

Loading

erikgrinaker commented Oct 31, 2024 •

edited

Loading

github-actions bot commented Oct 31, 2024 •

edited

Loading