Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: getpage@lsn benchmark #5771

Open
44 of 47 tasks
problame opened this issue Nov 2, 2023 · 2 comments
Open
44 of 47 tasks

Epic: getpage@lsn benchmark #5771

problame opened this issue Nov 2, 2023 · 2 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic

Comments

@problame
Copy link
Contributor

problame commented Nov 2, 2023

Motivation

https://www.notion.so/neondatabase/Test-Tools-633ddee0cf1d4e6b9962ff5df433a27d?pvs=4

DoD

There is a benchmark for getpage@lsn performance that does not require a compute to run for benchmark execution (ok for benchmark setup).

The benchmark is run as part of the performance regression tests

  • nightly against staging
  • as part of the pre-release testing

The benchmark results are reproducible for Pageserver developers.

Pageserver developers get alerted about perf regressions.

Tasks

High Level

Preview Give feedback

Impl

Preview Give feedback
  1. problame
  2. c/storage/pageserver t/bug
  3. c/storage/pageserver t/bug
    problame
  4. 1 of 1
    bayandin
  5. c/storage/pageserver
    problame
  6. c/storage/pageserver m/good_first_issue
  7. jcsp
  8. run-benchmarks
  9. 4 of 6
  10. bayandin
  11. a/ci a/performance c/infra c/storage/pageserver
    Bodobolero

Follow-Ups

Preview Give feedback
  1. c/storage/pageserver
    problame
@problame problame self-assigned this Nov 2, 2023
@problame problame added c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic labels Nov 2, 2023
problame added a commit that referenced this issue Nov 8, 2023
problame added a commit that referenced this issue Nov 22, 2023
(part of the getpage benchmarking epic #5771)

The plan is to make the benchmarking tool log on stderr and emit results
as JSON on stdout. That way, the test suite can simply take captures
stdout and json.loads() it, while interactive users of the benchmarking
tool have a reasonable experience as well.

Existing logging users continue to print to stdout, so, this change
should be a no-op functionally and performance-wise.
problame added a commit that referenced this issue Nov 22, 2023
(part of the getpage benchmarking epic #5771)

The plan is to make the benchmarking tool log on stderr and emit results
as JSON on stdout. That way, the test suite can simply take captures
stdout and json.loads() it, while interactive users of the benchmarking
tool have a reasonable experience as well.

Existing logging users continue to print to stdout, so, this change
should be a no-op functionally and performance-wise.
problame added a commit that referenced this issue Nov 22, 2023
(part of the getpage benchmarking epic #5771)

The plan is to make the benchmarking tool log on stderr and emit results
as JSON on stdout. That way, the test suite can simply take captures
stdout and json.loads() it, while interactive users of the benchmarking
tool have a reasonable experience as well.

Existing logging users continue to print to stdout, so, this change
should be a no-op functionally and performance-wise.
problame added a commit that referenced this issue Nov 28, 2023
problame added a commit that referenced this issue Nov 30, 2023
problame added a commit that referenced this issue Dec 13, 2023
Part of getpage@lsn benchmark epic: #5771
problame added a commit that referenced this issue Dec 13, 2023
Part of getpage@lsn benchmark epic: #5771
problame added a commit that referenced this issue Dec 14, 2023
Part of getpage@lsn benchmark epic:
#5771

This PR moves the control plane's spread-all-over-the-place client for
the pageserver management API into a separate module within the
pageserver crate.

It also switches to the async version of reqwest, which I think is
generally the right direction, and I need an async client API in the
benchmark epic.
problame added a commit that referenced this issue Dec 14, 2023
Part of getpage@lsn benchmark epic:
#5771

This PR moves the control plane's spread-all-over-the-place client for
the pageserver management API into a separate module within the
pageserver crate.

It also switches to the async version of reqwest, which I think is
generally the right direction, and I need an async client API in the
benchmark epic.
problame added a commit that referenced this issue Dec 14, 2023
Part of getpage@lsn benchmark epic:
#5771
problame added a commit that referenced this issue Dec 15, 2023
Part of getpage@lsn benchmark epic:
#5771

This PR moves the control plane's spread-all-over-the-place client for
the pageserver management API into a separate module within the
pageserver crate.

I need that client to be async in my benchmarking work, so, this PR
switches to the async version of `reqwest`.
That is also the right direction generally IMO.

The switch to async in turn mandated converting most of the
`control_plane/` code to async.

Note that some of the client methods should be taking `TenantShardId`
instead of `TenantId`, but, none of the callers seem to be
sharding-aware.
Leaving that for another time:
#6154
problame added a commit that referenced this issue Dec 16, 2023
problame added a commit that referenced this issue Dec 18, 2023
Part of getpage@lsn benchmark epic:
#5771

Stacked atop #6145
problame added a commit that referenced this issue Dec 19, 2023
Part of getpage@lsn benchmark epic:
#5771

This allows getting the list of tenants and timelines without triggering
initial logical size calculation by requesting the timeline details API
response, which would skew our results.
problame added a commit that referenced this issue Dec 21, 2023
This PR adds a component-level benchmarking utility for pageserver.
Its name is `pagebench`.

The problem solved by `pagebench` is that we want to put Pageserver
under high load.

This isn't easily achieved with `pgbench` because it needs to go through
a compute, which has signficant performance overhead compared to
accessing Pageserver directly.

Further, compute has its own performance optimizations (most
importantly: caches). Instead of designing a compute-facing workload
that defeats those internal optimizations, `pagebench` simply bypasses
them by accessing pageserver directly.

Supported benchmarks:

* getpage@latest_lsn
* basebackup
* triggering logical size calculation

This code has no automated users yet.
A performance regression test for getpage@latest_lsn will be added in a
later PR.

part of #5771
problame added a commit that referenced this issue Jan 8, 2024
Part of #5771
Extracted from #6214

This PR makes the test suite sensitive to the new env var
`NEON_ENV_BUILDER_FROM_REPO_DIR_USE_OVERLAYFS`.
If it is set, `NeonEnvBuilder.from_repo_dir` uses overlayfs
to duplicate the the snapshot repo dir contents.

Since mounting requires root privileges, we use sudo to perform
the mounts. That, and macOS support, is also why copytree remains
the default.

If we ever run on a filesystem with copy reflink support, we should
consider that as an alternative.

This PR can be tried on a Linux machine on the
`test_backward_compatiblity` test. I took the opportunity to create a
session-scoped fixture for the
compatibility snapshot directory, as a hint to where I hope the
remainder of #6214 will evolve.
problame added a commit that referenced this issue Jan 9, 2024
Part of #5771
Extracted from #6214

This PR makes the test suite sensitive to the new env var
`NEON_ENV_BUILDER_FROM_REPO_DIR_USE_OVERLAYFS`.
If it is set, `NeonEnvBuilder.from_repo_dir` uses overlayfs
to duplicate the the snapshot repo dir contents.

Since mounting requires root privileges, we use sudo to perform
the mounts. That, and macOS support, is also why copytree remains
the default.

If we ever run on a filesystem with copy reflink support, we should
consider that as an alternative.

This PR can be tried on a Linux machine on the
`test_backward_compatiblity` test, which uses `from_repo_dir`.
problame added a commit that referenced this issue Jan 12, 2024
problame added a commit that referenced this issue Jan 12, 2024
…rking (#6350)

Been using this all the time in
#6214

Part of #5771

Should consider this in #6297
@jcsp
Copy link
Collaborator

jcsp commented Mar 11, 2024

Status from Alexander:

@jcsp
Copy link
Collaborator

jcsp commented Sep 23, 2024

@bayandin his one has been stuck as In Progress for a while?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

3 participants