Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page_service: getpage batching: refactor & minor fixes #9792

Closed
wants to merge 18 commits into from

Conversation

problame
Copy link
Contributor

@problame problame commented Nov 18, 2024

This PR refactors the page_service server-side batching code that was recently added in
#9377.

Changes:

  • Store carried-over batching state in a local variable inside handle_pagestream, instead of in the PageServerHandler. This adds robustness because it systematically avoids a source of bugs.
  • Instead of next_batch, call it carry.
  • When starting a new batch read, take into account the time carried over from the previous call.
  • Build the batch inside the &mut Option<Carry> of in a local variable.
    • Before, we would have to make sure that we move the batch back into the self.next_batch whenever we bail.

Copy link

github-actions bot commented Nov 18, 2024

5535 tests run: 5309 passed, 0 failed, 226 skipped (full report)


Flaky tests (1)

Postgres 15

Code coverage* (full report)

  • functions: 31.4% (7951 of 25322 functions)
  • lines: 49.3% (63084 of 127831 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
058b35f at 2024-11-21T11:30:41.601Z :recycle:

problame added a commit that referenced this pull request Nov 18, 2024
## Problem

We don't take advantage of queue depth generated by the compute
on the pageserver. We can process getpage requests more efficiently
by batching them. 

## Summary of changes

Batch up incoming getpage requests that arrive within a configurable
time window (`server_side_batch_timeout`).
Then process the entire batch via one `get_vectored` timeline operation.
By default, no merging takes place.

## Testing

* **Functional**: #9792
* **Performance**: will be done in staging/pre-prod

# Refs

* #9377
* #9376

Co-authored-by: Christian Schwarz <christian@neon.tech>
Base automatically changed from vlad/pageserver-merge-get-page-requests to main November 18, 2024 20:24
@problame problame changed the title WIP: page_service: add basic testcase for merging WIP: page_service: add basic benchmark Nov 18, 2024
@problame problame changed the title WIP: page_service: add basic benchmark WIP: page_service: add basic test & benchmark Nov 18, 2024
The steps in the test work in neon_local + psql
but for some reason they don't work in the test.

Asked compute team on Slack for help:
https://neondb.slack.com/archives/C04DGM6SMTM/p1731952688386789
@problame problame force-pushed the problame/merge-getpage-test branch from a149e89 to 15e21c7 Compare November 18, 2024 22:57
@problame problame changed the title WIP: page_service: add basic test & benchmark WIP: page_service: add basic sequential scan benchmark Nov 18, 2024
@problame problame changed the title WIP: page_service: add basic sequential scan benchmark WIP: batching: bugfixes & benchmark Nov 19, 2024
@problame problame changed the base branch from main to problame/batching-benchmark November 20, 2024 13:24
@problame problame changed the title WIP: batching: bugfixes & benchmark page_service: refactor & minor fixes for batching code Nov 20, 2024
@problame problame requested a review from VladLazar November 20, 2024 13:36
@problame problame changed the title page_service: refactor & minor fixes for batching code page_service: getpage batching: refactor & minor fixes Nov 20, 2024
@problame problame marked this pull request as ready for review November 21, 2024 10:24
@problame problame requested a review from a team as a code owner November 21, 2024 10:24
Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked it out locally and stepped through it a number of times. Looks correct, but have a look at the comments.

pageserver/src/page_service.rs Show resolved Hide resolved
pageserver/src/page_service.rs Show resolved Hide resolved
pageserver/src/page_service.rs Show resolved Hide resolved
github-merge-queue bot pushed a commit that referenced this pull request Nov 25, 2024
This PR adds two benchmark to demonstrate the effect of server-side
getpage request batching added in
#9321.

For the CPU usage, I found the the `prometheus` crate's built-in CPU
usage accounts the seconds at integer granularity. That's not enough you
reduce the target benchmark runtime for local iteration. So, add a new
`libmetrics` metric and report that.

The benchmarks are disabled because [on our benchmark nodes, timer
resolution isn't high
enough](https://neondb.slack.com/archives/C059ZC138NR/p1732264223207449).
They work (no statement about quality) on my bare-metal devbox.

They will be refined and enabled once we find a fix. Candidates at time
of writing are:
- #9822
- #9851


Refs:

- Epic: #9376
- Extracted from #9792
Base automatically changed from problame/batching-benchmark to main November 25, 2024 15:54
@problame
Copy link
Contributor Author

The changes in this PR will be largely replaced by

which is currently stacked on top.

Closing this to avoid one merge-rebase-CI roudtrip.

@problame problame closed this Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pageserver: batch get page requests and serve them with one vectored get
2 participants