Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: add no_sync for use in regression tests (1/2) #9677

Merged
merged 3 commits into from
Nov 8, 2024

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Nov 7, 2024

Problem

In test environments, the syncfs that the pageserver does on startup can take a long time, as other tests running concurrently might have many gigabytes of dirty pages.

Summary of changes

  • Add a no_sync option to the pageserver's config.
  • Skip syncfs on startup if this is set
  • A subsequent PR (pageserver: add no_sync for use in regression tests (2/2) #9678) will enable this by default in tests. We need to wait until after the next release to avoid breaking compat tests, which would fail if we set no_sync & use an old pageserver binary.

Q: Why is this a different mechanism than safekeeper, which as a --no-sync CLI?
A: Because the way we manage pageservers in neon_local depends on the pageserver.toml containing the full configuration, whereas safekeepers have a config file which is neon-local-specific and can drive a CLI flag.

Q: Why is the option no_sync rather than sync?
A: For boolean configs with a dangerous value, it's preferable to make "false" the safe option, so that any downstream future config tooling that might have a "booleans are false by default" behavior (e.g. golang structs) is safe by default.

Q: Why only skip the syncfs, and not all fsyncs?
A: Skipping all fsyncs would require more code changes, and the most acute problem isn't fsyncs themselves (these just slow down a running test), it's the syncfs (which makes a pageserver startup slow as a result of other tests)

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/test Area: related to testing a/tech_debt Area: related to tech debt labels Nov 7, 2024
@jcsp jcsp changed the title Jcsp/pageserver no sync pageserver: add no_sync and use it in regression tests Nov 7, 2024
@jcsp jcsp marked this pull request as ready for review November 7, 2024 11:57
@jcsp jcsp requested a review from a team as a code owner November 7, 2024 11:57
@jcsp jcsp requested a review from VladLazar November 7, 2024 11:57
Copy link

github-actions bot commented Nov 7, 2024

5337 tests run: 5115 passed, 0 failed, 222 skipped (full report)


Flaky tests (1)

Postgres 17

Code coverage* (full report)

  • functions: 31.7% (7854 of 24775 functions)
  • lines: 49.4% (62119 of 125849 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
2aa794a at 2024-11-07T16:55:01.305Z :recycle:

@jcsp jcsp force-pushed the jcsp/pageserver-no-sync branch from 9e43b43 to d3b7850 Compare November 7, 2024 13:43
@jcsp jcsp changed the title pageserver: add no_sync and use it in regression tests pageserver: add no_sync for use in regression tests (1/2) Nov 7, 2024
@arpad-m
Copy link
Member

arpad-m commented Nov 7, 2024

Maybe rename it to no_startup_sync instead of no_sync? Latter name might be confused for a "don't do any fsyncs at all" option (which it isn't). In fact, the safekeeper no_sync option is such a "don't do any fsyncs at all" option.

@jcsp
Copy link
Collaborator Author

jcsp commented Nov 7, 2024

Maybe rename it to no_startup_sync instead of no_sync? Latter name might be confused for a "don't do any fsyncs at all" option (which it isn't). In fact, the safekeeper no_sync option is such a "don't do any fsyncs at all" option.

I was a bit intentionally vague with the name, as in future we might extend it to fsyncs in general, and I wouldn't want to have two separate configs

@jcsp jcsp enabled auto-merge (squash) November 8, 2024 08:52
@jcsp jcsp merged commit aa9112e into main Nov 8, 2024
84 checks passed
@jcsp jcsp deleted the jcsp/pageserver-no-sync branch November 8, 2024 10:16
jcsp added a commit that referenced this pull request Nov 13, 2024
## Problem

Followup to #9677 which enables
`no_sync` in tests. This can be merged once the next release has
happened.

## Summary of changes

- Always run pageserver with `no_sync = true` in tests.
jcsp added a commit that referenced this pull request Nov 18, 2024
## Problem

`no_sync` initially just skipped syncfs on startup (#9677). I'm also
interested in flaky tests that time out during pageserver shutdown while
flushing l0s, so to eliminate disk throughput as a source of issues
there,

## Summary of changes

- Drive-by change for test timeouts: add a couple more ::info logs
during pageserver startup so it's obvious which part got stuck.
- Add a SyncMode enum to configure VirtualFile and respect it in
sync_all and sync_data functions
- During pageserver startup, set SyncMode according to `no_sync`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt a/test Area: related to testing c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants