Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: do graceful shutdown by default #8655

Merged
merged 13 commits into from
Aug 12, 2024
Merged

Conversation

koivunej
Copy link
Member

@koivunej koivunej commented Aug 8, 2024

It should give us all possible allowed_errors more consistently.

While getting the workflows to pass on #8632 it was noticed that allowed_errors are rarely hit (1/4). This made me realize that we always do an immediate stop by default. Doing a graceful shutdown would had made the draining more apparent and likely we would not have needed the #8632 hotfix.

Downside of doing this is that we will see more timeouts if tests are randomly leaving pause failpoints which fail the shutdown.

The net outcome should however be positive, we could even detect too slow shutdowns caused by a bug or deadlock.

it should give us all allowed_errors and be less racy.
@koivunej
Copy link
Member Author

koivunej commented Aug 8, 2024

There should be some failures. (I did not get any of these locally.)

Copy link

github-actions bot commented Aug 8, 2024

2116 tests run: 2047 passed, 0 failed, 69 skipped (full report)


Code coverage* (full report)

  • functions: 32.5% (7167 of 22034 functions)
  • lines: 50.5% (57920 of 114712 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
f9e4531 at 2024-08-12T12:29:10.334Z :recycle:

@koivunej koivunej requested a review from a team as a code owner August 8, 2024 11:59
@koivunej koivunej requested a review from skyzh August 8, 2024 11:59
@skyzh
Copy link
Member

skyzh commented Aug 8, 2024

Just FYI, changes to the shutdown mode was recent -> #8289

@koivunej
Copy link
Member Author

koivunej commented Aug 8, 2024

Just FYI, changes to the shutdown mode was recent -> #8289

That's unrelated, this is mostly about pageserver shutdown, and it has been using SIGQUIT at least 2y -- but I closed the blame page already.

@koivunej
Copy link
Member Author

koivunej commented Aug 8, 2024

Excellent, the rest reproduce locally.

Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

early approval as I'm off next week and I expect this patch only changes error handling / failpoints and won't affect the actual code behavior :)

pageserver/src/tenant/timeline/compaction.rs Show resolved Hide resolved
pageserver/src/tenant/tasks.rs Outdated Show resolved Hide resolved
pageserver/src/tenant.rs Outdated Show resolved Hide resolved
@koivunej
Copy link
Member Author

koivunej commented Aug 9, 2024

  • test_sliding_working_set_approximation: debug

With the verbose pausable failpoint, it would appear that the logging is now slowing down the test.

Probably the same for the rest.

@koivunej koivunej force-pushed the joonas/always_graceful_shutdown branch 2 times, most recently from e77a784 to b0d118c Compare August 12, 2024 11:21
@koivunej koivunej force-pushed the joonas/always_graceful_shutdown branch from b0d118c to f9e4531 Compare August 12, 2024 11:36
@koivunej koivunej merged commit 9dc9a9b into main Aug 12, 2024
63 checks passed
@koivunej koivunej deleted the joonas/always_graceful_shutdown branch August 12, 2024 12:37
koivunej added a commit that referenced this pull request Aug 13, 2024
A few of the benchmarks have started failing after #8655 where they are
waiting for compactor task. Reads done by image layer creation should
already be cancellation sensitive because vectored get does a check each
time, but try sprinkling additional cancellation points to:

- each partition
- after each vectored read batch
koivunej added a commit that referenced this pull request Aug 13, 2024
After #8655 we've had a few issues (mostly tracked on #8708) with the
graceful shutdown. In order to shutdown more of the processes and catch
more errors, for example, from all pageservers, do an immediate shutdown
for those nodes which fail the initial (possibly graceful) shutdown.

Cc: #6485
koivunej added a commit that referenced this pull request Aug 14, 2024
Some benchmarks and tests might still fail because of #8655 (tracked in
#8708) because we are not fast enough to shut down ([one evidence]).
Partially this is explained by the current validation mode of streaming
k-merge, but otherwise because that is where we use a lot of time in
compaction. Outside of L0 => L1 compaction, the image layer generation
is already guarded by vectored reads doing cancellation checks.

32768 is a wild guess based on looking how many keys we put in each
layer in a bench (1-2 million), but I assume it will be good enough
divisor. Doing checks more often will start showing up as contention
which we cannot currently measure. Doing checks less often might be
reasonable.

[one evidence]:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/10384136483/index.html#suites/9681106e61a1222669b9d22ab136d07b/96e6d53af234924/

Earlier PR: #8706.
koivunej added a commit that referenced this pull request Aug 21, 2024
After #8655, we needed to mark some tests to shut down immediately. To
aid these tests, try the new pattern of `flush_ep_to_pageserver`
followed by a non-compacting checkpoint. This moves the general graceful
shutdown problem of having too much to flush at shutdown into the test.
Also, add logging for how long the graceful shutdown took, if we got to
complete it for faster log eyeballing.

Fixes: #8712
Cc: #8715, #8708
koivunej added a commit that referenced this pull request Aug 22, 2024
`test_forward_compatibility` is still often failing at graceful
shutdown. Fix this by explicit flush before shutdown.

Example:
https://neon-github-public-dev.s3.amazonaws.com/reports/main/10506613738/index.html#testresult/5e7111907f7ecfb2/

Cc: #8655 and #8708
Previous attempt: #8787
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants