-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: do graceful shutdown by default #8655
Conversation
it should give us all allowed_errors and be less racy.
There should be some failures. (I did not get any of these locally.) |
2116 tests run: 2047 passed, 0 failed, 69 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
f9e4531 at 2024-08-12T12:29:10.334Z :recycle: |
Just FYI, changes to the shutdown mode was recent -> #8289 |
That's unrelated, this is mostly about pageserver shutdown, and it has been using SIGQUIT at least 2y -- but I closed the blame page already. |
Excellent, the rest reproduce locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
early approval as I'm off next week and I expect this patch only changes error handling / failpoints and won't affect the actual code behavior :)
With the verbose pausable failpoint, it would appear that the logging is now slowing down the test. Probably the same for the rest. |
e77a784
to
b0d118c
Compare
also, we don't need this extra special conversion.
I don't know what the test is testing, but ... shutdown does not need to be delayed 10s.
This reverts commit 322ff85.
b0d118c
to
f9e4531
Compare
A few of the benchmarks have started failing after #8655 where they are waiting for compactor task. Reads done by image layer creation should already be cancellation sensitive because vectored get does a check each time, but try sprinkling additional cancellation points to: - each partition - after each vectored read batch
Some benchmarks and tests might still fail because of #8655 (tracked in #8708) because we are not fast enough to shut down ([one evidence]). Partially this is explained by the current validation mode of streaming k-merge, but otherwise because that is where we use a lot of time in compaction. Outside of L0 => L1 compaction, the image layer generation is already guarded by vectored reads doing cancellation checks. 32768 is a wild guess based on looking how many keys we put in each layer in a bench (1-2 million), but I assume it will be good enough divisor. Doing checks more often will start showing up as contention which we cannot currently measure. Doing checks less often might be reasonable. [one evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10384136483/index.html#suites/9681106e61a1222669b9d22ab136d07b/96e6d53af234924/ Earlier PR: #8706.
After #8655, we needed to mark some tests to shut down immediately. To aid these tests, try the new pattern of `flush_ep_to_pageserver` followed by a non-compacting checkpoint. This moves the general graceful shutdown problem of having too much to flush at shutdown into the test. Also, add logging for how long the graceful shutdown took, if we got to complete it for faster log eyeballing. Fixes: #8712 Cc: #8715, #8708
`test_forward_compatibility` is still often failing at graceful shutdown. Fix this by explicit flush before shutdown. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10506613738/index.html#testresult/5e7111907f7ecfb2/ Cc: #8655 and #8708 Previous attempt: #8787
It should give us all possible allowed_errors more consistently.
While getting the workflows to pass on #8632 it was noticed that allowed_errors are rarely hit (1/4). This made me realize that we always do an immediate stop by default. Doing a graceful shutdown would had made the draining more apparent and likely we would not have needed the #8632 hotfix.
Downside of doing this is that we will see more timeouts if tests are randomly leaving pause failpoints which fail the shutdown.
The net outcome should however be positive, we could even detect too slow shutdowns caused by a bug or deadlock.