test: do graceful shutdown by default #8655

koivunej · 2024-08-08T10:43:24Z

It should give us all possible allowed_errors more consistently.

While getting the workflows to pass on #8632 it was noticed that allowed_errors are rarely hit (1/4). This made me realize that we always do an immediate stop by default. Doing a graceful shutdown would had made the draining more apparent and likely we would not have needed the #8632 hotfix.

Downside of doing this is that we will see more timeouts if tests are randomly leaving pause failpoints which fail the shutdown.

The net outcome should however be positive, we could even detect too slow shutdowns caused by a bug or deadlock.

it should give us all allowed_errors and be less racy.

koivunej · 2024-08-08T10:55:36Z

There should be some failures. (I did not get any of these locally.)

github-actions · 2024-08-08T11:32:22Z

2116 tests run: 2047 passed, 0 failed, 69 skipped (full report)

Code coverage* (full report)

functions: 32.5% (7167 of 22034 functions)
lines: 50.5% (57920 of 114712 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
f9e4531 at 2024-08-12T12:29:10.334Z :recycle:}

skyzh · 2024-08-08T14:41:41Z

Just FYI, changes to the shutdown mode was recent -> #8289

koivunej · 2024-08-08T16:20:04Z

Just FYI, changes to the shutdown mode was recent -> #8289

That's unrelated, this is mostly about pageserver shutdown, and it has been using SIGQUIT at least 2y -- but I closed the blame page already.

koivunej · 2024-08-08T16:39:11Z

Excellent, the rest reproduce locally.

skyzh

early approval as I'm off next week and I expect this patch only changes error handling / failpoints and won't affect the actual code behavior :)

pageserver/src/tenant/timeline/compaction.rs

pageserver/src/tenant/tasks.rs

pageserver/src/tenant.rs

koivunej · 2024-08-09T07:47:32Z

test_sliding_working_set_approximation: debug

With the verbose pausable failpoint, it would appear that the logging is now slowing down the test.

Probably the same for the rest.

also, we don't need this extra special conversion.

I don't know what the test is testing, but ... shutdown does not need to be delayed 10s.

This reverts commit 322ff85.

A few of the benchmarks have started failing after #8655 where they are waiting for compactor task. Reads done by image layer creation should already be cancellation sensitive because vectored get does a check each time, but try sprinkling additional cancellation points to: - each partition - after each vectored read batch

After #8655 we've had a few issues (mostly tracked on #8708) with the graceful shutdown. In order to shutdown more of the processes and catch more errors, for example, from all pageservers, do an immediate shutdown for those nodes which fail the initial (possibly graceful) shutdown. Cc: #6485

Some benchmarks and tests might still fail because of #8655 (tracked in #8708) because we are not fast enough to shut down ([one evidence]). Partially this is explained by the current validation mode of streaming k-merge, but otherwise because that is where we use a lot of time in compaction. Outside of L0 => L1 compaction, the image layer generation is already guarded by vectored reads doing cancellation checks. 32768 is a wild guess based on looking how many keys we put in each layer in a bench (1-2 million), but I assume it will be good enough divisor. Doing checks more often will start showing up as contention which we cannot currently measure. Doing checks less often might be reasonable. [one evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10384136483/index.html#suites/9681106e61a1222669b9d22ab136d07b/96e6d53af234924/ Earlier PR: #8706.

After #8655, we needed to mark some tests to shut down immediately. To aid these tests, try the new pattern of `flush_ep_to_pageserver` followed by a non-compacting checkpoint. This moves the general graceful shutdown problem of having too much to flush at shutdown into the test. Also, add logging for how long the graceful shutdown took, if we got to complete it for faster log eyeballing. Fixes: #8712 Cc: #8715, #8708

`test_forward_compatibility` is still often failing at graceful shutdown. Fix this by explicit flush before shutdown. Example: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10506613738/index.html#testresult/5e7111907f7ecfb2/ Cc: #8655 and #8708 Previous attempt: #8787

test: do graceful shutdown by default

4fbb779

it should give us all allowed_errors and be less racy.

koivunej requested a review from a team as a code owner August 8, 2024 11:59

koivunej requested a review from skyzh August 8, 2024 11:59

skyzh approved these changes Aug 9, 2024

View reviewed changes

pageserver/src/tenant/timeline/compaction.rs Show resolved Hide resolved

pageserver/src/tenant/tasks.rs Outdated Show resolved Hide resolved

pageserver/src/tenant.rs Outdated Show resolved Hide resolved

koivunej force-pushed the joonas/always_graceful_shutdown branch 2 times, most recently from e77a784 to b0d118c Compare August 12, 2024 11:21

koivunej added 12 commits August 12, 2024 11:21

minor format

a741b8d

minor outdated doc comment

3142e64

chore: extra lifetime annotation

1c2056a

one missed cancelled compaction error

ae43409

also, we don't need this extra special conversion.

fix: mismapped notinitialized stopped to anyhow (already in main)

e737ad1

gc error mapping

d760a3a

compaction: cancel l0 compaction

aa5d60d

test_ancestor_branch: turn off the failpoint in the end

e04b1bc

I don't know what the test is testing, but ... shutdown does not need to be delayed 10s.

use proper pausable failpoint

c6a1682

Revert "use proper pausable failpoint"

fc72168

This reverts commit 322ff85.

unpause in test

cd02466

add a fixme for "walreceiver-after-ingest"

f9e4531

koivunej force-pushed the joonas/always_graceful_shutdown branch from b0d118c to f9e4531 Compare August 12, 2024 11:36

koivunej mentioned this pull request Aug 12, 2024

refactor: error/anyhow::Error wrapping #8697

Merged

koivunej merged commit 9dc9a9b into main Aug 12, 2024
63 checks passed

koivunej deleted the joonas/always_graceful_shutdown branch August 12, 2024 12:37

This was referenced Aug 12, 2024

fix(walredo): shutdown can complete too early #8701

Merged

fix: make compaction more sensitive to cancellation #8706

Merged

This was referenced Aug 13, 2024

pageserver graceful shutdown regression test failure fallout #8708

Open

test: do better job of shutting everything down #8714

Merged

koivunej mentioned this pull request Aug 14, 2024

chore: even more responsive compaction cancellation #8725

Merged

This was referenced Aug 15, 2024

test: wait_for_upload_queue_empty no longer works after #8550 #8715

Open

test: avoid some too long shutdowns by flushing before shutdown #8772

Merged

koivunej mentioned this pull request Aug 22, 2024

test_compatibility: flush in the end #8804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: do graceful shutdown by default #8655

test: do graceful shutdown by default #8655

koivunej commented Aug 8, 2024 •

edited

Loading

koivunej commented Aug 8, 2024 •

edited

Loading

github-actions bot commented Aug 8, 2024 •

edited

Loading

skyzh commented Aug 8, 2024

koivunej commented Aug 8, 2024

koivunej commented Aug 8, 2024

skyzh left a comment

koivunej commented Aug 9, 2024

test: do graceful shutdown by default #8655

test: do graceful shutdown by default #8655

Conversation

koivunej commented Aug 8, 2024 • edited Loading

koivunej commented Aug 8, 2024 • edited Loading

github-actions bot commented Aug 8, 2024 • edited Loading

2116 tests run: 2047 passed, 0 failed, 69 skipped (full report)

Code coverage* (full report)

skyzh commented Aug 8, 2024

koivunej commented Aug 8, 2024

koivunej commented Aug 8, 2024

skyzh left a comment

Choose a reason for hiding this comment

koivunej commented Aug 9, 2024

koivunej commented Aug 8, 2024 •

edited

Loading

koivunej commented Aug 8, 2024 •

edited

Loading

github-actions bot commented Aug 8, 2024 •

edited

Loading