pageserver: flush deletion queue on detach #5452

jcsp · 2023-10-03T13:11:09Z

Problem

If a caller detaches a tenant and then attaches it again, pending deletions from the old attachment might not have happened yet. This is not a correctness problem, but it causes:

Risk of leaking some objects in S3
Some warnings from the deletion queue when pending LSN updates and pending deletions don't pass validation.

Summary of changes

Deletion queue now uses UnboundedChannel so that the push interfaces don't have to be async.
- This was pulled out of pageserver: enable compaction to proceed while live-migrating #5397, where it is also useful to be able to drive the queue from non-async contexts.
- Why is it okay for this to be unbounded? The only way the unbounded-ness of the channel can become a problem is if writing out deletion lists can't keep up, but if the system were that overloaded then the code generating deletions (GC, compaction) would also be impacted.
DeletionQueueClient gets a new flush_advisory function, which is like flush_execute, but doesn't wait for completion: this is appropriate for use in contexts where we would like to encourage the deletion queue to flush, but don't need to block on it.
- This function is also expected to be useful in next steps for seamless migration, where the option to flush to S3 while transitioning into AttachedStale will also include flushing deletion queue, but we wouldn't want to block on that flush.
The tenant_detach code in mgr.rs invokes flush_advisory after stopping the Tenant object.

This avoids risk of discarded LSN update warnings in the logs when doing a fast detach/attach cycle

github-actions · 2023-10-03T13:49:40Z

2280 tests run: 2164 passed, 0 failed, 116 skipped (full report)

Flaky tests (1)

Postgres 16

test_wal_lagging: release

Code coverage (full report)

functions: 52.3% (8144 of 15558 functions)
lines: 81.0% (47637 of 58775 lines)

_{The comment gets automatically updated with the latest test results
16977d2 at 2023-10-10T09:48:35.577Z :recycle:}

arpad-m

Needs a rebase.

Right now we would have requests get stuck and eventually time out if the queue gets stuck, which creates noisy errors.

I also wonder if we should somehow measure the queue length in a metric or warn if the queue length is larger than some limit. There is the risk of having a memory "leak" like the one fixed in #5472 ...

jcsp · 2023-10-09T16:14:28Z

I also wonder if we should somehow measure the queue length in a metric or warn if the queue length is larger than some limit. There is the risk of having a memory "leak" like the one fixed in #5472 ...

We do already have metrics that enable calculating queue length -- this was a good reminder for me to add charts for those to the pageserver dashboard. I'll follow up to add an alert for queue length.

jcsp · 2023-10-09T16:33:07Z

Push:

Merge main into this branch
Since the LocationConf stuff merged, we should also do this flush-on-detach behavior when we shut down a tenant in upsert_location -- added that.

pageserver/src/tenant/mgr.rs

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

jcsp added 2 commits October 3, 2023 13:48

pageserver: enable non-async push to deletion queue

656dcc8

pageserver: flush deletion queue validation on detach

7f036fe

This avoids risk of discarded LSN update warnings in the logs when doing a fast detach/attach cycle

jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Oct 3, 2023

jcsp marked this pull request as ready for review October 4, 2023 08:21

jcsp requested a review from a team as a code owner October 4, 2023 08:21

jcsp requested review from hlinnaka, problame and arpad-m and removed request for a team and hlinnaka October 4, 2023 08:21

arpad-m reviewed Oct 5, 2023

View reviewed changes

jcsp added 3 commits October 9, 2023 17:17

Merge remote-tracking branch 'upstream/main' into jcsp/flush-on-detach

85a6def

Update put_tenant_location_config_handler for drain-on-detach

074549a

pageserver: extend flush-on-detach behavior to location config change

51926d9

jcsp requested a review from arpad-m October 9, 2023 16:33

arpad-m approved these changes Oct 9, 2023

View reviewed changes

pageserver/src/tenant/mgr.rs Outdated Show resolved Hide resolved

Update pageserver/src/tenant/mgr.rs

16977d2

Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

jcsp enabled auto-merge (squash) October 10, 2023 09:05

jcsp merged commit acefee9 into main Oct 10, 2023
34 checks passed

jcsp deleted the jcsp/flush-on-detach branch October 10, 2023 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: flush deletion queue on detach #5452

pageserver: flush deletion queue on detach #5452

jcsp commented Oct 3, 2023 •

edited

Loading

github-actions bot commented Oct 3, 2023 •

edited

Loading

Postgres 16

arpad-m left a comment

jcsp commented Oct 9, 2023

jcsp commented Oct 9, 2023

pageserver: flush deletion queue on detach #5452

pageserver: flush deletion queue on detach #5452

Conversation

jcsp commented Oct 3, 2023 • edited Loading

Problem

Summary of changes

github-actions bot commented Oct 3, 2023 • edited Loading

2280 tests run: 2164 passed, 0 failed, 116 skipped (full report)

Postgres 16

Code coverage (full report)

arpad-m left a comment

Choose a reason for hiding this comment

jcsp commented Oct 9, 2023

jcsp commented Oct 9, 2023

jcsp commented Oct 3, 2023 •

edited

Loading

github-actions bot commented Oct 3, 2023 •

edited

Loading