Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: flush deletion queue on detach #5452

Merged
merged 6 commits into from
Oct 10, 2023
Merged

pageserver: flush deletion queue on detach #5452

merged 6 commits into from
Oct 10, 2023

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Oct 3, 2023

Problem

If a caller detaches a tenant and then attaches it again, pending deletions from the old attachment might not have happened yet. This is not a correctness problem, but it causes:

  • Risk of leaking some objects in S3
  • Some warnings from the deletion queue when pending LSN updates and pending deletions don't pass validation.

Summary of changes

  • Deletion queue now uses UnboundedChannel so that the push interfaces don't have to be async.
    • This was pulled out of pageserver: enable compaction to proceed while live-migrating #5397, where it is also useful to be able to drive the queue from non-async contexts.
    • Why is it okay for this to be unbounded? The only way the unbounded-ness of the channel can become a problem is if writing out deletion lists can't keep up, but if the system were that overloaded then the code generating deletions (GC, compaction) would also be impacted.
  • DeletionQueueClient gets a new flush_advisory function, which is like flush_execute, but doesn't wait for completion: this is appropriate for use in contexts where we would like to encourage the deletion queue to flush, but don't need to block on it.
    • This function is also expected to be useful in next steps for seamless migration, where the option to flush to S3 while transitioning into AttachedStale will also include flushing deletion queue, but we wouldn't want to block on that flush.
  • The tenant_detach code in mgr.rs invokes flush_advisory after stopping the Tenant object.

jcsp added 2 commits October 3, 2023 13:48
This avoids risk of discarded LSN update warnings in
the logs when doing a fast detach/attach cycle
@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Oct 3, 2023
@github-actions
Copy link

github-actions bot commented Oct 3, 2023

2280 tests run: 2164 passed, 0 failed, 116 skipped (full report)


Flaky tests (1)

Postgres 16

Code coverage (full report)

  • functions: 52.3% (8144 of 15558 functions)
  • lines: 81.0% (47637 of 58775 lines)

The comment gets automatically updated with the latest test results
16977d2 at 2023-10-10T09:48:35.577Z :recycle:

@jcsp jcsp marked this pull request as ready for review October 4, 2023 08:21
@jcsp jcsp requested a review from a team as a code owner October 4, 2023 08:21
@jcsp jcsp requested review from hlinnaka, problame and arpad-m and removed request for a team and hlinnaka October 4, 2023 08:21
Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a rebase.

Right now we would have requests get stuck and eventually time out if the queue gets stuck, which creates noisy errors.

I also wonder if we should somehow measure the queue length in a metric or warn if the queue length is larger than some limit. There is the risk of having a memory "leak" like the one fixed in #5472 ...

@jcsp
Copy link
Collaborator Author

jcsp commented Oct 9, 2023

I also wonder if we should somehow measure the queue length in a metric or warn if the queue length is larger than some limit. There is the risk of having a memory "leak" like the one fixed in #5472 ...

We do already have metrics that enable calculating queue length -- this was a good reminder for me to add charts for those to the pageserver dashboard. I'll follow up to add an alert for queue length.

@jcsp
Copy link
Collaborator Author

jcsp commented Oct 9, 2023

Push:

  • Merge main into this branch
  • Since the LocationConf stuff merged, we should also do this flush-on-detach behavior when we shut down a tenant in upsert_location -- added that.

@jcsp jcsp requested a review from arpad-m October 9, 2023 16:33
pageserver/src/tenant/mgr.rs Outdated Show resolved Hide resolved
Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
@jcsp jcsp enabled auto-merge (squash) October 10, 2023 09:05
@jcsp jcsp merged commit acefee9 into main Oct 10, 2023
34 checks passed
@jcsp jcsp deleted the jcsp/flush-on-detach branch October 10, 2023 09:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants