Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: add Gate as a partner to CancellationToken for safe shutdown of Tenant & Timeline #5711

Merged
merged 24 commits into from
Nov 6, 2023

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Oct 30, 2023

Problem

When shutting down a Tenant, it isn't just important to cause any background tasks to stop. It's also important to wait until they have stopped before declaring shutdown complete, in cases where we may re-use the tenant's local storage for something else, such as running in secondary mode, or creating a new tenant with the same ID.

Summary of changes

A Gate class is added, inspired by seastar::gate. For types that have an important lifetime that corresponds to some physical resource, use of a Gate as well as a CancellationToken provides a robust pattern for async requests & shutdown:

  • Requests must always acquire the gate as long as they are using the object
  • Shutdown must set the cancellation token, and then close() the gate to wait for requests in progress before returning.

This is not for memory safety: it's for expressing the difference between "Arc exists", and "This tenant's files on disk are eligible to be read/written".

  • Both Tenant and Timeline get a Gate & CancellationToken.
  • The Timeline gate is held during eviction of layers, and during page_service requests.
  • Existing cancellation support in page_service is refined to use the timeline-scope cancellation token instead of a process-scope cancellation token. This replaces the use of task_mgr::associate_with: tasks no longer change their tenant/timelineidentity after being spawned.

The Tenant's Gate is not yet used, but will be important for Tenant-scoped operations in secondary mode, where we must ensure that our secondary-mode downloads for a tenant are gated wrt the activity of an attached Tenant.

This is part of a broader move away from using the global-state driven task_mgr shutdown tokens:

  • less global state where we rely on implicit knowledge of what task a given function is running in, and more explicit references to the cancellation token that a particular function/type will respect, making shutdown easier to reason about.
  • eventually avoid the big global TASKS mutex.

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Oct 30, 2023
@jcsp jcsp changed the title Jcsp/tenant shutdown gate pageserver: add Gate as a partner to CancellationToken for safe shutdown of Tenant & Timeline Oct 30, 2023
@jcsp jcsp force-pushed the jcsp/tenant-shutdown-gate branch from d8245f2 to ce3e371 Compare October 30, 2023 12:40
@github-actions
Copy link

github-actions bot commented Oct 30, 2023

2358 tests run: 2243 passed, 0 failed, 115 skipped (full report)


Flaky tests (2)

Postgres 16

  • test_crafted_wal_end[last_wal_record_crossing_segment]: release

Postgres 14

  • test_crafted_wal_end[last_wal_record_xlog_switch_ends_on_page_boundary]: debug

Code coverage (full report)

  • functions: 54.6% (8858 of 16232 functions)
  • lines: 81.6% (50930 of 62421 lines)

The comment gets automatically updated with the latest test results
0eb1060 at 2023-11-06T12:17:44.227Z :recycle:

@jcsp jcsp force-pushed the jcsp/tenant-shutdown-gate branch from ce3e371 to 13fd042 Compare October 30, 2023 12:50
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike how this seems to duplicate what utils::completion aims to do (which tokio now has as TaskTracker), but I don't have more time to review right now.

@jcsp
Copy link
Collaborator Author

jcsp commented Oct 30, 2023

I dislike how this seems to duplicate what utils::completion aims to do (which tokio now has as TaskTracker), but I don't have more time to review right now.

I wouldn't say it duplicates it: neither one is a concurrency primitive (both are wrappers), and they have different goals:

  • Completion is a "you must have at least one clone alive until you're done". As long as you want to keep a barrier blocked, you have to keep at least one Completion around.
  • Gate is a "you can have zero guards, and acquire more later, but you'll be refused if the gate is closed". But you don't have to keep a guard around all the time to clone it.

One could implement Gate as a funky wrapper on Barrier, but the way I look at it they're both ergonomic wrappers around tokio concurrency types, so it makes sense to have two different types to implement two different usage styles.

@jcsp jcsp force-pushed the jcsp/tenant-shutdown-gate branch from 13fd042 to c1ad27a Compare October 30, 2023 18:37
@koivunej
Copy link
Member

I wouldn't say it duplicates it: neither one is a concurrency primitive (both are wrappers), and they have different goals:

  • Completion is a "you must have at least one clone alive until you're done". As long as you want to keep a barrier blocked, you have to keep at least one Completion around.
  • Gate is a "you can have zero guards, and acquire more later, but you'll be refused if the gate is closed". But you don't have to keep a guard around all the time to clone it.

It introduces more state than we need: my idea was to add a completions to state (task local) from which we can start new work. If we are not in that state, we cannot obtain a guard thus we cannot start new work. If we transitioning from the state, then we must wait out all outstanding work.

libs/utils/src/gate.rs Outdated Show resolved Hide resolved
libs/utils/src/gate.rs Outdated Show resolved Hide resolved
@jcsp jcsp force-pushed the jcsp/tenant-shutdown-gate branch 4 times, most recently from 5243245 to 8855605 Compare November 2, 2023 10:52
@jcsp jcsp force-pushed the jcsp/tenant-shutdown-gate branch from 8855605 to 5cf7b7c Compare November 2, 2023 12:20
libs/utils/src/gate.rs Outdated Show resolved Hide resolved
@jcsp jcsp force-pushed the jcsp/tenant-shutdown-gate branch from 854425f to 28ffd2b Compare November 3, 2023 16:31
@jcsp jcsp requested a review from koivunej November 3, 2023 16:38
@jcsp jcsp force-pushed the jcsp/tenant-shutdown-gate branch from 4b7bdcd to 441fed8 Compare November 3, 2023 16:58
@jcsp jcsp enabled auto-merge (squash) November 6, 2023 12:37
@jcsp jcsp merged commit 6defa2b into main Nov 6, 2023
34 checks passed
@jcsp jcsp deleted the jcsp/tenant-shutdown-gate branch November 6, 2023 12:39
jcsp added a commit that referenced this pull request Nov 8, 2023
## Problem

#5711 and #5367 raced -- the `SlotGuard` type needs `Gate` to properly
enforce its invariant that we may not drop an `Arc<Tenant>` from a slot.

## Summary of changes

Replace the TODO with the intended check of Gate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants