Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: Pageserver Timeline Archival #8088

Open
35 of 46 tasks
jcsp opened this issue Jun 18, 2024 · 12 comments
Open
35 of 46 tasks

Epic: Pageserver Timeline Archival #8088

jcsp opened this issue Jun 18, 2024 · 12 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic t/feature Issue type: feature, for new features or requests

Comments

@jcsp
Copy link
Collaborator

jcsp commented Jun 18, 2024

Purpose

Enable users to create branches fearlessly, without worrying about hitting branch count limits & without having to worry about cleaning up old branches unless they want to.

Background

Currently, all timelines have significant physical overhead on the pageserver, even if they haven't been used for days/weeks/months:

  • scanning timeline's remote storage path on tenant startup & load their index
  • pinning some of the timeline's layers into local storage for logical size calculations
  • running a wal receiver for the timeline

Changes

This section isn't an authoritative design, but calls out functional areas that will need work.

  • We'll need some manifest in remote storage that the tenant can read on startup to learn which timelines should be loaded in an active state, vs. which timelines are hibernated. Keeping this properly up to date with timeline create/delete operations will be a key correctness point.
  • Persist enough information about hibernated timelines that we can know their logical size (& any other key stats) without having to load them fully. It probably makes sense to inline this into the per-tenant object that lists the timelines.
  • Our runtime state in Tenant will need to only store active timelines in Tenant::timelines, and have some other map of hibernated timelines.
  • APIs that list timelines will need either to change their semantics to only report active timelines, to avoid unreasonably large responses when users have many thousands of branches -- or paginated/queryable. Bu
  • An external API to enable the control plane to tell us when a timeline should be hibernated or awoken. We could also choose to auto-hibernate after some period of inactivity, but that might be duplicative wrt the externally driven mechanism.`.
  • A cache-warming routine that loads enough layers to serve reads at the tip of the branch, so that when we activate a timeline, the user doesn't encounter a long slow period while data is promoted to local storage.

Milestone: archived branches are cheap locally -- (no index load on startup, no layers on disk, no Timeline at runtime)

Preview Give feedback
  1. c/storage/pageserver t/feature
    jcsp
  2. a/tech_debt c/storage/pageserver
  3. arpad-m
  4. arpad-m
  5. arpad-m
  6. c/storage/pageserver t/feature
    jcsp
  7. c/storage
    arpad-m

Misc lower priority changes

Preview Give feedback
  1. c/storage/pageserver

Milestone: archived branches are cheap in remote storage -- eventually written as compressed image layers at a single LSN

Preview Give feedback
@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic labels Jun 18, 2024
@jcsp jcsp changed the title Epic: Pageserver Timeline Hibernation Epic: Pageserver Timeline Archival Jul 1, 2024
jcsp added a commit that referenced this issue Jul 3, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
VladLazar pushed a commit that referenced this issue Jul 8, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
VladLazar pushed a commit that referenced this issue Jul 8, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
VladLazar pushed a commit that referenced this issue Jul 8, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
VladLazar pushed a commit that referenced this issue Jul 8, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
VladLazar pushed a commit that referenced this issue Jul 8, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
VladLazar pushed a commit that referenced this issue Jul 8, 2024
## Problem

The metrics we have today aren't convenient for planning around the
impact of timeline archival on costs.

Closes: #8108

## Summary of changes

- Add metric `pageserver_archive_size`, which indicates the logical
bytes of data which we would expect to write into an archived branch.
- Add metric `pageserver_pitr_history_size`, which indicates the
distance between last_record_lsn and the PITR cutoff.

These metrics are somewhat temporary: when we implement #8088 and
associated consumption metric changes, these will reach a final form.
For now, an "archived" branch is just any branch outside of its parent's
PITR window: later, archival will become an explicit state (which will
_usually_ correspond to falling outside the parent's PITR window).

The overall volume of timeline metrics is something to watch, but we are
removing many more in #8245
than this PR is adding.
jcsp added a commit that referenced this issue Jul 11, 2024
A design for a cheap low-resource state for idle timelines:
- #8088
skyzh pushed a commit that referenced this issue Jul 15, 2024
A design for a cheap low-resource state for idle timelines:
- #8088
arpad-m added a commit that referenced this issue Jul 19, 2024
This adds an archival_config endpoint to the pageserver. Currently it
has no effect, and always "works", but later the intent is that it will
make a timeline archived/unarchived.

- [x] add yml spec
- [x] add endpoint handler

Part of #8088
problame pushed a commit that referenced this issue Jul 22, 2024
This adds an archival_config endpoint to the pageserver. Currently it
has no effect, and always "works", but later the intent is that it will
make a timeline archived/unarchived.

- [x] add yml spec
- [x] add endpoint handler

Part of #8088
@arpad-m
Copy link
Member

arpad-m commented Jul 22, 2024

arpad-m added a commit that referenced this issue Jul 22, 2024
arpad-m added a commit that referenced this issue Jul 27, 2024
Persists whether a timeline is archived or not in `index_part.json`. We
only return success if the upload has actually worked successfully.

Also introduces a new `index_part.json` version number.

Fixes #8459

Part of #8088
@arpad-m
Copy link
Member

arpad-m commented Aug 19, 2024

This week:

arpad-m added a commit that referenced this issue Oct 29, 2024
Currently, all callers of `unoffload_timeline` ensure that the tenant
the unoffload operation is called on is active. We rely on it being
active as we activate the timeline below and don't want to race with the
activation code of the tenant (in the worst case, activating a timeline
twice).

Therefore, add this assertion.

Part of #8088
jcsp pushed a commit that referenced this issue Oct 29, 2024
As pointed out in
#9489 (comment) ,
we currently didn't support deletion for offloaded timelines after the
timeline has been loaded from the manifest instead of having been
offloaded.

This was because the upload queue hasn't been initialized yet. This PR
thus initializes the timeline and shuts it down immediately.

Part of #8088
arpad-m added a commit that referenced this issue Oct 30, 2024
Disallow a request for timeline ancestor detach if either the to be
detached timeline, or any of the to be reparented timelines are
offloaded or archived.

In theory we could support timelines that are archived but not
offloaded, but archived timelines are at the risk of being offloaded, so
we treat them like offloaded timelines. As for offloaded timelines, any
code to "support" them would amount to unoffloading them, at which point
we can just demand to have the timelines be unarchived.

Part of #8088
arpad-m added a commit that referenced this issue Oct 30, 2024
Constructing a remote client is no big deal. Yes, it means an extra
download from S3 but it's not that expensive. This simplifies code paths
and scenarios to test. This unifies timelines that have been recently
offloaded with timelines that have been offloaded in an earlier
invocation of the process.

Part of #8088
arpad-m added a commit that referenced this issue Oct 30, 2024
If we delete a timeline that has childen, those children will have their
data corrupted. Therefore, extend the already existing safety check to
offloaded timelines as well.

Part of #8088
@arpad-m
Copy link
Member

arpad-m commented Nov 4, 2024

Last week I made and merged a lot of pull requests. Although some of them are quite small, they fix a lot of possible misuses/edge cases that can lead to corruption:

There has also been work by John for #9386, to make manifests more robust/generation ready:

This week:

  • merge the two open PRs to add a test and make it possible to enable offloading for single tenants
  • scrubber changes for persistence for offloaded state #9386, also to delete old generations of manifest
  • monitor staging and see if there is corruptions

So it's going really well, and work is mostly complete. Now the main task is to ensure it rolls out safely, with us reducing the impact of any possible issue by doing a staged rollout.

arpad-m added a commit that referenced this issue Nov 4, 2024
Allow us to enable timeline offloading for single tenants without having
to enable it for the entire pageserver.

Part of #8088.
@arpad-m
Copy link
Member

arpad-m commented Nov 11, 2024

last week:

this week:

I'll focus on the scrubber side of #9386, and continuing to analyze tenants.

arpad-m added a commit that referenced this issue Nov 11, 2024
Add a test that ensures the `retain_lsn` functionality works. Right now,
there is not a single test that is broken if offloaded or non-offloaded
timelines don't get registered at their parents, preventing gc from
discarding the ancestor_lsns of the children. This PR fills that gap.

The test has four modes:

* `offloaded`: offload the child timeline, run compaction on the parent
timeline, unarchive the child timeline, then try reading from it.
hopefully the data is still there.
* `offloaded-corrupted`: offload the child timeline, corrupts the
manifest in a way that the pageserver believes the timeline was
flattened. This is the closest we can get to pretend the `retain_lsn`
mechanism doesn't exist for offloaded timelines, so we can avoid adding
endpoints to the pageserver that do this manually for tests. The test
then checks that indeed data is corrupted and the endpoint can't be
started. That way we know that the test is actually working, and
actually tests the `retain_lsn` mechanism, instead of say the lsn lease
mechanism, or one of the many other mechanisms that impede gc.
* `archived`: the child timeline gets archived but doesn't get
offloaded. this currently matches the `None` case but we might have
refactors in the future that make archived timelines sufficiently
different from non-archived ones.
* `None`: the child timeline doesn't even get archived. this tests that
normal timelines participate in `retain_lsn`. I've made them locally not
participate in `retain_lsn` (via commenting out the respective
`ancestor_children.push` statement in tenant.rs) and ran the testsuite,
and not a single test failed. So this test is first of its kind.

Part of #8088.
arpad-m added a commit that referenced this issue Nov 15, 2024
PR #9308 has modified tenant activation code to take offloaded child
timelines into account for populating the list of `retain_lsn` values.
However, there is more places than just tenant activation where one
needs to update the `retain_lsn`s.

This PR fixes some bugs of the current code that could lead to
corruption in the worst case:

1. Deleting of an offloaded timeline would not get its `retain_lsn`
purged from its parent. With the patch we now do it, but as the parent
can be offloaded as well, the situatoin is a bit trickier than for
non-offloaded timelines which can just keep a pointer to their parent.
Here we can't keep a pointer because the parent might get offloaded,
then unoffloaded again, creating a dangling pointer situation. Keeping a
pointer to the *tenant* is not good either, because we might drop the
offloaded timeline in a context where a `offloaded_timelines` lock is
already held: so we don't want to acquire a lock in the drop code of
OffloadedTimeline.
2. Unoffloading a timeline would not get its `retain_lsn` values
populated, leading to it maybe garbage collecting values that its
children might need. We now call `initialize_gc_info` on the parent.
3. Offloading of a timeline would not get its `retain_lsn` values
registered as offloaded at the parent. So if we drop the `Timeline`
object, and its registration is removed, the parent would not have any
of the child's `retain_lsn`s around. Also, before, the `Timeline` object
would delete anything related to its timeline ID, now it only deletes
`retain_lsn`s that have `MaybeOffloaded::No` set.

Incorporates Chi's reproducer from #9753. cc
neondatabase/cloud#20199

The `test_timeline_retain_lsn` test is extended:

1. it gains a new dimension, duplicating each mode, to either have the
"main" branch be the direct parent of the timeline we archive, or the
"test_archived_parent" branch intermediary, creating a three timeline
structure. This doesn't test anything fixed by this PR in particular,
just explores the vast space of possible configurations a little bit
more.
2. it gains two new modes, `offload-parent`, which tests the second
point, and `offload-no-restart` which tests the third point.

It's easy to verify the test actually is "sharp" by removing one of the
respective `self.initialize_gc_info()`, `gc_info.insert_child()` or
`ancestor_children.push()`.

Part of #8088

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Alex Chi Z <chi@neon.tech>
arpad-m added a commit that referenced this issue Nov 18, 2024
…'s parent (#9791)

There is a potential data corruption issue, not one I've encountered,
but it's still not hard to hit with some correct looking code given our
current architecture. It has to do with the timeline's memory object storage
via reference counted `Arc`s, and the removal of `retain_lsn` entries at
the drop of the last `Arc` reference.

The corruption steps are as follows:

1. timeline gets offloaded. timeline object A doesn't get dropped
though, because some long-running task accesses it
2. the same timeline gets unoffloaded again. timeline object B gets
created for it, timeline object A still referenced. both point to the
same timeline.
3. the task keeping the reference to timeline object A exits. destructor
for object A runs, removing `retain_lsn` in the timeline's parent.
4. the timeline's parent runs gc without the `retain_lsn` of the still
exant timleine's child, leading to data corruption.

In general we are susceptible each time when we recreate a `Timeline`
object in the same process, which happens both during a timeline
offload/unoffload cycle, as well as during an ancestor detach operation.

The solution this PR implements is to make the destructor for a timeline
as well as an offloaded timeline remove at most one `retain_lsn`.

PR #9760 has added a log line to print the refcounts at timeline
offload, but this only detects one of the places where we do such a
recycle operation. Plus it doesn't prevent the actual issue.

I doubt that this occurs in practice. It is more a defense in depth measure.
Usually I'd assume that the timeline gets dropped immediately in step 1,
as there is no background tasks referencing it after its shutdown.
But one never knows, and reducing the stakes of step 1 actually occurring
is a really good idea, from potential data corruption to waste of CPU time.

Part of #8088
@arpad-m
Copy link
Member

arpad-m commented Nov 18, 2024

Last week:

This week:

arpad-m added a commit that referenced this issue Nov 20, 2024
In timeline preloading, we also do a preload for offloaded timelines.
This includes the download of `index-part.json`. Ultimately, such a
download is wasteful, therefore avoid it. Same goes for the remote
client, we just discard it immediately thereafter.

Part of #8088

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
@arpad-m
Copy link
Member

arpad-m commented Nov 25, 2024

Last week:

This week:

If I get time, also:

  • test that offloaded timelines are excluded from heatmaps and never downloaded to secondaries
  • test for many timelines depending on each other
  • test: offload but pageserver crashes somewhere in delete_local_timeline_directory: can the pageserver deal with remnants after a restart?

@arpad-m
Copy link
Member

arpad-m commented Nov 29, 2024

This week I have:

  • filed Support tenant manifests in the scrubber #9942
  • analyzed "could not find data for key" errors found in staging -> could see that the errors were different issues.
  • identified three prod tenants to try offloading on them, applied config on them

Next week:

  • get #9942 through review
  • monitor prod tenants for any corruptions
  • prepare offloading in a prod region in the release of week Dec 9 - Dec 13.

github-merge-queue bot pushed a commit that referenced this issue Dec 3, 2024
Support tenant manifests in the storage scrubber:

* list the manifests, order them by generation
* delete all manifests except for the two most recent generations
* for the latest manifest: try parsing it.

I've tested this patch by running the against a staging bucket and it
successfully deleted stuff (and avoided deleting the latest two
generations).

In follow-up work, we might want to also check some invariants of the
manifest, as mentioned in #8088.

Part of #9386
Part of #8088

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
awarus pushed a commit that referenced this issue Dec 5, 2024
Support tenant manifests in the storage scrubber:

* list the manifests, order them by generation
* delete all manifests except for the two most recent generations
* for the latest manifest: try parsing it.

I've tested this patch by running the against a staging bucket and it
successfully deleted stuff (and avoided deleting the latest two
generations).

In follow-up work, we might want to also check some invariants of the
manifest, as mentioned in #8088.

Part of #9386
Part of #8088

---------

Co-authored-by: Christian Schwarz <christian@neon.tech>
@arpad-m
Copy link
Member

arpad-m commented Dec 6, 2024

This week I have:

Next week:

I think the issue has progressed enough that weekly updates are no longer neccessary from now on.

github-merge-queue bot pushed a commit that referenced this issue Dec 11, 2024
This adds some validation of invariants that we want to uphold wrt the
tenant manifest and `index_part.json`:

* the data the manifest has about a timeline must match with the data in
`index_part.json`. It might actually change, e.g. when we do reparenting
during detach ancestor, but that requires the timeline to be
unoffloaded, i.e. removed from the manifest.
* any timeline mentioned in index part, must, if present, be archived.
If we unarchive, we first update the tenant manifest to unoffload, and
only then update index part. And one needs to archive before offloading.
* it is legal for timelines to be mentioned in the manifest but have no
`index_part`: this is a temporary state visible during deletion of the
timeline. if the pageserver crashed, an attach of the tenant will clean
the state up.
* it is also legal for offloaded timelines to have an
`ancestor_retain_lsn` of None while having an `ancestor_timeline_id`.
This is for the to-be-added flattening functionality: the plan is to set
former to None if we have flattened a timeline.

follow-up of #9942
part of #8088
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

No branches or pull requests

2 participants