Epic: Pageserver Timeline Archival #8088

jcsp · 2024-06-18T09:25:28Z

Purpose

Enable users to create branches fearlessly, without worrying about hitting branch count limits & without having to worry about cleaning up old branches unless they want to.

Background

Currently, all timelines have significant physical overhead on the pageserver, even if they haven't been used for days/weeks/months:

scanning timeline's remote storage path on tenant startup & load their index
pinning some of the timeline's layers into local storage for logical size calculations
running a wal receiver for the timeline

Changes

This section isn't an authoritative design, but calls out functional areas that will need work.

We'll need some manifest in remote storage that the tenant can read on startup to learn which timelines should be loaded in an active state, vs. which timelines are hibernated. Keeping this properly up to date with timeline create/delete operations will be a key correctness point.
Persist enough information about hibernated timelines that we can know their logical size (& any other key stats) without having to load them fully. It probably makes sense to inline this into the per-tenant object that lists the timelines.
Our runtime state in Tenant will need to only store active timelines in Tenant::timelines, and have some other map of hibernated timelines.
APIs that list timelines will need either to change their semantics to only report active timelines, to avoid unreasonably large responses when users have many thousands of branches -- or paginated/queryable. Bu
An external API to enable the control plane to tell us when a timeline should be hibernated or awoken. We could also choose to auto-hibernate after some period of inactivity, but that might be duplicative wrt the externally driven mechanism.`.
A cache-warming routine that loads enough layers to serve reads at the tip of the branch, so that when we activate a timeline, the user doesn't encounter a long slow period while data is promoted to local storage.

Milestone: archived branches are cheap locally -- (no index load on startup, no layers on disk, no Timeline at runtime)

Give feedback

pageserver: write timeline archival rfc #8218

c/storage/pageserver t/feature
pageserver: add supplementary branch usage stats #8131

a/tech_debt c/storage/pageserver
Add archival_config endpoint to pageserver #8414
Mark body of archival_config endpoint as required #8458
timeline archival: persistence #8459 / Persist archival information #8479
pageserver: implement visible layer housekeeping, for use in warm-ups
Timeline archival test #8824
controller: add pass-through for archival_config API: Implement archival_config timeline endpoint in the storage controller #8680
Forbid creation of child timelines of archived timeline #9122
Add timeline offload mechanism #8907
Shut down timelines during offload and add offload tests #9289
Also consider offloaded timelines for obtaining retain_lsn #9308
Activate timelines during unoffload #9399
Synthetic size should exclude archived timelines #9384
Add config variable for timeline offloading #9421
persistence for offloaded state #9386
offloaded timeline list API #9461
Support offloaded timelines during shard split #9489
Offloaded timeline deletion #9519
Fix unoffload_timeline races with creation #9525
Assert the tenant to be active in unoffload_timeline #9539
Disallow archived timelines to be detached or reparented #9578
Disallow offloaded children during timeline deletion #9582
Don't keep around the timeline's remote_client #9583
Add tenant config option to allow timeline_offloading #9598
Add a retain_lsn test #9599
pageserver: expose billing metrics for active size vs. archived size (decided to use logical size in the end, stop sending it after offload)
Don't preload offloaded timelines #9646
Correct mistakes in offloaded timeline retain_lsn management #9760
Remove at most one retain_lsn entry from (possibly offloaded) timelne's parent #9791
offloaded timeline query API
pageserver: generation numbers for manifest objects #9543

c/storage/pageserver t/feature
second synthectic storage size metric with archived branches included #9790

c/storage
Options

Misc lower priority changes

Give feedback

Do tenant manifest validation with index-part #10007
controller: ensure that timeline passthrough operations (incl. archival) land on shards with the latest generation (check generation is still current after they ack)
Impede external getpage requests for archived timelines #9548

c/storage/pageserver
test for many timelines depending on each other
test that offloaded timelines are excluded from heatmaps and never downloaded to secondaries
pytest for archival/unarchival together with storage controller and old generations
resume deletion instead of logging warning upon unoffloading
make scrubber check that a timeline that is archived must have all of its children archived as well
unified lock for offloaded/timelines/loading timelines: eliminates some race conditions and inconsistent states
test: offload but pageserver crashes somewhere in delete_local_timeline_directory: can the pageserver deal with remnants after a restart?
Options

Milestone: archived branches are cheap in remote storage -- eventually written as compressed image layers at a single LSN

Give feedback

pageserver: implement warm-up API
tests: after warming up, a read workload should not result in any on-demand downloads
add timeline flattening (including some way to block offload for it)
Options

The text was updated successfully, but these errors were encountered:

## Problem The metrics we have today aren't convenient for planning around the impact of timeline archival on costs. Closes: #8108 ## Summary of changes - Add metric `pageserver_archive_size`, which indicates the logical bytes of data which we would expect to write into an archived branch. - Add metric `pageserver_pitr_history_size`, which indicates the distance between last_record_lsn and the PITR cutoff. These metrics are somewhat temporary: when we implement #8088 and associated consumption metric changes, these will reach a final form. For now, an "archived" branch is just any branch outside of its parent's PITR window: later, archival will become an explicit state (which will _usually_ correspond to falling outside the parent's PITR window). The overall volume of timeline metrics is something to watch, but we are removing many more in #8245 than this PR is adding.

A design for a cheap low-resource state for idle timelines: - #8088

This adds an archival_config endpoint to the pageserver. Currently it has no effect, and always "works", but later the intent is that it will make a timeline archived/unarchived. - [x] add yml spec - [x] add endpoint handler Part of #8088

arpad-m · 2024-07-22T14:58:47Z

This week:

As pointed out in #8414 (comment) Part of #8088

Persists whether a timeline is archived or not in `index_part.json`. We only return success if the upload has actually worked successfully. Also introduces a new `index_part.json` version number. Fixes #8459 Part of #8088

arpad-m · 2024-08-19T13:29:58Z

This week:

get storage controller PR merged (tests missing): Implement archival_config timeline endpoint in the storage controller #8680
make offload MVP PR (ideally also reviewed plus merged)

Currently, all callers of `unoffload_timeline` ensure that the tenant the unoffload operation is called on is active. We rely on it being active as we activate the timeline below and don't want to race with the activation code of the tenant (in the worst case, activating a timeline twice). Therefore, add this assertion. Part of #8088

As pointed out in #9489 (comment) , we currently didn't support deletion for offloaded timelines after the timeline has been loaded from the manifest instead of having been offloaded. This was because the upload queue hasn't been initialized yet. This PR thus initializes the timeline and shuts it down immediately. Part of #8088

Disallow a request for timeline ancestor detach if either the to be detached timeline, or any of the to be reparented timelines are offloaded or archived. In theory we could support timelines that are archived but not offloaded, but archived timelines are at the risk of being offloaded, so we treat them like offloaded timelines. As for offloaded timelines, any code to "support" them would amount to unoffloading them, at which point we can just demand to have the timelines be unarchived. Part of #8088

Constructing a remote client is no big deal. Yes, it means an extra download from S3 but it's not that expensive. This simplifies code paths and scenarios to test. This unifies timelines that have been recently offloaded with timelines that have been offloaded in an earlier invocation of the process. Part of #8088

If we delete a timeline that has childen, those children will have their data corrupted. Therefore, extend the already existing safety check to offloaded timelines as well. Part of #8088

arpad-m · 2024-11-04T15:43:28Z

Last week I made and merged a lot of pull requests. Although some of them are quite small, they fix a lot of possible misuses/edge cases that can lead to corruption:

merged: Offloaded timeline deletion #9519
enabled offloading on staging: https://github.com/neondatabase/infra/pull/2205
filed and merged Fix unoffload_timeline races with creation #9525
filed and merged Assert the tenant to be active in unoffload_timeline #9539
filed and merged Disallow archived timelines to be detached or reparented #9578
filed and merged Disallow offloaded children during timeline deletion #9582
filed and merged Don't keep around the timeline's remote_client #9583
filed Add tenant config option to allow timeline_offloading #9598
filed Add a retain_lsn test #9599 <--- this one took me a long time of debugging, but it was worth the time, it's important to ensure that everything is correct.

There has also been work by John for #9386, to make manifests more robust/generation ready:

This week:

merge the two open PRs to add a test and make it possible to enable offloading for single tenants
scrubber changes for persistence for offloaded state #9386, also to delete old generations of manifest
monitor staging and see if there is corruptions

So it's going really well, and work is mostly complete. Now the main task is to ensure it rolls out safely, with us reducing the impact of any possible issue by doing a staged rollout.

Allow us to enable timeline offloading for single tenants without having to enable it for the entire pageserver. Part of #8088.

arpad-m · 2024-11-11T15:33:51Z

last week:

merged Add tenant config option to allow timeline_offloading #9598
filed and merged Don't attach is_archived to debug output #9679
filed Don't preload offloaded timelines #9646
Chi filed and merged fix(pageserver): drain upload queue before offloading timeline #9682
I've analyzed a staging tenant that had issues due to offloading. The issues have been caused by an already fixed (in Shut down timelines during offload and add offload tests #9289) bug caused by the timeline not having been shut down properly.

this week:

I'll focus on the scrubber side of #9386, and continuing to analyze tenants.

Add a test that ensures the `retain_lsn` functionality works. Right now, there is not a single test that is broken if offloaded or non-offloaded timelines don't get registered at their parents, preventing gc from discarding the ancestor_lsns of the children. This PR fills that gap. The test has four modes: * `offloaded`: offload the child timeline, run compaction on the parent timeline, unarchive the child timeline, then try reading from it. hopefully the data is still there. * `offloaded-corrupted`: offload the child timeline, corrupts the manifest in a way that the pageserver believes the timeline was flattened. This is the closest we can get to pretend the `retain_lsn` mechanism doesn't exist for offloaded timelines, so we can avoid adding endpoints to the pageserver that do this manually for tests. The test then checks that indeed data is corrupted and the endpoint can't be started. That way we know that the test is actually working, and actually tests the `retain_lsn` mechanism, instead of say the lsn lease mechanism, or one of the many other mechanisms that impede gc. * `archived`: the child timeline gets archived but doesn't get offloaded. this currently matches the `None` case but we might have refactors in the future that make archived timelines sufficiently different from non-archived ones. * `None`: the child timeline doesn't even get archived. this tests that normal timelines participate in `retain_lsn`. I've made them locally not participate in `retain_lsn` (via commenting out the respective `ancestor_children.push` statement in tenant.rs) and ran the testsuite, and not a single test failed. So this test is first of its kind. Part of #8088.

PR #9308 has modified tenant activation code to take offloaded child timelines into account for populating the list of `retain_lsn` values. However, there is more places than just tenant activation where one needs to update the `retain_lsn`s. This PR fixes some bugs of the current code that could lead to corruption in the worst case: 1. Deleting of an offloaded timeline would not get its `retain_lsn` purged from its parent. With the patch we now do it, but as the parent can be offloaded as well, the situatoin is a bit trickier than for non-offloaded timelines which can just keep a pointer to their parent. Here we can't keep a pointer because the parent might get offloaded, then unoffloaded again, creating a dangling pointer situation. Keeping a pointer to the *tenant* is not good either, because we might drop the offloaded timeline in a context where a `offloaded_timelines` lock is already held: so we don't want to acquire a lock in the drop code of OffloadedTimeline. 2. Unoffloading a timeline would not get its `retain_lsn` values populated, leading to it maybe garbage collecting values that its children might need. We now call `initialize_gc_info` on the parent. 3. Offloading of a timeline would not get its `retain_lsn` values registered as offloaded at the parent. So if we drop the `Timeline` object, and its registration is removed, the parent would not have any of the child's `retain_lsn`s around. Also, before, the `Timeline` object would delete anything related to its timeline ID, now it only deletes `retain_lsn`s that have `MaybeOffloaded::No` set. Incorporates Chi's reproducer from #9753. cc neondatabase/cloud#20199 The `test_timeline_retain_lsn` test is extended: 1. it gains a new dimension, duplicating each mode, to either have the "main" branch be the direct parent of the timeline we archive, or the "test_archived_parent" branch intermediary, creating a three timeline structure. This doesn't test anything fixed by this PR in particular, just explores the vast space of possible configurations a little bit more. 2. it gains two new modes, `offload-parent`, which tests the second point, and `offload-no-restart` which tests the third point. It's easy to verify the test actually is "sharp" by removing one of the respective `self.initialize_gc_info()`, `gc_info.insert_child()` or `ancestor_children.push()`. Part of #8088 --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Alex Chi Z <chi@neon.tech>

…'s parent (#9791) There is a potential data corruption issue, not one I've encountered, but it's still not hard to hit with some correct looking code given our current architecture. It has to do with the timeline's memory object storage via reference counted `Arc`s, and the removal of `retain_lsn` entries at the drop of the last `Arc` reference. The corruption steps are as follows: 1. timeline gets offloaded. timeline object A doesn't get dropped though, because some long-running task accesses it 2. the same timeline gets unoffloaded again. timeline object B gets created for it, timeline object A still referenced. both point to the same timeline. 3. the task keeping the reference to timeline object A exits. destructor for object A runs, removing `retain_lsn` in the timeline's parent. 4. the timeline's parent runs gc without the `retain_lsn` of the still exant timleine's child, leading to data corruption. In general we are susceptible each time when we recreate a `Timeline` object in the same process, which happens both during a timeline offload/unoffload cycle, as well as during an ancestor detach operation. The solution this PR implements is to make the destructor for a timeline as well as an offloaded timeline remove at most one `retain_lsn`. PR #9760 has added a log line to print the refcounts at timeline offload, but this only detects one of the places where we do such a recycle operation. Plus it doesn't prevent the actual issue. I doubt that this occurs in practice. It is more a defense in depth measure. Usually I'd assume that the timeline gets dropped immediately in step 1, as there is no background tasks referencing it after its shutdown. But one never knows, and reducing the stakes of step 1 actually occurring is a really good idea, from potential data corruption to waste of CPU time. Part of #8088

arpad-m · 2024-11-18T20:47:48Z

Last week:

filed and merged https://github.com/neondatabase/infra/pull/2391
filed and merged Correct mistakes in offloaded timeline retain_lsn management #9760 (main time sink here was making the python tests reproduce the issue). also analyzed the corrupted tenants in staging together with Chi.
filed and merged https://github.com/neondatabase/infra/pull/2412

This week:

Remove at most one retain_lsn entry from (possibly offloaded) timelne's parent #9791
Maybe getting Don't preload offloaded timelines #9646 into mergeable state
a stab at second synthectic storage size metric with archived branches included #9790, if I get time for it (probably going to be next week's task)

In timeline preloading, we also do a preload for offloaded timelines. This includes the download of `index-part.json`. Ultimately, such a download is wasteful, therefore avoid it. Same goes for the remote client, we just discard it immediately thereafter. Part of #8088 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>

arpad-m · 2024-11-25T12:38:46Z

Last week:

This week:

persistence for offloaded state #9386

If I get time, also:

test that offloaded timelines are excluded from heatmaps and never downloaded to secondaries
test for many timelines depending on each other
test: offload but pageserver crashes somewhere in delete_local_timeline_directory: can the pageserver deal with remnants after a restart?

arpad-m · 2024-11-29T14:58:24Z

This week I have:

filed Support tenant manifests in the scrubber #9942
analyzed "could not find data for key" errors found in staging -> could see that the errors were different issues.
identified three prod tenants to try offloading on them, applied config on them

Next week:

get #9942 through review
monitor prod tenants for any corruptions
prepare offloading in a prod region in the release of week Dec 9 - Dec 13.

Support tenant manifests in the storage scrubber: * list the manifests, order them by generation * delete all manifests except for the two most recent generations * for the latest manifest: try parsing it. I've tested this patch by running the against a staging bucket and it successfully deleted stuff (and avoided deleting the latest two generations). In follow-up work, we might want to also check some invariants of the manifest, as mentioned in #8088. Part of #9386 Part of #8088 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>

arpad-m · 2024-12-06T18:31:25Z

This week I have:

merged Support tenant manifests in the scrubber #9942
filed Do tenant manifest validation with index-part #10007
prepared offloading for a prod region in the next week's release.
filed issue for the rollout to prod: https://github.com/neondatabase/cloud/issues/21353

Next week:

get Do tenant manifest validation with index-part #10007 merged
Christian wanted code to be nicer, so I'll probably do a refactor PR after the PR linked above
monitor prod tenants and the newly rolled out prod region.

I think the issue has progressed enough that weekly updates are no longer neccessary from now on.

This adds some validation of invariants that we want to uphold wrt the tenant manifest and `index_part.json`: * the data the manifest has about a timeline must match with the data in `index_part.json`. It might actually change, e.g. when we do reparenting during detach ancestor, but that requires the timeline to be unoffloaded, i.e. removed from the manifest. * any timeline mentioned in index part, must, if present, be archived. If we unarchive, we first update the tenant manifest to unoffload, and only then update index part. And one needs to archive before offloading. * it is legal for timelines to be mentioned in the manifest but have no `index_part`: this is a temporary state visible during deletion of the timeline. if the pageserver crashed, an attach of the tenant will clean the state up. * it is also legal for offloaded timelines to have an `ancestor_retain_lsn` of None while having an `ancestor_timeline_id`. This is for the to-be-added flattening functionality: the plan is to set former to None if we have flattened a timeline. follow-up of #9942 part of #8088

jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver t/Epic Issue type: Epic labels Jun 18, 2024

jcsp mentioned this issue Jun 19, 2024

logical size limit is broken during PS restart #5963

Open

jcsp changed the title ~~Epic: Pageserver Timeline Hibernation~~ Epic: Pageserver Timeline Archival Jul 1, 2024

This was referenced Jul 1, 2024

rfcs: add RFC for timeline archival #8221

Merged

pageserver: add supplementary branch usage stats #8131

Merged

jcsp assigned arpad-m Jul 8, 2024

jcsp added a commit that referenced this issue Jul 11, 2024

rfcs: add RFC for timeline archival (#8221)

69b6675

A design for a cheap low-resource state for idle timelines: - #8088

skyzh pushed a commit that referenced this issue Jul 15, 2024

rfcs: add RFC for timeline archival (#8221)

32f668f

A design for a cheap low-resource state for idle timelines: - #8088

arpad-m mentioned this issue Jul 17, 2024

Add archival_config endpoint to pageserver #8414

Merged

2 tasks

This was referenced Jul 22, 2024

Mark body of archival_config endpoint as required #8458

Merged

timeline archival: persistence #8459

Closed

timeline archival: slimmed down timeline object #8460

Closed

arpad-m added a commit that referenced this issue Jul 22, 2024

Mark body of archival_config endpoint as required (#8458)

f17fe75

As pointed out in #8414 (comment) Part of #8088

arpad-m mentioned this issue Jul 23, 2024

Persist archival information #8479

Merged

arpad-m mentioned this issue Aug 9, 2024

Implement archival_config timeline endpoint in the storage controller #8680

Merged

arpad-m mentioned this issue Aug 24, 2024

Timeline archival test #8824

Merged

This was referenced Oct 29, 2024

Disallow archived timelines to be detached or reparented #9578

Merged

Disallow offloaded children during timeline deletion #9582

Merged

Don't keep around the timeline's remote_client #9583

Merged

This was referenced Oct 31, 2024

Add tenant config option to allow timeline_offloading #9598

Merged

Add a retain_lsn test #9599

Merged

arpad-m added a commit that referenced this issue Nov 4, 2024

Add tenant config option to allow timeline_offloading (#9598)

ee68bbf

Allow us to enable timeline offloading for single tenants without having to enable it for the entire pageserver. Part of #8088.

arpad-m mentioned this issue Nov 5, 2024

Don't preload offloaded timelines #9646

Merged

arpad-m mentioned this issue Nov 14, 2024

Correct mistakes in offloaded timeline retain_lsn management #9760

Merged

This was referenced Nov 18, 2024

second synthectic storage size metric with archived branches included #9790

Closed

Remove at most one retain_lsn entry from (possibly offloaded) timelne's parent #9791

Merged

arpad-m mentioned this issue Nov 29, 2024

Support tenant manifests in the scrubber #9942

Merged

arpad-m mentioned this issue Dec 4, 2024

Do tenant manifest validation with index-part #10007

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Pageserver Timeline Archival #8088

Epic: Pageserver Timeline Archival #8088

jcsp commented Jun 18, 2024 •

edited

Loading

Milestone: archived branches are cheap locally -- (no index load on startup, no layers on disk, no Timeline at runtime)

Misc lower priority changes

Milestone: archived branches are cheap in remote storage -- eventually written as compressed image layers at a single LSN

arpad-m commented Jul 22, 2024

arpad-m commented Aug 19, 2024

arpad-m commented Nov 4, 2024

arpad-m commented Nov 11, 2024

arpad-m commented Nov 18, 2024 •

edited

Loading

arpad-m commented Nov 25, 2024 •

edited

Loading

arpad-m commented Nov 29, 2024 •

edited

Loading

arpad-m commented Dec 6, 2024

Epic: Pageserver Timeline Archival #8088

Epic: Pageserver Timeline Archival #8088

Comments

jcsp commented Jun 18, 2024 • edited Loading

Purpose

Background

Changes

Milestone: archived branches are cheap locally -- (no index load on startup, no layers on disk, no Timeline at runtime)

Misc lower priority changes

Milestone: archived branches are cheap in remote storage -- eventually written as compressed image layers at a single LSN

arpad-m commented Jul 22, 2024

arpad-m commented Aug 19, 2024

arpad-m commented Nov 4, 2024

arpad-m commented Nov 11, 2024

arpad-m commented Nov 18, 2024 • edited Loading

arpad-m commented Nov 25, 2024 • edited Loading

arpad-m commented Nov 29, 2024 • edited Loading

arpad-m commented Dec 6, 2024

jcsp commented Jun 18, 2024 •

edited

Loading

arpad-m commented Nov 18, 2024 •

edited

Loading

arpad-m commented Nov 25, 2024 •

edited

Loading

arpad-m commented Nov 29, 2024 •

edited

Loading