Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Remove at most one retain_lsn entry from (possibly offloaded) timelne…
…'s parent (#9791) There is a potential data corruption issue, not one I've encountered, but it's still not hard to hit with some correct looking code given our current architecture. It has to do with the timeline's memory object storage via reference counted `Arc`s, and the removal of `retain_lsn` entries at the drop of the last `Arc` reference. The corruption steps are as follows: 1. timeline gets offloaded. timeline object A doesn't get dropped though, because some long-running task accesses it 2. the same timeline gets unoffloaded again. timeline object B gets created for it, timeline object A still referenced. both point to the same timeline. 3. the task keeping the reference to timeline object A exits. destructor for object A runs, removing `retain_lsn` in the timeline's parent. 4. the timeline's parent runs gc without the `retain_lsn` of the still exant timleine's child, leading to data corruption. In general we are susceptible each time when we recreate a `Timeline` object in the same process, which happens both during a timeline offload/unoffload cycle, as well as during an ancestor detach operation. The solution this PR implements is to make the destructor for a timeline as well as an offloaded timeline remove at most one `retain_lsn`. PR #9760 has added a log line to print the refcounts at timeline offload, but this only detects one of the places where we do such a recycle operation. Plus it doesn't prevent the actual issue. I doubt that this occurs in practice. It is more a defense in depth measure. Usually I'd assume that the timeline gets dropped immediately in step 1, as there is no background tasks referencing it after its shutdown. But one never knows, and reducing the stakes of step 1 actually occurring is a really good idea, from potential data corruption to waste of CPU time. Part of #8088
- Loading branch information
4fc3af1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5499 tests run: 5272 passed, 1 failed, 226 skipped (full report)
Failures on Postgres 17
test_crafted_wal_end[simple]
: debug-x86-64Test coverage report is not available
4fc3af1 at 2024-11-18T21:35:15.946Z :recycle: