-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is_rel_block_key: exclude the relsize key #6266
is_rel_block_key: exclude the relsize key #6266
Conversation
Before this PR, `is_rel_block_key` returns true for the blknum `0xffffffff`, which is a blknum that's actually never written by Postgres, but used by Neon Pageserver to store the relsize. Quoting @MMeent: > PostgreSQL can't extend the relation beyond size of 0xFFFFFFFF blocks, > so block number 0xFFFFFFFE is the last valid block number. This PR changes the definition of the function to exclude blknum 0xffffffff. My motivation for doing this change is to fix the `pagebench` getpage benchmark, which uses `is_rel_block_key` to filter the keyspace for valid pages to request from page_service. fixes #6210 I checked other users of the function. The first one is `key_is_shard0`, which already had added an exemption for 0xffffffff. So, there's no functional change with this PR. The second one is `DatadirModification::flush`[^1]. With this PR, `.flush()` will skip the relsize key, whereas it didn't before. This means we will pile up all the relsize key-value pairs `(Key,u32)` in `DatadirModification::pending_updates` until `.commit()` is called. The only place I can think of where that would be a problem is if we import from a full basebackup, and don't `.commit()` regularly, like we currently don't do in `import_basebackup_from_tar`. It exposes us to input-controlled allocations. However, that was already the case for the other keys that are skipped, so, one can argue that this change is not making the situation much worse. [^1]: That type's `flush()` and `commit()` methods are terribly named, but, that's for another time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems OK to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The reason I didn't make this change when writing shard.rs was out of an abundance of caution, in case there were any code paths using the function in its literal sense to mean "Is this a relation block or the relsize for that relation?". You've done that diligence now, thank you.
This seems okay to me: we already have O(N_relations) memory structures e.g. in the relsize cache on Timeline. |
2208 tests run: 2124 passed, 0 failed, 84 skipped (full report)Flaky tests (2)Postgres 16
Postgres 15
Code coverage (full report)
The comment gets automatically updated with the latest test results
9f8ff6a at 2024-01-04T09:36:25.110Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was never noticed in this PR that blkno=0xffffffff is not just invalid block number from Postgres point of view, it corresponds to the completely different key - relation size. I have noticed that it is necessary either to change name of this mathod (if we really wan to scatter also relation size data) either change definition of this method.
Just an interesting fact:
But with original |
I covered that in the PR description
Breaking the block comment down:
Skipping the relsize key seems acceptable here, see arguments in the PR description.
Regarding this Given you posted a comment after you approved the PR, I'd ask you to re-affirm that you still think this PR is correct. |
Yes, it is correct |
PR #6266 broke the getpage_latest_lsn benchmark. Before this patch, we'd fail with ``` not implemented: split up range ``` because `r.start = rel size key` and `r.end = rel size key + 1`.
PR #6266 broke the getpage_latest_lsn benchmark. Before this patch, we'd fail with ``` not implemented: split up range ``` because `r.start = rel size key` and `r.end = rel size key + 1`.
PR #6266 broke the getpage_latest_lsn benchmark. Before this patch, we'd fail with ``` not implemented: split up range ``` because `r.start = rel size key` and `r.end = rel size key + 1`. The filtering of the key ranges in that loop is a bit ugly, but, I measured: * setup with 180k layer files (20k tenants * 9 layers). * total physical size is 463GiB * 5k tenants, the range filtering takes `0.6 seconds` on an i3en.3xlarge. That's a tiny fraction of the overall time it takes for pagebench to get ready to send requests. So, this is good enough for now / there are other bottlenecks that are bigger.
Before this PR,
is_rel_block_key
returns true for the blknum0xffffffff
,which is a blknum that's actually never written by Postgres, but used by
Neon Pageserver to store the relsize.
Quoting @MMeent:
This PR changes the definition of the function to exclude blknum 0xffffffff.
My motivation for doing this change is to fix the
pagebench
getpagebenchmark, which uses
is_rel_block_key
to filter the keyspace forvalid pages to request from page_service.
fixes #6210
I checked other users of the function.
The first one is
key_is_shard0
, which already had added an exemptionfor 0xffffffff. So, there's no functional change with this PR.
The second one is
DatadirModification::flush
1. With this PR,.flush()
will skip the relsize key, whereas it didn'tbefore. This means we will pile up all the relsize key-value pairs
(Key,u32)
in
DatadirModification::pending_updates
until.commit()
is called.The only place I can think of where that would be a problem is if we
import from a full basebackup, and don't
.commit()
regularly,like we currently don't do in
import_basebackup_from_tar
.It exposes us to input-controlled allocations.
However, that was already the case for the other keys that are skipped,
so, one can argue that this change is not making the situation much
worse.
Footnotes
That type's
flush()
andcommit()
methods are terribly named, but,that's for another time ↩