is_rel_block_key: exclude the relsize key #6266

problame · 2024-01-03T16:39:53Z

Before this PR, is_rel_block_key returns true for the blknum 0xffffffff,
which is a blknum that's actually never written by Postgres, but used by
Neon Pageserver to store the relsize.

Quoting @MMeent:

PostgreSQL can't extend the relation beyond size of 0xFFFFFFFF blocks,
so block number 0xFFFFFFFE is the last valid block number.

This PR changes the definition of the function to exclude blknum 0xffffffff.

My motivation for doing this change is to fix the pagebench getpage
benchmark, which uses is_rel_block_key to filter the keyspace for
valid pages to request from page_service.
fixes #6210

I checked other users of the function.

The first one is key_is_shard0, which already had added an exemption
for 0xffffffff. So, there's no functional change with this PR.

The second one is DatadirModification::flush¹. With this PR,
.flush() will skip the relsize key, whereas it didn't
before. This means we will pile up all the relsize key-value pairs (Key,u32)
in DatadirModification::pending_updates until .commit() is called.

The only place I can think of where that would be a problem is if we
import from a full basebackup, and don't .commit() regularly,
like we currently don't do in import_basebackup_from_tar.
It exposes us to input-controlled allocations.
However, that was already the case for the other keys that are skipped,
so, one can argue that this change is not making the situation much
worse.

That type's flush() and commit() methods are terribly named, but,
that's for another time ↩

@MMeent

Before this PR, `is_rel_block_key` returns true for the blknum `0xffffffff`, which is a blknum that's actually never written by Postgres, but used by Neon Pageserver to store the relsize. Quoting @MMeent: > PostgreSQL can't extend the relation beyond size of 0xFFFFFFFF blocks, > so block number 0xFFFFFFFE is the last valid block number. This PR changes the definition of the function to exclude blknum 0xffffffff. My motivation for doing this change is to fix the `pagebench` getpage benchmark, which uses `is_rel_block_key` to filter the keyspace for valid pages to request from page_service. fixes #6210 I checked other users of the function. The first one is `key_is_shard0`, which already had added an exemption for 0xffffffff. So, there's no functional change with this PR. The second one is `DatadirModification::flush`[^1]. With this PR, `.flush()` will skip the relsize key, whereas it didn't before. This means we will pile up all the relsize key-value pairs `(Key,u32)` in `DatadirModification::pending_updates` until `.commit()` is called. The only place I can think of where that would be a problem is if we import from a full basebackup, and don't `.commit()` regularly, like we currently don't do in `import_basebackup_from_tar`. It exposes us to input-controlled allocations. However, that was already the case for the other keys that are skipped, so, one can argue that this change is not making the situation much worse. [^1]: That type's `flush()` and `commit()` methods are terribly named, but, that's for another time

MMeent

Seems OK to me.

jcsp

LGTM. The reason I didn't make this change when writing shard.rs was out of an abundance of caution, in case there were any code paths using the function in its literal sense to mean "Is this a relation block or the relsize for that relation?". You've done that diligence now, thank you.

jcsp · 2024-01-03T16:59:49Z

This means we will pile up all the relsize key-value pairs (Key,u32)
in DatadirModification::pending_updates until .commit() is called.

This seems okay to me: we already have O(N_relations) memory structures e.g. in the relsize cache on Timeline.

github-actions · 2024-01-03T17:09:48Z

2208 tests run: 2124 passed, 0 failed, 84 skipped (full report)

Flaky tests (2)

Postgres 16

test_pageserver_restarts_under_worload: debug

Postgres 15

test_pageserver_restarts_under_worload: debug

Code coverage (full report)

functions: 54.4% (9816 of 18049 functions)
lines: 81.6% (56455 of 69217 lines)

_{The comment gets automatically updated with the latest test results
9f8ff6a at 2024-01-04T09:36:25.110Z :recycle:}

knizhnik

It was never noticed in this PR that blkno=0xffffffff is not just invalid block number from Postgres point of view, it corresponds to the completely different key - relation size. I have noticed that it is necessary either to change name of this mathod (if we really wan to scatter also relation size data) either change definition of this method.

knizhnik · 2024-01-04T06:06:27Z

Just an interesting fact: is_rel_block_key is used not only for sharding (shard.rs) but also in DatadirModification::flush.
And here the intention was to flush data before metadata (there is a huge comment explaining why it is needed):

    ///
    /// Flush changes accumulated so far to the underlying repository.
    ///
    /// Usually, changes made in DatadirModification are atomic, but this allows
    /// you to flush them to the underlying repository before the final `commit`.
    /// That allows to free up the memory used to hold the pending changes.
    ///
    /// Currently only used during bulk import of a data directory. In that
    /// context, breaking the atomicity is OK. If the import is interrupted, the
    /// whole import fails and the timeline will be deleted anyway.
    /// (Or to be precise, it will be left behind for debugging purposes and
    /// ignored, see <https://github.com/neondatabase/neon/pull/1809>)
    ///
    /// Note: A consequence of flushing the pending operations is that they
    /// won't be visible to subsequent operations until `commit`. The function
    /// retains all the metadata, but data pages are flushed. That's again OK
    /// for bulk import, where you are just loading data pages and won't try to
    /// modify the same pages twice.
    pub async fn flush(&mut self, ctx: &RequestContext) -> anyhow::Result<()> {
          ...
        // Flush relation and  SLRU data blocks, keep metadata.
        let mut retained_pending_updates = HashMap::<_, Vec<_>>::new();
        for (key, values) in self.pending_updates.drain() {
            for (lsn, value) in values {
                if is_rel_block_key(&key) || is_slru_block_key(key) {
                    // This bails out on first error without modifying pending_updates.
                    // That's Ok, cf this function's doc comment.
                    writer.put(key, lsn, &value, ctx).await?;
                } else {
                    retained_pending_updates
                        .entry(key)
                        .or_default()
                        .push((lsn, value));
                }
            }
        }

But with original is_rel_block_key definition it returns true also for metadata (relation size).
So the assumption that we update relation size only on commit was not correct.

problame · 2024-01-04T09:06:44Z

Just an interesting fact: is_rel_block_key is used not only for sharding (shard.rs) but also in DatadirModification::flush. And here the intention was to flush data before metadata (there is a huge comment explaining why it is needed):

    ///
    /// Flush changes accumulated so far to the underlying repository.
    ///
    /// Usually, changes made in DatadirModification are atomic, but this allows
    /// you to flush them to the underlying repository before the final `commit`.
    /// That allows to free up the memory used to hold the pending changes.
    ///
    /// Currently only used during bulk import of a data directory. In that
    /// context, breaking the atomicity is OK. If the import is interrupted, the
    /// whole import fails and the timeline will be deleted anyway.
    /// (Or to be precise, it will be left behind for debugging purposes and
    /// ignored, see <https://github.com/neondatabase/neon/pull/1809>)
    ///
    /// Note: A consequence of flushing the pending operations is that they
    /// won't be visible to subsequent operations until `commit`. The function
    /// retains all the metadata, but data pages are flushed. That's again OK
    /// for bulk import, where you are just loading data pages and won't try to
    /// modify the same pages twice.
    pub async fn flush(&mut self, ctx: &RequestContext) -> anyhow::Result<()> {
          ...
        // Flush relation and  SLRU data blocks, keep metadata.
        let mut retained_pending_updates = HashMap::<_, Vec<_>>::new();
        for (key, values) in self.pending_updates.drain() {
            for (lsn, value) in values {
                if is_rel_block_key(&key) || is_slru_block_key(key) {
                    // This bails out on first error without modifying pending_updates.
                    // That's Ok, cf this function's doc comment.
                    writer.put(key, lsn, &value, ctx).await?;
                } else {
                    retained_pending_updates
                        .entry(key)
                        .or_default()
                        .push((lsn, value));
                }
            }
        }

But with original is_rel_block_key definition it returns true also for metadata (relation size). So the assumption that we update relation size only on commit was not correct.

I covered that in the PR description

The second one is DatadirModification::flush1. With this PR, .flush() will skip the relsize key, whereas it didn't before. This means we will pile up all the relsize key-value pairs (Key,u32) in DatadirModification::pending_updates until .commit() is called.

The only place I can think of where that would be a problem is if we import from a full basebackup, and don't .commit() regularly, like we currently don't do in import_basebackup_from_tar. It exposes us to input-controlled allocations. However, that was already the case for the other keys that are skipped, so, one can argue that this change is not making the situation much worse.

Breaking the block comment down:

/// Usually, changes made in DatadirModification are atomic, but this allows
/// you to flush them to the underlying repository before the final `commit`.
/// That allows to free up the memory used to hold the pending changes.
///
/// Currently only used during bulk import of a data directory. In that
/// context, breaking the atomicity is OK. If the import is interrupted, the
/// whole import fails and the timeline will be deleted anyway.
/// (Or to be precise, it will be left behind for debugging purposes and
/// ignored, see <https://github.com/neondatabase/neon/pull/1809>)

Skipping the relsize key seems acceptable here, see arguments in the PR description.

/// Note: A consequence of flushing the pending operations is that they
/// won't be visible to subsequent operations until `commit`. The function
/// retains all the metadata, but data pages are flushed. That's again OK
/// for bulk import, where you are just loading data pages and won't try to
/// modify the same pages twice.

Regarding this Note, I think this PR is a correction, as my understanding is that relsize is metadata.

Given you posted a comment after you approved the PR, I'd ask you to re-affirm that you still think this PR is correct.

knizhnik · 2024-01-04T17:36:19Z

Given you posted a comment after you approved the PR, I'd ask you to re-affirm that you still think this PR is correct.

Yes, it is correct

PR #6266 broke the getpage_latest_lsn benchmark. Before this patch, we'd fail with ``` not implemented: split up range ``` because `r.start = rel size key` and `r.end = rel size key + 1`.

PR #6266 broke the getpage_latest_lsn benchmark. Before this patch, we'd fail with ``` not implemented: split up range ``` because `r.start = rel size key` and `r.end = rel size key + 1`. The filtering of the key ranges in that loop is a bit ugly, but, I measured: * setup with 180k layer files (20k tenants * 9 layers). * total physical size is 463GiB * 5k tenants, the range filtering takes `0.6 seconds` on an i3en.3xlarge. That's a tiny fraction of the overall time it takes for pagebench to get ready to send requests. So, this is good enough for now / there are other bottlenecks that are bigger.

problame requested review from a team as code owners January 3, 2024 16:39

problame requested review from save-buffer, koivunej, hlinnaka, jcsp and MMeent and removed request for a team, save-buffer and koivunej January 3, 2024 16:39

MMeent approved these changes Jan 3, 2024

View reviewed changes

jcsp approved these changes Jan 3, 2024

View reviewed changes

problame requested review from knizhnik and removed request for hlinnaka and knizhnik January 3, 2024 18:24

problame enabled auto-merge (squash) January 3, 2024 18:29

knizhnik approved these changes Jan 4, 2024

View reviewed changes

problame disabled auto-merge January 4, 2024 09:07

problame merged commit d260426 into main Jan 5, 2024
47 checks passed

problame deleted the problame/benchmarking/pr/fix-6210-block-0xffffffff branch January 5, 2024 10:48

problame mentioned this pull request Jan 8, 2024

pagebench: fixup after is_rel_block_key changes in #6266 #6303

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is_rel_block_key: exclude the relsize key #6266

is_rel_block_key: exclude the relsize key #6266

problame commented Jan 3, 2024

MMeent left a comment

jcsp left a comment

jcsp commented Jan 3, 2024

github-actions bot commented Jan 3, 2024 •

edited

Loading

Postgres 16

Postgres 15

knizhnik left a comment

knizhnik commented Jan 4, 2024

problame commented Jan 4, 2024

knizhnik commented Jan 4, 2024

is_rel_block_key: exclude the relsize key #6266

is_rel_block_key: exclude the relsize key #6266

Conversation

problame commented Jan 3, 2024

Footnotes

MMeent left a comment

Choose a reason for hiding this comment

jcsp left a comment

Choose a reason for hiding this comment

jcsp commented Jan 3, 2024

github-actions bot commented Jan 3, 2024 • edited Loading

2208 tests run: 2124 passed, 0 failed, 84 skipped (full report)

Postgres 16

Postgres 15

Code coverage (full report)

knizhnik left a comment

Choose a reason for hiding this comment

knizhnik commented Jan 4, 2024

problame commented Jan 4, 2024

knizhnik commented Jan 4, 2024

github-actions bot commented Jan 3, 2024 •

edited

Loading