Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_rel_block_key: exclude the relsize key #6266

Merged
merged 1 commit into from
Jan 5, 2024

Conversation

problame
Copy link
Contributor

@problame problame commented Jan 3, 2024

Before this PR, is_rel_block_key returns true for the blknum 0xffffffff,
which is a blknum that's actually never written by Postgres, but used by
Neon Pageserver to store the relsize.

Quoting @MMeent:

PostgreSQL can't extend the relation beyond size of 0xFFFFFFFF blocks,
so block number 0xFFFFFFFE is the last valid block number.

This PR changes the definition of the function to exclude blknum 0xffffffff.

My motivation for doing this change is to fix the pagebench getpage
benchmark, which uses is_rel_block_key to filter the keyspace for
valid pages to request from page_service.
fixes #6210

I checked other users of the function.

The first one is key_is_shard0, which already had added an exemption
for 0xffffffff. So, there's no functional change with this PR.

The second one is DatadirModification::flush1. With this PR,
.flush() will skip the relsize key, whereas it didn't
before. This means we will pile up all the relsize key-value pairs (Key,u32)
in DatadirModification::pending_updates until .commit() is called.

The only place I can think of where that would be a problem is if we
import from a full basebackup, and don't .commit() regularly,
like we currently don't do in import_basebackup_from_tar.
It exposes us to input-controlled allocations.
However, that was already the case for the other keys that are skipped,
so, one can argue that this change is not making the situation much
worse.

Footnotes

  1. That type's flush() and commit() methods are terribly named, but,
    that's for another time

Before this PR, `is_rel_block_key` returns true for the blknum `0xffffffff`,
which is a blknum that's actually never written by Postgres, but used by
Neon Pageserver to store the relsize.

Quoting @MMeent:

> PostgreSQL can't extend the relation beyond size of 0xFFFFFFFF blocks,
> so block number 0xFFFFFFFE is the last valid block number.

This PR changes the definition of the function to exclude blknum 0xffffffff.

My motivation for doing this change is to fix the `pagebench` getpage
benchmark, which uses `is_rel_block_key` to filter the keyspace for
valid pages to request from page_service.
fixes #6210

I checked other users of the function.

The first one is `key_is_shard0`, which already had added an exemption
for 0xffffffff. So, there's no functional change with this PR.

The second one is `DatadirModification::flush`[^1]. With this PR,
`.flush()` will skip the relsize key, whereas it didn't
before. This means we will pile up all the relsize key-value pairs `(Key,u32)`
in `DatadirModification::pending_updates` until `.commit()` is called.

The only place I can think of where that would be a problem is if we
import from a full basebackup, and don't `.commit()` regularly,
like we currently don't do in `import_basebackup_from_tar`.
It exposes us to input-controlled allocations.
However, that was already the case for the other keys that are skipped,
so, one can argue that this change is not making the situation much
worse.

[^1]: That type's `flush()` and `commit()` methods are terribly named, but,
      that's for another time
@problame problame requested review from a team as code owners January 3, 2024 16:39
@problame problame requested review from save-buffer, koivunej, hlinnaka, jcsp and MMeent and removed request for a team, save-buffer and koivunej January 3, 2024 16:39
Copy link
Contributor

@MMeent MMeent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems OK to me.

Copy link
Collaborator

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The reason I didn't make this change when writing shard.rs was out of an abundance of caution, in case there were any code paths using the function in its literal sense to mean "Is this a relation block or the relsize for that relation?". You've done that diligence now, thank you.

@jcsp
Copy link
Collaborator

jcsp commented Jan 3, 2024

This means we will pile up all the relsize key-value pairs (Key,u32)
in DatadirModification::pending_updates until .commit() is called.

This seems okay to me: we already have O(N_relations) memory structures e.g. in the relsize cache on Timeline.

Copy link

github-actions bot commented Jan 3, 2024

2208 tests run: 2124 passed, 0 failed, 84 skipped (full report)


Flaky tests (2)

Postgres 16

  • test_pageserver_restarts_under_worload: debug

Postgres 15

  • test_pageserver_restarts_under_worload: debug

Code coverage (full report)

  • functions: 54.4% (9816 of 18049 functions)
  • lines: 81.6% (56455 of 69217 lines)

The comment gets automatically updated with the latest test results
9f8ff6a at 2024-01-04T09:36:25.110Z :recycle:

@problame problame requested review from knizhnik and removed request for hlinnaka and knizhnik January 3, 2024 18:24
@problame problame enabled auto-merge (squash) January 3, 2024 18:29
Copy link
Contributor

@knizhnik knizhnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was never noticed in this PR that blkno=0xffffffff is not just invalid block number from Postgres point of view, it corresponds to the completely different key - relation size. I have noticed that it is necessary either to change name of this mathod (if we really wan to scatter also relation size data) either change definition of this method.

@knizhnik
Copy link
Contributor

knizhnik commented Jan 4, 2024

Just an interesting fact: is_rel_block_key is used not only for sharding (shard.rs) but also in DatadirModification::flush.
And here the intention was to flush data before metadata (there is a huge comment explaining why it is needed):

    ///
    /// Flush changes accumulated so far to the underlying repository.
    ///
    /// Usually, changes made in DatadirModification are atomic, but this allows
    /// you to flush them to the underlying repository before the final `commit`.
    /// That allows to free up the memory used to hold the pending changes.
    ///
    /// Currently only used during bulk import of a data directory. In that
    /// context, breaking the atomicity is OK. If the import is interrupted, the
    /// whole import fails and the timeline will be deleted anyway.
    /// (Or to be precise, it will be left behind for debugging purposes and
    /// ignored, see <https://github.com/neondatabase/neon/pull/1809>)
    ///
    /// Note: A consequence of flushing the pending operations is that they
    /// won't be visible to subsequent operations until `commit`. The function
    /// retains all the metadata, but data pages are flushed. That's again OK
    /// for bulk import, where you are just loading data pages and won't try to
    /// modify the same pages twice.
    pub async fn flush(&mut self, ctx: &RequestContext) -> anyhow::Result<()> {
          ...
        // Flush relation and  SLRU data blocks, keep metadata.
        let mut retained_pending_updates = HashMap::<_, Vec<_>>::new();
        for (key, values) in self.pending_updates.drain() {
            for (lsn, value) in values {
                if is_rel_block_key(&key) || is_slru_block_key(key) {
                    // This bails out on first error without modifying pending_updates.
                    // That's Ok, cf this function's doc comment.
                    writer.put(key, lsn, &value, ctx).await?;
                } else {
                    retained_pending_updates
                        .entry(key)
                        .or_default()
                        .push((lsn, value));
                }
            }
        }

But with original is_rel_block_key definition it returns true also for metadata (relation size).
So the assumption that we update relation size only on commit was not correct.

@problame
Copy link
Contributor Author

problame commented Jan 4, 2024

Just an interesting fact: is_rel_block_key is used not only for sharding (shard.rs) but also in DatadirModification::flush. And here the intention was to flush data before metadata (there is a huge comment explaining why it is needed):

    ///
    /// Flush changes accumulated so far to the underlying repository.
    ///
    /// Usually, changes made in DatadirModification are atomic, but this allows
    /// you to flush them to the underlying repository before the final `commit`.
    /// That allows to free up the memory used to hold the pending changes.
    ///
    /// Currently only used during bulk import of a data directory. In that
    /// context, breaking the atomicity is OK. If the import is interrupted, the
    /// whole import fails and the timeline will be deleted anyway.
    /// (Or to be precise, it will be left behind for debugging purposes and
    /// ignored, see <https://github.com/neondatabase/neon/pull/1809>)
    ///
    /// Note: A consequence of flushing the pending operations is that they
    /// won't be visible to subsequent operations until `commit`. The function
    /// retains all the metadata, but data pages are flushed. That's again OK
    /// for bulk import, where you are just loading data pages and won't try to
    /// modify the same pages twice.
    pub async fn flush(&mut self, ctx: &RequestContext) -> anyhow::Result<()> {
          ...
        // Flush relation and  SLRU data blocks, keep metadata.
        let mut retained_pending_updates = HashMap::<_, Vec<_>>::new();
        for (key, values) in self.pending_updates.drain() {
            for (lsn, value) in values {
                if is_rel_block_key(&key) || is_slru_block_key(key) {
                    // This bails out on first error without modifying pending_updates.
                    // That's Ok, cf this function's doc comment.
                    writer.put(key, lsn, &value, ctx).await?;
                } else {
                    retained_pending_updates
                        .entry(key)
                        .or_default()
                        .push((lsn, value));
                }
            }
        }

But with original is_rel_block_key definition it returns true also for metadata (relation size). So the assumption that we update relation size only on commit was not correct.

I covered that in the PR description

The second one is DatadirModification::flush1. With this PR, .flush() will skip the relsize key, whereas it didn't before. This means we will pile up all the relsize key-value pairs (Key,u32) in DatadirModification::pending_updates until .commit() is called.

The only place I can think of where that would be a problem is if we import from a full basebackup, and don't .commit() regularly, like we currently don't do in import_basebackup_from_tar. It exposes us to input-controlled allocations. However, that was already the case for the other keys that are skipped, so, one can argue that this change is not making the situation much worse.

Breaking the block comment down:

/// Usually, changes made in DatadirModification are atomic, but this allows
/// you to flush them to the underlying repository before the final `commit`.
/// That allows to free up the memory used to hold the pending changes.
///
/// Currently only used during bulk import of a data directory. In that
/// context, breaking the atomicity is OK. If the import is interrupted, the
/// whole import fails and the timeline will be deleted anyway.
/// (Or to be precise, it will be left behind for debugging purposes and
/// ignored, see <https://github.com/neondatabase/neon/pull/1809>)

Skipping the relsize key seems acceptable here, see arguments in the PR description.

/// Note: A consequence of flushing the pending operations is that they
/// won't be visible to subsequent operations until `commit`. The function
/// retains all the metadata, but data pages are flushed. That's again OK
/// for bulk import, where you are just loading data pages and won't try to
/// modify the same pages twice.

Regarding this Note, I think this PR is a correction, as my understanding is that relsize is metadata.


Given you posted a comment after you approved the PR, I'd ask you to re-affirm that you still think this PR is correct.

@problame problame disabled auto-merge January 4, 2024 09:07
@knizhnik
Copy link
Contributor

knizhnik commented Jan 4, 2024

Given you posted a comment after you approved the PR, I'd ask you to re-affirm that you still think this PR is correct.

Yes, it is correct

@problame problame merged commit d260426 into main Jan 5, 2024
47 checks passed
@problame problame deleted the problame/benchmarking/pr/fix-6210-block-0xffffffff branch January 5, 2024 10:48
problame added a commit that referenced this pull request Jan 8, 2024
PR #6266 broke the getpage_latest_lsn benchmark.

Before this patch, we'd fail with

```
not implemented: split up range
```

because `r.start = rel size key` and `r.end = rel size key + 1`.
problame added a commit that referenced this pull request Jan 8, 2024
PR #6266 broke the getpage_latest_lsn benchmark.

Before this patch, we'd fail with

```
not implemented: split up range
```

because `r.start = rel size key` and `r.end = rel size key + 1`.
problame added a commit that referenced this pull request Jan 9, 2024
PR #6266 broke the getpage_latest_lsn benchmark.

Before this patch, we'd fail with

```
not implemented: split up range
```

because `r.start = rel size key` and `r.end = rel size key + 1`.

The filtering of the key ranges in that loop is a bit ugly, but,
I measured:
* setup with 180k layer files (20k tenants * 9 layers).
* total physical size is 463GiB
* 5k tenants, the range filtering takes `0.6 seconds` on an
i3en.3xlarge.
That's a tiny fraction of the overall time it takes for pagebench to get
ready to send requests. So, this is good enough for now / there are
other bottlenecks that are bigger.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pagebench: sends requests for block 0xffffffff on some relations
4 participants