Compute release 2024-11-28 #9935

ololobus · 2024-11-28T21:47:39Z

Problem

We have this incident ongoing on prod #investigation-2024-11-28-stuck-projects-with-tabs-in-database-names

Summary of changes

Release the current main, it should be OK.

…9861) ## Problem close #9761 The test assumed that no new L0 layers are flushed throughout the process, which is not true. ## Summary of changes Fix the test case `test_compaction_l0_memory` by flushing in-memory layers before compaction. Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem We have a couple of CI workflows that still run on Debian Bullseye, and the default Debian version in images is Bullseye as well (we explicitly set building on Bookworm) ## Summary of changes - Run `pgbench-pgvector` on Bookworm (fix a couple of packages) - Run `trigger_bench_on_ec2_machine_in_eu_central_1` on Bookworm - Change default `DEBIAN_VERSION` in Dockerfiles to Bookworm - Make `pinned` docker tag an alias to `pinned-bookworm`

The 1.82.0 version of Rust will be stable soon, let's get the clippy lint fixes in before the compiler version upgrade.

## Problem Prefetch is disabled at MacODS because `posix_fadvise` is not available. But Neon prefetch is not using this function and for testing at MacOS is it very convenient that prefetch is available. ## Summary of changes Define `USE_PREFETCH` in Makefile. --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

This PR adds two benchmark to demonstrate the effect of server-side getpage request batching added in #9321. For the CPU usage, I found the the `prometheus` crate's built-in CPU usage accounts the seconds at integer granularity. That's not enough you reduce the target benchmark runtime for local iteration. So, add a new `libmetrics` metric and report that. The benchmarks are disabled because [on our benchmark nodes, timer resolution isn't high enough](https://neondb.slack.com/archives/C059ZC138NR/p1732264223207449). They work (no statement about quality) on my bare-metal devbox. They will be refined and enabled once we find a fix. Candidates at time of writing are: - #9822 - #9851 Refs: - Epic: #9376 - Extracted from #9792

…#9746) ## Problem For any given tenant shard, pageservers receive all of the tenant's WAL from the safekeeper. This soft-blocks us from using larger shard counts due to bandwidth concerns and CPU overhead of filtering out the records. ## Summary of changes This PR lifts the decoding and interpretation of WAL from the pageserver into the safekeeper. A customised PG replication protocol is used where instead of sending raw WAL, the safekeeper sends filtered, interpreted records. The receiver drives the protocol selection, so, on the pageserver side, usage of the new protocol is gated by a new pageserver config: `wal_receiver_protocol`. More granularly the changes are: 1. Optionally inject the protocol and shard identity into the arguments used for starting replication 2. On the safekeeper side, implement a new wal sending primitive which decodes and interprets records before sending them over 3. On the pageserver side, implement the ingestion of this new replication message type. It's very similar to what we already have for raw wal (minus decoding and interpreting). ## Notes * This PR currently uses my [branch of rust-postgres](https://github.com/neondatabase/rust-postgres/tree/vlad/interpreted-wal-record-replication-support) which includes the deserialization logic for the new replication message type. PR for that is open [here](neondatabase/rust-postgres#32). * This PR contains changes for both pageservers and safekeepers. It's safe to merge because the new protocol is disabled by default on the pageserver side. We can gradually start enabling it in subsequent releases. * CI tests are running on #9747 ## Links Related: #9336 Epic: #9329

## Problem The vast majority of the error/warn logs from cplane are about time or data transfer quotas exceeded or endpoint-not-found errors and not operational errors in proxy or cplane. ## Summary of changes * Demote cplane error replies to info level. * Raise other errors from warn back to error.

## Problem Any errors from these async blocks are unconditionally logged at error level even though we already handle such errors based on context. ## Summary of changes * Log raw errors from creating and executing cplane requests at debug level. * Inline macro calls to retain the correct callsite.

Before, we hardcoded the pg_version to 140000, while the code expected version numbers like 14. Now we use an enum, and code from `extension_server.rs` to auto-detect the correct version. The enum helps when we add support for a version: enums ensure that compilation fails if one forgets to put the version to one of the `match` locations. cc #9218

## Problem The RequestContext::span shouldn't live for the entire postgres connection, only the handshake. ## Summary of changes * Slight refactor to the RequestContext to discard the span upon handshake completion. * Make sure the temporary future for the handshake is dropped (not bound to a variable) * Runs our nightly fmt script

This bump comes from a recommendation from Chi. Signed-off-by: Tristan Partin <tristan@neon.tech>

…e compute side parallelism (#9904) ## Problem ingest benchmark tests project migration to Neon involving steps - COPY relation data - create indexes - create constraints Previously we used only 4 copy jobs, 4 create index jobs and 7 maintenance workers. After increasing effective_io_concurrency on compute we see that we can sustain more parallelism in the ingest bench ## Summary of changes Increase copy jobs to 8, create index jobs to 8 and maintenance workers to 16

## Problem The `pre-merge-checks` workflow relies on the build-tools image. If changes to the `build-tools` image have been merged into the main branch since the last CI run for a PR (with other changes to the `build-tools`), the image will be rebuilt during the merge queue run. Otherwise, cached images are used. Rebuilding the image adds approximately 10 minutes on x86-64 and 20 minutes on arm64 to the process. ## Summary of changes - parametrise `build-build-tools-image` job with arch and Debian version - Run `pre-merge-checks` only on Debian 12 x86-64 image

…#9821) ## Problem #9746 lifted decoding and interpretation of WAL to the safekeeper. This reduced the ingested amount on the pageservers by around 10x for a tenant with 8 shards, but doubled the ingested amount for single sharded tenants. Also, #9746 uses bincode which doesn't support schema evolution. Technically the schema can be evolved, but it's very cumbersome. ## Summary of changes This patch set addresses both problems by adding protobuf support for the interpreted wal records and adding compression support. Compressed protobuf reduced the ingested amount by 100x on the 32 shards `test_sharded_ingest` case (compared to non-interpreted proto). For the 1 shard case the reduction is 5x. Sister change to `rust-postgres` is [here](neondatabase/rust-postgres#33). ## Links Related: #9336 Epic: #9329

## Problem Can't change protocol at tenant granularity. ## Summary of changes Add tenant config level override for wal receiver protocol. ## Links Related: #9336 Epic: #9329

## Problem We don't have any observability for the relation size cache. We have seen cache misses cause significant performance impact with high relation counts. Touches #9855. ## Summary of changes Adds the following metrics: * `pageserver_relsize_cache_entries` * `pageserver_relsize_cache_hits` * `pageserver_relsize_cache_misses` * `pageserver_relsize_cache_misses_old`

Valid layer assumption is a necessary condition for a layer map to be valid. It's a stronger check imposed by gc-compaction than the actual valid layermap definition. Actually, the system can work as long as there are no overlapping layer maps. Therefore, we degrade that into a warning. Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem When ingesting implicit `ClearVmBits` operations, we silently drop the writes if the relation or page is unknown. There are implicit assumptions around VM pages wrt. explicit/implicit updates, sharding, and relation sizes, which can possibly drop writes incorrectly. Adding a few metrics will allow us to investigate further and tighten up the logic. Touches #9855. ## Summary of changes Add a `pageserver_wal_ingest_clear_vm_bits_unknown` metric to record dropped `ClearVmBits` writes. Also add comments clarifying the behavior of relation sizes on non-zero shards.

* Promote two logs from mpsc send errors to error level. The channels are unbounded and there shouldn't be errors. * Fix one multiline log from anyhow::Error. Use Debug instead of Display.

## Problem close #9859 ## Summary of changes Ensure that the deletion queue gets fully flushed (i.e., the deletion lists get applied) during a graceful shutdown. It is still possible that an incomplete shutdown would leave deletion list behind and cause race upon the next startup, but we assume this will unlikely happen, and even if it happened, the pageserver should already be at a tainted state and the tenant should be moved to a new tenant with a new generation number. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

# Problem VM (visibility map) pages are stored and managed as any regular relation page, in the VM fork of the main relation. They are also sharded like other pages. Regular WAL writes to the VM pages (typically performed by vacuum) are routed to the correct shard as usual. However, VM pages are also updated via `ClearVmBits` metadata records emitted when main relation pages are updated. These metadata records were sent to all shards, like other metadata records. This had the following effects: * On shards responsible for VM pages, the `ClearVmBits` applies as expected. * On shard 0, which knows about the VM relation and its size but doesn't necessarily have any VM pages, the `ClearVmBits` writes may have been applied without also having applied the explicit WAL writes to VM pages. * If VM pages are spread across multiple shards (unlikely with 256MB stripe size), all shards may have applied `ClearVmBits` if the pages fall within their local view of the relation size, even for pages they do not own. * On other shards, this caused a relation size cache miss and a DbDir and RelDir lookup before dropping the `ClearVmBits`. With many relations, this could cause significant CPU overhead. This is not believed to be a correctness problem, but this will be verified in #9914. Resolves #9855. # Changes Route `ClearVmBits` metadata records only to the shards responsible for the VM pages. Verification of the current VM handling and cleanup of incomplete VM pages on shard 0 (and potentially elsewhere) is left as follow-up work.

## Problem For cancellation, a connection is open during all the cancel checks. ## Summary of changes Spawn cancellation checks in the background, and close connection immediately. Use task_tracker for cancellation checks.

## Problem We currently see elevated levels of errors for GetBlob requests. This is because 404 and 304 are counted as errors for metric reporting. ## Summary of Changes Bring the implementation in line with the S3 client and treat 404 and 304 responses as ok for metric purposes. Related: neondatabase/cloud#20666

Build the `pg_visibility` extension for use with `neon_local`. This is useful to inspect the visibility map for debugging. Touches #9914.

We keep the practice of keeping the compiler up to date, pointing to the latest release. This is done by many other projects in the Rust ecosystem as well. [Release notes](https://releases.rs/docs/1.83.0/). Also update `cargo-hakari`, `cargo-deny`, `cargo-hack` and `cargo-nextest` to their latest versions. Prior update was in #9445.

…nses (#9928) ## Problem For the interpreted proto the pageserver is not returning the correct LSN in replies to keep alive requests. This is because the interpreted protocol arm was not updating `last_rec_lsn`. ## Summary of changes * Return correct LSN in keep-alive responses * Fix shard field in wal sender traces

## Problem Currently, we rerun only known flaky tests. This approach was chosen to reduce the number of tests that go unnoticed (by forcing people to take a look at failed tests and rerun the job manually), but it has some drawbacks: - In PRs, people tend to push new changes without checking failed tests (that's ok) - In the main, tests are just restarted without checking (understandable) - Parametrised tests become flaky one by one, i.e. if `test[1]` is flaky `, test[2]` is not marked as flaky automatically (which may or may not be the case). I suggest rerunning all failed tests to increase the stability of GitHub jobs and using the Grafana Dashboard with flaky tests for deeper analysis. ## Summary of changes - Rerun all failed tests twice at max

## Problem We used `set_path()` to replace the database name in the connection string. It automatically does url-safe encoding if the path is not already encoded, but it does it as per the URL standard, which assumes that tabs can be safely removed from the path without changing the meaning of the URL. See, e.g., https://url.spec.whatwg.org/#concept-basic-url-parser. It also breaks for DBs with properly %-encoded names, like with `%20`, as they are kept intact, but actually should be escaped. Yet, this is not true for Postgres, where it's completely valid to have trailing tabs in the database name. I think this is the PR that caused this regression #9717, as it switched from `postgres::config::Config` back to `set_path()`. This was fixed a while ago already [1], btw, I just haven't added a test to catch this regression back then :( ## Summary of changes This commit changes the code back to use `postgres/tokio_postgres::Config` everywhere. While on it, also do some changes around, as I had to touch this code: 1. Bump some logging from `debug` to `info` in the spec apply path. We do not use `debug` in prod, and it was tricky to understand what was going on with this bug in prod. 2. Refactor configuration concurrency calculation code so it was reusable. Yet, still keep `1` in the case of reconfiguration. The database can be actively used at this moment, so we cannot guarantee that there will be enough spare connection slots, and the underlying code won't handle connection errors properly. 3. Simplify the installed extensions code. It was spawning a blocking task inside async function, which doesn't make much sense. Instead, just have a main sync function and call it with `spawn_blocking` in the API code -- the only place we need it to be async. 4. Add regression python test to cover this and related problems in the future. Also, add more extensive testing of schema dump and DBs and roles listing API. [1]: 4d1e48f [2]: https://www.postgresql.org/message-id/flat/20151023003445.931.91267%40wrigleys.postgresql.org Resolves neondatabase/cloud#20869

github-actions · 2024-11-28T23:17:02Z

6952 tests run: 6644 passed, 0 failed, 308 skipped (full report)

Flaky tests (2)

Postgres 17

test_subscriber_synchronous_commit: release-x86-64

Postgres 15

test_pull_timeline[True]: release-arm64

Code coverage* (full report)

functions: 30.6% (7981 of 26064 functions)
lines: 48.6% (63462 of 130635 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
42fb3c4 at 2024-11-28T23:49:17.313Z :recycle:}

skyzh and others added 29 commits November 25, 2024 14:25

Address beta clippy lint needless_lifetimes (#9877)

77630e5

The 1.82.0 version of Rust will be stable soon, let's get the clippy lint fixes in before the compiler version upgrade.

Bump neon.logical_replication_max_snap_files default to 10000 (#9896)

2b788cb

This bump comes from a recommendation from Chi. Signed-off-by: Tristan Partin <tristan@neon.tech>

pageserver: add tenant config override for wal receiver proto (#9888)

8fdf786

## Problem Can't change protocol at tenant granularity. ## Summary of changes Add tenant config level override for wal receiver protocol. ## Links Related: #9336 Epic: #9329

proxy: promote two logs to error, fix multiline log (#9913)

5c41707

* Promote two logs from mpsc send errors to error level. The channels are unbounded and there shouldn't be errors. * Fix one multiline log from anyhow::Error. Use Debug instead of Display.

proxy: spawn cancellation checks in the background (#9918)

8173dc6

## Problem For cancellation, a connection is open during all the cancel checks. ## Summary of changes Spawn cancellation checks in the background, and close connection immediately. Use task_tracker for cancellation checks.

Makefile: build pg_visibility (#9922)

70780e3

Build the `pg_visibility` extension for use with `neon_local`. This is useful to inspect the visibility map for debugging. Touches #9914.

ololobus requested a review from a team as a code owner November 28, 2024 21:47

ololobus requested review from a team as code owners November 28, 2024 21:47

ololobus requested review from problame, conradludgate, hlinnaka, sharnoff, clipperhouse and NanoBjorn and removed request for a team November 28, 2024 21:47

bayandin approved these changes Nov 28, 2024

View reviewed changes

sharnoff approved these changes Nov 28, 2024

View reviewed changes

hlinnaka approved these changes Nov 29, 2024

View reviewed changes

hlinnaka merged commit 1ca9b56 into release Nov 29, 2024
305 of 306 checks passed

hlinnaka deleted the compute-rc-2024-11-28 branch November 29, 2024 07:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute release 2024-11-28 #9935

Compute release 2024-11-28 #9935

ololobus commented Nov 28, 2024

github-actions bot commented Nov 28, 2024 •

edited

Loading

Postgres 17

Postgres 15

Compute release 2024-11-28 #9935

Compute release 2024-11-28 #9935

Conversation

ololobus commented Nov 28, 2024

Problem

Summary of changes

github-actions bot commented Nov 28, 2024 • edited Loading

6952 tests run: 6644 passed, 0 failed, 308 skipped (full report)

Postgres 17

Postgres 15

Code coverage* (full report)

github-actions bot commented Nov 28, 2024 •

edited

Loading