Proxy release 2024-08-22 #8799

vipvap · 2024-08-22T06:01:54Z

Proxy release 2024-08-22

Please merge this Pull Request using 'Create a merge commit' button

With additional phases from #8430 the `detach_ancestor::Error` became untenable. Split it up into phases, and introduce laundering for remaining `anyhow::Error` to propagate them as most often `Error::ShuttingDown`. Additionally, complete FIXMEs. Cc: #6994

## Problem When pageservers do compaction, they frequently create image layers that make earlier layers un-needed for reads, but then keep those earlier layers around for 24 hours waiting for time-based eviction to expire them. Now that we track layer visibility, we can use it as an input to eviction, and avoid the 24 hour "disk bump" that happens around pageserver restarts. ## Summary of changes - During time-based eviction, if a layer is marked Covered, use the eviction period as the threshold: i.e. these layers get to remain resident for at least one iteration of the eviction loop, but then get evicted. With current settings this means they get evicted after 1h instead of 24h. - During disk usage eviction, prioritized evicting covered layers above all other layers. Caveats: - Using the period as the threshold for time based eviction in this case is a bit of a hack, but it avoids adding yet another configuration property, and in any case the value of a new property would be somewhat arbitrary: there's no "right" length of time to keep covered layers around just in case. - We had previously planned on removing time-based eviction: this change would motivate us to keep it around, but we can still simplify the code later to just do the eviction of covered layers, rather than applying a TTL policy to all layers.

## Problem `author_association` doesn't properly work if a GitHub user decides not to show affiliation with the org in their profile (i.e. if it's private) ## Summary of changes - Call `/orgs/ORG/members/USERNAME` API to check whether a PR/issue author is a member of the org

We can get CompactionError::Other(Cancelled) via the error handling with a few ways. [evidence](https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8655/10301613380/index.html#suites/cae012a1e6acdd9fdd8b81541972b6ce/653a33de17802bb1/). Hopefully fix it by: 1. replace the `map_err` which hid the `GetReadyAncestorError::Cancelled` with `From<GetReadyAncestorError> for GetVectoredError` conversion 2. simplifying the code in pgdatadir_mapping to eliminate the token anyhow wrapping for deserialization errors 3. stop wrapping GetVectoredError as anyhow errors 4. stop wrapping PageReconstructError as anyhow errors Additionally, produce warnings if we treat any other error (as was legal before this PR) as missing key. Cc: #8708.

basic JWT implementation that caches JWKs and verifies signatures. this code is currently not reachable from proxy, I just wanted to get something merged in.

) ## Problem This command is kind of a hack, used when we're migrating large tenants and want to get their resident size down. It sets the tenant config to a fixed value, which omitted heatmap_period, so caused secondaries to get out of date. ## Summary of changes - Set heatmap period to the same 300s default that we use elsewhere when updating eviction settings This is not as elegant as some general purpose partial modification of the config, but it practically makes the command safer to use.

## Problem During `Run rust tests` step (for debug builds), we accidentally rebuild neon twice (by `cargo test --doc` and by `cargo nextest run`). It happens because we don't set `cov_prefix` for the `cargo test --doc` command, which triggers rebuilding with different build flags, and one more rebuild by `cargo nextest run`. ## Summary of changes - Set `cov_prefix` for `cargo test --doc` to prevent unneeded rebuilds

Some benchmarks and tests might still fail because of #8655 (tracked in #8708) because we are not fast enough to shut down ([one evidence]). Partially this is explained by the current validation mode of streaming k-merge, but otherwise because that is where we use a lot of time in compaction. Outside of L0 => L1 compaction, the image layer generation is already guarded by vectored reads doing cancellation checks. 32768 is a wild guess based on looking how many keys we put in each layer in a bench (1-2 million), but I assume it will be good enough divisor. Doing checks more often will start showing up as contention which we cannot currently measure. Doing checks less often might be reasonable. [one evidence]: https://neon-github-public-dev.s3.amazonaws.com/reports/main/10384136483/index.html#suites/9681106e61a1222669b9d22ab136d07b/96e6d53af234924/ Earlier PR: #8706.

## Problem It's recommended that a couple of additional RUSTFLAGS be set up to improve the performance of Rust applications on AWS Graviton. See https://github.com/aws/aws-graviton-getting-started/blob/57dc813626d0266f1cc12ef83474745bb1f31fb4/rust.md Note: Apple Silicon is compatible with neoverse-n1: ``` $ clang --version Apple clang version 15.0.0 (clang-1500.3.9.4) Target: arm64-apple-darwin23.6.0 Thread model: posix InstalledDir: /Applications/Xcode_15.4.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin $ $ clang --print-supported-cpus 2>&1 | grep neoverse- neoverse-512tvb neoverse-e1 neoverse-n1 neoverse-n2 neoverse-v1 neoverse-v2 ``` ## Summary of changes - Add `-Ctarget-feature=+lse -Ctarget-cpu=neoverse-n1` to RUSTFLAGS for ARM images

## Problem A bunch of small fixes and improvements for CI, that are too small to have a separate PR for them ## Summary of changes - CI(build-and-test): fix parenthesis - CI(actionlint): fix path to workflow file - CI: remove default args from actions/checkout - CI: remove `gen3` label, using a couple `self-hosted` + `small{,-arm64}`/`large{,-arm64}` is enough - CI: prettify Slack messages, hide links behind text messages - C(build-and-test): add more dependencies to `conclusion` job

…ces (#8717) The `tokio_epoll_uring::Slice` / `tokio_uring::Slice` type is weird. The new `FullSlice` newtype is better. See the doc comment for details. The naming is not ideal, but we'll clean that up in a future refactoring where we move the `FullSlice` into `tokio_epoll_uring`. Then, we'll do the following: * tokio_epoll_uring::Slice is removed * `FullSlice` becomes `tokio_epoll_uring::IoBufView` * new type `tokio_epoll_uring::IoBufMutView` for the current `tokio_epoll_uring::Slice<IoBufMut>` Context ------- I did this work in preparation for #8537. There, I'm changing the type that the `inmemory_layer.rs` passes to `DeltaLayerWriter::put_value_bytes` and thus it seemed like a good opportunity to make this cleanup first.

## Problem The control file contains the id of the safekeeper that uploaded it. Previously, when sending a snapshot of the control file to another sk, it would eventually be gc-ed by the receiving sk. This is incorrect because the original sk might still need it later. ## Summary of Changes When sending a snapshot and the control file contains an uploaded segment: * Create a copy of the segment in s3 with the destination sk in the object name * Tweak the streamed control file to point to the object create in the previous step Note that the snapshot endpoint now has to know the id of the requestor, so the api has been extended to include the node if of the destination sk. Closes #8542

## Problem On macOS, clippy fails with the following error: ``` error: unused import: `crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt` --> pageserver/src/tenant/remote_timeline_client/download.rs:26:5 | 26 | use crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt; | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | = note: `-D unused-imports` implied by `-D warnings` = help: to override `-D warnings` add `#[allow(unused_imports)]` ``` Introduced in #8717 ## Summary of changes - allow `unused_imports` for `crate::virtual_file::owned_buffers_io::io_buf_ext::IoBufExt` on macOS in download.rs

…8727) Per #8674, disallow node configuration while drain/fill are ongoing. Implement it by adding a only-http wrapper `Service::external_node_configure` which checks for operation existing before configuring. Additionally: - allow cancelling drain/fill after a pageserver has restarted and transitioned to WarmingUp Fixes: #8674

this way we do not need to repeat the %node_id everywhere, and we get no stray messages in logs from within the op.

this gives spans for panics, and does not globber loki output by writing to stderr while all of the other logging is to stdout. See: #3475

## Problem Logical replication BGW checking replication lag is not reloading config ## Summary of changes Add handling of reload config request ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

## Problem `secrets.GITHUB_TOKEN` (with any permissions) is not enough to get a user's membership info if they decide to hide it. ## Summary of changes - Use `secrets.CI_ACCESS_TOKEN` for `gh api` call - Use `pull_request_target` instead of `pull_request` event to get access to secrets

…el (#8687) ## Problem We want to store Nightly Replication test results in the database and notify the relevant Slack channel about failures ## Summary of changes - Store test results in the database - Notify `on-call-compute-staging-stream` about failures

Some benchmarks are failing with a "long" flushing, which might be because there is a queue of in-memory layers, or something else. Add logging to narrow it down. Private slack DM ref: https://neondb.slack.com/archives/D049K7HJ9JM/p1723727305238099

We've had physical replication support for a long time, but we never created an RFC for the feature. This RFC does that after the fact. Even though we've already implemented the feature, let's have a design discussion as if it hadn't done that. It can still be quite insightful. This is written from a pretty compute-centric viewpoint, not much on how it works in the control plane.

## Problem #8588 implemented the mechanism for storage controller leadership transfers. However, there's no tests that exercise the behaviour. ## Summary of changes 1. Teach `neon_local` how to handle multiple storage controller instances. Each storage controller instance gets its own subdirectory (`storage_controller_1, ...`). `storage_controller start|stop` subcommands have also been extended to optionally accept an instance id. 2. Add a storage controller proxy test fixture. It's a basic HTTP server that forwards requests from pageserver and test env to the currently configured storage controller. 3. Add a test which exercises storage controller leadership transfer. 4. Finally fix a couple bugs that the test surfaced

Previously, we protected from multiple ProposerElected messages from the same walproposer with the following condition: msg.term == self.get_last_log_term() && self.flush_lsn() > msg.start_streaming_at It is not exhaustive, i.e. we could still proceed to truncating WAL even though safekeeper inserted something since the divergence point has been calculated. While it was most likely safe because walproposer can't use safekeeper position to commit WAL until last_log_term reaches the current walproposer term, let's be more careful and properly calculate the divergence point like walproposer does.

…first, except l0s (#8729) ## Problem When a secondary location is trying to catch up while a tenant is receiving new writes, it can become quite wasteful: - Downloading L0s which are soon destroyed by compaction to L1s - Downloading older layer files which are soon made irrelevant when covered by image layers. ## Summary of changes Sort the layer files in the heatmap: - L0 layers are the lowest priority - Other layers are sorted to download the highest LSNs first.

## Problem Current superuser check always passes because it returns a tuple like `(False,)`, and then the `if not superuser` passes. ## Summary of changes Fixes the issue by unwrapping the tuple. Verified that it works against a project where I don't have superuser.

## Problem Multiple increase/decrease LFC limit may cause unlimited growth of LFC file because punched holes while LFC shrinking are not reused when LFC is extended. ## Summary of changes Keep track of holes and reused them when LFC size is increased. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>

Part of #8128. ## Description This PR creates a unified command to run both physical gc and metadata health check as a cron job. This also enables us to add additional tasks to the cron job in the future. Signed-off-by: Yuchen Liang <yuchen@neon.tech>

After the rollout has succeeded, we now set the default image compression to be enabled. We also remove its explicit mention from `neon_fixtures.py` added in #8368 as it is now the default (and we switch to `zstd(1)` which is a bit nicer on CPU time). Part of #5431

* Missing blank lifetimes which is now deprecated. * Matching off unqualified enum variants that could act like variable. * Missing semicolons.

## Problem Storage controllers did not have the right token to speak to their peers for leadership transitions. ## Summary of changes Accept a peer jwt token for the storage controller. Epic: neondatabase/cloud#14701

## Problem We don't have a convenient way for a human to ask "how far are secondary downloads along for this tenant". This is useful when driving migrations of tenants to the storage controller, as we first create a secondary location and want to see it warm up before we cut over. That can already be done via storcon_cli, but we would like a way that doesn't require direct API access. ## Summary of changes Add a metric that reports to total size of layers in the heatmap: this may be used in conjunction with the existing `pageserver_secondary_resident_physical_size` to estimate "warmth" of the secondary location.

part of #8140 The blob writer path now uses `maybe_fatal_err` Signed-off-by: Alex Chi Z <chi@neon.tech>

) ## Problem Database preparation workflow needs Neon artifacts but does not checkout necessary download action. We were lucke in a few runs like this one https://github.com/neondatabase/neon/actions/runs/10413970941/job/28870668020 but this is flaky and a race condition which failed here https://github.com/neondatabase/neon/actions/runs/10446395644/job/28923749772#step:4:1 ## Summary of changes Checkout code (including actions) before invoking download action Successful test run https://github.com/neondatabase/neon/actions/runs/10469356296/job/28992200694

…8769) ## Problem Compaction jobs and other background loops are concurrency-limited through a global semaphore. The current counters allow quantifying how _many_ tasks are waiting. But there is no way to tell how _much_ delay is added by the semaphore. So, add a counter that aggregates the wall clock time seconds spent acquiring the semaphore. The metrics can be used as follows: * retroactively calculate average acquisition time in a given time range * compare the degree of background loop backlog among pageservers The metric is insufficient to calculate * run-up of ongoing acquisitions that haven't finished acquiring yet * Not easily feasible because ["Cancelling a call to acquire makes you lose your place in the queue"](https://docs.rs/tokio/latest/tokio/sync/struct.Semaphore.html#method.acquire) ## Summary of changes * Refactor the metrics to follow the current best practice for typed metrics in `metrics.rs`. * Add the new counter.

if a heartbeat happens during shutdown, then the task is already cancelled and will not be sending responses. Fixes: #8766

Add another allowed_error for this rarity. Fixes: #8773

## Problem We want to run our regression test suite on ARM. ## Summary of changes - run regression tests on release ARM builds - run `build-neon` (including rust tests) on debug ARM builds - add `arch` parameter to test to distinguish them in the allure report and in a database

This removes workspace hack from all libs, not from any binaries. This does not change the behaviour of the hack. Running ``` cargo clean cargo build --release --bin proxy ``` Before this change took 5m16s. After this change took 3m3s. This is because this allows the build to be parallelisable much more.

Track the number of logical snapshot files on an endpoint over time. Signed-off-by: Tristan Partin <tristan@neon.tech>

Adds endpoint GET /tenant/timeline listing all not deleted timelines.

Going through the list of recent flaky tests, trying to fix those related to graceful shutdown. - test_forward_compatibility: flush and wait for uploads to avoid graceful shutdown - test_layer_bloating: in the end the endpoint and vanilla are still up => immediate shutdown - test_lagging_sk: pageserver shutdown is not related to the test => immediate shutdown - test_lsn_lease_size: pageserver flushing is not needed => immediate shutdown Additionally: - remove `wait_for_upload` usage from workload fixture Cc: #8708 Fixes: #8710

Useful for dashboarding the replication metrics of a single endpoint. Signed-off-by: Tristan Partin <tristan@neon.tech>

After #8655, we needed to mark some tests to shut down immediately. To aid these tests, try the new pattern of `flush_ep_to_pageserver` followed by a non-compacting checkpoint. This moves the general graceful shutdown problem of having too much to flush at shutdown into the test. Also, add logging for how long the graceful shutdown took, if we got to complete it for faster log eyeballing. Fixes: #8712 Cc: #8715, #8708

…force img generation (#8776) close #8558 * Directly generate image layers for sparse keyspaces during initdb optimization. * Support force image layer generation for sparse keyspaces. * Fix a bug of incorrect image layer key range in case of duplicated keys. (The added line: `start = img_range.end;`) This can cause overlapping image layers and keys to disappear. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

github-actions · 2024-08-22T07:01:41Z

3801 tests run: 3695 passed, 0 failed, 106 skipped (full report)

Flaky tests (2)

Postgres 16

test_forward_compatibility: debug
test_lfc_resize: debug

Code coverage* (full report)

functions: 32.4% (7242 of 22345 functions)
lines: 50.4% (58463 of 116047 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
a968554 at 2024-08-22T07:01:40.556Z :recycle:}

conradludgate · 2024-08-22T08:55:45Z

Relevant commits:

koivunej and others added 30 commits August 14, 2024 10:16

proxy: start of jwk cache (#8690)

a7028d9

basic JWT implementation that caches JWKs and verifies signatures. this code is currently not reachable from proxy, I just wanted to get something merged in.

storcon: add spans to drain/fill ops (#8735)

52641eb

this way we do not need to repeat the %node_id everywhere, and we get no stray messages in logs from within the op.

storcon: use tracing for logging panics (#8734)

24d347f

this gives spans for panics, and does not globber loki output by writing to stderr while all of the other logging is to stdout. See: #3475

Add logical replication test to exercise snapfiles (#8364)

df086cd

proxy: Fix some warnings by extended clippy checks (#8748)

f246aa3

* Missing blank lifetimes which is now deprecated. * Matching off unqualified enum variants that could act like variable. * Missing semicolons.

VladLazar and others added 16 commits August 20, 2024 15:25

storcon: add peer jwt token (#8764)

fa0750a

## Problem Storage controllers did not have the right token to speak to their peers for leadership transitions. ## Summary of changes Accept a peer jwt token for the storage controller. Epic: neondatabase/cloud#14701

impr(pageserver): abort on fatal I/O writer error (#8777)

c8b9116

part of #8140 The blob writer path now uses `maybe_fatal_err` Signed-off-by: Alex Chi Z <chi@neon.tech>

storcon: handle heartbeater shutdown gracefully (#8767)

477246f

if a heartbeat happens during shutdown, then the task is already cancelled and will not be sending responses. Fixes: #8766

test: test_timeline_ancestor_detach_errors rare allowed_error (#8782)

3b80164

Add another allowed_error for this rarity. Fixes: #8773

Add compute_logical_snapshot_files metric

f4b3c31

Track the number of logical snapshot files on an endpoint over time. Signed-off-by: Tristan Partin <tristan@neon.tech>

safekeeper: add listing timelines

d919770

Adds endpoint GET /tenant/timeline listing all not deleted timelines.

Add compute_receive_lsn metric

99c19ca

Useful for dashboarding the replication metrics of a single endpoint. Signed-off-by: Tristan Partin <tristan@neon.tech>

Prefix current_lsn with compute_

04752df

vipvap requested review from a team as code owners August 22, 2024 06:01

vipvap requested review from arssher, conradludgate and knizhnik and removed request for a team August 22, 2024 06:01

conradludgate approved these changes Aug 22, 2024

View reviewed changes

conradludgate merged commit 66a9900 into release-proxy Aug 22, 2024
162 of 168 checks passed

conradludgate deleted the rc/proxy/2024-08-22 branch August 22, 2024 09:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy release 2024-08-22 #8799

Proxy release 2024-08-22 #8799

vipvap commented Aug 22, 2024

github-actions bot commented Aug 22, 2024

Postgres 16

conradludgate commented Aug 22, 2024

Proxy release 2024-08-22 #8799

Proxy release 2024-08-22 #8799

Conversation

vipvap commented Aug 22, 2024