[Core] Avoid reconnection if the node is Gone #2619

AhmedSoliman · 2025-02-04T14:27:25Z

Stack created with Sapling. Best reviewed with ReviewStack.

github-actions · 2025-02-04T14:53:29Z

Test Results

7 files ±0 7 suites ±0 3m 34s ⏱️ -54s
45 tests - 2 44 ✅ - 2 1 💤 ±0 0 ❌ ±0
174 runs - 8 171 ✅ - 8 3 💤 ±0 0 ❌ ±0

Results for commit b446ccc. ± Comparison against base commit d641697.

This pull request removes 2 tests.

dev.restate.sdktesting.tests.AwaitTimeout ‑ timeout(Client)
dev.restate.sdktesting.tests.RawHandler ‑ rawHandler(Client)

♻️ This comment has been updated with latest results.

tillrohrmann

Thanks for preventing connection attempts to gone nodes @AhmedSoliman. Modulo one comment about a condition the changes look good to me.

crates/core/src/network/connection_manager.rs

crates/types/src/node_id.rs

tillrohrmann

LGTM. Left a minor comment regarding the filter condition on get_or_connect. Apart from this, +1 for merging :-)

tillrohrmann · 2025-02-05T08:12:24Z

crates/core/src/network/connection_manager.rs

+                .lock()
+                .observed_generations
+                .get(&node_id.as_plain())
+                .map(|status| node_id.generation() <= status.generation && status.gone)


Could we make this filter even a bit more selective by node_id.generation() < status.generation || node_id.generation() == status.generation && status.gone?

Fixes: - On graceful shutdown, we had a long-standing bug where draining of some connections can get stuck due to the connection-aware rpc router holding owned senders in their futures. This was addressed by not blocking on the receive stream on drain, we'll only process the messages we have received after we sent the shutdown signal. Note that connections that have been terminated by the peer will also skip the drain since we don't want to process further messages from them. - On graceful shutdown, we had a bug that peers would have ignored the Control Frame holding the shutting down signal since those messages have no header. This is now fixed, this will impact a future PR to mark this node generation as `Gone` to avoid reconnects - On system shutdown, we first stop cluster controller to: - Make sure it doesn't react to our own partial/complete loss of connectivity during shutdown - To avoid any competition with other controllers that might become leader during shutdown of this node - We now drain connections first and stop socket handlers gracefully before we continue the shutdown to give the best chance for the shutdown control frame to be sent to peers. This should make other controllers and parts of the system realise that this node is `gone` as early as possible to improve failover time (MTTR) - Minor logging changes ``` // intentionally empty ```

- RepairTail bugs fixed + restatectl's digest command now tolerates a failed node - RepairTail improved logging to explain what's happened - Inner retries removed, most retries are now on the out layers (tbd if more inner retries need to be removed). Note that this causes some of the outer operations to fail more often than before. This will be evaluated as we test and fixed on the higher level as needed - Minor logging fixes ``` // intentionally empty ```

This was referenced Feb 4, 2025

Health in TaskCenter #2613

Closed

[Core] Improve connection termination on shutdown #2614

Closed

Fixes for tail repair and reduction of inner retries #2615

Closed

AhmedSoliman force-pushed the pr2619 branch from a730c14 to a3ca2c2 Compare February 4, 2025 14:27

AhmedSoliman marked this pull request as ready for review February 4, 2025 15:13

AhmedSoliman requested a review from tillrohrmann February 4, 2025 15:13

AhmedSoliman force-pushed the pr2619 branch 2 times, most recently from 255c962 to ebc159e Compare February 4, 2025 17:13

AhmedSoliman mentioned this pull request Feb 4, 2025

[Bifrost] Improve logging of seal failures #2621

Closed

tillrohrmann reviewed Feb 4, 2025

View reviewed changes

crates/core/src/network/connection_manager.rs Outdated Show resolved Hide resolved

crates/types/src/node_id.rs Show resolved Hide resolved

AhmedSoliman force-pushed the pr2619 branch 2 times, most recently from 770b6cb to 8fe1dcd Compare February 5, 2025 08:11

tillrohrmann approved these changes Feb 5, 2025

View reviewed changes

AhmedSoliman mentioned this pull request Feb 5, 2025

[Bifrost] Provisioning nodes are not authoritative empty #2625

Merged

AhmedSoliman force-pushed the pr2619 branch 2 times, most recently from 7f86d7e to e005bdf Compare February 5, 2025 10:30

AhmedSoliman added 4 commits February 5, 2025 10:32

Health in TaskCenter

e407716

[Core] Avoid reconnection if the node is Gone

b446ccc

AhmedSoliman force-pushed the pr2619 branch from e005bdf to b446ccc Compare February 5, 2025 11:22

AhmedSoliman closed this Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Avoid reconnection if the node is Gone #2619

[Core] Avoid reconnection if the node is Gone #2619

AhmedSoliman commented Feb 4, 2025 •

edited

Loading

github-actions bot commented Feb 4, 2025 •

edited

Loading

tillrohrmann left a comment

tillrohrmann left a comment

tillrohrmann Feb 5, 2025

AhmedSoliman Feb 5, 2025

[Core] Avoid reconnection if the node is Gone #2619

[Core] Avoid reconnection if the node is Gone #2619

Conversation

AhmedSoliman commented Feb 4, 2025 • edited Loading

github-actions bot commented Feb 4, 2025 • edited Loading

Test Results

tillrohrmann left a comment

Choose a reason for hiding this comment

tillrohrmann left a comment

Choose a reason for hiding this comment

tillrohrmann Feb 5, 2025

Choose a reason for hiding this comment

AhmedSoliman Feb 5, 2025

Choose a reason for hiding this comment

AhmedSoliman commented Feb 4, 2025 •

edited

Loading

github-actions bot commented Feb 4, 2025 •

edited

Loading