[Core] Improve connection termination on shutdown #2614

AhmedSoliman · 2025-02-03T15:44:23Z

Fixes:

On graceful shutdown, we had a long-standing bug where draining of some connections can get stuck due to the connection-aware rpc router holding owned senders in their futures. This was addressed by not blocking on the receive stream on drain, we'll only process the messages we have received after we sent the shutdown signal. Note that connections that have been terminated by the peer will also skip the drain since we don't want to process further messages from them.
On graceful shutdown, we had a bug that peers would have ignored the Control Frame holding the shutting down signal since those messages have no header. This is now fixed, this will impact a future PR to mark this node generation as Gone to avoid reconnects
On system shutdown, we first stop cluster controller to:
- Make sure it doesn't react to our own partial/complete loss of connectivity during shutdown
- To avoid any competition with other controllers that might become leader during shutdown of this node
We now drain connections first and stop socket handlers gracefully before we continue the shutdown to give the best chance for the shutdown control frame to be sent to peers. This should make other controllers and parts of the system realise that this node is gone as early as possible to improve failover time (MTTR)
Minor logging changes

// intentionally empty

Stack created with Sapling. Best reviewed with ReviewStack.

github-actions · 2025-02-03T16:31:31Z

Test Results

7 files ±0 7 suites ±0 2m 59s ⏱️ - 1m 29s
47 tests ±0 46 ✅ ±0 1 💤 ±0 0 ❌ ±0
182 runs ±0 179 ✅ ±0 3 💤 ±0 0 ❌ ±0

Results for commit 2bc3292. ± Comparison against base commit d641697.

♻️ This comment has been updated with latest results.

tillrohrmann

Impressive work. Must have been super hard to hunt these problems down. Thanks a lot for finding and fixing them Ahmed 🦸 LGTM. +1 for merging.

tillrohrmann · 2025-02-04T16:05:38Z

crates/core/src/network/connection_manager.rs

@@ -632,41 +642,43 @@ where
    drop(connection);

    let drain_start = std::time::Instant::now();
-    trace!("Draining connection");
+    debug!("Draining connection");


Should this be part of the if branch?

tillrohrmann · 2025-02-04T16:25:09Z

crates/core/src/network/connection_manager.rs

-                            PeerMetadataVersion::from(header),
+    if needs_drain {
+        // Draining of incoming queue
+        while let Some(Some(Ok(msg))) = incoming.next().now_or_never() {


The ConnectionAwareRpcRouter on the sending end of this stream is keeping it open because it can hold the sender for this stream in a future it created and which is being awaited on the other end of the connection. Did I understand it correctly? Must have been quite hard to track this down. Respect.

Fixes: - On graceful shutdown, we had a long-standing bug where draining of some connections can get stuck due to the connection-aware rpc router holding owned senders in their futures. This was addressed by not blocking on the receive stream on drain, we'll only process the messages we have received after we sent the shutdown signal. Note that connections that have been terminated by the peer will also skip the drain since we don't want to process further messages from them. - On graceful shutdown, we had a bug that peers would have ignored the Control Frame holding the shutting down signal since those messages have no header. This is now fixed, this will impact a future PR to mark this node generation as `Gone` to avoid reconnects - On system shutdown, we first stop cluster controller to: - Make sure it doesn't react to our own partial/complete loss of connectivity during shutdown - To avoid any competition with other controllers that might become leader during shutdown of this node - We now drain connections first and stop socket handlers gracefully before we continue the shutdown to give the best chance for the shutdown control frame to be sent to peers. This should make other controllers and parts of the system realise that this node is `gone` as early as possible to improve failover time (MTTR) - Minor logging changes ``` // intentionally empty ```

This was referenced Feb 3, 2025

[Bifrost] Design improvements for find_tail #2593

Merged

Health in TaskCenter #2613

Closed

AhmedSoliman force-pushed the pr2614 branch from 12b7bf9 to 6b320a4 Compare February 3, 2025 15:59

AhmedSoliman mentioned this pull request Feb 3, 2025

Fixes for tail repair and reduction of inner retries #2615

Closed

AhmedSoliman changed the title ~~Improve connection termination~~ [Core] Improve connection termination on shutdown Feb 4, 2025

AhmedSoliman force-pushed the pr2614 branch from 6b320a4 to 51bcb75 Compare February 4, 2025 11:18

AhmedSoliman requested a review from tillrohrmann February 4, 2025 11:19

AhmedSoliman marked this pull request as ready for review February 4, 2025 11:19

AhmedSoliman force-pushed the pr2614 branch from 51bcb75 to b3ce11d Compare February 4, 2025 11:51

AhmedSoliman mentioned this pull request Feb 4, 2025

[Core] Avoid reconnection if the node is Gone #2619

Closed

AhmedSoliman force-pushed the pr2614 branch 2 times, most recently from 4db47c7 to df1cfe2 Compare February 4, 2025 16:00

tillrohrmann approved these changes Feb 4, 2025

View reviewed changes

AhmedSoliman force-pushed the pr2614 branch from df1cfe2 to b42aee4 Compare February 4, 2025 17:13

AhmedSoliman mentioned this pull request Feb 4, 2025

[Bifrost] Improve logging of seal failures #2621

Closed

AhmedSoliman force-pushed the pr2614 branch 2 times, most recently from ed7040c to 69a7ccd Compare February 5, 2025 08:11

AhmedSoliman mentioned this pull request Feb 5, 2025

[Bifrost] Provisioning nodes are not authoritative empty #2625

Merged

AhmedSoliman force-pushed the pr2614 branch from 69a7ccd to a304b76 Compare February 5, 2025 10:23

AhmedSoliman added 2 commits February 5, 2025 10:32

Health in TaskCenter

e407716

AhmedSoliman force-pushed the pr2614 branch from a304b76 to 2bc3292 Compare February 5, 2025 11:22

AhmedSoliman closed this Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Improve connection termination on shutdown #2614

[Core] Improve connection termination on shutdown #2614

AhmedSoliman commented Feb 3, 2025 •

edited

Loading

github-actions bot commented Feb 3, 2025 •

edited

Loading

tillrohrmann left a comment

tillrohrmann Feb 4, 2025

tillrohrmann Feb 4, 2025

AhmedSoliman Feb 4, 2025

[Core] Improve connection termination on shutdown #2614

[Core] Improve connection termination on shutdown #2614

Conversation

AhmedSoliman commented Feb 3, 2025 • edited Loading

github-actions bot commented Feb 3, 2025 • edited Loading

Test Results

tillrohrmann left a comment

Choose a reason for hiding this comment

tillrohrmann Feb 4, 2025

Choose a reason for hiding this comment

tillrohrmann Feb 4, 2025

Choose a reason for hiding this comment

AhmedSoliman Feb 4, 2025

Choose a reason for hiding this comment

AhmedSoliman commented Feb 3, 2025 •

edited

Loading

github-actions bot commented Feb 3, 2025 •

edited

Loading