Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Improve connection termination on shutdown #2614

Closed
wants to merge 2 commits into from
Closed

Conversation

AhmedSoliman
Copy link
Contributor

@AhmedSoliman AhmedSoliman commented Feb 3, 2025

Fixes:

  • On graceful shutdown, we had a long-standing bug where draining of some connections can get stuck due to the connection-aware rpc router holding owned senders in their futures. This was addressed by not blocking on the receive stream on drain, we'll only process the messages we have received after we sent the shutdown signal. Note that connections that have been terminated by the peer will also skip the drain since we don't want to process further messages from them.
  • On graceful shutdown, we had a bug that peers would have ignored the Control Frame holding the shutting down signal since those messages have no header. This is now fixed, this will impact a future PR to mark this node generation as Gone to avoid reconnects
  • On system shutdown, we first stop cluster controller to:
    • Make sure it doesn't react to our own partial/complete loss of connectivity during shutdown
    • To avoid any competition with other controllers that might become leader during shutdown of this node
  • We now drain connections first and stop socket handlers gracefully before we continue the shutdown to give the best chance for the shutdown control frame to be sent to peers. This should make other controllers and parts of the system realise that this node is gone as early as possible to improve failover time (MTTR)
  • Minor logging changes
// intentionally empty

Stack created with Sapling. Best reviewed with ReviewStack.

Copy link

github-actions bot commented Feb 3, 2025

Test Results

  7 files  ±0    7 suites  ±0   2m 59s ⏱️ - 1m 29s
 47 tests ±0   46 ✅ ±0  1 💤 ±0  0 ❌ ±0 
182 runs  ±0  179 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit 2bc3292. ± Comparison against base commit d641697.

♻️ This comment has been updated with latest results.

@AhmedSoliman AhmedSoliman changed the title Improve connection termination [Core] Improve connection termination on shutdown Feb 4, 2025
@AhmedSoliman AhmedSoliman marked this pull request as ready for review February 4, 2025 11:19
@AhmedSoliman AhmedSoliman force-pushed the pr2614 branch 2 times, most recently from 4db47c7 to df1cfe2 Compare February 4, 2025 16:00
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work. Must have been super hard to hunt these problems down. Thanks a lot for finding and fixing them Ahmed 🦸 LGTM. +1 for merging.

@@ -632,41 +642,43 @@ where
drop(connection);

let drain_start = std::time::Instant::now();
trace!("Draining connection");
debug!("Draining connection");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of the if branch?

PeerMetadataVersion::from(header),
if needs_drain {
// Draining of incoming queue
while let Some(Some(Ok(msg))) = incoming.next().now_or_never() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ConnectionAwareRpcRouter on the sending end of this stream is keeping it open because it can hold the sender for this stream in a future it created and which is being awaited on the other end of the connection. Did I understand it correctly? Must have been quite hard to track this down. Respect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

Fixes:
- On graceful shutdown, we had a long-standing bug where draining of some connections can get stuck due to the connection-aware rpc router holding owned senders in their futures. This was addressed by not blocking on the receive stream on drain, we'll only process the messages we have received after we sent the shutdown signal. Note that connections that have been terminated by the peer will also skip the drain since we don't want to process further messages from them.
- On graceful shutdown, we had a bug that peers would have ignored the Control Frame holding the shutting down signal since those messages have no header. This is now fixed, this will impact a future PR to mark this node generation as `Gone` to avoid reconnects
- On system shutdown, we first stop cluster controller to:
  - Make sure it doesn't react to our own partial/complete loss of connectivity during shutdown
  - To avoid any competition with other controllers that might become leader during shutdown of this node
- We now drain connections first and stop socket handlers gracefully before we continue the shutdown to give the best chance for the shutdown control frame to be sent to peers. This should make other controllers and parts of the system realise that this node is `gone` as early as possible to improve failover time (MTTR)
- Minor logging changes

```
// intentionally empty
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants