-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Avoid reconnection if the node is Gone #2619
Conversation
Test Results 7 files ±0 7 suites ±0 3m 34s ⏱️ -54s Results for commit b446ccc. ± Comparison against base commit d641697. This pull request removes 2 tests.
♻️ This comment has been updated with latest results. |
255c962
to
ebc159e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for preventing connection attempts to gone nodes @AhmedSoliman. Modulo one comment about a condition the changes look good to me.
770b6cb
to
8fe1dcd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left a minor comment regarding the filter condition on get_or_connect
. Apart from this, +1 for merging :-)
.lock() | ||
.observed_generations | ||
.get(&node_id.as_plain()) | ||
.map(|status| node_id.generation() <= status.generation && status.gone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this filter even a bit more selective by node_id.generation() < status.generation || node_id.generation() == status.generation && status.gone
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alrighty.
7f86d7e
to
e005bdf
Compare
Fixes: - On graceful shutdown, we had a long-standing bug where draining of some connections can get stuck due to the connection-aware rpc router holding owned senders in their futures. This was addressed by not blocking on the receive stream on drain, we'll only process the messages we have received after we sent the shutdown signal. Note that connections that have been terminated by the peer will also skip the drain since we don't want to process further messages from them. - On graceful shutdown, we had a bug that peers would have ignored the Control Frame holding the shutting down signal since those messages have no header. This is now fixed, this will impact a future PR to mark this node generation as `Gone` to avoid reconnects - On system shutdown, we first stop cluster controller to: - Make sure it doesn't react to our own partial/complete loss of connectivity during shutdown - To avoid any competition with other controllers that might become leader during shutdown of this node - We now drain connections first and stop socket handlers gracefully before we continue the shutdown to give the best chance for the shutdown control frame to be sent to peers. This should make other controllers and parts of the system realise that this node is `gone` as early as possible to improve failover time (MTTR) - Minor logging changes ``` // intentionally empty ```
- RepairTail bugs fixed + restatectl's digest command now tolerates a failed node - RepairTail improved logging to explain what's happened - Inner retries removed, most retries are now on the out layers (tbd if more inner retries need to be removed). Note that this causes some of the outer operations to fail more often than before. This will be evaluated as we test and fixed on the higher level as needed - Minor logging fixes ``` // intentionally empty ```
Stack created with Sapling. Best reviewed with ReviewStack.