-
Notifications
You must be signed in to change notification settings - Fork 701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For MEETs, save the extensions support flag immediately during MEET processing #778
Conversation
… MEET processing For backwards compatibility reasons, a node will wait until it receives a cluster message with the open source extensions flag before sending its own open source extensions. This leads to a delay in shard ID propogation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes most of that delay by immediately triggering the extensions-supported flag during the MEET processing, allowing the PONG reply to contain OSS extensions and to be marked as compatible by the sender. Signed-off-by: Ben Totten <btotten@amazon.com>
(force pushed with DCO) |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #778 +/- ##
============================================
+ Coverage 70.23% 70.33% +0.09%
============================================
Files 112 112
Lines 60602 61278 +676
============================================
+ Hits 42566 43099 +533
- Misses 18036 18179 +143
|
Signed-off-by: Ben Totten <btotten@amazon.com>
Signed-off-by: Ben Totten <btotten@amazon.com>
Signed-off-by: Ben Totten <btotten@amazon.com>
Signed-off-by: Ben Totten <btotten@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bentotten. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Signed-off-by: Ben Totten <btotten@amazon.com>
Signed-off-by: Ben Totten <btotten@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, if I am understanding this correctly, adding the node name here did seem to cause a problem for the next packet exchanges - it looks like adding the name here during the MEET causes sender
to be valid during the next packet received, which hits this logic and frees the node/link before the handshake flag can be removed
if (sender) {
serverLog(LL_VERBOSE,
"Handshake: we already know node %.40s (%s), "
"updating the address if needed.", sender->name, sender->human_nodename);
if (nodeUpdateAddressIfNeeded(sender,link,hdr))
{
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|
CLUSTER_TODO_UPDATE_STATE);
}
/* Free this node as we already have it. This will
* cause the link to be freed as well. */
clusterDelNode(link->node);
return 0;
Follow up question:
Do we want to finish the handshake during MEET processing (maybe after sending the PONG) or leave the name as-is and keep the handshake completion for the ping/pong after the MEET?
Sorry for dropping the ball on this one, @bentotten. Let me find some time next and get back to this thread. |
any update on this? |
Thanks for your patience, @bentotten! I think this change makes sense overall. I think we should also consider updating the extension_capable flag right after completing the handshake as below: Line 3194 in 089048d
otherwise, I think we will have to wait for the next Line 3070 in 089048d
Yeah, you are right. Setting the name during Along the same line, I'm now wondering whether we should connect the |
Based on some offline discussion, we're going to merge this to see if it helps with another issue and I will create an issue with Ping's comment to followup with. |
…rocessing (valkey-io#778) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes valkey-io#774 --------- Signed-off-by: Ben Totten <btotten@amazon.com> Co-authored-by: Ben Totten <btotten@amazon.com> Signed-off-by: Ping Xie <pingxie@google.com>
…rocessing (valkey-io#778) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes valkey-io#774 --------- Signed-off-by: Ben Totten <btotten@amazon.com> Co-authored-by: Ben Totten <btotten@amazon.com> Signed-off-by: Ping Xie <pingxie@google.com>
…rocessing (valkey-io#778) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes valkey-io#774 --------- Signed-off-by: Ben Totten <btotten@amazon.com> Co-authored-by: Ben Totten <btotten@amazon.com> Signed-off-by: Ping Xie <pingxie@google.com>
…rocessing (valkey-io#778) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes valkey-io#774 --------- Signed-off-by: Ben Totten <btotten@amazon.com> Co-authored-by: Ben Totten <btotten@amazon.com>
…rocessing (valkey-io#778) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes valkey-io#774 --------- Signed-off-by: Ben Totten <btotten@amazon.com> Co-authored-by: Ben Totten <btotten@amazon.com> Signed-off-by: Ping Xie <pingxie@google.com>
…rocessing (valkey-io#778) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes valkey-io#774 --------- Signed-off-by: Ben Totten <btotten@amazon.com> Co-authored-by: Ben Totten <btotten@amazon.com> Signed-off-by: Ping Xie <pingxie@google.com>
…rocessing (#778) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes #774 --------- Signed-off-by: Ben Totten <btotten@amazon.com> Co-authored-by: Ben Totten <btotten@amazon.com> Signed-off-by: Ping Xie <pingxie@google.com>
…rocessing (valkey-io#778) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes valkey-io#774 --------- Signed-off-by: Ben Totten <btotten@amazon.com> Co-authored-by: Ben Totten <btotten@amazon.com> Signed-off-by: naglera <anagler123@gmail.com>
After valkey-io#1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink: ``` === ASSERTION FAILED === ==> '!link->node' is not true ``` In valkey-io#778, we will call setClusterNodeToInboundClusterLink to attach the node to the link during the MEET processing, so if we receive a another MEET packet in a short time, the node is still in handshake state, we will meet this assert and crash the server. If the link is bound to a node and the node is in the handshake state, and we receive a MEET packet, it may be that the sender sent multiple MEET packets when reconnecting, and in here we are dropping the MEET. Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET packet eliminate the handshake state. Signed-off-by: Binbin <binloveplay1314@qq.com>
After #1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink: ``` === ASSERTION FAILED === ==> '!link->node' is not true ``` In #778, we will call setClusterNodeToInboundClusterLink to attach the node to the link during the MEET processing, so if we receive a another MEET packet in a short time, the node is still in handshake state, we will meet this assert and crash the server. If the link is bound to a node and the node is in the handshake state, and we receive a MEET packet, it may be that the sender sent multiple MEET packets so in here we are dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert will happen if the other sends a MEET packet because it detects that there is no inbound link, this node creates a new node in HANDSHAKE state (with a random node name), and respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET flag. This node is supposed to open an outbound connection to the other node in the next cron cycle, but before this happens, the other node re-sends a MEET on the same link because it still detects no inbound connection. Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET packet eliminate the handshake state. Signed-off-by: Binbin <binloveplay1314@qq.com>
For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize.
This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions.
Partially fixes #774