Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout #1307

pieturin · 2024-11-14T20:03:46Z

In some cases, when meeting a new node, if the handshake times out, we can end up with an inconsistent view of the cluster where the new node knows about all the nodes in the cluster, but the cluster does not know about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound link but no inbound link, in this case it probably means this node does not know us. In this case we (re-)send a MEET packet to this node to do a new handshake with it.
If we receive a MEET packet from a known node, we disconnect the outbound link to force a reconnect and sending of a PING packet so that the other node recognizes the link as belonging to us. This prevents cases where a node could send MEET packets in a loop because it thinks the other node does not have an inbound link.

This fixes the bug described in #1251.

codecov · 2024-11-14T20:32:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.90%. Comparing base (fd58f8d) to head (8086a07).
Report is 35 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1307      +/-   ##
============================================
+ Coverage     70.62%   70.90%   +0.28%     
============================================
  Files           117      119       +2     
  Lines         63324    64631    +1307     
============================================
+ Hits          44722    45828    +1106     
- Misses        18602    18803     +201

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.83% <100.00%> (+0.14%)`	⬆️

... and 63 files with indirect coverage changes

hpatro

What if we disconnect the outbound link if inbound link is not available? I think it will lead to the same reconnection flow. Would it help with having simpler code and one unified flow. I'm not sure if it will perform the MEET operation though.

src/cluster_legacy.c

src/cluster_legacy.h

src/cluster_legacy.c

tests/unit/cluster/cluster-reliable-meet.tcl

pieturin · 2024-11-14T22:23:27Z

What if we disconnect the outbound link if inbound link is not available?

In this case we would just re-open an outbound connection, which the other node will accept, but it won't force the other node to recognize us as being part of the cluster if it doesn't trust us yet. The only way to force the other node to add us to its cluster view is for us to send a MEET packet.

hpatro · 2024-11-14T23:19:16Z

What if we disconnect the outbound link if inbound link is not available?

In this case we would just re-opened an outbound connection, which the other node will accept, but it won't force the other node to recognize us as being part of the cluster if it doesn't trust us yet. The only way to force the other node to add us to its cluster view is for us to send a MEET packet.

CLUSTER MEET is an admin operation but I guess we are fine with the case of reinitiating it if the operation wasn't successful in first place and retry it.

tests/unit/cluster/cluster-reliable-meet.tcl

enjoy-binbin

can we do something like #461? only clear the CLUSTER_NODE_MEET flag when myself receive a "ack" (not the plain PONG but something with a strong ack, ack that sender has already meet myself?) I haven't thought about it carefully, but i feel it is more reliable?

madolson · 2024-11-22T01:26:22Z

only clear the CLUSTER_NODE_MEET flag when myself receive a "ack" (not the plain PONG but something with a strong ack, ack that sender has already meet myself?

Do you mean by like adding a new flag? I think the concern is we could still end up in the inverse state, where the the node that received the "strong" ack will put the other node online but then might go offline.

My original thought was that as long as one node believes the other is part of the cluster, is should try to have the other node join. It's sort of like an "enhanced" version of how we built up the mesh when two disjoin clusters meet each other.

pieturin · 2024-11-23T03:24:09Z

can we do something like #461? only clear the CLUSTER_NODE_MEET flag when myself receive a "ack" (not the plain PONG but something with a strong ack, ack that sender has already meet myself?) I haven't thought about it carefully, but i feel it is more reliable?

We could do a 3-way handshake to strengthen the handshake reliability, instead of the current -> MEET/PONG, <- PING/PONG. We could add an extra back and forth between the two nodes before clearing the flags. But we can always end up in a situation where one node thinks the handshake is done and the other node times out the handshake because the last packet got delayed or dropped.

With this solution, if the handshake has succeeded on one side and not the other, we ensure both sides will eventually know each other.

pieturin · 2024-11-23T03:27:39Z

With this fix, we can sometime get in a (potentially) infinite loop where a node keeps sending a MEET packet to the other node, but both nodes know each other. This sometimes (although rarely) happens in the second test Handshake eventually succeeds after node handshake timeout on one side with inconsistent view of the cluster.

The following sequence of events can trigger this issue:

Nodes 1 & 2 know each other, but don't know node 0.
We make node 0 meet node 1. But the handshake times-out on node 1's side, but succeed on node 0's side.
When node 1 marks the handshake as timed out, it will close both connections with node 0.
From node 0 perspective, both connections to node 1 are closed. But since it knows node 1, it will re-open an outbound connection to it.
Node 1 will accept the inbound connection from node 0, but it doesn't know this node, so it doesn't register this connection as belonging to any known node (ie: link->node stays NULL).
With the change from this PR, node 0 will detect that node 2 doesn't have any inbound connection to it. So node 0 will send a MEET packet to node 2. (for this bug to happen, node 2 should be met first)
Handshake with node 0 and 2 succeeds.
Now node 2 gossips about node 0 to node 1. So node 1 will add node 0 to its list of known nodes.
Node 1 opens a connection to node 0. At this point node 0 has both inbound and outbound connections to node 1. But from node 1's perspective, it only has an outbound connection to node 0. The inbound connection is not attached to any node (still has link->node set to NULL).
So node 1 sends a MEET packet to node 0, since it doesn't think it has an inbound connection for it.
The handshake completes successfully since node 0 responds to the MEET packet. But still no inbound connection.
So node 1 keeps sending MEET packets to node 0 until node 0 sends a PING packet to node 1. When node 1 receives a PING packet from the node 0, it will set the node's inbound connection (here).

Node 0 should eventually send a PING packet to node 1, but there is no guarantee as to when that can happen. When I reproduce the issue, node 0 never gets a chance to send a PING to node 1 because node 0 overrides node->pong_received for node 1 when node 2 gossips about node 1 with a higher pong_received value. And node 2 always has a lower pong_received value compared to node 1 when trying to select a node to send a PING to here:

valkey/src/cluster_legacy.c

Line 5045 in 33f42d7

if (min_pong_node == NULL || min_pong > this->pong_received) {

I think there are various ways to mitigate this issue:

Do nothing, the MEET packet loop should eventually stop. (but I have to update the test so that it's not flaky)
Change the pinging decision logic to force a PING to every nodes at least every X amount of time, even if we know that another node was able to ping it recently. X can be set to something like 2 * (server.cluster_ping_interval ? server.cluster_ping_interval : server.cluster_node_timeout / 2). This will incur an increase in cluster bus traffic on large clusters.
If I receive a MEET packet and I already have an outbound link for that node, then I should free my existing outbound link to it, and re-open a new one.

I think option 3 should work best without making too many changes to the current logic. But I'm open to suggestions.

pieturin · 2024-11-27T23:59:37Z

tests/unit/expire.tcl is currently failing in unstable too. I'll rebase once there is a fix.

enjoy-binbin · 2024-11-28T06:11:29Z

#1368 the expire test is fixed

In some cases, when meeting a new node, if the handshake times out, we can end up with an inconsistent view of the cluster where the new node knows about all the nodes in the cluster, but the cluster does not know about this new node (or vice versa). To detect this inconsistency, we now check if a node has an outbound link but no inbound link, in this case it probably means this node does not know us. In this case we (re-)send a MEET packet to this node to do a new handshake with it. Signed-off-by: Pierre Turin <pieturin@amazon.com>

Signed-off-by: Pierre Turin <pieturin@amazon.com>

Update test to check node IDs instead of relying on number of words. Rename nodeIsMeeting() to nodeInMeetState(). Introduce nodeInNormalState() macro. Signed-off-by: Pierre Turin <pieturin@amazon.com>

If we receive a MEET packet from a known node, we disconnect the outbound link to force a reconnect and sending of a PING packet so that the other node recognizes the link as belonging to us. Also deflaked one of the tests. And improved testing code following PR comment. Signed-off-by: Pierre Turin <pieturin@amazon.com>

Signed-off-by: Pierre Turin <pieturin@amazon.com>

Sometimes the outbound link from node 0 to node 1 can be disconnected. Assert that node 0 know node 1 without expecting the node to be marked as connected. Signed-off-by: Pierre Turin <pieturin@amazon.com>

src/cluster_legacy.c

hpatro · 2024-12-03T17:51:59Z

@pieturin Could you update the top comment as well about the exact behavior change?

Signed-off-by: Pierre Turin <pieturin@amazon.com>

hpatro

LGTM. But I would like someone else as well to look at it before merging this in.

pieturin · 2024-12-03T21:48:53Z

Updated the PR description.

pieturin · 2024-12-03T21:53:02Z

@PingXie, @madolson, @enjoy-binbin, any comments on this PR?

enjoy-binbin

Top comment and code LGTM, did not review the tests.

src/cluster_legacy.c

madolson

I suppose the comment isn't really a blocker for me, it's just about documentation.

Signed-off-by: Pierre Turin <pieturin@amazon.com>

PingXie

Forgot to send out my (partial) review :)

Also I don't think I fully understand the problem that we are trying to solve and why the fix works.

src/cluster_legacy.c

PingXie · 2024-12-12T06:26:26Z

Also I don't think I fully understand the problem that we are trying to solve

Do I understand correctly that we are trying to solve two problems here?

inconsistent cluster topology within the existing node.

More specifically, a new node that hasn't gone through the full handshake process can still get gossiped around the cluster and then picked up by existing nodes that are not involved in the handshake process?

existing nodes in a cluster: a, b, c
new node: A
a is instructed to meet A.
the handshake process never completes but A's PONG response to a's MEET should make it to a
a then gossips A's id to b and c, which then add A to their cluster view - because the server doesn't check A's MEET state (CLUSTER_STATE_MEET) today.
now a thinks A is in the MEET state but b and c think A is normal and A sees no other node.

cluster meet reliability

If A fails to reply PONG to a's MEET, both a and A should eventually remove each other from its cluster view respectively and A will never make to b and c. In this sense, the existing cluster, made up of a, b, and c, is still consistent.

and why the fix works.

The fix for the first problem makes sense to me: resending MEET after the handshake timeout.

I am not sure how/if the second problem is addressed by this PR? if both a and A remove each other from their cluster view respectively, how can the handshake process be restarted?

pieturin · 2024-12-12T19:22:15Z

I am not sure how/if the second problem is addressed by this PR? if both a and A remove each other from their cluster view respectively, how can the handshake process be restarted?

If the handshake times out on both side of the meet, then, this fix won't do anything. This fix is only meant to make sure we never end up with an inconsistent view of the cluster (where one side knows the other nodes, but not the other side). This makes the meet handshake more reliable but not fool proof.
We could revisit the proposal to make MEET synchronous or sticky (see: redis/redis#11095), which would solve the second problem.

The inconsistent view of the cluster can happen in two different ways (that I could see):

The handshake succeeds on one node but times out on the other node. This is tested by the test Handshake eventually succeeds after node handshake timeout on one side with inconsistent view of the cluster.

            # Node 0 -- MEET -> Node 1
            # Node 0 <- PONG -- Node 1
            # Node 0 <- PING -- Node 1 [Node 0 will mark the handshake as successful]
            # Node 0 -- PONG -> Node 1 [we drop this message, so node 1 will eventually mark the handshake as timed out]

The handshake times out on both sides, but the new node learns about other nodes of the existing cluster through the gossip section in the MEET packet. In this case, even if the handshake times out, the new node will know some of the nodes from the existing cluster, but they will not know this new node. This is tested by the test Handshake eventually succeeds after node handshake timeout on both sides with inconsistent view of the cluster.

            # Node 1 -- MEET -> Node 0 [Node 0 might learn about Node 2 from the gossip section of the msg]
            # Node 1 <- PONG -- Node 0 [we drop this message, so Node 1 will eventually mark the handshake as timed out]
            # Node 1 <- PING -- Node 0 [we drop this message, so Node 1 will never send a PONG and Node 0 will eventually mark the handshake as timed out]

a then gossips A's id to b and c, which then add A to their cluster view - because the server doesn't check A's MEET state (CLUSTER_STATE_MEET) today.

We don't gossip a node if it's in HANDSHAKE state. See:

valkey/src/cluster_legacy.c

Lines 4039 to 4048 in 5f7fe9e

    
                   /* In the gossip section don't include: 
        
                    * 1) Nodes in HANDSHAKE state. 
        
                    * 3) Nodes with the NOADDR flag set. 
        
                    * 4) Disconnected nodes if they don't have configured slots. 
        
                    */ 
        
                   if (this->flags & (CLUSTER_NODE_HANDSHAKE | CLUSTER_NODE_NOADDR) || 
        
                       (this->link == NULL && this->numslots == 0)) { 
        
                       freshnodes--; /* Technically not correct, but saves CPU. */ 
        
                       continue; 
        
                   }

I'll prepare a follow-up PR to address your comments @PingXie, thanks for the review!

After valkey-io#1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink: ``` === ASSERTION FAILED === ==> '!link->node' is not true ``` In valkey-io#778, we will call setClusterNodeToInboundClusterLink to attach the node to the link during the MEET processing, so if we receive a another MEET packet in a short time, the node is still in handshake state, we will meet this assert and crash the server. If the link is bound to a node and the node is in the handshake state, and we receive a MEET packet, it may be that the sender sent multiple MEET packets when reconnecting, and in here we are dropping the MEET. Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET packet eliminate the handshake state. Signed-off-by: Binbin <binloveplay1314@qq.com>

…ency when handshake timedout (valkey-io#1307) In some cases, when meeting a new node, if the handshake times out, we can end up with an inconsistent view of the cluster where the new node knows about all the nodes in the cluster, but the cluster does not know about this new node (or vice versa). To detect this inconsistency, we now check if a node has an outbound link but no inbound link, in this case it probably means this node does not know us. In this case we (re-)send a MEET packet to this node to do a new handshake with it. If we receive a MEET packet from a known node, we disconnect the outbound link to force a reconnect and sending of a PING packet so that the other node recognizes the link as belonging to us. This prevents cases where a node could send MEET packets in a loop because it thinks the other node does not have an inbound link. This fixes the bug described in valkey-io#1251. --------- Signed-off-by: Pierre Turin <pieturin@amazon.com>

After #1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink: ``` === ASSERTION FAILED === ==> '!link->node' is not true ``` In #778, we will call setClusterNodeToInboundClusterLink to attach the node to the link during the MEET processing, so if we receive a another MEET packet in a short time, the node is still in handshake state, we will meet this assert and crash the server. If the link is bound to a node and the node is in the handshake state, and we receive a MEET packet, it may be that the sender sent multiple MEET packets so in here we are dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert will happen if the other sends a MEET packet because it detects that there is no inbound link, this node creates a new node in HANDSHAKE state (with a random node name), and respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET flag. This node is supposed to open an outbound connection to the other node in the next cron cycle, but before this happens, the other node re-sends a MEET on the same link because it still detects no inbound connection. Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET packet eliminate the handshake state. Signed-off-by: Binbin <binloveplay1314@qq.com>

PingXie · 2024-12-16T07:34:33Z

If the handshake times out on both side of the meet, then, this fix won't do anything.

make sense

We don't gossip a node if it's in HANDSHAKE state.

Right but in this case A is in the MEET state from a's perspective so it gets picked/gossiped.

I'll prepare a follow-up PR to address your comments

Thanks. Will TAL.

pieturin · 2024-12-16T22:30:08Z

Right but in this case A is in the MEET state from a's perspective so it gets picked/gossiped.

Ah yes, you're right, this is what should happen:

            # Node a -- MEET -> Node A [Node A has flags HANDSHAKE|MEET]
            # Node a <- PONG -- Node A [After receiving this packet, Node A has flag MEET]
            # Node a <- PING -- Node A [After receiving this packer, we clear the MEET flag for Node A]
            # Node a -- PONG -> Node A

now a thinks A is in the MEET state but b and c think A is normal and A sees no other node.

This is correct that a thinks A is in MEET state, and b and c can get gossip information about it. But A might know about b and c, from the gossip section of the MEET packet from a.

pieturin force-pushed the cluster-handshake-fix branch from 1920952 to d65423e Compare November 14, 2024 20:16

hpatro reviewed Nov 14, 2024

View reviewed changes

hpatro reviewed Nov 15, 2024

View reviewed changes

tests/unit/cluster/cluster-reliable-meet.tcl Outdated Show resolved Hide resolved

zuiderkwast requested a review from enjoy-binbin November 18, 2024 06:54

enjoy-binbin reviewed Nov 20, 2024

View reviewed changes

pieturin force-pushed the cluster-handshake-fix branch from 68324ec to f6eae88 Compare November 27, 2024 22:53

pieturin added 5 commits November 28, 2024 19:41

Fix typo in comment

ce34684

Signed-off-by: Pierre Turin <pieturin@amazon.com>

Addressed PR comments

4ee7546

Update test to check node IDs instead of relying on number of words. Rename nodeIsMeeting() to nodeInMeetState(). Introduce nodeInNormalState() macro. Signed-off-by: Pierre Turin <pieturin@amazon.com>

Fix formatting on comment

7ae3b05

Signed-off-by: Pierre Turin <pieturin@amazon.com>

pieturin force-pushed the cluster-handshake-fix branch from 2e8c6f1 to 7ae3b05 Compare November 28, 2024 19:48

enjoy-binbin requested a review from PingXie November 30, 2024 09:02

pieturin added 2 commits December 2, 2024 20:08

Free outbound link on MEET packet only after handshake timeout

85904dd

Signed-off-by: Pierre Turin <pieturin@amazon.com>

Deflake cluster-multiple-meets test

6a017b0

Sometimes the outbound link from node 0 to node 1 can be disconnected. Assert that node 0 know node 1 without expecting the node to be marked as connected. Signed-off-by: Pierre Turin <pieturin@amazon.com>

pieturin force-pushed the cluster-handshake-fix branch from 3b5aa41 to 6a017b0 Compare December 2, 2024 20:08

hpatro reviewed Dec 3, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Show resolved Hide resolved

Update comments

cacd76c

Signed-off-by: Pierre Turin <pieturin@amazon.com>

hpatro approved these changes Dec 3, 2024

View reviewed changes

enjoy-binbin approved these changes Dec 5, 2024

View reviewed changes

enjoy-binbin changed the title ~~[cluster-bus] Send a MEET packet to a node if there is no inbound link~~ Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout Dec 5, 2024

madolson reviewed Dec 9, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

madolson approved these changes Dec 10, 2024

View reviewed changes

Update debug log format when sending ping

8086a07

Signed-off-by: Pierre Turin <pieturin@amazon.com>

madolson approved these changes Dec 12, 2024

View reviewed changes

madolson linked an issue Dec 12, 2024 that may be closed by this pull request

[BUG] Clusters can become inconsistent if one side times out the handshake during cluster meets #1251

Closed

madolson merged commit 5f7fe9e into valkey-io:unstable Dec 12, 2024
48 checks passed

PingXie reviewed Dec 12, 2024

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

src/cluster_legacy.c Show resolved Hide resolved

src/cluster_legacy.c Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Show resolved Hide resolved

enjoy-binbin mentioned this pull request Dec 13, 2024

Drop the MEET packet if the link node is in handshake state #1436

Merged

pieturin deleted the cluster-handshake-fix branch December 13, 2024 17:56

pieturin mentioned this pull request Dec 13, 2024

Only (re-)send MEET packet once every handshake timeout period #1441

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout #1307

Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout #1307

pieturin commented Nov 14, 2024 •

edited

Loading

codecov bot commented Nov 14, 2024 •

edited

Loading

hpatro left a comment

pieturin commented Nov 14, 2024 •

edited

Loading

hpatro commented Nov 14, 2024

enjoy-binbin left a comment

madolson commented Nov 22, 2024

pieturin commented Nov 23, 2024

pieturin commented Nov 23, 2024

pieturin commented Nov 27, 2024

enjoy-binbin commented Nov 28, 2024

hpatro commented Dec 3, 2024

hpatro left a comment

pieturin commented Dec 3, 2024

pieturin commented Dec 3, 2024

enjoy-binbin left a comment

madolson left a comment

PingXie left a comment

PingXie commented Dec 12, 2024

pieturin commented Dec 12, 2024

PingXie commented Dec 16, 2024

pieturin commented Dec 16, 2024

Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout #1307

Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout #1307

Conversation

pieturin commented Nov 14, 2024 • edited Loading

codecov bot commented Nov 14, 2024 • edited Loading

Codecov Report

hpatro left a comment

Choose a reason for hiding this comment

pieturin commented Nov 14, 2024 • edited Loading

hpatro commented Nov 14, 2024

enjoy-binbin left a comment

Choose a reason for hiding this comment

madolson commented Nov 22, 2024

pieturin commented Nov 23, 2024

pieturin commented Nov 23, 2024

pieturin commented Nov 27, 2024

enjoy-binbin commented Nov 28, 2024

hpatro commented Dec 3, 2024

hpatro left a comment

Choose a reason for hiding this comment

pieturin commented Dec 3, 2024

pieturin commented Dec 3, 2024

enjoy-binbin left a comment

Choose a reason for hiding this comment

madolson left a comment

Choose a reason for hiding this comment

PingXie left a comment

Choose a reason for hiding this comment

PingXie commented Dec 12, 2024

pieturin commented Dec 12, 2024

PingXie commented Dec 16, 2024

pieturin commented Dec 16, 2024

pieturin commented Nov 14, 2024 •

edited

Loading

codecov bot commented Nov 14, 2024 •

edited

Loading

pieturin commented Nov 14, 2024 •

edited

Loading