Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout #1307

Merged
merged 9 commits into from
Dec 12, 2024

Conversation

pieturin
Copy link
Contributor

@pieturin pieturin commented Nov 14, 2024

In some cases, when meeting a new node, if the handshake times out, we can end up with an inconsistent view of the cluster where the new node knows about all the nodes in the cluster, but the cluster does not know about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound link but no inbound link, in this case it probably means this node does not know us. In this case we (re-)send a MEET packet to this node to do a new handshake with it.
If we receive a MEET packet from a known node, we disconnect the outbound link to force a reconnect and sending of a PING packet so that the other node recognizes the link as belonging to us. This prevents cases where a node could send MEET packets in a loop because it thinks the other node does not have an inbound link.

This fixes the bug described in #1251.

@pieturin pieturin force-pushed the cluster-handshake-fix branch from 1920952 to d65423e Compare November 14, 2024 20:16
Copy link

codecov bot commented Nov 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.90%. Comparing base (fd58f8d) to head (8086a07).
Report is 35 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1307      +/-   ##
============================================
+ Coverage     70.62%   70.90%   +0.28%     
============================================
  Files           117      119       +2     
  Lines         63324    64631    +1307     
============================================
+ Hits          44722    45828    +1106     
- Misses        18602    18803     +201     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.83% <100.00%> (+0.14%) ⬆️

... and 63 files with indirect coverage changes

Copy link
Contributor

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we disconnect the outbound link if inbound link is not available? I think it will lead to the same reconnection flow. Would it help with having simpler code and one unified flow. I'm not sure if it will perform the MEET operation though.

src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.h Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
tests/unit/cluster/cluster-reliable-meet.tcl Show resolved Hide resolved
tests/unit/cluster/cluster-reliable-meet.tcl Outdated Show resolved Hide resolved
tests/unit/cluster/cluster-reliable-meet.tcl Show resolved Hide resolved
@pieturin
Copy link
Contributor Author

pieturin commented Nov 14, 2024

What if we disconnect the outbound link if inbound link is not available?

In this case we would just re-open an outbound connection, which the other node will accept, but it won't force the other node to recognize us as being part of the cluster if it doesn't trust us yet. The only way to force the other node to add us to its cluster view is for us to send a MEET packet.

@hpatro
Copy link
Contributor

hpatro commented Nov 14, 2024

What if we disconnect the outbound link if inbound link is not available?

In this case we would just re-opened an outbound connection, which the other node will accept, but it won't force the other node to recognize us as being part of the cluster if it doesn't trust us yet. The only way to force the other node to add us to its cluster view is for us to send a MEET packet.

CLUSTER MEET is an admin operation but I guess we are fine with the case of reinitiating it if the operation wasn't successful in first place and retry it.

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do something like #461? only clear the CLUSTER_NODE_MEET flag when myself receive a "ack" (not the plain PONG but something with a strong ack, ack that sender has already meet myself?) I haven't thought about it carefully, but i feel it is more reliable?

@madolson
Copy link
Member

only clear the CLUSTER_NODE_MEET flag when myself receive a "ack" (not the plain PONG but something with a strong ack, ack that sender has already meet myself?

Do you mean by like adding a new flag? I think the concern is we could still end up in the inverse state, where the the node that received the "strong" ack will put the other node online but then might go offline.

My original thought was that as long as one node believes the other is part of the cluster, is should try to have the other node join. It's sort of like an "enhanced" version of how we built up the mesh when two disjoin clusters meet each other.

@pieturin
Copy link
Contributor Author

can we do something like #461? only clear the CLUSTER_NODE_MEET flag when myself receive a "ack" (not the plain PONG but something with a strong ack, ack that sender has already meet myself?) I haven't thought about it carefully, but i feel it is more reliable?

We could do a 3-way handshake to strengthen the handshake reliability, instead of the current -> MEET/PONG, <- PING/PONG. We could add an extra back and forth between the two nodes before clearing the flags. But we can always end up in a situation where one node thinks the handshake is done and the other node times out the handshake because the last packet got delayed or dropped.

With this solution, if the handshake has succeeded on one side and not the other, we ensure both sides will eventually know each other.

@pieturin
Copy link
Contributor Author

With this fix, we can sometime get in a (potentially) infinite loop where a node keeps sending a MEET packet to the other node, but both nodes know each other. This sometimes (although rarely) happens in the second test Handshake eventually succeeds after node handshake timeout on one side with inconsistent view of the cluster.

The following sequence of events can trigger this issue:

  1. Nodes 1 & 2 know each other, but don't know node 0.
  2. We make node 0 meet node 1. But the handshake times-out on node 1's side, but succeed on node 0's side.
  3. When node 1 marks the handshake as timed out, it will close both connections with node 0.
  4. From node 0 perspective, both connections to node 1 are closed. But since it knows node 1, it will re-open an outbound connection to it.
  5. Node 1 will accept the inbound connection from node 0, but it doesn't know this node, so it doesn't register this connection as belonging to any known node (ie: link->node stays NULL).
  6. With the change from this PR, node 0 will detect that node 2 doesn't have any inbound connection to it. So node 0 will send a MEET packet to node 2. (for this bug to happen, node 2 should be met first)
  7. Handshake with node 0 and 2 succeeds.
  8. Now node 2 gossips about node 0 to node 1. So node 1 will add node 0 to its list of known nodes.
  9. Node 1 opens a connection to node 0. At this point node 0 has both inbound and outbound connections to node 1. But from node 1's perspective, it only has an outbound connection to node 0. The inbound connection is not attached to any node (still has link->node set to NULL).
  10. So node 1 sends a MEET packet to node 0, since it doesn't think it has an inbound connection for it.
  11. The handshake completes successfully since node 0 responds to the MEET packet. But still no inbound connection.
  12. So node 1 keeps sending MEET packets to node 0 until node 0 sends a PING packet to node 1. When node 1 receives a PING packet from the node 0, it will set the node's inbound connection (here).

Node 0 should eventually send a PING packet to node 1, but there is no guarantee as to when that can happen. When I reproduce the issue, node 0 never gets a chance to send a PING to node 1 because node 0 overrides node->pong_received for node 1 when node 2 gossips about node 1 with a higher pong_received value. And node 2 always has a lower pong_received value compared to node 1 when trying to select a node to send a PING to here:

if (min_pong_node == NULL || min_pong > this->pong_received) {

I think there are various ways to mitigate this issue:

  1. Do nothing, the MEET packet loop should eventually stop. (but I have to update the test so that it's not flaky)
  2. Change the pinging decision logic to force a PING to every nodes at least every X amount of time, even if we know that another node was able to ping it recently. X can be set to something like 2 * (server.cluster_ping_interval ? server.cluster_ping_interval : server.cluster_node_timeout / 2). This will incur an increase in cluster bus traffic on large clusters.
  3. If I receive a MEET packet and I already have an outbound link for that node, then I should free my existing outbound link to it, and re-open a new one.

I think option 3 should work best without making too many changes to the current logic. But I'm open to suggestions.

@pieturin pieturin force-pushed the cluster-handshake-fix branch from 68324ec to f6eae88 Compare November 27, 2024 22:53
@pieturin
Copy link
Contributor Author

tests/unit/expire.tcl is currently failing in unstable too. I'll rebase once there is a fix.

@enjoy-binbin
Copy link
Member

#1368 the expire test is fixed

In some cases, when meeting a new node, if the handshake times out, we
can end up with an inconsistent view of the cluster where the new node
knows about all the nodes in the cluster, but the cluster does not know
about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound
link but no inbound link, in this case it probably means this node does
not know us. In this case we (re-)send a MEET packet to this node to do
a new handshake with it.

Signed-off-by: Pierre Turin <pieturin@amazon.com>
Signed-off-by: Pierre Turin <pieturin@amazon.com>
Update test to check node IDs instead of relying on number of words.
Rename nodeIsMeeting() to nodeInMeetState().
Introduce nodeInNormalState() macro.

Signed-off-by: Pierre Turin <pieturin@amazon.com>
If we receive a MEET packet from a known node, we disconnect the
outbound link to force a reconnect and sending of a PING packet so that
the other node recognizes the link as belonging to us.

Also deflaked one of the tests. And improved testing code following PR
comment.

Signed-off-by: Pierre Turin <pieturin@amazon.com>
Signed-off-by: Pierre Turin <pieturin@amazon.com>
@pieturin pieturin force-pushed the cluster-handshake-fix branch from 2e8c6f1 to 7ae3b05 Compare November 28, 2024 19:48
@enjoy-binbin enjoy-binbin requested a review from PingXie November 30, 2024 09:02
Signed-off-by: Pierre Turin <pieturin@amazon.com>
Sometimes the outbound link from node 0 to node 1 can be disconnected.
Assert that node 0 know node 1 without expecting the node to be marked
as connected.

Signed-off-by: Pierre Turin <pieturin@amazon.com>
@pieturin pieturin force-pushed the cluster-handshake-fix branch from 3b5aa41 to 6a017b0 Compare December 2, 2024 20:08
src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Show resolved Hide resolved
@hpatro
Copy link
Contributor

hpatro commented Dec 3, 2024

@pieturin Could you update the top comment as well about the exact behavior change?

Signed-off-by: Pierre Turin <pieturin@amazon.com>
Copy link
Contributor

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But I would like someone else as well to look at it before merging this in.

@pieturin
Copy link
Contributor Author

pieturin commented Dec 3, 2024

Updated the PR description.

@pieturin
Copy link
Contributor Author

pieturin commented Dec 3, 2024

@PingXie, @madolson, @enjoy-binbin, any comments on this PR?

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top comment and code LGTM, did not review the tests.

@enjoy-binbin enjoy-binbin changed the title [cluster-bus] Send a MEET packet to a node if there is no inbound link Send MEET packet to node if there is no inbound link to fix inconsistency when handshake timedout Dec 5, 2024
src/cluster_legacy.c Outdated Show resolved Hide resolved
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the comment isn't really a blocker for me, it's just about documentation.

Signed-off-by: Pierre Turin <pieturin@amazon.com>
@madolson madolson merged commit 5f7fe9e into valkey-io:unstable Dec 12, 2024
48 checks passed
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to send out my (partial) review :)

Also I don't think I fully understand the problem that we are trying to solve and why the fix works.

src/cluster_legacy.c Show resolved Hide resolved
src/cluster_legacy.c Show resolved Hide resolved
src/cluster_legacy.c Show resolved Hide resolved
src/cluster_legacy.c Outdated Show resolved Hide resolved
src/cluster_legacy.c Show resolved Hide resolved
@PingXie
Copy link
Member

PingXie commented Dec 12, 2024

Also I don't think I fully understand the problem that we are trying to solve

Do I understand correctly that we are trying to solve two problems here?

  1. inconsistent cluster topology within the existing node.

More specifically, a new node that hasn't gone through the full handshake process can still get gossiped around the cluster and then picked up by existing nodes that are not involved in the handshake process?

  • existing nodes in a cluster: a, b, c
  • new node: A
  • a is instructed to meet A.
  • the handshake process never completes but A's PONG response to a's MEET should make it to a
  • a then gossips A's id to b and c, which then add A to their cluster view - because the server doesn't check A's MEET state (CLUSTER_STATE_MEET) today.
  • now a thinks A is in the MEET state but b and c think A is normal and A sees no other node.
  1. cluster meet reliability

If A fails to reply PONG to a's MEET, both a and A should eventually remove each other from its cluster view respectively and A will never make to b and c. In this sense, the existing cluster, made up of a, b, and c, is still consistent.

and why the fix works.

The fix for the first problem makes sense to me: resending MEET after the handshake timeout.

I am not sure how/if the second problem is addressed by this PR? if both a and A remove each other from their cluster view respectively, how can the handshake process be restarted?

@pieturin
Copy link
Contributor Author

I am not sure how/if the second problem is addressed by this PR? if both a and A remove each other from their cluster view respectively, how can the handshake process be restarted?

If the handshake times out on both side of the meet, then, this fix won't do anything. This fix is only meant to make sure we never end up with an inconsistent view of the cluster (where one side knows the other nodes, but not the other side). This makes the meet handshake more reliable but not fool proof.
We could revisit the proposal to make MEET synchronous or sticky (see: redis/redis#11095), which would solve the second problem.

The inconsistent view of the cluster can happen in two different ways (that I could see):

  • The handshake succeeds on one node but times out on the other node. This is tested by the test Handshake eventually succeeds after node handshake timeout on one side with inconsistent view of the cluster.
            # Node 0 -- MEET -> Node 1
            # Node 0 <- PONG -- Node 1
            # Node 0 <- PING -- Node 1 [Node 0 will mark the handshake as successful]
            # Node 0 -- PONG -> Node 1 [we drop this message, so node 1 will eventually mark the handshake as timed out]
  • The handshake times out on both sides, but the new node learns about other nodes of the existing cluster through the gossip section in the MEET packet. In this case, even if the handshake times out, the new node will know some of the nodes from the existing cluster, but they will not know this new node. This is tested by the test Handshake eventually succeeds after node handshake timeout on both sides with inconsistent view of the cluster.
            # Node 1 -- MEET -> Node 0 [Node 0 might learn about Node 2 from the gossip section of the msg]
            # Node 1 <- PONG -- Node 0 [we drop this message, so Node 1 will eventually mark the handshake as timed out]
            # Node 1 <- PING -- Node 0 [we drop this message, so Node 1 will never send a PONG and Node 0 will eventually mark the handshake as timed out]
  • a then gossips A's id to b and c, which then add A to their cluster view - because the server doesn't check A's MEET state (CLUSTER_STATE_MEET) today.

We don't gossip a node if it's in HANDSHAKE state. See:

valkey/src/cluster_legacy.c

Lines 4039 to 4048 in 5f7fe9e

/* In the gossip section don't include:
* 1) Nodes in HANDSHAKE state.
* 3) Nodes with the NOADDR flag set.
* 4) Disconnected nodes if they don't have configured slots.
*/
if (this->flags & (CLUSTER_NODE_HANDSHAKE | CLUSTER_NODE_NOADDR) ||
(this->link == NULL && this->numslots == 0)) {
freshnodes--; /* Technically not correct, but saves CPU. */
continue;
}

I'll prepare a follow-up PR to address your comments @PingXie, thanks for the review!

enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Dec 13, 2024
After valkey-io#1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:
```
=== ASSERTION FAILED ===
==> '!link->node' is not true
```

In valkey-io#778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.

If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets when reconnecting,
and in here we are dropping the MEET. Note that in getNodeFromLinkAndMsg, the node in
the handshake state has a random name and not truly "known", so we don't know the sender.
Dropping the MEET packet can prevent us from creating a random node, avoid incorrect
link binding, and avoid duplicate MEET packet eliminate the handshake state.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@pieturin pieturin deleted the cluster-handshake-fix branch December 13, 2024 17:56
vudiep411 pushed a commit to Autxmaton/valkey that referenced this pull request Dec 15, 2024
…ency when handshake timedout (valkey-io#1307)

In some cases, when meeting a new node, if the handshake times out, we
can end up with an inconsistent view of the cluster where the new node
knows about all the nodes in the cluster, but the cluster does not know
about this new node (or vice versa).
To detect this inconsistency, we now check if a node has an outbound
link but no inbound link, in this case it probably means this node does
not know us. In this case we (re-)send a MEET packet to this node to do
a new handshake with it.
If we receive a MEET packet from a known node, we disconnect the
outbound link to force a reconnect and sending of a PING packet so that
the other node recognizes the link as belonging to us. This prevents
cases where a node could send MEET packets in a loop because it thinks
the other node does not have an inbound link.

This fixes the bug described in valkey-io#1251.

---------

Signed-off-by: Pierre Turin <pieturin@amazon.com>
enjoy-binbin added a commit that referenced this pull request Dec 16, 2024
After #1307 got merged, we notice there is a assert happen in setClusterNodeToInboundClusterLink:
```
=== ASSERTION FAILED ===
==> '!link->node' is not true
```

In #778, we will call setClusterNodeToInboundClusterLink to attach the node to the link
during the MEET processing, so if we receive a another MEET packet in a short time, the
node is still in handshake state, we will meet this assert and crash the server.

If the link is bound to a node and the node is in the handshake state, and we receive
a MEET packet, it may be that the sender sent multiple MEET packets so in here we are
dropping the MEET to avoid the assert in setClusterNodeToInboundClusterLink. The assert
will happen if the other sends a MEET packet because it detects that there is no inbound
link, this node creates a new node in HANDSHAKE state (with a random node name), and
respond with a PONG. The other node receives the PONG and removes the CLUSTER_NODE_MEET
flag. This node is supposed to open an outbound connection to the other node in the next
cron cycle, but before this happens, the other node re-sends a MEET on the same link
because it still detects no inbound connection.

Note that in getNodeFromLinkAndMsg, the node in the handshake state has a random name
and not truly "known", so we don't know the sender. Dropping the MEET packet can prevent
us from creating a random node, avoid incorrect link binding, and avoid duplicate MEET
packet eliminate the handshake state.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@PingXie
Copy link
Member

PingXie commented Dec 16, 2024

If the handshake times out on both side of the meet, then, this fix won't do anything.

make sense

We don't gossip a node if it's in HANDSHAKE state.

Right but in this case A is in the MEET state from a's perspective so it gets picked/gossiped.

I'll prepare a follow-up PR to address your comments

Thanks. Will TAL.

@pieturin
Copy link
Contributor Author

Right but in this case A is in the MEET state from a's perspective so it gets picked/gossiped.

Ah yes, you're right, this is what should happen:

            # Node a -- MEET -> Node A [Node A has flags HANDSHAKE|MEET]
            # Node a <- PONG -- Node A [After receiving this packet, Node A has flag MEET]
            # Node a <- PING -- Node A [After receiving this packer, we clear the MEET flag for Node A]
            # Node a -- PONG -> Node A

now a thinks A is in the MEET state but b and c think A is normal and A sees no other node.

This is correct that a thinks A is in MEET state, and b and c can get gossip information about it. But A might know about b and c, from the gossip section of the MEET packet from a.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Clusters can become inconsistent if one side times out the handshake during cluster meets
5 participants