Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: protocol out of sync when mixing sunsubscribe requests and slot migrations #691

Merged
merged 1 commit into from
Dec 10, 2024

Conversation

rueian
Copy link
Collaborator

@rueian rueian commented Dec 9, 2024

Fixes #690

In the TCP dumps provided by the issuer, we found the following traces during slot migration:

1. client -> [SUNSUBSCRIBE channel1] -> server
2. server -> [sunsubscribe channel1] -> client
3. client -> [HGET key field] --------> server
4. server -> [MOVED slot address] ----> client
5. server -> [True HGET response] ----> client

In the above sequence, (2) was the proactive sunsubscribe message sent by the server due to slot migration, and the message was wrongly delivered to the (1) SUNSUBSCRIBE request sent by the user. The true response to the (1) should be (4), the MOVED.

The wrong delivery of (2) caused the pipeline out-of-sync. The (4) MOVED was also wrongly delivered to (3), the HGET request. Then finally the (5), the true response of the HGET, couldn't be delivered anywhere, thus the pipeline panicked.

This situation is the same as valkey-io/valkey#1066.

Workaround

Ideally, we need to differentiate whether the server sent those sunsubscribe messages due to slot migration, but it is impossible. Our only choice is to ignore all sunsubscribe messages and make a dummy response for the SUNSUBSCRIBE request. Besides SUNSUBSCRIBE, we also apply this method to all *UNSUBSCRIBE so that we don't need to keep track of how many channels and patterns the client has subscribed to.

  1. We ignore *unsubscribe messages while delivering messages to requests.
  2. We make up responses for *UNSUBSCRIBE requests by injecting a PING request after each *UNSUBSCRIBE.
  3. If a *UNSUBSCRIBE request succeeds, then the PONG response will be delivered to it.
  4. if a *UNSUBSCRIBE request fails, such as MOVED or CLUSTERDOWN, then the error will be delivered to it, and the following PONG response will be discarded.

If the injected PING failed with a NOPERM error, we treat it as a normal PONG response.

@rueian rueian force-pushed the fix-sunsubscribe-outofsync branch 8 times, most recently from 594e80a to 6c2bfd8 Compare December 10, 2024 02:18
@rueian rueian marked this pull request as ready for review December 10, 2024 02:48
…migrations

Signed-off-by: Rueian <rueiancsie@gmail.com>
@rueian rueian force-pushed the fix-sunsubscribe-outofsync branch from 6c2bfd8 to 620608f Compare December 10, 2024 19:50
@rueian rueian merged commit 723761f into main Dec 10, 2024
60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic: protocol bug, message handled out of order
1 participant