-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support smooth swithover in standalone mode by using REDIRECT #319
Comments
I think we need to improve this scenario. I think you @soloestoy can be our Chief of Standalone Mode. It seems like many other in the core team are more focused on Cluster. The main concern is clients. Cluster clients already handle MOVED but the slot is in the range 0-16383. What happens if they see -1 here? Worst case they crash. Can we make a small investigation to make sure some of the most popular clients don't crash? Then I think we can add it, but the default configuration can be off. We need to update the docs for clients. |
I like the general idea here. This could be a big help for applications that don't require clusters. Are there any dumb clients still in use? (Dumb meaning that they won't even support
Do we need to do this part? Does the client need to know that this is "redirect from replica to master" and not "redirect to another shard"? I can see that if we have a smart enough client, it can detect the -1, and redirect all future requests to the new master. That would be a good optimization and would apply to either standalone or cluster mode. But, is it anything more than an very good optimization for a rare event? Initially, none of the clients will understand -1. Even if they don't crash, they won't work. It would be surprising if this crashed This feels like a breaking change. Perhaps the protocol could be extended to allow a client that understands the -1 to turn this on just for the client's connection. All of this makes me think it would be nice for cluster V2 to be designed so that standalone was just a special case. A cluster with one node. Then standalone could be easily scaled up, and a cluster could be scaled down to standalone. |
@daniel-house Only cluster clients recognize -MOVED redirects. Far from all clients are cluster clients. |
This is a very fine idea but the effectiveness of this feature apparently depends on clients being collaborative. Given the tricky client ecosystem situation, IMO, until we have a mature valkey client ecosystem, which is a much bigger fish to fry, few users will be able to take advantage of this feature. Btw, I agree with @daniel-house that this will be a breaking change so it shouldn't be enabled by default. @soloestoy @CharlesChen888 what are your thoughts on the client support? Alternatively, why wouldn't simply dropping the connection from the replica side work? It is not clear to me how the network is set up in this scenario. for instance, is there an ILB in front of the two nodes? or the two nodes are exposed directly to the client? if the two nodes are exposed to the client directly, how does the client tell who the primary is in the first place? I would assume somewhere in your system there is a way to help the client figure out who the primary is? if so and if the client has been set up to retry the request on errors, dropping the connections would be more compatible with the common error handling logic. PS, cluster V2 also depends on us having a mature valkey-native client ecosystem. |
First, let's shift our focus away from -1, as it's not the main point. We can solely use -1 in standalone mode, while the cluster continues to maintain access to the slot : ) Indeed, this is an improvement that requires joint effort from both the client and server. In the typical pathway for applying new features, it is first implemented on the server side, followed by client-side support. Things are indeed moving in this direction, and in fact, it has already been accepted by the Redis core team. However, it is regrettable that Redis modified the protocol before the merge could happen. On the client side, I believe we need to have our own Valkey client to push the development of the ecosystem. Major languages should have one or more clients. I think we can start with Java, either by forking Jedis or writing a new one, we can name it Jackey (Java client for Valkey), @yangbodong22011 is very interested in being the maintainer for Jackey. We can initiate a vote to decide.
As I know, dropping connection is not a smooth way. Imagine that after sending a command, the connection breaks before receiving the response, it's an error to clients. If the replica drops connection upon receiving a request, and then the client will attempt to reconnect (cannot redirect since no MOVED received), this would be an infinite loop.
I can somewhat grasp what you are saying, but I don't fully understand it. Let me explain based on my understanding:
|
Although it is a breaking change, it believe it is mostly safe, because a client would get an error and now it would get a redirect, which also a kind of error reply. Most clients just return this to the user code. The only risk I can see if there are any clients that behave like this:
I don't know if any clients behave like this, but it is a possible behaviour and therefore a risk. We can completely avoid this by using a new, different redirect reply, such as @soloestoy I don't remember exactly, but I think this was already discussed before. But maybe we didn't discuss why it is more safe? |
I have heard of smart clients but never cluster clients. I am assuming that cluster=smart and smart clients are ones that understand ASK/ASKING and re-issue the query in the appropriate way to the shard indicated by the MOVED response. If that is not so, please correct me. Please forgive me, but I am incredibly picky which it comes to jargon and definitions and stuff like that. I could have just assumed |
I have never heard the term "smart" clients. |
Hmmm, I find lots of references to "smart client" using Google, but none of them are on any Redis page. I also see terms like "cluster aware" and "caching". There is definitely a need for better jargon. |
I like using a totally new message such as REDIRECT. In the switchover there will be a period during which there is no server handling queries on the original master's port - what is a client supposed to do? Is there a risk of a network storm in this scenario? Would MOVED/REDIRECT increase or decrease this risk? Should the scenario be extended to include one last step: switching back to the original host:port? |
Yeah, I am onboard with forking the clients now. With the amount of the new features that we are planning, I see this becoming the necessary evil. I think I was the only one holding out on this forking idea among the core team members so we should be able to close on this here async.
Technically speaking, the new message is an error for clients not recognizing the proposed new error string too. But yes, there is a chance to make the experience better with an "enlightened" client.
Yeah this is what I was trying to get at in my questions. I think your explanation answered them. Thanks
@zuiderkwast, curious, are you aware of a client actually doing this or this is just a hypothesis? In any case, agreed that a new message is cleaner.
I don't think so. This is essentially the cluster behavior today.
This is the same scenario too. Whenever the primary gets demoted, the client will get the notification with this proposal. I am aligned with this proposal and the new message. |
I'm not sure if there is really some clients using MOVED to detect what mode the server is, I don't think it is a good approach since cluster would not return MOVED when clients access master node. But there are some clients using SELECT or CLUSTER command to detect, so we cannot know all clients' behavior. I'm glad with |
I'm not aware of any. As I mentioned earlier "I don't know if any clients behave like this, but it is a possible behavior and therefore a risk". Example:
|
To implement #319 1. replica is able to redirect read and write commands to it's primary in standalone mode * reply with "-REDIRECT primary-ip:port" 2. add a subcommand `CLIENT CAPA redirect`, a client can announce the capability to handle redirection * if a client can handle redirection, the data access commands (read and write) will be redirected 3. allow `readonly` and `readwrite` command in standalone mode, may be a breaking change * a client with redirect capability cannot process read commands on a replica by default * use READONLY command can allow read commands on a replica --------- Signed-off-by: zhaozhao.zz <zhaozhao.zz@alibaba-inc.com>
Issue: redis/redis#12097
PR: redis/redis#12192
In many scenarios, we need to perform the action of switching. Take a rolling version upgrade as an example, suppose there are two nodes A and B, where A is the master and B is a replica of A. Under normal circumstances, users only access the master node A. Then, during a version upgrade, users would first upgrade the version of the replica node B, then switch to make B the master and turn A into B's replica. Simultaneously, user access also needs to switch from node A to node B, followed by an upgrade of node A to complete the entire version upgrade process:
However, during this process, the switch of user access is not entirely smooth. In fact, there are different reactions under cluster mode and standalone mode:
-MOVED slot ip:port
. In cluster mode, this is a routine result. The client will not treat it as an error returned to the user. Instead, the client will automatically redirect to the new master node B, as indicated by theip:port
in the result, thus achieving a smooth switch.-READONLY You can't write against a read only replica.
The client will pass this error to the user, and a smooth switch cannot be achieved.As we can see, the switching experience in cluster mode is clearly superior to that of standalone mode. Therefore, to improve the switching process in standalone mode, we can utilize the
MOVED
redirect mechanism. The first step would require the valkey server to useMOVED
to replaceREADONLY
in standalone mode, and then support for handling theMOVED
return value needs to be implemented in the client ecosystem.The latest discussion is we want to introduce a new reply
REDIRECT
, so it would not breaking anything to cluster.More details, the slot returned by the replica in theMOVE
can be-1
, and this can also be applied to cluster mode. It serves to clearly inform the client that it has accessed a replica and should be redirected to the master, rather than having accessed the wrong shard requiring redirection.The text was updated successfully, but these errors were encountered: