-
Notifications
You must be signed in to change notification settings - Fork 735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEW] Cluster V2 Discussion #384
Comments
I understand that the issue of strong consistency&better management capability seems to be more caused by large-scale decentralized architecture. One idea is to consider using Redis Sentinel to manage Redis clusters. Control meta information and highly available operations within a small range of nodes. Regarding the issue of more resilience, I have noticed a previous issue: redis/redis#10878 Perhaps @madolson has some work to do Regarding the design of a higher scale, an excessively large cluster size can lead to an increase in the number of connections and a decrease in performance. |
There were few more points which I feel should be addressed as part of the new design which I brought up here, #58 (comment) Reposting my thoughts here (maybe we can merge things into the top level comment)
|
I am not sure if in cluster v2, is there plan to remove sentinel node? |
My directional long term ask would be to merge them together. It remains API compatible but becomes a special case instead of a different deployment mode. |
The way I see it, a bit philosophically, the operational convenience from the coupling of cluster management with data access comes at the cost of complexity, reliability and scalability. In this sense, a big part of the cluster V2 aspiration is to go back to the Sentinel architecture and decouple the two so there is the chance of merging the two (sentinel and cluster v2), conceptually speaking. However, I would also be interested in retaining the existing operational experience. |
The one thing I don't want to retain with sentinel is that I don't want there to necessarily be a distinct "sentinel" nodes. The pathway for other projects like kafka is there is internal control nodes, but they are organized in the same cluster, so more transparent to users. If you think about it from kubernetes deployment, we want to deploy 1 cluster that is able to internally handle failovers. I think that was one of the things I disliked about the Redis ltd., which is they wanted to force users to understand the TD and FC concepts. |
Here was my original list (removing the features that have been implemented) Improved use case supportThis pillar focuses on providing improved functionality outside of the core cluster code but helps improve the usability of cluster mode.
Cluster management improvementsThis pillar focuses around improving the ease of use for managing Redis clusters.
Cluster HA improvementsThis pillar focuses on improving the high availability aspects of Redis cluster and focuses around improving failover and health checks.
|
@PingXie What do you want to get consensus on here? At it's core, I think the next step is we need to have someone come up with a concrete design. Independently, I also want us to finish the module interface for cluster so that we can work on that as a module that can be tested and folks can opt-in to it during the lifecycle of Valkey 8 and we can GA it in Valkey 9. |
The value proposition, aka the "why" question. I consider this thread to be more of an open-ended discussion for the broader community.
Agreed. I think it is wise to avoid coupling whenever possible. "modularization of the cluster management logic" is a good thing on its own and practically speaking it is actually a half-done job already. I don't like where we are and I think we should just go ahead and finish it properly.
I am onboard with that and I can see it being a parallel thread to the "why" discussion (this thread). How about we break this topic into three issues/discussoins?
|
I've spend the last few days working with ValKey clustering and there are some definite gaps that need addressed. The biggest issue I'm seeing is inconsistencies between
So I've also seen nodes showing online in Stepping back a bit my general understanding is that cluster management is a Redis paid enterprise feature, so there's no official Sentinel parallel for clustering. Reading this thread it looks like the Sentinel pattern is planned to be kept here as well. I think some of the pain I'm feeling is likely intentional grit in the gears to drive customers into paid plans, I'm hoping this grit can be cleaned out in ValKey. A use case that I do think is under-supported in both clustering and HA is ephemeral environments where there is no persistent storage and ports are dynamic, such as container platforms. We run many instances in this fashion for caching. In this case the Redis/Valkey instance is not the sole store of the data, it's just a cache of data that can be regenerated if need be. As such solving the persistent storage or nat problems is more expensive than just dealing with the cache loss if the instance is dropped, so that's the tradeoff we've made. We would like to have HA and clustering so that we can be more resilient. To this end I really need the ability to tell Valkey in immutable config that it's supposed to be in a HA cluster with other nodes, and have it go join the cluster automatically without further interaction. Sentinel's rewriting of it's own config file and the fact that it remembers nodes forever is a problem here, as I can't manage the config with a template and the config will cruft up with ip:port pairs over time. Clustering works a lot better, though I am driving it all myself. Aside from the command inconsistencies the But, to reiterate, the biggest thing I'd like to see in Cluster v2 is support for environments with dynamic ip:port combos and no persistent storage: When a node is gone it's gone and never coming back. I'd like to be able to declare in config which nodes should be in the cluster and have it bring itself online, adding nodes as they join and balancing slots. I believe this is inline with the Sentinel patterns I've read about in the other comments. Thanks for working on making this feature even better! |
It would also be nice if the HA solutions didn't require so much client side support. Current clustering needs support for the MOVED command, which seems to be hit-or-miss. In the Ruby Redis implementation they spent years saying they won't support it, and today it's somewhat supported but requires installing extra packages. Sentinel officially requires client support, connecting to the sentinel to find the leader then connecting to that node. These are all things the server admin should care about, not the application owner. App developers shouldn't need to know if they're connecting to a single node, a Sentinel HA cluster, or a full cluster in code. They definitely shouldn't need to deal with the install dependencies being different. The server admin should be able to deploy the topology that makes the most sense for the deployment, and the client doesn't care. I personally recommend making |
Everything mentioned addresses the problems we experience with redis cluster. If then it is developed together with an official maintained and stable Kubernetes operator there would be nothing to be wished for. |
Here are the problems that I think we need to solve in the current cluster:
cluster topology is concerned with which nodes own which slots and primaryship. The current cluster implementation is not even eventually consistent by design because there are places where node epochs are bumped without consensus (trade-offs). This leads to increased complexity on the client side.
This particular issue provides the exact context on this pain point
Today, both the cluster bus and the client workload run on the same main thread. So a demanding client workload has the potential to starve the cluster bus and leads to unnecessary failover.
The V1 cluster is a mesh so the cluster gossip traffic is proportional to N^2, where N is the (data) nodes in the cluster. The practical limit of a V1 cluster is ~500 nodes.
Originally posted by @PingXie in #58 (comment)
The text was updated successfully, but these errors were encountered: