[NEW] Cluster V2 Discussion #384

PingXie · 2024-04-25T23:19:38Z

Here are the problems that I think we need to solve in the current cluster:

strong consistency (for cluster topology)

cluster topology is concerned with which nodes own which slots and primaryship. The current cluster implementation is not even eventually consistent by design because there are places where node epochs are bumped without consensus (trade-offs). This leads to increased complexity on the client side.

better manageability (of global config/data)

This particular issue provides the exact context on this pain point

more resilience (to stressful client workload)

Today, both the cluster bus and the client workload run on the same main thread. So a demanding client workload has the potential to starve the cluster bus and leads to unnecessary failover.

higher scale

The V1 cluster is a mesh so the cluster gossip traffic is proportional to N^2, where N is the (data) nodes in the cluster. The practical limit of a V1 cluster is ~500 nodes.

Originally posted by @PingXie in #58 (comment)

artikell · 2024-04-26T05:27:48Z

I understand that the issue of strong consistency&better management capability seems to be more caused by large-scale decentralized architecture. One idea is to consider using Redis Sentinel to manage Redis clusters. Control meta information and highly available operations within a small range of nodes.

Regarding the issue of more resilience, I have noticed a previous issue: redis/redis#10878 Perhaps @madolson has some work to do

Regarding the design of a higher scale, an excessively large cluster size can lead to an increase in the number of connections and a decrease in performance.
This can consider whether the node supports larger specifications to enhance the capacity of the entire cluster. So, forkless is also a point that needs attention.

hpatro · 2024-04-29T23:11:21Z

There were few more points which I feel should be addressed as part of the new design which I brought up here, #58 (comment) Reposting my thoughts here (maybe we can merge things into the top level comment)

Support higher Scaling (Larger cluster size) - Currently Valkey cluster can't scale beyond 500 nodes beyond which the gossip protocol doesn't scale well. The new architecture should strive to handle much larger cluster size.
Support smaller cluster size - With the current design, only primary nodes can vote hence the ideal cluster would require atleast 3 primaries to form a quorum. The new design should take into consideration where each node (primary/replica) should be part of the quorum.
Centralized metadata store - Currently config, functions, acl needs to be applied on each node. This is painful and cumbersome during cluster setup as well as during scale out. Building a mechanism to be able to send those information once and being applied to all of the nodes would be ideal.
Decouple control and data transfer via cluster bus - Currently, the clusterbus carries both control data (health info, gossip info, etc) as well as pubsub data. With high pubsub data, the clusterbus can get overwhelmed and cause issue with the cluster health status and can delay cluster topology information consistency in the system.
Lower failover time - Due to the primary node only consensus mechanism in the current approach, failover in a shard can be delayed if there is a network partition and other primaries aren't reachable.

hwware · 2024-05-01T14:44:13Z

I am not sure if in cluster v2, is there plan to remove sentinel node?

madolson · 2024-05-01T16:37:33Z

I am not sure if in cluster v2, is there plan to remove sentinel node?

My directional long term ask would be to merge them together. It remains API compatible but becomes a special case instead of a different deployment mode.

PingXie · 2024-05-01T17:57:00Z

The way I see it, a bit philosophically, the operational convenience from the coupling of cluster management with data access comes at the cost of complexity, reliability and scalability. In this sense, a big part of the cluster V2 aspiration is to go back to the Sentinel architecture and decouple the two so there is the chance of merging the two (sentinel and cluster v2), conceptually speaking. However, I would also be interested in retaining the existing operational experience.

madolson · 2024-05-01T19:19:08Z

The one thing I don't want to retain with sentinel is that I don't want there to necessarily be a distinct "sentinel" nodes. The pathway for other projects like kafka is there is internal control nodes, but they are organized in the same cluster, so more transparent to users. If you think about it from kubernetes deployment, we want to deploy 1 cluster that is able to internally handle failovers. I think that was one of the things I disliked about the Redis ltd., which is they wanted to force users to understand the TD and FC concepts.

madolson · 2024-05-01T19:22:14Z

Here was my original list (removing the features that have been implemented)

Improved use case support

This pillar focuses on providing improved functionality outside of the core cluster code but helps improve the usability of cluster mode.

Clusterbus as HA for single shard:
Allows the clusterbus to replace sentinel as the HA mechanism for Redis. This will require voting replicas which is dicussed later.
Add specification for a new cluster implementation. redis/redis#10875
Custom hashing support:
Some applications want to have their own mechanism for determining slots, so we should extend the hashtag semantics to include information about what slot the request is intended for.
Hashtag scanning/atomic deletion:
A common ask has been for being able to use scan like commands to find elements in a hashtag without having to scan the entire keyspace. A proposal is to be able to create a group of keys that can be atomically deleted. A secondary index could also solve this issue.
(I'm sure there is an issue for this, I'll find it)

Cluster management improvements

This pillar focuses around improving the ease of use for managing Redis clusters.

Consensus based + Atomic slot migration:
Implement a server based slot migration command that migrates the data from one slot to another slot. (We have a solution we hopefully will someday post for this)
"CLUSTER MIGRATE slot node" command redis/redis#2807
Improved metrics for slot performance:
Add metrics for individual slot performance to make decisions about hot shards/keys. ** This makes it easier to identify slots that should be moved. Easy metrics to grab our key accesses, ideally memory would be better but that's hard.
Dynamic slot ownership
For all master clusters in caching based used cases, its data durability is not needed and nodes in a cluster can simply take over slots from other nodes when a node dies. Adding nodes can also mean that it will automatically takeover slot ownership from other nodes.
[Cluster] "Cache-Only" mode redis/redis#4160
Auto scaling
Support automatic rebalancing of clusters when adding nodes/removing nodes as well as during steady state when there is traffic load mismatch.
Redis Cluster Auto Scaling - Requirements redis/redis#3009
Moving cluster bus to a separate thread, improved reliability in case of busy server
Today if the main thread is busy it main not respond to a health check ping even though it is still up and healthy. Refactoring the clusterbus onto its own thread will make it more responsive.
Refactor abstractions in cluster.c:
Several abstractions in cluster.c are hard to follow and should be broken up including: Cluster bus and node handling, slot awareness, health monitoring.
Human readable names for nodes:
Today individual Redis nodes report their hexadecimal names, which are not human readable. Instead we should additionally assign them some more readable name that is either logical or corresponds to their primary.
Cluster human readable nodename feature redis/redis#9564
*Module support for different consensus algorithms *
Today Redis only supports the clusterbus as a consensus algorithm, but we could also support module hooks for other forms of consensus.

Cluster HA improvements

This pillar focuses on improving the high availability aspects of Redis cluster and focuses around improving failover and health checks.

Reduce messages sent for node health decisions:
The Redis clusterbus has an NxN full mesh of gossip health messages. This can cause performance degradation and instability in large clusters as health and voting authorization is slow. There are several ways to solve this such as having failovers be shard local or being smarter about propagation of information.
Redis Cluster: reduce gossip messages total traffic redis/redis#3929
Voting replicas: (group this with other conensus ones)
Today replicas don’t take part in leader election, this would be useful for smaller cluster sizes especially single shards.
Add specification for a new cluster implementation. redis/redis#10875
Avoiding cascading failovers leading to data loss:
It's possible that a replica without data can be promoted to be the master role and lost all data in the shard. This is typically the result of a cascading failover. Ideally we should add a stopgap here to prevent this last node from being demoted.
Placement awareness:
Today the individual nodes have no concept of how they are placed compared to each other, and will happily allow all the primaries to exist in the same zone. This also may include the notion of multi-region awareness.
RESPV3 topology updates
Today clusters come to learn about topology changes when they send a request to the wrong node. This can be limited by having nodes proactively notify clients when a topology change has occurred. This can be inefficient since today clients need to call CLUSTER SLOTS to re-learn the entire topology. A client can opt into topology changes, and from that point on it will receive information about just what topology has changed.
CLUSTER SUBSCRIBE SLOTS (topology changes) redis/redis#10150

madolson · 2024-05-01T19:24:50Z

@PingXie What do you want to get consensus on here? At it's core, I think the next step is we need to have someone come up with a concrete design. Independently, I also want us to finish the module interface for cluster so that we can work on that as a module that can be tested and folks can opt-in to it during the lifecycle of Valkey 8 and we can GA it in Valkey 9.

PingXie · 2024-05-01T21:00:22Z

@PingXie What do you want to get consensus on here?

The value proposition, aka the "why" question. I consider this thread to be more of an open-ended discussion for the broader community.

Independently, I also want us to finish the module interface for cluster so that we can work on that as a module that can be tested and folks can opt-in to it during the lifecycle of Valkey 8 and we can GA it in Valkey 9.

Agreed. I think it is wise to avoid coupling whenever possible. "modularization of the cluster management logic" is a good thing on its own and practically speaking it is actually a half-done job already. I don't like where we are and I think we should just go ahead and finish it properly.

At it's core, I think the next step is we need to have someone come up with a concrete design.

I am onboard with that and I can see it being a parallel thread to the "why" discussion (this thread).

How about we break this topic into three issues/discussoins?

the "value" discussion can stay on this thread
we can create a new/more concrete task to track the modularization work
we can also deep dive into a strawman design proposal for cluster v2.

ag-TJNII · 2024-05-30T13:50:03Z

I've spend the last few days working with ValKey clustering and there are some definite gaps that need addressed.

The biggest issue I'm seeing is inconsistencies between cluster nodes and cluster shards. Here's one I'm seeing right now:

10.185.19.186:22952> cluster shards
[Snip]
4) 1) "slots"
   2) (empty array)
   3) "nodes"
   4) 1)  1) "id"
          2) "a9e2294479e4ab6a5134fee4251854ebe35dd6a6"
          3) "tls-port"
          4) (integer) 20208
          5) "ip"
          6) "10.185.10.46"
          7) "endpoint"
          8) "10.185.10.46"
          9) "role"
         10) "master"
         11) "replication-offset"
         12) (integer) 4016
         13) "health"
         14) "fail"
      2)  1) "id"
          2) "b3294652fb0b2c65bdcbf00b52c8f9c3762854ea"
          3) "tls-port"
          4) (integer) 22952
          5) "ip"
          6) "10.185.19.186"
          7) "endpoint"
          8) "10.185.19.186"
          9) "role"
         10) "master"
         11) "replication-offset"
         12) (integer) 4016
         13) "health"
         14) "online"

10.185.19.186:22952> cluster nodes
[Snip]
a9e2294479e4ab6a5134fee4251854ebe35dd6a6 10.185.10.46:20208@26038 master,fail - 1717074501516 1717074497476 7 connected
b3294652fb0b2c65bdcbf00b52c8f9c3762854ea 10.185.19.186:22952@29926 myself,master - 0 1717075359000 9 connected 10923-16383

So cluster nodes shows node b3294652fb0b2c65bdcbf00b52c8f9c3762854ea being the primary for slots 10923-16383 whereas cluster slots shows b3294652fb0b2c65bdcbf00b52c8f9c3762854ea not having any slots.

I've also seen nodes showing online in cluster slots but disconnected in cluster nodes, connected with no IP, and online with no IP. This inconsistency makes it very hard to manage, I think that in v2 improving reporting on the state of the cluster needs to be improved.

Stepping back a bit my general understanding is that cluster management is a Redis paid enterprise feature, so there's no official Sentinel parallel for clustering. Reading this thread it looks like the Sentinel pattern is planned to be kept here as well. I think some of the pain I'm feeling is likely intentional grit in the gears to drive customers into paid plans, I'm hoping this grit can be cleaned out in ValKey.

A use case that I do think is under-supported in both clustering and HA is ephemeral environments where there is no persistent storage and ports are dynamic, such as container platforms. We run many instances in this fashion for caching. In this case the Redis/Valkey instance is not the sole store of the data, it's just a cache of data that can be regenerated if need be. As such solving the persistent storage or nat problems is more expensive than just dealing with the cache loss if the instance is dropped, so that's the tradeoff we've made. We would like to have HA and clustering so that we can be more resilient. To this end I really need the ability to tell Valkey in immutable config that it's supposed to be in a HA cluster with other nodes, and have it go join the cluster automatically without further interaction. Sentinel's rewriting of it's own config file and the fact that it remembers nodes forever is a problem here, as I can't manage the config with a template and the config will cruft up with ip:port pairs over time.

Clustering works a lot better, though I am driving it all myself. Aside from the command inconsistencies the shard ID concept feels underdeveloped. Determining which nodes should be in a shard by slots alone is painful. For bootstrapping I settled on making my own shard ID in config, and making my Sentinel parallel use that ID to find the cluster node IDs that should be joined. I had to resort to doing this completely externally, getting the Valkey cluster ID from the instance and exposing a external API that correlates the cluster ID with my external shard ID to do the bootstrapping. This pattern solved a lot of my problems and will allow for future slot migrations, so I think it was a good direction to go. Given that Valkey has the concept of a shard ID it would be nice if that could be exposed through the config in a way where the admin can use that to tell Valkey that certain nodes are meant to be grouped in the same shard.

But, to reiterate, the biggest thing I'd like to see in Cluster v2 is support for environments with dynamic ip:port combos and no persistent storage: When a node is gone it's gone and never coming back. I'd like to be able to declare in config which nodes should be in the cluster and have it bring itself online, adding nodes as they join and balancing slots. I believe this is inline with the Sentinel patterns I've read about in the other comments.

Thanks for working on making this feature even better!

ag-TJNII · 2024-06-11T18:49:02Z

It would also be nice if the HA solutions didn't require so much client side support. Current clustering needs support for the MOVED command, which seems to be hit-or-miss. In the Ruby Redis implementation they spent years saying they won't support it, and today it's somewhat supported but requires installing extra packages. Sentinel officially requires client support, connecting to the sentinel to find the leader then connecting to that node. These are all things the server admin should care about, not the application owner. App developers shouldn't need to know if they're connecting to a single node, a Sentinel HA cluster, or a full cluster in code. They definitely shouldn't need to deal with the install dependencies being different. The server admin should be able to deploy the topology that makes the most sense for the deployment, and the client doesn't care.

I personally recommend making MOVED command support be a requirement of all clients, as in my research of the use cases I'm concerned with being able to direct a client to the correct node as needed holds a lot of power. With MOVED the client can connect to any server, or a simple load balancer, and the server can redirect it to the proper node for the operation at hand. On the surface that seems to solve a lot of problems and eliminates client topology specific config. I'm far from a SME, though, and I do recognize that MOVED opens up client connection pooling/reuse concerns, but from my chair that's the direction I'd go.

fidoriel · 2024-09-04T21:15:28Z

Everything mentioned addresses the problems we experience with redis cluster. If then it is developed together with an official maintained and stable Kubernetes operator there would be nothing to be wished for.

madolson mentioned this issue May 1, 2024

Redis Cluster V2 project redis/redis#8948

Open

supercaracal mentioned this issue Aug 9, 2024

Client is raising Redis::Cluster::NodeMightBeDown when Redis Cluster is observed to be healthy redis-rb/redis-cluster-client#368

Closed

zuiderkwast mentioned this issue Aug 27, 2024

[NEW] Migrate non topology update cluster bus messages to light header type #932

Open

zuiderkwast mentioned this issue Jan 21, 2025

[NEW] Non-voting primaries, voting empty primaries, voting replicas #1600

Open

PingXie mentioned this issue Feb 18, 2025

[NEW] Multiple DB supports in cluster mode #1319

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Cluster V2 Discussion #384

[NEW] Cluster V2 Discussion #384

PingXie commented Apr 25, 2024

artikell commented Apr 26, 2024

hpatro commented Apr 29, 2024

hwware commented May 1, 2024

madolson commented May 1, 2024

PingXie commented May 1, 2024 •

edited

Loading

madolson commented May 1, 2024

madolson commented May 1, 2024

madolson commented May 1, 2024

PingXie commented May 1, 2024

ag-TJNII commented May 30, 2024 •

edited

Loading

ag-TJNII commented Jun 11, 2024

fidoriel commented Sep 4, 2024

[NEW] Cluster V2 Discussion #384

[NEW] Cluster V2 Discussion #384

Comments

PingXie commented Apr 25, 2024

artikell commented Apr 26, 2024

hpatro commented Apr 29, 2024

hwware commented May 1, 2024

madolson commented May 1, 2024

PingXie commented May 1, 2024 • edited Loading

madolson commented May 1, 2024

madolson commented May 1, 2024

Improved use case support

Cluster management improvements

Cluster HA improvements

madolson commented May 1, 2024

PingXie commented May 1, 2024

ag-TJNII commented May 30, 2024 • edited Loading

ag-TJNII commented Jun 11, 2024

fidoriel commented Sep 4, 2024

PingXie commented May 1, 2024 •

edited

Loading

ag-TJNII commented May 30, 2024 •

edited

Loading