Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async redis cluster should use initial startup nodes during reinitialization in case of failover #2472

Open
artesby opened this issue Nov 26, 2022 · 5 comments · Fixed by valkey-io/valkey-py#167 · May be fixed by #3447
Open

Comments

@artesby
Copy link

artesby commented Nov 26, 2022

Version:
redis-py: 4.3.3
redis: 7.0.2

Platform:
python 3.7

Description:
Async implementation of RedisCluster overwrites initial startup nodes during first initialization. This is caused by this line.

When deployed in dynamic systems like Kubernetes your shards may sometimes change their IP addresses.
This likely leads to a situation when python client ends up with a pool of invalid (outdated) startup nodes and can not perform reinitialization, since NodeManager uses raw IP addresses.

For example, let's say we have cluster with 3 masters in k8s and use python client this way:

from redis.asyncio import RedisCluster

client = RedisCluster.from_url('redis://redis-cluster-headless:6379/0')
print(client.nodes_manager.startup_nodes)

# outputs {'redis-cluster-headless:6379': [host=redis-cluster-headless, port=6379, name=redis-cluster-headless:6379, server_type=None]}

await client # initialization
print(client.nodes_manager.startup_nodes)

# outputs something like this
# {'<ip1>:6379': [host=<ip1>, port=6379, name=<ip1>:6379, server_type=primary],
#  '<ip2>:6379': [host=<ip2>, port=6379, name=<ip2>:6379, server_type=primary],
#  '<ip3>:6379': [host=<ip3>, port=6379, name=<ip3>:6379, server_type=primary]}

now image your python client is idle for some time without requests, and during that period K8S have recreated each redis container due to whatever reason (crash, or moved to another node by autoscaler for example).

Then, next time python client will try to perform a redis call it will find out that all ClusterNodes are unavaliable, will try to reinitialize NodeManager and end up with no connectable node to get new cluster configuration from.
This would not be an issue if initial headless URL was still in the pool of startup nodes.

Sync implementation of RedisCluster preserves initial node address just as suggested above, so i believe async implementation should behave the same.

For now I use the following patch, but I can't call it an elegant way to get around the problem. It must be solved through changes to NodeManager.

class AsyncRedisCluster(RedisCluster):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._initial_startup_nodes = dict(self.nodes_manager.startup_nodes)

    async def initialize(self):
        rv = await super().initialize()
        self.nodes_manager.startup_nodes.update(self._initial_startup_nodes)
        return rv

P.S. sync implementation has the opposite problem, NodesManager doesn't pop unused addresses from startup_nodes storage => it may grow indefinitely as soon as your containers change IPs.

@utkarshgupta137
Copy link
Contributor

I'll be happy to review any PRs with a fix.

@jiyeonseo
Copy link

Hi! Is there any update on this? I got the same issue when I use AWS MemoryDB.
got below exception(the same one mentioned above) when all the node was down and recovered.

Redis Cluster cannot be connected. Please provide at least one reachable node: None

I thought doing some reinitialize would be working, but it was not.

@shuyingz
Copy link

shuyingz commented Feb 20, 2023

Hi! I got the same issue too when I use AWS redis. Exactly the same exception as Jiyeonseo posted:

"redis_error": "Redis errors out with exception: Redis Cluster cannot be connected. Please provide at least one reachable node: None"

Is there any updates on this issue? Thanks!

@thundergolfer
Copy link

Heads up that the "this line" link in the original code description pointed at main and has thus become stale.

Given that state of the repository at the time the issue was created, the line in question is

self.set_nodes(self.startup_nodes, self.nodes_cache, remove_old=True)

@andr-c
Copy link

andr-c commented May 21, 2024

Seems like adding dynamic_startup_nodes=False in cluster constructor solves the problem, at least for sync client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants