-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client is raising Redis::Cluster::NodeMightBeDown
when Redis Cluster is observed to be healthy
#368
Comments
I think the setting of redis-cluster-client/test/test_against_cluster_broken.rb Lines 84 to 93 in 628276b
I experienced that our CI was flaky when nodes replied inconsistent responses by Was there possibility that something was being happened in your cluster bus even if the state was healthy? Or there may be some bugs in our client. I'll look into it later. Unfortunately, since I use redis gem v4 in my work, I have no experience to use redis-cluster-client gem with a long-running cluster in a production environment. |
Thanks for sharing the links. I found the first one during initial stages of investigation too but it is not entirely relevant as we deploy the Redis cluster in VMs with fixed dns name (no chance of the service/pod IP confusion).
It could be possible. For context, the cluster's I suspected what could have happened for us:
Note: we lost the metrics and logs for the affected VM unfortunately, so there is some inference to be done here. In any case, I think the lesson here is to sample more master nodes and be mindful of the trade-off since users could see a spike in Redis command traffic during deployments. |
redis-cluster-client/lib/redis_client/cluster/router.rb Lines 200 to 208 in 1967399
I think this is the wrong place for this retry to happen - actually the retry needs to happen one level up, in redis-cluster-client/lib/redis_client/cluster/router.rb Lines 159 to 162 in 1967399
So that after refreshing the cluster topology with |
Thank you for your reporting. That's definitely right. I think it's a bug. I'll fix later. |
@slai11 I've fixed the behavior related to this issue. One is the mitigation for the frequency of queries with CLUSTER NODES command, and another one is the enhancement for the recoverability from the state of cluster down. |
@supercaracal thank you for the improvements! I think this issue can be closed for now (I'm not sure what is your workflow for that). It is hard to reproduce the events of the incident separately to validate the fixes. But I'll update if we do encounter it again 👍 |
Feel free to reopen this issue when happening again. |
Issue
A small but non-trival percentage of
Redis::Cluster::NodeMightBeDown
is seen on some of my Sidekiq jobs. I understand that this error is raised in thefind_node
method after 3 retries where@node.reload!
is called on each retry in an attempt to fix the@topology.clients
and@slots
hash.Setup
For context, the affected Redis server is a 3-shard Redis Cluster. But looking at the observability metrics, we are fairly confident that the cluster state was healthy during the incident window (~2 hours). If the cluster state were unhealthy, the impact would have been more much severe.
We also configure
REDIS_CLIENT_MAX_STARTUP_SAMPLE=1
.Part of the stack trace:
I'm running on the following gem versions
Investigation details
We observed an increase in incoming new TCP connection on 1 of the 3 VMs containing a master redis-server process. This would match the 3
@node.reload!
retries which would open a new connection for callingCLUSTER NODES
and close it thereafter.I've ruled out server-side network issues since the
redis-client
would raise aConnectionError
when that happens. I've verified this while attempting reproduce this locally with a Redis-Cluster setup configured with very lowmaxclients
andtcp-backlog
values. I ended up withRedisClient::Cluster::InitialSetupError
when trying to reload the nodes.I've been unable to locally reproduce this behaviour (will update when I do). The client suggests that the server is down but the server seems fine. Could there be a form of server response that could lead to an incomplete client/slot map?
To the maintainers, in your experience developing this client library, was this (small percentage of
NodeMightBeDown
with a seemingly healthy cluster) behaviour something that could have happened?Linking issue for reference: https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/3715
The text was updated successfully, but these errors were encountered: