-
Notifications
You must be signed in to change notification settings - Fork 529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutations are not being triggered properly in clusters consisting of nodes with different number of CPUs each, if default :pool_size settings are used #1342
Comments
@hubertlepicki this is a really interesting problem. I think you are 100% right that if you have different sized pods you'll get different topic counts, and this will just result in missed messages between nodes. Specifying a fixed count is definitely one solution. M:N topic mapping definitely seems like a problem someone has solved before but nothing jumps immediately to mind. |
@benwilson512 since I wrote the above, I have also confirmed this is happening on Gigalixir. I guess they run a similar set up on Kubernetes cluster. One of my pods on Gigalixir is reporting 32, and the other one 16 by Maybe we should add a note to setup instructions that if you're doing a cloud deployment, and you can't control the number of CPUs, you should specify a fixed |
…uide subscriptions.md
…uide subscriptions.md
…uide subscriptions.md
…uide subscriptions.md
@benwilson512 I have provided the documentation changes here: https://github.com/absinthe-graphql/absinthe/pull/1343/files please merge them if you feel this is beneficial to document. |
…-pool-size-documentation [#1342] Add note on the need of fixed :pool_size to guide
@benwilson512 I think this can be closed now which I'm going to do |
I don't actually know if this could be considered a bug in Absinthe, or something that needs to be changed some way - maybe only be documented and maybe this issue will suffice for that purpose. I think it would be nice if this did not happen but not sure how to prevent this.
In short, I have an Elixir Phoenix application deployed on Kubernetes GKE Autopilot cluster. I have noticed that mutations are not always trigged reliabily. It depended on which pod in the cluster received web request to handle the mutation.
On some pods, the mutation would execute just fine and a subscription is triggered and propagated to JavaScript clients 100% reliabily. On other pods, a mutation happened, it succeeded, but no message was broadcasted to clients at all that a subscription was triggered at all.
I identified what is different about these pods that do not work: they had lowered number of CPU cores assigned to them by the Autopilot cluster. The issue would also fix itself if all pods received fair amount of traffic - the Kubernetes cluster would scale them up all to maximum number of CPUs and the issue was gone. This made it quite difficult to debug.
I think the code over here decides on a topic name, and one of the parameters being taken into consideration is :pool_size
absinthe/lib/absinthe/subscription.ex
Lines 214 to 225 in 3d0823b
Now, the :pool_size defaults to
System.schedulers() * 2
:absinthe/lib/absinthe/subscription.ex
Lines 60 to 61 in 3d0823b
My cluster consists of pods that, have 2 CPU cores, and those that have 4 CPU cores. The generated topic names for my mutation were
"__absinthe__:proxy:6"
vs"__absinthe__:proxy:2"
, depending on which pod the mutation arrived to from load balancer.I believe the problem is that the number of shards/proxies started on each pod in the cluster depends on the of CPUs, and the code over here starts different number of proxies, and some topics are simply not being listened on:
absinthe/lib/absinthe/subscription/proxy_supervisor.ex
Lines 15 to 16 in 3d0823b
The fix was specifying a fixed
:pool_size
, so instead of:I did
Again, I don't know what can be done here, but maybe we should think of a mechanism to warn about this (if we can detect?)
The text was updated successfully, but these errors were encountered: