Mutations are not being triggered properly in clusters consisting of nodes with different number of CPUs each, if default :pool_size settings are used #1342

hubertlepicki · 2024-10-01T18:59:05Z

I don't actually know if this could be considered a bug in Absinthe, or something that needs to be changed some way - maybe only be documented and maybe this issue will suffice for that purpose. I think it would be nice if this did not happen but not sure how to prevent this.

In short, I have an Elixir Phoenix application deployed on Kubernetes GKE Autopilot cluster. I have noticed that mutations are not always trigged reliabily. It depended on which pod in the cluster received web request to handle the mutation.

On some pods, the mutation would execute just fine and a subscription is triggered and propagated to JavaScript clients 100% reliabily. On other pods, a mutation happened, it succeeded, but no message was broadcasted to clients at all that a subscription was triggered at all.

I identified what is different about these pods that do not work: they had lowered number of CPU cores assigned to them by the Autopilot cluster. The issue would also fix itself if all pods received fair amount of traffic - the Kubernetes cluster would scale them up all to maximum number of CPUs and the issue was gone. This made it quite difficult to debug.

I think the code over here decides on a topic name, and one of the parameters being taken into consideration is :pool_size

absinthe/lib/absinthe/subscription.ex

Lines 214 to 225 in 3d0823b

    
           def publish_remote(pubsub, mutation_result, subscribed_fields) do 
        
             {:ok, pool_size} = 
        
               pubsub 
        
               |> registry_name 
        
               |> Registry.meta(:pool_size) 
        
             shard = :erlang.phash2(mutation_result, pool_size) 
        
             proxy_topic = Subscription.Proxy.topic(shard) 
        
             :ok = pubsub.publish_mutation(proxy_topic, mutation_result, subscribed_fields) 
        
           end

Now, the :pool_size defaults to System.schedulers() * 2:

absinthe/lib/absinthe/subscription.ex

Lines 60 to 61 in 3d0823b

    
             * `:pool_size` - (Optional - default `System.schedulers() * 2`) An integer 
        
               specifying the number of `Absinthe.Subscription.Proxy` processes to start.

My cluster consists of pods that, have 2 CPU cores, and those that have 4 CPU cores. The generated topic names for my mutation were "__absinthe__:proxy:6" vs "__absinthe__:proxy:2", depending on which pod the mutation arrived to from load balancer.

I believe the problem is that the number of shards/proxies started on each pod in the cluster depends on the of CPUs, and the code over here starts different number of proxies, and some topics are simply not being listened on:

absinthe/lib/absinthe/subscription/proxy_supervisor.ex

Lines 15 to 16 in 3d0823b

    
           proxies = 
        
             for shard <- 0..(pool_size - 1) do

The fix was specifying a fixed :pool_size, so instead of:

{Absinthe.Subscription, MyApp.Endpoint}

I did

{Absinthe.Subscription, pubsub: MyApp.Endpoint, pool_size: 8}

Again, I don't know what can be done here, but maybe we should think of a mechanism to warn about this (if we can detect?)

The text was updated successfully, but these errors were encountered:

benwilson512 · 2024-10-01T20:54:31Z

@hubertlepicki this is a really interesting problem. I think you are 100% right that if you have different sized pods you'll get different topic counts, and this will just result in missed messages between nodes.

Specifying a fixed count is definitely one solution. M:N topic mapping definitely seems like a problem someone has solved before but nothing jumps immediately to mind.

hubertlepicki · 2024-10-02T06:50:15Z

@benwilson512 since I wrote the above, I have also confirmed this is happening on Gigalixir. I guess they run a similar set up on Kubernetes cluster. One of my pods on Gigalixir is reporting 32, and the other one 16 by System.schedulers()

Maybe we should add a note to setup instructions that if you're doing a cloud deployment, and you can't control the number of CPUs, you should specify a fixed :pool_size?

…uide subscriptions.md

hubertlepicki · 2024-10-02T07:16:16Z

@benwilson512 I have provided the documentation changes here: https://github.com/absinthe-graphql/absinthe/pull/1343/files please merge them if you feel this is beneficial to document.

…-pool-size-documentation [#1342] Add note on the need of fixed :pool_size to guide

hubertlepicki · 2024-10-02T18:23:51Z

@benwilson512 I think this can be closed now which I'm going to do

hubertlepicki added a commit to hubertlepicki/absinthe that referenced this issue Oct 2, 2024

[absinthe-graphql#1342] Add note on the need of fixed :pool_size to g…

9463a17

…uide subscriptions.md

hubertlepicki added a commit to hubertlepicki/absinthe that referenced this issue Oct 2, 2024

[absinthe-graphql#1342] Add note on the need of fixed :pool_size to g…

5852735

…uide subscriptions.md

hubertlepicki added a commit to hubertlepicki/absinthe that referenced this issue Oct 2, 2024

[absinthe-graphql#1342] Add note on the need of fixed :pool_size to g…

39b2177

…uide subscriptions.md

hubertlepicki added a commit to hubertlepicki/absinthe that referenced this issue Oct 2, 2024

[absinthe-graphql#1342] Add note on the need of fixed :pool_size to g…

87bc90e

…uide subscriptions.md

benwilson512 added a commit that referenced this issue Oct 2, 2024

Merge pull request #1343 from hubertlepicki/feature/1342-subscription…

e37e285

…-pool-size-documentation [#1342] Add note on the need of fixed :pool_size to guide

hubertlepicki closed this as completed Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mutations are not being triggered properly in clusters consisting of nodes with different number of CPUs each, if default :pool_size settings are used #1342

Mutations are not being triggered properly in clusters consisting of nodes with different number of CPUs each, if default :pool_size settings are used #1342

hubertlepicki commented Oct 1, 2024 •

edited

Loading

benwilson512 commented Oct 1, 2024

hubertlepicki commented Oct 2, 2024

hubertlepicki commented Oct 2, 2024

hubertlepicki commented Oct 2, 2024

Mutations are not being triggered properly in clusters consisting of nodes with different number of CPUs each, if default :pool_size settings are used #1342

Mutations are not being triggered properly in clusters consisting of nodes with different number of CPUs each, if default :pool_size settings are used #1342

Comments

hubertlepicki commented Oct 1, 2024 • edited Loading

benwilson512 commented Oct 1, 2024

hubertlepicki commented Oct 2, 2024

hubertlepicki commented Oct 2, 2024

hubertlepicki commented Oct 2, 2024

hubertlepicki commented Oct 1, 2024 •

edited

Loading