Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The ASG is not getting scaled when using same node_class in 2 different nomad datacenters #938

Open
sandi91 opened this issue Jul 29, 2024 · 1 comment

Comments

@sandi91
Copy link

sandi91 commented Jul 29, 2024

We have a very strange issue and we cannot find out what is wrong we are doing.
We have 2 datacenters in nomad: aws-eun-1 and aws-euw-1 but we have the same node_class set for a set of nodes.

Now when we are trying to configure an autoscaling policy and try filtering by both settings it takes into consideration all the servers from node_class, it does also make any difference in calculations when we are using the only node_class which proves it does not take the datacenter configuration while calculating the count of nodes.

So here is the example configuration we are trying to apply:

scaling "nomad_worker_test_stage_policy" {
  enabled = true
  min     = 1
  max     = 15
  policy {
    cooldown            = "3m"
    evaluation_interval = "1m"

    check "memory_allocated_percentage" {
      source       = "nomad-apm"
      query        = "percentage-allocated_memory"
      query_window = "1m"
      strategy "target-value" {
        target = 80.0
      }
    }

    target "aws-asg-euw" {
      aws_asg_name                  = "test-stage"
      datacenter                    = "aws-euw-1"
      node_class                    = "test-stage"
      dry-run                       = true
      node_drain_deadline           = "5m"
      node_purge                    = "true"
      node_drain_ignore_system_jobs = "false"
    }
  }
}

The result of this would be:

myapp-1  | 2024-07-25T09:01:39.241Z [DEBUG] internal_plugin.nomad-apm: collected node pool resource data: allocated_cpu=45790 allocated_memory=53368 allocatable_cpu=70000 allocatable_memory=101465 
myapp-1  | 2024-07-25T09:01:39.241Z [TRACE] policy_eval.worker.check_handler: metric result: check=memory_allocated_percentage id=6fea318a-2c45-0540-1514-5c996cd179e3 policy_id=4caf85ae-eaea-9718-24ce-9bd26bbe911e queue=cluster source=nomad-apm strategy=target-value target=aws-asg-euw ts="2024-07-25 09:01:39.241342385 +0000 UTC m=+908.747117582" value=52.597447395653674 
myapp-1  | 2024-07-25T09:01:39.241Z [DEBUG] policy_eval.worker.check_handler: calculating new count: check=memory_allocated_percentage id=6fea318a-2c45-0540-1514-5c996cd179e3 policy_id=4caf85ae-eaea-9718-24ce-9bd26bbe911e queue=cluster source=nomad-apm strategy=target-value target=aws-asg-euw count=2 
myapp-1  | 2024-07-25T09:01:39.241Z [TRACE] internal_plugin.target-value: calculated scaling strategy results: check_name=memory_allocated_percentage current_count=2 new_count=2 metric_value=52.597447395653674 metric_time="2024-07-25 09:01:39.241342385 +0000 UTC m=+908.747117582" factor=0.657468092445671 direction=down max_scale_up="+Inf" max_scale_down=-Inf

If we delete the datacenter setting the result is still "the same" if we check the allocatable_cpu and allocatable_memory:

myapp-1  | 2024-07-25T09:27:15.989Z [DEBUG] internal_plugin.nomad-apm: collected node pool resource data: allocated_cpu=45790 allocated_memory=53368 allocatable_cpu=70000 allocatable_memory=101465 
myapp-1  | 2024-07-25T09:27:15.989Z [TRACE] policy_eval.worker.check_handler: metric result: check=memory_allocated_percentage id=b7fc6c61-fce4-df96-f013-afc46a63da83 policy_id=9d72db2d-5383-363f-a613-004df4866ab9 queue=cluster source=nomad-apm strategy=target-value target=aws-asg-euw ts="2024-07-25 09:27:15.98928497 +0000 UTC m=+65.658597573" value=52.597447395653674 
myapp-1  | 2024-07-25T09:27:15.989Z [DEBUG] policy_eval.worker.check_handler: calculating new count: check=memory_allocated_percentage id=b7fc6c61-fce4-df96-f013-afc46a63da83 policy_id=9d72db2d-5383-363f-a613-004df4866ab9 queue=cluster source=nomad-apm strategy=target-value target=aws-asg-euw count=2 
myapp-1  | 2024-07-25T09:27:15.989Z [TRACE] internal_plugin.target-value: calculated scaling strategy results: check_name=memory_allocated_percentage current_count=2 new_count=2 metric_value=52.597447395653674 metric_time="2024-07-25 09:27:15.98928497 +0000 UTC m=+65.658597573" factor=0.657468092445671 direction=down max_scale_up="+Inf" max_scale_down=-Inf

BUT if we change the node_class to something very unique between the 2 datacenters it works properly, it is also visible by allocatable_cpu and allocatable_memory values:

myapp-1  | 2024-07-25T09:21:55.781Z [DEBUG] internal_plugin.nomad-apm: collected node pool resource data: allocated_cpu=500 allocated_memory=2176 allocatable_cpu=20000 allocatable_memory=28990 
myapp-1  | 2024-07-25T09:21:55.781Z [TRACE] policy_eval.worker.check_handler: metric result: check=memory_allocated_percentage id=b0656bf6-fa07-f97f-bd5d-145cc79e4573 policy_id=15eadd6f-7491-d0d4-9353-2a7a6ad8ff06 queue=cluster source=nomad-apm strategy=target-value target=aws-asg-euw ts="2024-07-25 09:21:55.781273586 +0000 UTC m=+64.251790905" value=7.506036564332528 
myapp-1  | 2024-07-25T09:21:55.781Z [DEBUG] policy_eval.worker.check_handler: calculating new count: check=memory_allocated_percentage id=b0656bf6-fa07-f97f-bd5d-145cc79e4573 policy_id=15eadd6f-7491-d0d4-9353-2a7a6ad8ff06 queue=cluster source=nomad-apm strategy=target-value target=aws-asg-euw count=2 
myapp-1  | 2024-07-25T09:21:55.781Z [TRACE] internal_plugin.target-value: calculated scaling strategy results: check_name=memory_allocated_percentage current_count=2 new_count=1 metric_value=7.506036564332528 metric_time="2024-07-25 09:21:55.781273586 +0000 UTC m=+64.251790905" factor=0.0938254570541566 direction=down max_scale_up="+Inf" max_scale_down=-Inf

Is this some limitation of nomad-apm and we just have to use other more advanced queries where we would use the datacenter filtering instead? or is this something we are doing wrong?

Thank you in advance for any answer.

@jrasell
Copy link
Member

jrasell commented Sep 18, 2024

Hi @sandi91 and thanks for raising this issue.

Could you clarify your setup a little please? Specifically "We have 2 datacenters in nomad: aws-eun-1 and aws-euw-1 but we have the same node_class set for a set of nodes." When you mention two datacenters, are these Nomad datacenters that are within the same region?

It looks like this is currently a limitation with the Nomad APM which can only discover and calculate resource utilisation based on "node_class" as it only has it setup option internally. This is something that we should certainly try and enhance, so that we can support the use-case you have described and bring this closer to how target node identification works. I'll therefore mark this for roadmapping.

For future readers, we do have a warning within our docs noting this, although it is slightly ambiguous and clearly at odds with other configuration options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants