[Autoscaler] starts extra workers than necessary #36926

wjzhou-ep · 2023-06-28T23:04:14Z

What happened + What you expected to happen

Autoscaler started extra worker while they are not needed.

From the following log, I believe it may have race condition on reading

pending tasks
usage

Our tasks use 1 CPU, 30G memory, the worker has 120GB memory

At beginning, autoscaler started 12 workers correctly. (4 tasks each worker and 48 tasks)

Then, the moment the worker started, it seems there is a race condition on the usage and pending tasks.
e.g.
Usage:
10.0/176.0 CPU # 10 task is runing
so, there should be 38 (48-10) tasks pending.

However, the autoscaler think there are still 48 task pending and start 3 extra workers for them

======== Autoscaler status: 2023-06-28 14:34:29.072047 ========
Node status
---------------------------------------------------------------
Healthy:
 11 large-group
 1 head-group
Pending:
 192.168.26.101: large-group, waiting
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 10.0/176.0 CPU
 0.0/1100.0 large-group
 279.40GiB/1.22TiB memory
 0.0/1100.0 no_gpu
 9.52KiB/181.94GiB object_store_memory

Demands:
 {'CPU': 1.0, 'memory': 30000000000.0}: 48+ pending tasks/actors
2023-06-28 14:34:29,073	INFO autoscaler.py:1374 -- StandardAutoscaler: Queue 3 new nodes for launch
2023-06-28 14:34:29,073	INFO node_launcher.py:166 -- BaseNodeLauncher: Got 3 nodes to launch.
2023-06-28 14:34:29,073	INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 3 nodes, type large-group.
2023-06-28 14:34:29,074	INFO node_provider.py:287 -- Autoscaler is submitting the following patch to RayCluster ray in namespace pm-wuy-research.
2023-06-28 14:34:29,074	INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 3}]
2023-06-28 14:34:29,119	INFO autoscaler.py:471 -- The autoscaler took 0.112 seconds to complete the update iteration.
2023-06-28 14:34:29,120	INFO monitor.py:425 -- :event_summary:Resized to 176 CPUs.
2023-06-28 14:34:29,120	INFO monitor.py:425 -- :event_summary:Adding 3 node(s) of type large-group.
2023-06-28 14:34:34,184	INFO node_provider.py:240 -- Listing pods for RayCluster ray in namespace pm-wuy-research at pods resource version >= 42434271.
2023-06-28 14:34:34,210	INFO node_provider.py:258 -- Fetched pod data at resource version 42434271.
2023-06-28 14:34:34,210	INFO autoscaler.py:148 -- The autoscaler took 0.065 seconds to fetch the list of non-terminated nodes.
2023-06-28 14:34:34,211	INFO autoscaler.py:427 --

Versions / Dependencies

ray 2.5.1

Reproduction script

normal setup

Issue Severity

Low: It annoys or frustrates me.

The text was updated successfully, but these errors were encountered:

dcarrion87 · 2023-07-05T06:21:25Z

@wjzhou-ep @scv119

We seem to be seeing this behaviour too. Maybe slightly different. But essentially the autoscaler is starting more nodes than necessary.

Initial request for 24x:

 {'CPU': 32.0, 'GPU': 1.0}: 24+ pending tasks/actors
2023-07-04 22:44:01,163	INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 24 new nodes for launch
2023-07-04 22:44:01,164	INFO node_launcher.py:166 -- BaseNodeLauncher: Got 24 nodes to launch.
2023-07-04 22:44:01,164	INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 24 nodes, type workergroup.
2023-07-04 22:44:01,164	INFO node_provider.py:286 -- Autoscaler is submitting the following patch to RayCluster coder-xavierholt-ray-1688535753 in namespace annalise-coder-prod.
2023-07-04 22:44:01,164	INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 24}]
2023-07-04 22:44:01,193	INFO autoscaler.py:462 -- The autoscaler took 0.075 seconds to complete the update iteration.
2023-07-04 22:44:01,193	INFO monitor.py:428 -- :event_summary:Adding 24 node(s) of type workergroup.
2023-07-04 22:44:06,283	INFO node_provider.py:257 -- Fetched pod data at resource version 210040046.
2023-07-04 22:44:06,283	INFO autoscaler.py:143 -- The autoscaler took 0.056 seconds to fetch the list of non-terminated nodes.
2023-07-04 22:44:06,284	INFO autoscaler.py:419 --

A few loops later it starts queuing unnecessary nodes.

 {'CPU': 32.0, 'GPU': 1.0}: 24+ pending tasks/actors
2023-07-04 22:44:16,577	INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 2 new nodes for launch
2023-07-04 22:44:16,577	INFO node_launcher.py:166 -- BaseNodeLauncher: Got 2 nodes to launch.
2023-07-04 22:44:16,577	INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 2 nodes, type workergroup.
2023-07-04 22:44:16,577	INFO node_provider.py:286 -- Autoscaler is submitting the following patch to RayCluster coder-xavierholt-ray-1688535753 in namespace annalise-coder-prod.
2023-07-04 22:44:16,577	INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 26}]
2023-07-04 22:44:16,608	INFO autoscaler.py:462 -- The autoscaler took 0.097 seconds to complete the update iteration.
2023-07-04 22:44:16,609	INFO monitor.py:428 -- :event_summary:Resized to 64 CPUs, 2 GPUs.
2023-07-04 22:44:16,609	INFO monitor.py:428 -- :event_summary:Adding 2 node(s) of type workergroup.
2023-07-04 22:44:21,724	INFO node_provider.py:257 -- Fetched pod data at resource version 210040402.
2023-07-04 22:44:21,724	INFO autoscaler.py:143 -- The autoscaler took 0.065 seconds to fetch the list of non-terminated nodes.
2023-07-04 22:44:21,725	INFO autoscaler.py:419 --

And then some more

 {'CPU': 32.0, 'GPU': 1.0}: 16+ pending tasks/actors
2023-07-04 22:44:32,167	INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 6 new nodes for launch
2023-07-04 22:44:32,167	INFO node_launcher.py:166 -- BaseNodeLauncher: Got 6 nodes to launch.
2023-07-04 22:44:32,167	INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 6 nodes, type workergroup.
2023-07-04 22:44:32,168	INFO node_provider.py:286 -- Autoscaler is submitting the following patch to RayCluster coder-senorchang-ray-1688535753 in namespace anon-coder-prod.
2023-07-04 22:44:32,168	INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 32}]
2023-07-04 22:44:32,203	INFO autoscaler.py:462 -- The autoscaler took 0.176 seconds to complete the update iteration.
2023-07-04 22:44:32,204	INFO monitor.py:428 -- :event_summary:Resized to 512 CPUs, 16 GPUs.
2023-07-04 22:44:32,204	INFO monitor.py:428 -- :event_summary:Adding 6 node(s) of type workergroup.
2023-07-04 22:44:37,303	INFO node_provider.py:257 -- Fetched pod data at resource version 210040797.
2023-07-04 22:44:37,304	INFO autoscaler.py:143 -- The autoscaler took 0.067 seconds to fetch the list of non-terminated nodes.
2023-07-04 22:44:37,305	INFO autoscaler.py:419 --

We don't understand why?

The issue for us is they pend forever due to constraints around the pods and adding to resource quota counts which aren't valid.

Attached logs: autoscaler.txt

Naton1 · 2023-07-18T02:32:20Z

Seeing the same as well.

No jobs running...

======== Autoscaler status: 2023-07-18 02:18:09.914151 ========
Node status
---------------------------------------------------------------
Healthy:
 4 worker_node
 1 head_node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/24.0 CPU
 0B/29.45GiB memory
 0B/12.47GiB object_store_memory

Demands:
 (no resource demands)
2023-07-18 02:18:09,916	INFO autoscaler.py:470 -- The autoscaler took 0.065 seconds to complete the update iteration.
2023-07-18 02:18:14,984	INFO autoscaler.py:147 -- The autoscaler took 0.053 seconds to fetch the list of non-terminated nodes.
2023-07-18 02:18:14,985	INFO autoscaler.py:427 --

Then 5x 4 CPUs jobs started with 24 CPUs already available, 4 more instances launched unnecessarily

======== Autoscaler status: 2023-07-18 02:18:14.985085 ========
Node status
---------------------------------------------------------------
Healthy:
 4 worker_node
 1 head_node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 20.0/24.0 CPU
 0B/29.45GiB memory
 0B/12.47GiB object_store_memory

Demands:
 {'CPU': 4.0}: 5+ pending tasks/actors
2023-07-18 02:18:14,987	INFO autoscaler.py:1374 -- StandardAutoscaler: Queue 4 new nodes for launch
2023-07-18 02:18:14,987	INFO autoscaler.py:470 -- The autoscaler took 0.056 seconds to complete the update iteration.
2023-07-18 02:18:14,988	INFO node_launcher.py:166 -- NodeLauncher0: Got 4 nodes to launch.
2023-07-18 02:18:16,976	INFO node_launcher.py:166 -- NodeLauncher0: Launching 4 nodes, type worker_node.
2023-07-18 02:18:20,149	INFO autoscaler.py:147 -- The autoscaler took 0.141 seconds to fetch the list of non-terminated nodes.
2023-07-18 02:18:20,150	INFO autoscaler.py:427 -- 
======== Autoscaler status: 2023-07-18 02:18:20.150147 ========
Node status
---------------------------------------------------------------
Healthy:
 4 worker_node
 1 head_node
Pending:
 172.31.3.147: worker_node, uninitialized
 172.31.14.20: worker_node, uninitialized
 172.31.5.8: worker_node, uninitialized
 172.31.13.158: worker_node, uninitialized
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 20.0/24.0 CPU
 0B/29.45GiB memory
 74.59MiB/12.47GiB object_store_memory

Demands:
 (no resource demands)
2023-07-18 02:18:20,152	INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0231fa18511c6f83f.
2023-07-18 02:18:20,152	INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0970cd9533d7fb8fe.
2023-07-18 02:18:20,153	INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0c5c4df3cac069b0e.
2023-07-18 02:18:20,154	INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0db0afd004f36a21a.
2023-07-18 02:18:20,155	INFO autoscaler.py:470 -- The autoscaler took 0.147 seconds to complete the update iteration.
2023-07-18 02:18:20,156	INFO monitor.py:423 -- :event_summary:Adding 4 node(s) of type worker_node.
2023-07-18 02:18:25,313	INFO autoscaler.py:147 -- The autoscaler took 0.136 seconds to fetch the list of non-terminated nodes.
2023-07-18 02:18:25,314	INFO autoscaler.py:427 --

meastham · 2023-07-20T14:51:15Z

I'm also seeing this in a cluster I started on AWS EC2 instances with ray up. There's an additional wrinkle in my case where the "extra" worker also is oversized for the work that's actually in the queue.

The config (I've put the whole thing at the end of this comment) includes multiple node types, two of which are ray.worker.ray-dev-r6a.2xlarge and ray.worker.ray-dev-r6a.4xlarge which are configured in the autoscaler with 47.8 GiB and 89.6 GiB of memory respectively. I start with just the head node running, which has a worker too small for the workload I'm testing. Then start a workload with a requirement of 35 GiB of memory. At first a 2xlarge worker node starts as expected, but right as it finishes start the autoscaler also starts a 4xlarge node which is both unnecessary and too large. The job ends up scheduling on the 2xlarge as expected and the 4xlarge node just ends up shutting down after being idle.

Autoscaler log:

======== Autoscaler status: 2023-07-20 07:29:54.064934 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.ray-dev
Pending:
 10.128.3.31: ray.worker.ray-dev-r6a.2xlarge, setting-up
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0B/4.36GiB memory
 0B/2.18GiB object_store_memory

Demands:
 {'CPU': 4.0, 'memory': 37580963840.0}: 1+ pending tasks/actors
2023-07-20 07:29:54,066TINFO autoscaler.py:470 -- The autoscaler took 0.12 seconds to complete the update iteration.
2023-07-20 07:29:59,145TINFO autoscaler.py:147 -- The autoscaler took 0.051 seconds to fetch the list of non-terminated nodes.
2023-07-20 07:29:59,146TINFO autoscaler.py:427 --
======== Autoscaler status: 2023-07-20 07:29:59.146404 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.ray-dev
 1 ray.worker.ray-dev-r6a.2xlarge
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 4.0/10.0 CPU
 35.00GiB/49.16GiB memory
 0B/20.60GiB object_store_memory

Demands:
 {'CPU': 4.0, 'memory': 37580963840.0}: 1+ pending tasks/actors
2023-07-20 07:29:59,147TINFO autoscaler.py:1374 -- StandardAutoscaler: Queue 1 new nodes for launch
2023-07-20 07:29:59,147TINFO autoscaler.py:470 -- The autoscaler took 0.053 seconds to complete the update iteration.
2023-07-20 07:29:59,148TINFO node_launcher.py:166 -- NodeLauncher1: Got 1 nodes to launch.
2023-07-20 07:29:59,148TINFO monitor.py:423 -- :event_summary:Resized to 10 CPUs.
2023-07-20 07:30:00,403TINFO node_launcher.py:166 -- NodeLauncher1: Launching 1 nodes, type ray.worker.ray-dev-r6a.4xlarge.
2023-07-20 07:30:04,291TINFO autoscaler.py:147 -- The autoscaler took 0.117 seconds to fetch the list of non-terminated nodes.
2023-07-20 07:30:04,291TINFO autoscaler.py:427 --

Full cluster config:

auth: {ssh_user: ubuntu}
available_node_types:
  ray.head.ray-dev:
    node_config:
      BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs: {VolumeSize: 140, VolumeType: gp3}
      ImageId: ami-0387d929287ab193e
      InstanceType: m5.large
    resources: {}
  ray.worker.ray-dev-r6a.12xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.12xlarge
  ray.worker.ray-dev-r6a.16xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.16xlarge
  ray.worker.ray-dev-r6a.24xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.24xlarge
  ray.worker.ray-dev-r6a.2xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.2xlarge
  ray.worker.ray-dev-r6a.32xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.32xlarge
  ray.worker.ray-dev-r6a.4xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.4xlarge
  ray.worker.ray-dev-r6a.8xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.8xlarge
  ray.worker.ray-dev-r6a.large:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.large
  ray.worker.ray-dev-r6a.xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.xlarge
cluster_name: default
cluster_synced_files: []
docker:
  container_name: ray_container
  image: rayproject/ray:2.5.1-py39-cpu
  pull_before_run: true
  run_options: ['--ulimit nofile=65536:65536']
file_mounts: {}
file_mounts_sync_continuously: false
head_node_type: ray.head.ray-dev
head_setup_commands: []
head_start_ray_commands: [ray stop, ray start --head --port=6379 --object-manager-port=8076
    --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0]
idle_timeout_minutes: 5
initialization_commands: []
max_workers: 5
provider: {availability_zone: 'us-west-2a,us-west-2b,us-west-2c,us-west-2d', cache_stopped_nodes: true,
  region: us-west-2, type: aws}
rsync_exclude: ['**/.git', '**/.git/**']
rsync_filter: [.gitignore]
setup_commands: []
upscaling_speed: 1.0
worker_setup_commands: []
worker_start_ray_commands: [ray stop, 'ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076']

dcarrion87 · 2023-08-02T22:54:51Z

Just wondering if anyone has figured out what's happening here? Still happening to us. As we constrain nodes it's just leaving pending pods littered throughout the different clusters.

wjzhou-ep · 2023-08-03T12:48:47Z

as I said in the my original post, I believe there is a race condition on the reading of pending task and Usage:

The Usage detects the running tasks (after node starts)
But the Pending task was the old value, so, cluster scale up extra nodes for these Usage + Pending, (The Usage tasks were counted twice, once in Usage and once in the pending )

dcarrion87 · 2023-08-28T00:12:30Z

Do we know when this is likely to come in as a fix?

anyscalesam · 2023-09-26T17:32:31Z

cc @rickyyx can you try and repro and see if this is fixed since ray27 release in the v2 oss autoscaler?

rickyyx · 2023-09-26T18:05:56Z

2.7 should have patches that mitigate this - but this is essentially this #38189 as well.

Current plan is to fix in 2.8.

anyscalesam · 2023-11-08T07:21:46Z

@vitsai > chasing the breadcrumbs > which is the most promising GH issue / PR that we think will resolve this issue?

vitsai · 2023-11-08T17:12:38Z

#40488

vitsai · 2023-11-08T17:13:17Z

The fix for autoscaler v2 is in 2.8, the linked PR is for autoscaler v1.

rkooo567 · 2023-11-13T13:20:32Z

I will downgrade the priority as V1 fix is less prioritized. Does it sound okay?

anyscalesam · 2023-11-13T22:27:55Z

Reviewed with @rkooo567 @rynewang @vitsai > let's decide whether we should fix this in autoscaler v1.

This is fixed in v2 but we have an interim state in ray29 where default may still be autosclaer v1 in which case this issue will be there.

Next steps lets decide do we just skip/[ush to autoscaler v2 in ray210 or fix this regression.

llidev · 2024-01-03T21:45:43Z

@vitsai @anyscalesam @rickyyx Could you please share more about how to use autoscaler v2 in ray29? Didn't find related doc

llidev · 2024-01-03T22:16:32Z

Is it just enabling RAY_enable_autoscaler_v2=1?

rickyyx · 2024-01-03T22:36:09Z

@vitsai @anyscalesam @rickyyx Could you please share more about how to use autoscaler v2 in ray29? Didn't find related doc

Hey @llidev we are still working on the fix with autoscaler v1. @vitsai has a PR here #40488: , while we try doing so, we are also working on v2 autoscaler. This has been delayed due to other priority, so it's not available with ray29 yet.

llidev · 2024-01-04T01:51:15Z

Thank you @rickyyx . Do you think autoscaler_v2 is something the end user can try right now? Or it's still recommended to wait.

rickyyx · 2024-01-04T18:16:15Z

Thank you @rickyyx . Do you think autoscaler_v2 is something the end user can try right now? Or it's still recommended to wait.

It's still under active dev - so still not yet ready.

rkooo567 · 2024-01-26T00:01:24Z

We will close it after autoscaler v2 is enabled

DmitriGekhtman · 2024-09-05T05:15:17Z

Any progress on this one?

anyscalesam · 2024-09-06T06:30:31Z

@DmitriGekhtman - not yet; v2 is still optional and not the default scaler for now.

DmitriGekhtman · 2024-09-06T14:52:51Z

Hmm, looks like we're in a state where autoscaler v1 functionality is gradually degrading but autoscaler v2 development is suspended (last feature commit was in March).
(Totally understandable, I'm sure the maintainers have a lot on their hands.)
Might proceed cautiously with adopting Ray autoscaling and try to collaborate on stability fixes where possible.

DmitriGekhtman · 2024-10-02T16:08:08Z

#36926 (comment)

But essentially the autoscaler is starting more nodes than necessary.
autoscaler.txt

@dcarrion87
It's been a while, but do you still have context on what was going on? Based on the linked autoscaler logs, it looks like there were more than 24 tasks submitted. If it's not the case, (i.e. the 24 submitted tasks were counted many times leading to an explosion of node launches) -- that would be a catastrophic failure of autoscaling functionality.

DmitriGekhtman · 2024-12-08T06:52:43Z

This is consistently reproducible on our infrastructure in the following way:

Set up autoscaler configuration with 60-CPU nodes, 0 min_workers, large upscaling speed.
Submit 1000 6-CPU tasks. Each task sleeps for 10 minutes.

One would expect exactly 100 nodes to come up. The first time I tried this, I got 150 nodes -- that's quite severe over-provisioning.
Setting a high autoscaler AUTOSCALER_UPDATE_INTERVAL_S somewhat mitigates the issue, though I still get 5 to 10 extra nodes with the poll period increased to 60 seconds, up from the default 5.

One way I can get around this, sort of:
In addition to setting a high AUTOSCALER_UPDATE_INTERVAL_S, I modified our internal v1 node provider to reject the autoscaler's requests to scale up if there are any nodes still pending. In that case, I tend to get exactly the desired 100 nodes.
However, this solution is obviously problematic in that, if any given node is slow in coming up, upscaling is stranded.

Given prioritization history for this issue, looks like it's unlikely to be resolved for autoscaler v1. On the other hand, looks like there's been a little bit of a pick-up in activity for autoscaler v2, so maybe there's some hope that stable Ray autoscaling will be available in the OSS in the not-too-distant future.

DmitriGekhtman · 2024-12-08T09:14:54Z

One way I can get around this, sort of:
In addition to setting a high AUTOSCALER_UPDATE_INTERVAL_S, I modified our internal v1 node provider to reject the autoscaler's requests to scale up if there are any nodes still pending. In that case, I tend to get exactly the desired 100 nodes.
However, this solution is obviously problematic in that, if any given node is slow in coming up, upscaling is stranded.

I refined this by rejecting the autoscaler's upscaling attempts only up to a certain number of times consecutively; this way you can upscale eventually even with a stuck node.

Summary: A potential workaround is to modify your node provider to upscale more slowly when nodes are pending.

DmitriGekhtman · 2024-12-08T19:07:27Z

Or maybe, the right heuristic here is to backoff of upscaling if any node has recently transitioned from pending to running, to provide for time for the stray resource bundles to be removed from the pending list.

glindstr · 2024-12-27T15:38:43Z

#46588 minimal reproducible steps for overprovision

wjzhou-ep added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 28, 2023

scv119 added P1 Issue that should be fixed within a few weeks Ray-2.7 core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 3, 2023

rkooo567 added core-autoscaler autoscaler related issues core-autoscaler and removed core-autoscaler autoscaler related issues core-autoscaler labels Jul 17, 2023

rkooo567 added ray-2.8 and removed Ray-2.7 labels Aug 24, 2023

anyscalesam added P0 Issues that should be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Sep 26, 2023

rickyyx self-assigned this Sep 26, 2023

rickyyx added this to the Autoscaler V2 milestone Sep 26, 2023

rkooo567 added the size-large label Sep 29, 2023

rkooo567 mentioned this issue Oct 2, 2023

[Autoscaler] Over counting pending demands, resulting in over provisioning of nodes #38189

Closed

rkooo567 assigned vitsai and unassigned rickyyx Oct 2, 2023

jjyao mentioned this issue Oct 2, 2023

[Serve] Skip test_max_replicas_per_node on Windows #40030

Merged

8 tasks

anyscalesam added P1 Issue that should be fixed within a few weeks ray-2.10 and removed P0 Issues that should be fixed in short order ray-2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) labels Nov 13, 2023

rkooo567 unassigned vitsai Jan 26, 2024

rkooo567 removed the ray-2.10 label Jan 26, 2024

anyscalesam added the stability label Mar 1, 2024

DmitriGekhtman mentioned this issue Oct 2, 2024

[CORE][CLUSTER] Ray Autoscaler Overprovisioning Resources on AWS #46588

Open

jjyao added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Oct 30, 2024

DmitriGekhtman mentioned this issue Dec 8, 2024

[Ray Autoscaler] Ray autoscaler does not scale up effectively and fast #45373

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Autoscaler] starts extra workers than necessary #36926

[Autoscaler] starts extra workers than necessary #36926

wjzhou-ep commented Jun 28, 2023

dcarrion87 commented Jul 5, 2023 •

edited

Loading

Naton1 commented Jul 18, 2023

meastham commented Jul 20, 2023 •

edited

Loading

dcarrion87 commented Aug 2, 2023 •

edited

Loading

wjzhou-ep commented Aug 3, 2023

dcarrion87 commented Aug 28, 2023

anyscalesam commented Sep 26, 2023

rickyyx commented Sep 26, 2023 •

edited

Loading

anyscalesam commented Nov 8, 2023

vitsai commented Nov 8, 2023

vitsai commented Nov 8, 2023

rkooo567 commented Nov 13, 2023

anyscalesam commented Nov 13, 2023

llidev commented Jan 3, 2024 •

edited

Loading

llidev commented Jan 3, 2024

rickyyx commented Jan 3, 2024

llidev commented Jan 4, 2024

rickyyx commented Jan 4, 2024

rkooo567 commented Jan 26, 2024

DmitriGekhtman commented Sep 5, 2024

anyscalesam commented Sep 6, 2024

DmitriGekhtman commented Sep 6, 2024 •

edited

Loading

DmitriGekhtman commented Oct 2, 2024

DmitriGekhtman commented Dec 8, 2024

DmitriGekhtman commented Dec 8, 2024 •

edited

Loading

DmitriGekhtman commented Dec 8, 2024 •

edited

Loading

glindstr commented Dec 27, 2024

[Autoscaler] starts extra workers than necessary #36926

[Autoscaler] starts extra workers than necessary #36926

Comments

wjzhou-ep commented Jun 28, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

dcarrion87 commented Jul 5, 2023 • edited Loading

Naton1 commented Jul 18, 2023

meastham commented Jul 20, 2023 • edited Loading

dcarrion87 commented Aug 2, 2023 • edited Loading

wjzhou-ep commented Aug 3, 2023

dcarrion87 commented Aug 28, 2023

anyscalesam commented Sep 26, 2023

rickyyx commented Sep 26, 2023 • edited Loading

anyscalesam commented Nov 8, 2023

vitsai commented Nov 8, 2023

vitsai commented Nov 8, 2023

rkooo567 commented Nov 13, 2023

anyscalesam commented Nov 13, 2023

llidev commented Jan 3, 2024 • edited Loading

llidev commented Jan 3, 2024

rickyyx commented Jan 3, 2024

llidev commented Jan 4, 2024

rickyyx commented Jan 4, 2024

rkooo567 commented Jan 26, 2024

DmitriGekhtman commented Sep 5, 2024

anyscalesam commented Sep 6, 2024

DmitriGekhtman commented Sep 6, 2024 • edited Loading

DmitriGekhtman commented Oct 2, 2024

DmitriGekhtman commented Dec 8, 2024

DmitriGekhtman commented Dec 8, 2024 • edited Loading

DmitriGekhtman commented Dec 8, 2024 • edited Loading

glindstr commented Dec 27, 2024

dcarrion87 commented Jul 5, 2023 •

edited

Loading

meastham commented Jul 20, 2023 •

edited

Loading

dcarrion87 commented Aug 2, 2023 •

edited

Loading

rickyyx commented Sep 26, 2023 •

edited

Loading

llidev commented Jan 3, 2024 •

edited

Loading

DmitriGekhtman commented Sep 6, 2024 •

edited

Loading

DmitriGekhtman commented Dec 8, 2024 •

edited

Loading

DmitriGekhtman commented Dec 8, 2024 •

edited

Loading