Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Autoscaler] starts extra workers than necessary #36926

Open
wjzhou-ep opened this issue Jun 28, 2023 · 28 comments · Fixed by #40369
Open

[Autoscaler] starts extra workers than necessary #36926

wjzhou-ep opened this issue Jun 28, 2023 · 28 comments · Fixed by #40369
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues P2 Important issue, but not time-critical size-large stability
Milestone

Comments

@wjzhou-ep
Copy link

What happened + What you expected to happen

Autoscaler started extra worker while they are not needed.

From the following log, I believe it may have race condition on reading

  • pending tasks
  • usage

Our tasks use 1 CPU, 30G memory, the worker has 120GB memory

At beginning, autoscaler started 12 workers correctly. (4 tasks each worker and 48 tasks)

Then, the moment the worker started, it seems there is a race condition on the usage and pending tasks.
e.g.
Usage:
10.0/176.0 CPU # 10 task is runing
so, there should be 38 (48-10) tasks pending.

However, the autoscaler think there are still 48 task pending and start 3 extra workers for them

======== Autoscaler status: 2023-06-28 14:34:29.072047 ========
Node status
---------------------------------------------------------------
Healthy:
 11 large-group
 1 head-group
Pending:
 192.168.26.101: large-group, waiting
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 10.0/176.0 CPU
 0.0/1100.0 large-group
 279.40GiB/1.22TiB memory
 0.0/1100.0 no_gpu
 9.52KiB/181.94GiB object_store_memory

Demands:
 {'CPU': 1.0, 'memory': 30000000000.0}: 48+ pending tasks/actors
2023-06-28 14:34:29,073	INFO autoscaler.py:1374 -- StandardAutoscaler: Queue 3 new nodes for launch
2023-06-28 14:34:29,073	INFO node_launcher.py:166 -- BaseNodeLauncher: Got 3 nodes to launch.
2023-06-28 14:34:29,073	INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 3 nodes, type large-group.
2023-06-28 14:34:29,074	INFO node_provider.py:287 -- Autoscaler is submitting the following patch to RayCluster ray in namespace pm-wuy-research.
2023-06-28 14:34:29,074	INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 3}]
2023-06-28 14:34:29,119	INFO autoscaler.py:471 -- The autoscaler took 0.112 seconds to complete the update iteration.
2023-06-28 14:34:29,120	INFO monitor.py:425 -- :event_summary:Resized to 176 CPUs.
2023-06-28 14:34:29,120	INFO monitor.py:425 -- :event_summary:Adding 3 node(s) of type large-group.
2023-06-28 14:34:34,184	INFO node_provider.py:240 -- Listing pods for RayCluster ray in namespace pm-wuy-research at pods resource version >= 42434271.
2023-06-28 14:34:34,210	INFO node_provider.py:258 -- Fetched pod data at resource version 42434271.
2023-06-28 14:34:34,210	INFO autoscaler.py:148 -- The autoscaler took 0.065 seconds to fetch the list of non-terminated nodes.
2023-06-28 14:34:34,211	INFO autoscaler.py:427 -- 

Versions / Dependencies

ray 2.5.1

Reproduction script

normal setup

Issue Severity

Low: It annoys or frustrates me.

@wjzhou-ep wjzhou-ep added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 28, 2023
@scv119 scv119 added P1 Issue that should be fixed within a few weeks Ray-2.7 core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 3, 2023
@dcarrion87
Copy link

dcarrion87 commented Jul 5, 2023

@wjzhou-ep @scv119

We seem to be seeing this behaviour too. Maybe slightly different. But essentially the autoscaler is starting more nodes than necessary.

Initial request for 24x:

 {'CPU': 32.0, 'GPU': 1.0}: 24+ pending tasks/actors
2023-07-04 22:44:01,163	INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 24 new nodes for launch
2023-07-04 22:44:01,164	INFO node_launcher.py:166 -- BaseNodeLauncher: Got 24 nodes to launch.
2023-07-04 22:44:01,164	INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 24 nodes, type workergroup.
2023-07-04 22:44:01,164	INFO node_provider.py:286 -- Autoscaler is submitting the following patch to RayCluster coder-xavierholt-ray-1688535753 in namespace annalise-coder-prod.
2023-07-04 22:44:01,164	INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 24}]
2023-07-04 22:44:01,193	INFO autoscaler.py:462 -- The autoscaler took 0.075 seconds to complete the update iteration.
2023-07-04 22:44:01,193	INFO monitor.py:428 -- :event_summary:Adding 24 node(s) of type workergroup.
2023-07-04 22:44:06,283	INFO node_provider.py:257 -- Fetched pod data at resource version 210040046.
2023-07-04 22:44:06,283	INFO autoscaler.py:143 -- The autoscaler took 0.056 seconds to fetch the list of non-terminated nodes.
2023-07-04 22:44:06,284	INFO autoscaler.py:419 -- 

A few loops later it starts queuing unnecessary nodes.

 {'CPU': 32.0, 'GPU': 1.0}: 24+ pending tasks/actors
2023-07-04 22:44:16,577	INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 2 new nodes for launch
2023-07-04 22:44:16,577	INFO node_launcher.py:166 -- BaseNodeLauncher: Got 2 nodes to launch.
2023-07-04 22:44:16,577	INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 2 nodes, type workergroup.
2023-07-04 22:44:16,577	INFO node_provider.py:286 -- Autoscaler is submitting the following patch to RayCluster coder-xavierholt-ray-1688535753 in namespace annalise-coder-prod.
2023-07-04 22:44:16,577	INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 26}]
2023-07-04 22:44:16,608	INFO autoscaler.py:462 -- The autoscaler took 0.097 seconds to complete the update iteration.
2023-07-04 22:44:16,609	INFO monitor.py:428 -- :event_summary:Resized to 64 CPUs, 2 GPUs.
2023-07-04 22:44:16,609	INFO monitor.py:428 -- :event_summary:Adding 2 node(s) of type workergroup.
2023-07-04 22:44:21,724	INFO node_provider.py:257 -- Fetched pod data at resource version 210040402.
2023-07-04 22:44:21,724	INFO autoscaler.py:143 -- The autoscaler took 0.065 seconds to fetch the list of non-terminated nodes.
2023-07-04 22:44:21,725	INFO autoscaler.py:419 -- 

And then some more

 {'CPU': 32.0, 'GPU': 1.0}: 16+ pending tasks/actors
2023-07-04 22:44:32,167	INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 6 new nodes for launch
2023-07-04 22:44:32,167	INFO node_launcher.py:166 -- BaseNodeLauncher: Got 6 nodes to launch.
2023-07-04 22:44:32,167	INFO node_launcher.py:166 -- BaseNodeLauncher: Launching 6 nodes, type workergroup.
2023-07-04 22:44:32,168	INFO node_provider.py:286 -- Autoscaler is submitting the following patch to RayCluster coder-senorchang-ray-1688535753 in namespace anon-coder-prod.
2023-07-04 22:44:32,168	INFO node_provider.py:290 -- [{'op': 'replace', 'path': '/spec/workerGroupSpecs/0/replicas', 'value': 32}]
2023-07-04 22:44:32,203	INFO autoscaler.py:462 -- The autoscaler took 0.176 seconds to complete the update iteration.
2023-07-04 22:44:32,204	INFO monitor.py:428 -- :event_summary:Resized to 512 CPUs, 16 GPUs.
2023-07-04 22:44:32,204	INFO monitor.py:428 -- :event_summary:Adding 6 node(s) of type workergroup.
2023-07-04 22:44:37,303	INFO node_provider.py:257 -- Fetched pod data at resource version 210040797.
2023-07-04 22:44:37,304	INFO autoscaler.py:143 -- The autoscaler took 0.067 seconds to fetch the list of non-terminated nodes.
2023-07-04 22:44:37,305	INFO autoscaler.py:419 -- 

We don't understand why?

The issue for us is they pend forever due to constraints around the pods and adding to resource quota counts which aren't valid.

Attached logs: autoscaler.txt

@rkooo567 rkooo567 added core-autoscaler autoscaler related issues core-autoscaler and removed core-autoscaler autoscaler related issues core-autoscaler labels Jul 17, 2023
@Naton1
Copy link

Naton1 commented Jul 18, 2023

Seeing the same as well.

No jobs running...

======== Autoscaler status: 2023-07-18 02:18:09.914151 ========
Node status
---------------------------------------------------------------
Healthy:
 4 worker_node
 1 head_node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/24.0 CPU
 0B/29.45GiB memory
 0B/12.47GiB object_store_memory

Demands:
 (no resource demands)
2023-07-18 02:18:09,916	INFO autoscaler.py:470 -- The autoscaler took 0.065 seconds to complete the update iteration.
2023-07-18 02:18:14,984	INFO autoscaler.py:147 -- The autoscaler took 0.053 seconds to fetch the list of non-terminated nodes.
2023-07-18 02:18:14,985	INFO autoscaler.py:427 -- 

Then 5x 4 CPUs jobs started with 24 CPUs already available, 4 more instances launched unnecessarily

======== Autoscaler status: 2023-07-18 02:18:14.985085 ========
Node status
---------------------------------------------------------------
Healthy:
 4 worker_node
 1 head_node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 20.0/24.0 CPU
 0B/29.45GiB memory
 0B/12.47GiB object_store_memory

Demands:
 {'CPU': 4.0}: 5+ pending tasks/actors
2023-07-18 02:18:14,987	INFO autoscaler.py:1374 -- StandardAutoscaler: Queue 4 new nodes for launch
2023-07-18 02:18:14,987	INFO autoscaler.py:470 -- The autoscaler took 0.056 seconds to complete the update iteration.
2023-07-18 02:18:14,988	INFO node_launcher.py:166 -- NodeLauncher0: Got 4 nodes to launch.
2023-07-18 02:18:16,976	INFO node_launcher.py:166 -- NodeLauncher0: Launching 4 nodes, type worker_node.
2023-07-18 02:18:20,149	INFO autoscaler.py:147 -- The autoscaler took 0.141 seconds to fetch the list of non-terminated nodes.
2023-07-18 02:18:20,150	INFO autoscaler.py:427 -- 
======== Autoscaler status: 2023-07-18 02:18:20.150147 ========
Node status
---------------------------------------------------------------
Healthy:
 4 worker_node
 1 head_node
Pending:
 172.31.3.147: worker_node, uninitialized
 172.31.14.20: worker_node, uninitialized
 172.31.5.8: worker_node, uninitialized
 172.31.13.158: worker_node, uninitialized
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 20.0/24.0 CPU
 0B/29.45GiB memory
 74.59MiB/12.47GiB object_store_memory

Demands:
 (no resource demands)
2023-07-18 02:18:20,152	INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0231fa18511c6f83f.
2023-07-18 02:18:20,152	INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0970cd9533d7fb8fe.
2023-07-18 02:18:20,153	INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0c5c4df3cac069b0e.
2023-07-18 02:18:20,154	INFO autoscaler.py:1322 -- Creating new (spawn_updater) updater thread for node i-0db0afd004f36a21a.
2023-07-18 02:18:20,155	INFO autoscaler.py:470 -- The autoscaler took 0.147 seconds to complete the update iteration.
2023-07-18 02:18:20,156	INFO monitor.py:423 -- :event_summary:Adding 4 node(s) of type worker_node.
2023-07-18 02:18:25,313	INFO autoscaler.py:147 -- The autoscaler took 0.136 seconds to fetch the list of non-terminated nodes.
2023-07-18 02:18:25,314	INFO autoscaler.py:427 -- 

@meastham
Copy link

meastham commented Jul 20, 2023

I'm also seeing this in a cluster I started on AWS EC2 instances with ray up. There's an additional wrinkle in my case where the "extra" worker also is oversized for the work that's actually in the queue.

The config (I've put the whole thing at the end of this comment) includes multiple node types, two of which are ray.worker.ray-dev-r6a.2xlarge and ray.worker.ray-dev-r6a.4xlarge which are configured in the autoscaler with 47.8 GiB and 89.6 GiB of memory respectively. I start with just the head node running, which has a worker too small for the workload I'm testing. Then start a workload with a requirement of 35 GiB of memory. At first a 2xlarge worker node starts as expected, but right as it finishes start the autoscaler also starts a 4xlarge node which is both unnecessary and too large. The job ends up scheduling on the 2xlarge as expected and the 4xlarge node just ends up shutting down after being idle.

Autoscaler log:

======== Autoscaler status: 2023-07-20 07:29:54.064934 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.ray-dev
Pending:
 10.128.3.31: ray.worker.ray-dev-r6a.2xlarge, setting-up
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/2.0 CPU
 0B/4.36GiB memory
 0B/2.18GiB object_store_memory

Demands:
 {'CPU': 4.0, 'memory': 37580963840.0}: 1+ pending tasks/actors
2023-07-20 07:29:54,066TINFO autoscaler.py:470 -- The autoscaler took 0.12 seconds to complete the update iteration.
2023-07-20 07:29:59,145TINFO autoscaler.py:147 -- The autoscaler took 0.051 seconds to fetch the list of non-terminated nodes.
2023-07-20 07:29:59,146TINFO autoscaler.py:427 --
======== Autoscaler status: 2023-07-20 07:29:59.146404 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.ray-dev
 1 ray.worker.ray-dev-r6a.2xlarge
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 4.0/10.0 CPU
 35.00GiB/49.16GiB memory
 0B/20.60GiB object_store_memory

Demands:
 {'CPU': 4.0, 'memory': 37580963840.0}: 1+ pending tasks/actors
2023-07-20 07:29:59,147TINFO autoscaler.py:1374 -- StandardAutoscaler: Queue 1 new nodes for launch
2023-07-20 07:29:59,147TINFO autoscaler.py:470 -- The autoscaler took 0.053 seconds to complete the update iteration.
2023-07-20 07:29:59,148TINFO node_launcher.py:166 -- NodeLauncher1: Got 1 nodes to launch.
2023-07-20 07:29:59,148TINFO monitor.py:423 -- :event_summary:Resized to 10 CPUs.
2023-07-20 07:30:00,403TINFO node_launcher.py:166 -- NodeLauncher1: Launching 1 nodes, type ray.worker.ray-dev-r6a.4xlarge.
2023-07-20 07:30:04,291TINFO autoscaler.py:147 -- The autoscaler took 0.117 seconds to fetch the list of non-terminated nodes.
2023-07-20 07:30:04,291TINFO autoscaler.py:427 --

Full cluster config:

auth: {ssh_user: ubuntu}
available_node_types:
  ray.head.ray-dev:
    node_config:
      BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs: {VolumeSize: 140, VolumeType: gp3}
      ImageId: ami-0387d929287ab193e
      InstanceType: m5.large
    resources: {}
  ray.worker.ray-dev-r6a.12xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.12xlarge
  ray.worker.ray-dev-r6a.16xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.16xlarge
  ray.worker.ray-dev-r6a.24xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.24xlarge
  ray.worker.ray-dev-r6a.2xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.2xlarge
  ray.worker.ray-dev-r6a.32xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.32xlarge
  ray.worker.ray-dev-r6a.4xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.4xlarge
  ray.worker.ray-dev-r6a.8xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.8xlarge
  ray.worker.ray-dev-r6a.large:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.large
  ray.worker.ray-dev-r6a.xlarge:
    max_workers: 5
    min_workers: 0
    node_config:
      IamInstanceProfile: {Arn: 'arn:aws:iam::997204316583:instance-profile/ray-autoscaler-v1'}
      ImageId: ami-0387d929287ab193e
      InstanceMarketOptions: {MarketType: spot}
      InstanceType: r6a.xlarge
cluster_name: default
cluster_synced_files: []
docker:
  container_name: ray_container
  image: rayproject/ray:2.5.1-py39-cpu
  pull_before_run: true
  run_options: ['--ulimit nofile=65536:65536']
file_mounts: {}
file_mounts_sync_continuously: false
head_node_type: ray.head.ray-dev
head_setup_commands: []
head_start_ray_commands: [ray stop, ray start --head --port=6379 --object-manager-port=8076
    --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0]
idle_timeout_minutes: 5
initialization_commands: []
max_workers: 5
provider: {availability_zone: 'us-west-2a,us-west-2b,us-west-2c,us-west-2d', cache_stopped_nodes: true,
  region: us-west-2, type: aws}
rsync_exclude: ['**/.git', '**/.git/**']
rsync_filter: [.gitignore]
setup_commands: []
upscaling_speed: 1.0
worker_setup_commands: []
worker_start_ray_commands: [ray stop, 'ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076']

@dcarrion87
Copy link

dcarrion87 commented Aug 2, 2023

Just wondering if anyone has figured out what's happening here? Still happening to us. As we constrain nodes it's just leaving pending pods littered throughout the different clusters.

@wjzhou-ep
Copy link
Author

as I said in the my original post, I believe there is a race condition on the reading of pending task and Usage:

The Usage detects the running tasks (after node starts)
But the Pending task was the old value, so, cluster scale up extra nodes for these Usage + Pending, (The Usage tasks were counted twice, once in Usage and once in the pending )

@rkooo567 rkooo567 added ray-2.8 and removed Ray-2.7 labels Aug 24, 2023
@dcarrion87
Copy link

Do we know when this is likely to come in as a fix?

@anyscalesam anyscalesam added P0 Issues that should be fixed in short order and removed P1 Issue that should be fixed within a few weeks labels Sep 26, 2023
@anyscalesam
Copy link
Contributor

cc @rickyyx can you try and repro and see if this is fixed since ray27 release in the v2 oss autoscaler?

@rickyyx
Copy link
Contributor

rickyyx commented Sep 26, 2023

2.7 should have patches that mitigate this - but this is essentially this #38189 as well.

Current plan is to fix in 2.8.

@anyscalesam
Copy link
Contributor

@vitsai > chasing the breadcrumbs > which is the most promising GH issue / PR that we think will resolve this issue?

@vitsai
Copy link
Contributor

vitsai commented Nov 8, 2023

#40488

@vitsai
Copy link
Contributor

vitsai commented Nov 8, 2023

The fix for autoscaler v2 is in 2.8, the linked PR is for autoscaler v1.

@rkooo567
Copy link
Contributor

I will downgrade the priority as V1 fix is less prioritized. Does it sound okay?

@anyscalesam anyscalesam added P1 Issue that should be fixed within a few weeks ray-2.10 and removed P0 Issues that should be fixed in short order ray-2.9 Issues targeting Ray 2.9 release (~Q4 CY2023) labels Nov 13, 2023
@anyscalesam
Copy link
Contributor

Reviewed with @rkooo567 @rynewang @vitsai > let's decide whether we should fix this in autoscaler v1.

This is fixed in v2 but we have an interim state in ray29 where default may still be autosclaer v1 in which case this issue will be there.

Next steps lets decide do we just skip/[ush to autoscaler v2 in ray210 or fix this regression.

@llidev
Copy link

llidev commented Jan 3, 2024

@vitsai @anyscalesam @rickyyx Could you please share more about how to use autoscaler v2 in ray29? Didn't find related doc

@llidev
Copy link

llidev commented Jan 3, 2024

Is it just enabling RAY_enable_autoscaler_v2=1?

@rickyyx
Copy link
Contributor

rickyyx commented Jan 3, 2024

@vitsai @anyscalesam @rickyyx Could you please share more about how to use autoscaler v2 in ray29? Didn't find related doc

Hey @llidev we are still working on the fix with autoscaler v1. @vitsai has a PR here #40488: , while we try doing so, we are also working on v2 autoscaler. This has been delayed due to other priority, so it's not available with ray29 yet.

@llidev
Copy link

llidev commented Jan 4, 2024

Thank you @rickyyx . Do you think autoscaler_v2 is something the end user can try right now? Or it's still recommended to wait.

@rickyyx
Copy link
Contributor

rickyyx commented Jan 4, 2024

Thank you @rickyyx . Do you think autoscaler_v2 is something the end user can try right now? Or it's still recommended to wait.

It's still under active dev - so still not yet ready.

@rkooo567
Copy link
Contributor

We will close it after autoscaler v2 is enabled

@DmitriGekhtman
Copy link
Contributor

Any progress on this one?

@anyscalesam
Copy link
Contributor

@DmitriGekhtman - not yet; v2 is still optional and not the default scaler for now.

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Sep 6, 2024

Hmm, looks like we're in a state where autoscaler v1 functionality is gradually degrading but autoscaler v2 development is suspended (last feature commit was in March).
(Totally understandable, I'm sure the maintainers have a lot on their hands.)
Might proceed cautiously with adopting Ray autoscaling and try to collaborate on stability fixes where possible.

@DmitriGekhtman
Copy link
Contributor

#36926 (comment)

But essentially the autoscaler is starting more nodes than necessary.
autoscaler.txt

@dcarrion87
It's been a while, but do you still have context on what was going on? Based on the linked autoscaler logs, it looks like there were more than 24 tasks submitted. If it's not the case, (i.e. the 24 submitted tasks were counted many times leading to an explosion of node launches) -- that would be a catastrophic failure of autoscaling functionality.

@jjyao jjyao added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Oct 30, 2024
@DmitriGekhtman
Copy link
Contributor

This is consistently reproducible on our infrastructure in the following way:

  • Set up autoscaler configuration with 60-CPU nodes, 0 min_workers, large upscaling speed.
  • Submit 1000 6-CPU tasks. Each task sleeps for 10 minutes.

One would expect exactly 100 nodes to come up. The first time I tried this, I got 150 nodes -- that's quite severe over-provisioning.
Setting a high autoscaler AUTOSCALER_UPDATE_INTERVAL_S somewhat mitigates the issue, though I still get 5 to 10 extra nodes with the poll period increased to 60 seconds, up from the default 5.

One way I can get around this, sort of:
In addition to setting a high AUTOSCALER_UPDATE_INTERVAL_S, I modified our internal v1 node provider to reject the autoscaler's requests to scale up if there are any nodes still pending. In that case, I tend to get exactly the desired 100 nodes.
However, this solution is obviously problematic in that, if any given node is slow in coming up, upscaling is stranded.

Given prioritization history for this issue, looks like it's unlikely to be resolved for autoscaler v1. On the other hand, looks like there's been a little bit of a pick-up in activity for autoscaler v2, so maybe there's some hope that stable Ray autoscaling will be available in the OSS in the not-too-distant future.

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Dec 8, 2024

One way I can get around this, sort of:
In addition to setting a high AUTOSCALER_UPDATE_INTERVAL_S, I modified our internal v1 node provider to reject the autoscaler's requests to scale up if there are any nodes still pending. In that case, I tend to get exactly the desired 100 nodes.
However, this solution is obviously problematic in that, if any given node is slow in coming up, upscaling is stranded.

I refined this by rejecting the autoscaler's upscaling attempts only up to a certain number of times consecutively; this way you can upscale eventually even with a stuck node.

Summary: A potential workaround is to modify your node provider to upscale more slowly when nodes are pending.

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Dec 8, 2024

Or maybe, the right heuristic here is to backoff of upscaling if any node has recently transitioned from pending to running, to provide for time for the stray resource bundles to be removed from the pending list.

@glindstr
Copy link

#46588 minimal reproducible steps for overprovision

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues P2 Important issue, but not time-critical size-large stability
Projects
None yet