-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE][CLUSTER] Ray Autoscaler Overprovisioning Resources on AWS #46588
Comments
Hi can you share more info?
|
Hi @rynewang. Thanks for looking at this. 🙏
In almost every case I've seen it will spin up just 1 additional node -- the smallest node.
I found it is possible to trigger over provision with less than capacity. I am unable to find the exact number. In my tests 230 of 256 wont trigger it but 240 of 256 can. As the margin grows the reliability of triggering the bug decreases.
I have task runners. Each task runner needs to execute thousands of unique jobs. Often the jobs take 10s-1min. To manage resources I've created a task runner that limits the number of concurrent tasks as per the api guidance using ray.wait. So in an example with one task runner it will queue up to 256 tasks concurrently with the remainder waiting on the client or actor. As ray.wait receives finished tasks it submits more up to the 256 limit. However then the autoscaler provisions an extra machine say with 64cpu which expands the cluster to from 256 to 320cpu but the task runner continues to limit tasks to 256. So that newly provisioned machine effectively sits there idle. Additional Notes: The bug can be triggered without ray.wait. So that task runner code isn't causing the issue. In the example code I provided I'm initializing with the runtime environment At 512 cpus when presumably the task scheduling takes longer than 10s (two autoscaler periods) it still only schedules 1 additional machine (all 64cpu machines). |
This looks like a duplicate of #36926 |
@DmitriGekhtman agreed this is a duplicated. This may be helpful to #36926 as I did provide code to reproduce the bug. |
What happened + What you expected to happen
Scenario:
AWS EC2 Cluster using Ray Autoscaler
Expected:
For a 256 CPU cluster sitting idle and when 256 {CPU:1} tasks are called in quick succession I expect the cluster to meet the demand with the current resources without ray autoscaler provisioning additional resources.
What happened:
Creating X {CPU:1} tasks on an idle cluster of size X CPU will cause the autoscaler to provision more CPU resources when X is sufficiently large enough. When runtime_env is specified on ray.init() the required X size seems to be smaller possibly due to the amount of time to schedule the tasks. However I've found it isn't strictly necessary. I can reliably reproduce this problem when X = 256. But I can be unreliably reproduce it with lower values like 32 or 64.
Thank you for investigating.
Versions / Dependencies
Python 3.10.12
Ray 2.32.0
Boto3
Reproduction script
Setup the venv and a test directory.
Test directory contains:
setup.py
run.py
cluster.yaml
setup.py
run.py
cluster.yaml
Activate virtual environment and navigate to test directory
-ray up cluster.yaml (monitor using ray status --address=...)
-python setup.py
Wait until all cpus are provisioned
-python run.py
Issue Severity
Low: It annoys or frustrates me.
The text was updated successfully, but these errors were encountered: