[core] Prestart workers up to available CPU limit #18166

stephanie-wang · 2021-08-27T18:49:18Z

Why are these changes needed?

We prestart workers to improve task scheduling time, but the soft limit is a static limit based on the number of CPUs. This means prestart won't work as expected if some workers are holding less than 1 CPU (such as an actor). This PR allows prestart to start workers up to the number of available CPUs, if this is more than what the soft limit would allow.

Related issue number

This PR is needed to fix the distributed test_many_tasks.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl

IIUC, this is basically allowing prestart past the CPU limit. So suppose I have 4 actors using 0.25 CPUs each, 2 CPUs total, and 8 more actors queued, this allows prestart of up to 4 more workers, correct?

It would be great to also add some unit tests checking these scenarios.

stephanie-wang · 2021-08-27T21:10:19Z

IIUC, this is basically allowing prestart past the CPU limit. So suppose I have 4 actors using 0.25 CPUs each, 2 CPUs total, and 8 more actors queued, this allows prestart of up to 4 more workers, correct?

It would be great to also add some unit tests checking these scenarios.

Yes, but I think it would actually prestart up to 1 more worker in the scenario you described, since there is 1 CPU available.

ericl · 2021-08-27T21:18:46Z

Doesn't the existing logic already prestart 1 more in that case? Can you give a before/after comparison on a few cases here? (Also would be good to have this as a comment in the code)

…

On Fri, Aug 27, 2021, 2:10 PM Stephanie Wang ***@***.***> wrote: IIUC, this is basically allowing prestart past the CPU limit. So suppose I have 4 actors using 0.25 CPUs each, 2 CPUs total, and 8 more actors queued, this allows prestart of up to 4 more workers, correct? It would be great to also add some unit tests checking these scenarios. Yes, but I think it would actually prestart up to 1 more worker in the scenario you described, since there is 1 CPU available. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#18166 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSQWE74QT5MFTJSAC4DT675MLANCNFSM5C6C5NMA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

stephanie-wang · 2021-08-27T21:26:18Z

The previous logic only starts up to the soft limit, including existing workers.

So with 4 actors using 0.25 CPUs each, 2 CPUs total, 8 actors queued:

old: min(soft_limit - num_total_workers, backlog) = min(2 - 4, backlog) -> do not start any more workers because we are already over the soft limit
new: min(max(soft_limit - num_total_workers, num available CPUs), backlog) -> start 1 more worker

Actually I think we should just change it to min(num available CPUs, backlog), unless there's a reason we want to keep the previous calculation?

ericl · 2021-08-27T21:35:19Z

The new calculation doesn't seem right then. Suppose I have 16 CPUs, 15 actors using 1 CPU each, and 100 actors backlogged. Then you prestart 16 workers, which is clearly incorrect right?

…

On Fri, Aug 27, 2021, 2:26 PM Stephanie Wang ***@***.***> wrote: The previous logic only starts up to the soft limit, including existing workers. So with 4 actors using 0.25 CPUs each, 2 CPUs total, 8 actors queued: - old: min(soft_limit - num_total_workers, backlog) = min(2 - 4, backlog) -> do not start any more workers because we are already over the soft limit - new: min(max(soft_limit - num_total_workers, num available CPUs), backlog) -> start 1 more worker Actually I think we should just change it to min(num available CPUs, backlog), unless there's a reason we want to keep the previous calculation? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#18166 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSVYPCPN3LS4HLIAXVTT677ILANCNFSM5C6C5NMA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

stephanie-wang · 2021-08-27T21:38:11Z

The new calculation doesn't seem right then. Suppose I have 16 CPUs, 15 actors using 1 CPU each, and 100 actors backlogged. Then you prestart 16 workers, which is clearly incorrect right?
…
On Fri, Aug 27, 2021, 2:26 PM Stephanie Wang @.***> wrote: The previous logic only starts up to the soft limit, including existing workers. So with 4 actors using 0.25 CPUs each, 2 CPUs total, 8 actors queued: - old: min(soft_limit - num_total_workers, backlog) = min(2 - 4, backlog) -> do not start any more workers because we are already over the soft limit - new: min(max(soft_limit - num_total_workers, num available CPUs), backlog) -> start 1 more worker Actually I think we should just change it to min(num available CPUs, backlog), unless there's a reason we want to keep the previous calculation? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#18166 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSVYPCPN3LS4HLIAXVTT677ILANCNFSM5C6C5NMA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

I'm not sure how you got that...it should prestart 1 worker for the last available CPU, I think.

ericl · 2021-08-27T21:45:40Z

Oh I see, available==not used, not total. In this case then min(num available CPUs, backlog) sounds good to me.

ericl · 2021-08-27T21:48:17Z

Actually, isn't one issue that starting more workers than the soft limit will just result in the workers getting immediately idle killed (since the soft limit is exceeded)?

Another concern is the well conditioning of prestart. Suppose I use min(num_avail_cpus, backlog) as the number to prestart. And then I call PrestartWorkers() 10 times in quick succession, during which time num_avail_cpus doesn't change. This starts 10x the number of workers right?

So I think we have to take into account the number of current workers. I believe this formula would be better:
min(soft_limit - sum(current_worker_cpu_allocations), backlog). That accounts for the edge cases of fractional CPU allocations, while preserving the well conditioning of the current formula.

stephanie-wang · 2021-08-27T21:55:06Z

Actually, isn't one issue that starting more workers than the soft limit will just result in the workers getting immediately idle killed (since the soft limit is exceeded)?

My logic here is that we only start up to the task backlog. So as long as the owner submits the tasks in the backlog before the idle worker timeout, the worker should be usable. Long-term, it would probably be better to also adjust the soft limit dynamically, based on the available (unused) CPUs.

By the way, this issue isn't new, since before #17202, a queued task would end up triggering multiple worker starts.

ericl · 2021-08-27T21:57:10Z

OK, I buy that exceeding the soft limit temporarily is safe if the scheduler immediately grabs the worker. But I think the other concern stands.

stephanie-wang added 2 commits August 27, 2021 11:45

Prestart workers according to num available CPUs

20a1584

lint

2856bf1

stephanie-wang assigned ericl and wuisawesome Aug 27, 2021

stephanie-wang mentioned this pull request Aug 27, 2021

[core] Assign tasks to the first available worker #18167

Merged

6 tasks

ericl reviewed Aug 27, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 27, 2021

ericl approved these changes Aug 27, 2021

View reviewed changes

stephanie-wang added 5 commits August 27, 2021 15:27

Prestart min(available CPU, backlog)

250554f

Fix test, adjust policy

10e9a05

debug

a53bef5

retry

c22ae90

lint

a1d7be5

stephanie-wang merged commit 7bc1ef0 into ray-project:master Aug 29, 2021

stephanie-wang deleted the prestart-workers branch August 29, 2021 21:11

stephanie-wang mentioned this pull request Aug 30, 2021

[Core] revert: revert Unified worker starter #18008

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Prestart workers up to available CPU limit #18166

[core] Prestart workers up to available CPU limit #18166

stephanie-wang commented Aug 27, 2021

ericl left a comment

stephanie-wang commented Aug 27, 2021

ericl commented Aug 27, 2021 via email

stephanie-wang commented Aug 27, 2021

ericl commented Aug 27, 2021 via email

stephanie-wang commented Aug 27, 2021

ericl commented Aug 27, 2021

ericl commented Aug 27, 2021 •

edited

Loading

stephanie-wang commented Aug 27, 2021

ericl commented Aug 27, 2021

[core] Prestart workers up to available CPU limit #18166

[core] Prestart workers up to available CPU limit #18166

Conversation

stephanie-wang commented Aug 27, 2021

Why are these changes needed?

Related issue number

Checks

ericl left a comment

Choose a reason for hiding this comment

stephanie-wang commented Aug 27, 2021

ericl commented Aug 27, 2021 via email

stephanie-wang commented Aug 27, 2021

ericl commented Aug 27, 2021 via email

stephanie-wang commented Aug 27, 2021

ericl commented Aug 27, 2021

ericl commented Aug 27, 2021 • edited Loading

stephanie-wang commented Aug 27, 2021

ericl commented Aug 27, 2021

ericl commented Aug 27, 2021 •

edited

Loading