-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Prestart workers up to available CPU limit #18166
[core] Prestart workers up to available CPU limit #18166
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this is basically allowing prestart past the CPU limit. So suppose I have 4 actors using 0.25 CPUs each, 2 CPUs total, and 8 more actors queued, this allows prestart of up to 4 more workers, correct?
It would be great to also add some unit tests checking these scenarios.
Yes, but I think it would actually prestart up to 1 more worker in the scenario you described, since there is 1 CPU available. |
Doesn't the existing logic already prestart 1 more in that case? Can you
give a before/after comparison on a few cases here? (Also would be good to
have this as a comment in the code)
…On Fri, Aug 27, 2021, 2:10 PM Stephanie Wang ***@***.***> wrote:
IIUC, this is basically allowing prestart past the CPU limit. So suppose I
have 4 actors using 0.25 CPUs each, 2 CPUs total, and 8 more actors queued,
this allows prestart of up to 4 more workers, correct?
It would be great to also add some unit tests checking these scenarios.
Yes, but I think it would actually prestart up to 1 more worker in the
scenario you described, since there is 1 CPU available.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#18166 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSQWE74QT5MFTJSAC4DT675MLANCNFSM5C6C5NMA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
The previous logic only starts up to the soft limit, including existing workers. So with 4 actors using 0.25 CPUs each, 2 CPUs total, 8 actors queued:
Actually I think we should just change it to min(num available CPUs, backlog), unless there's a reason we want to keep the previous calculation? |
The new calculation doesn't seem right then. Suppose I have 16 CPUs, 15
actors using 1 CPU each, and 100 actors backlogged.
Then you prestart 16 workers, which is clearly incorrect right?
…On Fri, Aug 27, 2021, 2:26 PM Stephanie Wang ***@***.***> wrote:
The previous logic only starts up to the soft limit, including existing
workers.
So with 4 actors using 0.25 CPUs each, 2 CPUs total, 8 actors queued:
- old: min(soft_limit - num_total_workers, backlog) = min(2 - 4,
backlog) -> do not start any more workers because we are already over the
soft limit
- new: min(max(soft_limit - num_total_workers, num available CPUs),
backlog) -> start 1 more worker
Actually I think we should just change it to min(num available CPUs,
backlog), unless there's a reason we want to keep the previous calculation?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#18166 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSVYPCPN3LS4HLIAXVTT677ILANCNFSM5C6C5NMA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I'm not sure how you got that...it should prestart 1 worker for the last available CPU, I think. |
Oh I see, available==not used, not total. In this case then |
Actually, isn't one issue that starting more workers than the soft limit will just result in the workers getting immediately idle killed (since the soft limit is exceeded)? Another concern is the well conditioning of prestart. Suppose I use So I think we have to take into account the number of current workers. I believe this formula would be better: |
My logic here is that we only start up to the task backlog. So as long as the owner submits the tasks in the backlog before the idle worker timeout, the worker should be usable. Long-term, it would probably be better to also adjust the soft limit dynamically, based on the available (unused) CPUs. By the way, this issue isn't new, since before #17202, a queued task would end up triggering multiple worker starts. |
OK, I buy that exceeding the soft limit temporarily is safe if the scheduler immediately grabs the worker. But I think the other concern stands. |
Why are these changes needed?
We prestart workers to improve task scheduling time, but the soft limit is a static limit based on the number of CPUs. This means prestart won't work as expected if some workers are holding less than 1 CPU (such as an actor). This PR allows prestart to start workers up to the number of available CPUs, if this is more than what the soft limit would allow.
Related issue number
This PR is needed to fix the distributed test_many_tasks.
Checks
scripts/format.sh
to lint the changes in this PR.