-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Assign tasks to the first available worker #18167
[core] Assign tasks to the first available worker #18167
Conversation
We should merge #18166 first. |
Probably need to update the C++ tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one simplification idea to unify dedicated/vs not paths by having them always go through the same hash tables, but with a unique key.
src/ray/raylet/worker_pool.h
Outdated
@@ -41,6 +41,9 @@ namespace raylet { | |||
using WorkerCommandMap = | |||
std::unordered_map<Language, std::vector<std::string>, std::hash<int>>; | |||
|
|||
// TODO(swang): Are we worried about hash collisions with int type? | |||
using RuntimeEnvHash = int; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I had a PR that changed it to int64 but it got lost. Maybe file a backlog issue / add to the tech debt cleanup ticket.
src/ray/raylet/worker_pool.cc
Outdated
} | ||
}); | ||
} else { | ||
if (!requires_dedicated_worker) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to unify dedicated/vs not paths (e.g., by giving dedicated workers some unique hash id?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good idea, I didn't like how this part turned out.
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
d37449a
to
a7c6af3
Compare
PR #19014 introduced the idea of a StartupToken to uniquely identify a worker via a counter. This PR: - returns the Process and the StartupToken from StartWorkerProcess (previously only Process was returned) - Change the starting_workers_to_tasks map to index via the StartupToken, which seems to fix the windows failures. - Unskip the windows tests in test_basic_2.py It seems once a fix to PR #18167 goes in, the starting_workers_to_tasks map will be removed, which should remove the need for the changes to StartWorkerProcess made in this PR.
Why are these changes needed?
#17202 assigns tasks to specific workers if there are none available and a new worker has to be started. This is not robust because it can block the task if another worker becomes available first.
This PR converts the worker pool to queue the tasks and starting workers instead. If a worker becomes available and there is a task in the queue, we assign the task to that worker. If workers fail to start, we trigger the callbacks for all queued tasks.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.