Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Assign tasks to the first available worker #18167

Merged
merged 44 commits into from
Oct 5, 2021

Conversation

stephanie-wang
Copy link
Contributor

Why are these changes needed?

#17202 assigns tasks to specific workers if there are none available and a new worker has to be started. This is not robust because it can block the task if another worker becomes available first.

This PR converts the worker pool to queue the tasks and starting workers instead. If a worker becomes available and there is a task in the queue, we assign the task to that worker. If workers fail to start, we trigger the callbacks for all queued tasks.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@stephanie-wang stephanie-wang added the do-not-merge Do not merge this PR! label Aug 27, 2021
@stephanie-wang
Copy link
Contributor Author

We should merge #18166 first.

@stephanie-wang stephanie-wang removed the do-not-merge Do not merge this PR! label Aug 30, 2021
@stephanie-wang
Copy link
Contributor Author

Probably need to update the C++ tests.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one simplification idea to unify dedicated/vs not paths by having them always go through the same hash tables, but with a unique key.

@@ -41,6 +41,9 @@ namespace raylet {
using WorkerCommandMap =
std::unordered_map<Language, std::vector<std::string>, std::hash<int>>;

// TODO(swang): Are we worried about hash collisions with int type?
using RuntimeEnvHash = int;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I had a PR that changed it to int64 but it got lost. Maybe file a backlog issue / add to the tech debt cleanup ticket.

}
});
} else {
if (!requires_dedicated_worker) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to unify dedicated/vs not paths (e.g., by giving dedicated workers some unique hash id?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good idea, I didn't like how this part turned out.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 31, 2021
stephanie-wang and others added 8 commits August 30, 2021 18:02
@rkooo567 rkooo567 self-assigned this Sep 13, 2021
@stephanie-wang stephanie-wang merged commit 545db13 into ray-project:master Oct 5, 2021
@stephanie-wang stephanie-wang deleted the test-many-tasks branch October 5, 2021 20:45
rkooo567 added a commit to rkooo567/ray that referenced this pull request Oct 7, 2021
ericl pushed a commit that referenced this pull request Oct 7, 2021
This was referenced Dec 9, 2021
pcmoritz pushed a commit that referenced this pull request Dec 13, 2021
PR #19014 introduced the idea of a StartupToken to uniquely identify a worker via a counter. This PR:
- returns the Process and the StartupToken from StartWorkerProcess (previously only Process was returned)
- Change the starting_workers_to_tasks map to index via the StartupToken, which seems to fix the windows failures.
- Unskip the windows tests in test_basic_2.py
It seems once a fix to PR #18167 goes in, the starting_workers_to_tasks map will be removed, which should remove the need for the changes to StartWorkerProcess made in this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants