Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Simplify CoreWorker state machine in worker pool. #47994

Open
rynewang opened this issue Oct 11, 2024 · 1 comment
Open

[core] Simplify CoreWorker state machine in worker pool. #47994

rynewang opened this issue Oct 11, 2024 · 1 comment
Assignees
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability P0 Issues that should be fixed in short order

Comments

@rynewang
Copy link
Contributor

Description

Now:

(None) -> RegisterWorker -> AnnouncePort -> (ready) -> PopWorker to assign task -> task done, PushWorker back -> (ready). At each step it can go to DisconnectClient -> (None).

So in order to be (ready), it needs 2 RPCs: RegisterWorker -> AnnouncePort. This is not needed, we can merge the 2.

  • RegisterWorker: (worker -> raylet), giving pids, etc, asking for a port.
  • AnnouncePort: (worker -> raylet), giving port, and indicating the worker is ready and prepared to receive tasks.

so RegisterWorker's only usage is to ask for a port. We can instead give the port in worker start up commandline, then it can be:

  • BikeshedRegisterWorkerAndAnnouncePort: (worker -> raylet), giving pids, port, etc, and indicating the worker is ready and prepared to receive tasks.

One caveat: Drivers. because it's not started by raylet, raylet can't give it a port by commandline. it needs a way to receive a port. Today, it's a separate RPC RegisterDriver anyway. let's make it

RegisterDriver -> BikeshedRegisterWorkerAndAnnouncePort -> (ready).

Use case

No response

@rynewang rynewang added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) P0 Issues that should be fixed in short order core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 11, 2024
@rynewang rynewang self-assigned this Oct 11, 2024
@rynewang
Copy link
Contributor Author

In CoreWorker ctor, local_raylet_client_ is first init'd (sending RegisterWorker). Then it's passed to a lot of classes, eg plasma_store_provider_, memory_store_, task_receiver_ etc. At end of ctor there's: a call to ConnectToRaylet which depends on connect_on_start.

// Tell the raylet the port that we are listening on.
// NOTE: This also marks the worker as available in Raylet. We do this at the
// very end in case there is a problem during construction.
if (options.connect_on_start) {
ConnectToRayletInternal();
}

We need to understand: if we are gonna merge the 2 rpc calls, shall it happen before or after it's passed to these many classes. and what's the deal with connect_on_start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability P0 Issues that should be fixed in short order
Projects
None yet
Development

No branches or pull requests

1 participant