Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Bug] global-scoped actor handles/Ray objects prevents Ray workers from being destructed. #23677

Open
1 of 2 tasks
scv119 opened this issue Apr 4, 2022 · 6 comments
Open
1 of 2 tasks
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P3 Issue moderate in impact or severity
Milestone

Comments

@scv119
Copy link
Contributor

scv119 commented Apr 4, 2022

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

Issue Severity

Medium: It contributes to significant difficulty to complete my task but I work arounds and get it resolved.

What happened + What you expected to happen

Specifically, we have a single-core (single-worker) cluster and three different map executions that alternate compute models:

ray.init(num_cpus=1)
ds = ray.data.range(10)

class StatefulFn:
   ...

# 1
ds.flat_map(StatefulFn, compute="tasks").take()
# 2
ds.flat_map(StatefulFn, compute="actors").take()
# 3
ds.flat_map(StatefulFn, compute="tasks").take()

where StatefulFn construction should be cached in the Python worker for the “tasks” compute model, and in the actor for the “actors” compute model. Before the change, #1, #2, and #3 would each create a new Python worker, so the StatefulFn wouldn’t be cached across those map executions, as expected. After the change, #1 and #3 somehow use the same Python worker and therefore reuse the cached StatefulFn construction, which is definitely unexpected. I’d expect the Python worker created in #1 to be destroyed once execution of #2 starts, so I’m not sure how the worker for #1 is getting reused for #3.

for non-actor workers, global object references will never be force removed since they’re assumed to be held by other processes, so such a global-scope named actor handle is expected to cause this worker to live forever.

Versions / Dependencies

latest ray

Reproduction script

N/A

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@scv119 scv119 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 4, 2022
@scv119 scv119 added this to the Core Backlog milestone Apr 4, 2022
@scv119
Copy link
Contributor Author

scv119 commented Apr 4, 2022

References that are captured in a remote function or class definition will be pinned permanently. For example:

x_ref = foo.remote()
@ray.remote
def capture():  
   ray.get(x_ref)  # x_ref is captured. It will be pinned as long as the driver lives.

One thing i didn't fully get is why can't we release the reference (since it's borrower) after the capture finished execution. cc @stephanie-wang

@rkooo567
Copy link
Contributor

rkooo567 commented Apr 4, 2022

@clarkzinzow

global-scope named actor handle

Can you show me an example of what this means?

@clarkzinzow
Copy link
Contributor

@rkooo567 Sure!

import ray

# Actor handle lives at global scope.
handle = None

def _get_named_actor():
    global handle

    if handle is None:
        handle = ray.get_actor("adder")
    return handle


@ray.remote
class Adder:
    def __init__(self, x):
        self.x = x

    def add(self, y):
        return self.x + y


@ray.remote
def add(y):
    # This global actor handle will be set in a non-driver worker.
    adder = _get_named_actor()
    return ray.get(adder.add.remote(y))


adder = Adder.options(name="adder").remote(1)
results = [add.remote(y) for y in range(10)]
results = ray.get(results)
assert results == list(range(1, 11))

@stephanie-wang
Copy link
Contributor

Yeah it does seem like this is a bug. The reference should be a "borrower" and the worker is also supposed to force-remove its local ref count upon exit.

@clarkzinzow
Copy link
Contributor

clarkzinzow commented Apr 5, 2022

@stephanie-wang Btw I think it may actually be a borrower as expected, I have to double-check. It looks like we might always consider fetched named actor handles to be "detached" in the actor manager, so it looks like it wouldn't be an owned reference and isn't the underlying issue.

@scv119 scv119 added the core Issues that should be addressed in Ray Core label Apr 13, 2022
@ericl
Copy link
Contributor

ericl commented Apr 15, 2022

This might be a fundamentally hard issue to fix, since global references are indistinguishable from references caught in transient GC cycles.

I don't see how you can solve this without adding GC collect calls after task executions, which would be prohibitively expensive.

@rkooo567 rkooo567 added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Jul 8, 2022
@jjyao jjyao added P3 Issue moderate in impact or severity and removed P2 Important issue, but not time-critical labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P3 Issue moderate in impact or severity
Projects
None yet
Development

No branches or pull requests

6 participants