[Core][Bug] global-scoped actor handles/Ray objects prevents Ray workers from being destructed. #23677

scv119 · 2022-04-04T05:00:13Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Core

Issue Severity

Medium: It contributes to significant difficulty to complete my task but I work arounds and get it resolved.

What happened + What you expected to happen

Specifically, we have a single-core (single-worker) cluster and three different map executions that alternate compute models:

ray.init(num_cpus=1)
ds = ray.data.range(10)

class StatefulFn:
   ...

# 1
ds.flat_map(StatefulFn, compute="tasks").take()
# 2
ds.flat_map(StatefulFn, compute="actors").take()
# 3
ds.flat_map(StatefulFn, compute="tasks").take()

where StatefulFn construction should be cached in the Python worker for the “tasks” compute model, and in the actor for the “actors” compute model. Before the change, #1, #2, and #3 would each create a new Python worker, so the StatefulFn wouldn’t be cached across those map executions, as expected. After the change, #1 and #3 somehow use the same Python worker and therefore reuse the cached StatefulFn construction, which is definitely unexpected. I’d expect the Python worker created in #1 to be destroyed once execution of #2 starts, so I’m not sure how the worker for #1 is getting reused for #3.

for non-actor workers, global object references will never be force removed since they’re assumed to be held by other processes, so such a global-scope named actor handle is expected to cause this worker to live forever.

Versions / Dependencies

latest ray

Reproduction script

N/A

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

scv119 · 2022-04-04T05:24:58Z

References that are captured in a remote function or class definition will be pinned permanently. For example:

x_ref = foo.remote()
@ray.remote
def capture():  
   ray.get(x_ref)  # x_ref is captured. It will be pinned as long as the driver lives.

One thing i didn't fully get is why can't we release the reference (since it's borrower) after the capture finished execution. cc @stephanie-wang

rkooo567 · 2022-04-04T05:27:05Z

@clarkzinzow

global-scope named actor handle

Can you show me an example of what this means?

clarkzinzow · 2022-04-05T15:43:31Z

@rkooo567 Sure!

import ray

# Actor handle lives at global scope.
handle = None

def _get_named_actor():
    global handle

    if handle is None:
        handle = ray.get_actor("adder")
    return handle


@ray.remote
class Adder:
    def __init__(self, x):
        self.x = x

    def add(self, y):
        return self.x + y


@ray.remote
def add(y):
    # This global actor handle will be set in a non-driver worker.
    adder = _get_named_actor()
    return ray.get(adder.add.remote(y))


adder = Adder.options(name="adder").remote(1)
results = [add.remote(y) for y in range(10)]
results = ray.get(results)
assert results == list(range(1, 11))

stephanie-wang · 2022-04-05T16:02:44Z

Yeah it does seem like this is a bug. The reference should be a "borrower" and the worker is also supposed to force-remove its local ref count upon exit.

clarkzinzow · 2022-04-05T16:07:29Z

@stephanie-wang Btw I think it may actually be a borrower as expected, I have to double-check. It looks like we might always consider fetched named actor handles to be "detached" in the actor manager, so it looks like it wouldn't be an owned reference and isn't the underlying issue.

ericl · 2022-04-15T17:18:32Z

This might be a fundamentally hard issue to fix, since global references are indistinguishable from references caught in transient GC cycles.

I don't see how you can solve this without adding GC collect calls after task executions, which would be prohibitively expensive.

scv119 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 4, 2022

scv119 added this to the Core Backlog milestone Apr 4, 2022

scv119 assigned jjyao Apr 13, 2022

scv119 added the core Issues that should be addressed in Ray Core label Apr 13, 2022

rkooo567 unassigned jjyao Jul 8, 2022

rkooo567 added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Jul 8, 2022

jjyao added P3 Issue moderate in impact or severity and removed P2 Important issue, but not time-critical labels Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Bug] global-scoped actor handles/Ray objects prevents Ray workers from being destructed. #23677

[Core][Bug] global-scoped actor handles/Ray objects prevents Ray workers from being destructed. #23677

scv119 commented Apr 4, 2022

scv119 commented Apr 4, 2022 •

edited

Loading

rkooo567 commented Apr 4, 2022

clarkzinzow commented Apr 5, 2022

stephanie-wang commented Apr 5, 2022

clarkzinzow commented Apr 5, 2022 •

edited

Loading

ericl commented Apr 15, 2022

[Core][Bug] global-scoped actor handles/Ray objects prevents Ray workers from being destructed. #23677

[Core][Bug] global-scoped actor handles/Ray objects prevents Ray workers from being destructed. #23677

Comments

scv119 commented Apr 4, 2022

Search before asking

Ray Component

Issue Severity

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

scv119 commented Apr 4, 2022 • edited Loading

rkooo567 commented Apr 4, 2022

clarkzinzow commented Apr 5, 2022

stephanie-wang commented Apr 5, 2022

clarkzinzow commented Apr 5, 2022 • edited Loading

ericl commented Apr 15, 2022

scv119 commented Apr 4, 2022 •

edited

Loading

clarkzinzow commented Apr 5, 2022 •

edited

Loading