-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][aDAG] Hang when using ray before using adag #47864
Comments
simplify the reproduction: import ray
import ray.dag
from ray.experimental.channel.torch_tensor_type import TorchTensorType
import torch
@ray.remote(num_gpus=1)
class GPUSender:
def send(self, shape):
return torch.rand(shape, device="cuda", dtype=torch.float32)
@ray.remote(num_gpus=1)
class GPUReceiver:
def recv(self, tensor: torch.Tensor):
assert tensor.device.type == "cuda"
return tensor.shape
shape = (1000,10000)
def test_basic():
print("Basic start")
sender = GPUSender.remote()
receiver = GPUReceiver.remote()
obj = sender.send.remote(shape)
result = receiver.recv.remote(obj)
assert ray.get(result) == shape
print("Basic end")
def test_dag():
print("DAG start")
sender = GPUSender.remote()
receiver = GPUReceiver.remote()
with ray.dag.InputNode() as inp:
dag = sender.send.bind(inp)
dag = dag.with_type_hint(TorchTensorType(transport="nccl", _shape=shape, _dtype=torch.float32))
dag = receiver.recv.bind(dag)
# Creates a NCCL group across the participating actors. The group is destroyed during dag.teardown().
adag = dag.experimental_compile()
assert ray.get(adag.execute(shape)) == shape
print("DAG end")
if __name__ == "__main__":
ray.init()
test_basic()
test_dag()
ray.shutdown()
All failures happened after "DAG end". Updated: I synchronized the branch with the master branch, and I need to try 10s times to reproduce the segmentation fault issue. |
I used valgrind to profile it. SIGSEGV is from the following:
|
I checked core dump, and I found the segmentation fault is from ![]() |
Found another issue which is the same as the one valgrind found:
![]() @rynewang suggests me to use Py_IsFinalizing() to check whether the interpreter is in process of being finalized. |
After giving it a second thought, I think I may have misunderstood the issue. I ran the script and didn't observe any "hanging" behavior as described in #47864 (comment). I ran it on my GPU devbox as the screenshot below. The segmentation fault always occurs at ray.shutdown(), which may be resolved or alleviated by #48808. ![]() I will close this issue after double check with @rkooo567 tomorrow. |
chatted with @rkooo567 offline. Close this issue. |
cc @dayshah |
What happened + What you expected to happen
This hangs on A100. If I change the order of test_basic() and test_dag(), it works
Versions / Dependencies
master
Reproduction script
n/a
Issue Severity
None
The text was updated successfully, but these errors were encountered: