You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A SIGSEGV error occurs when running the following code with PJRT_DEVICE=CUDA torchrun --nproc_per_node=2 test.py.
import torch
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl
import os
os.environ['PJRT_LOCAL_PROCESS_RANK'] = os.environ['LOCAL_RANK']
device = xm.xla_device()
xm.set_replication(device, [device])
import torch_xla.utils.utils as xu
train_loader = xu.SampleGenerator(
data=torch.zeros(1, 12),
sample_count=1024)
train_loader = pl.MpDeviceLoader(train_loader, device)
max_steps = 10
for step, inputs in enumerate(train_loader):
xm.all_reduce('sum', [inputs], scale=1.0/xm.xrt_world_size())
if step > max_steps: break
This is due to early exit from the dataloader, causing the xm.mark_step in pl.MpDeviceLoader to not be executed. As a result, all_reduce_token is not set to None, which causes all_reduce_token in the global variable g_all_reduce_tokens in torch_xla/csrc/cross_replica_reduces.cpp to be released only when the program exits.
The text was updated successfully, but these errors were encountered:
A SIGSEGV error occurs when running the following code with
PJRT_DEVICE=CUDA torchrun --nproc_per_node=2 test.py
.This is due to early exit from the dataloader, causing the
xm.mark_step
in pl.MpDeviceLoader to not be executed. As a result,all_reduce_token
is not set to None, which causesall_reduce_token
in the global variableg_all_reduce_tokens
intorch_xla/csrc/cross_replica_reduces.cpp
to be released only when the program exits.The text was updated successfully, but these errors were encountered: