Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV when exiting the dataloader in the middle of training #6246

Closed
yitongh opened this issue Jan 2, 2024 · 0 comments · Fixed by #6247
Closed

SIGSEGV when exiting the dataloader in the middle of training #6246

yitongh opened this issue Jan 2, 2024 · 0 comments · Fixed by #6247

Comments

@yitongh
Copy link
Contributor

yitongh commented Jan 2, 2024

A SIGSEGV error occurs when running the following code with PJRT_DEVICE=CUDA torchrun --nproc_per_node=2 test.py.

import torch
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl

import os
os.environ['PJRT_LOCAL_PROCESS_RANK'] = os.environ['LOCAL_RANK']

device = xm.xla_device()
xm.set_replication(device, [device])
import torch_xla.utils.utils as xu
train_loader = xu.SampleGenerator(
      data=torch.zeros(1, 12),
      sample_count=1024)
train_loader = pl.MpDeviceLoader(train_loader, device)
max_steps = 10
for step, inputs in enumerate(train_loader):
  xm.all_reduce('sum', [inputs], scale=1.0/xm.xrt_world_size())

  if step > max_steps: break

This is due to early exit from the dataloader, causing the xm.mark_step in pl.MpDeviceLoader to not be executed. As a result, all_reduce_token is not set to None, which causes all_reduce_token in the global variable g_all_reduce_tokens in torch_xla/csrc/cross_replica_reduces.cpp to be released only when the program exits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant