SIGSEGV when exiting the dataloader in the middle of training #6246

yitongh · 2024-01-02T10:56:38Z

A SIGSEGV error occurs when running the following code with PJRT_DEVICE=CUDA torchrun --nproc_per_node=2 test.py.

import torch
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl

import os
os.environ['PJRT_LOCAL_PROCESS_RANK'] = os.environ['LOCAL_RANK']

device = xm.xla_device()
xm.set_replication(device, [device])
import torch_xla.utils.utils as xu
train_loader = xu.SampleGenerator(
      data=torch.zeros(1, 12),
      sample_count=1024)
train_loader = pl.MpDeviceLoader(train_loader, device)
max_steps = 10
for step, inputs in enumerate(train_loader):
  xm.all_reduce('sum', [inputs], scale=1.0/xm.xrt_world_size())

  if step > max_steps: break

This is due to early exit from the dataloader, causing the xm.mark_step in pl.MpDeviceLoader to not be executed. As a result, all_reduce_token is not set to None, which causes all_reduce_token in the global variable g_all_reduce_tokens in torch_xla/csrc/cross_replica_reduces.cpp to be released only when the program exits.

The text was updated successfully, but these errors were encountered:

yitongh mentioned this issue Jan 2, 2024

Set all_reduce_token to None when exiting #6247

Merged

JackCaoG closed this as completed in #6247 Jan 5, 2024

yitongh mentioned this issue Jan 19, 2024

Revert "Set all_reduce_token to None when exiting" #6321

Merged

yitongh mentioned this issue Apr 8, 2024

Set all_reduce_token to null when exiting #6898

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV when exiting the dataloader in the middle of training #6246

SIGSEGV when exiting the dataloader in the middle of training #6246

yitongh commented Jan 2, 2024 •

edited

Loading

SIGSEGV when exiting the dataloader in the middle of training #6246

SIGSEGV when exiting the dataloader in the middle of training #6246

Comments

yitongh commented Jan 2, 2024 • edited Loading

yitongh commented Jan 2, 2024 •

edited

Loading