-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV on TPUVM pod size >= v3-128 #8358
Comments
I've been testing around with If lightning shows different behaviors according to input image size, maybe is this data_loader related issue? |
There is no behavioral difference on the Lightning side for training on
Need to dig more into it, to find out the cause. Assigning myself to the issue. |
@zcain117 @kaushikb11 Here’s what I’ve found:
I’ll attach latest error log on v3-256: Any ideas or progress? |
The action item here would be to test it with an XLA script and test if it raises the same issue. |
@kaushikb11 I've tested this script on v3-256 and it works with native torch-xla. Here's steps that I took:
Lightning seems to fail during new_process. |
@zcain117 @kaushikb11 Fixed. Replacing xla.sample generator with native pytorch dataset class and using LightningDataModule solves memory OOM happening with large pod + large image input size. Here are examples for proper DataModules with fake_data generation option. |
Great find! Glad it's working now Do you think it was an OOM that was causing the SIGSEGV? I am curious why there is an OOM for larger TPU sizes. The With native Is there some step in PyTorch Lightning setup where all the SampleGenerators would be on the same machine? |
Maybe related to ddp sampler? Since lightning is supposed to autometically add DDP sampler for multi-tpu training and somehow it duplicates the data. |
@tgisaturday Could you add the recent updates to the Github issue? |
@kaushikb11 I'm testing out few more things. I'll leave an update when they are finished. |
@tgisaturday We could close this issue, as we figured it was a DataLoader issue. We could create a separate issue for the Custom logging issue. |
I'm also testing text data loading. I'll open another issue if it seems to be lightning problem. |
🐛 Bug
Given a decently large input data size, PyTorch Lightning will crash on v3-128 (and presumably any larger TPU pods too). The same code works fine on a v3-32. If I make the input image size smaller, the code also works on v3-128
I am continuing to try to find more informative logs deep in the TPU stack about the segfault.
This crash does not happen with regular pytorch/xla. So I wanted to get a sense of:
Repro:
below is a repro script. IMAGE_SIZE=256 results in SIGSEGV on v3-128 but IMAGE_SIZE=28 is able to train successfully on v3-128:
My repro steps:
SSH into TPUVM pod:
gcloud alpha compute tpus tpu-vm ssh zcain-v3-128 --zone europe-west4-a --project=tpu-pytorch
gcloud compute config-ssh
python3 -m torch_xla.distributed.xla_dist --tpu=zcain-v3-128 -- python3 /usr/share/taming-transformers-tpu/repro.py
The text was updated successfully, but these errors were encountered: