-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running codes on large TPU VM Pod causes SIGSEGV #3028
Comments
Here's full error log. 2021-07-08 08:28:12 10.164.0.33 [0] *** SIGSEGV (@0x7f97c0e8e528), see gl__________25#s15 received by PID 16710 (TID 16710) on cpu 32; stack trace: *** |
I've looked a little this week and don't have a fix yet. I was able to confirm that the SIGSEGV is not really specific to your model. It seems like any model in pytorch lightning will crash on v3-128 if the images are moderately large (e.g. 28x28x3 image is OK but 256x256x3 image like yours results in crash) I had a few clarification questions for the Lightning team in Lightning-AI/pytorch-lightning#8358 In particular I am wondering if Pytorch Lightning handles the init differently for v3-32 vs v3-128. Or maybe there is some memory management issue in the way that Lightning sets up the workers that gets worse as the number of TPU cores increases I will keep trying to dig around to find a more informative error message from the SIGSEGV and hopefully can get some hints from the Lightning team as well |
Found out the cause and fixed it. I'll leave the link for those who are having similar problem with TPU + xla + lightning |
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
pip install -r requirements.txt
python3 -m torch_xla.distributed.xla_dist --tpu= POD_NAME --restart-tpuvm-pod-server -- python3 /home/taehoon.kim/taming-transformers-tpu/main.py --use_tpus --refresh_rate 1 --disc_start 1 --fake_data
2021-07-08 08:28:12 10.164.0.33 [0] File "/home/taehoon.kim/.local/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/tpu_spawn.py", line 265, in start_training
2021-07-08 08:28:12 10.164.0.33 [0] xmp.spawn(self.new_process, **self.xmp_spawn_kwargs)
2021-07-08 08:28:12 10.164.0.33 [0] File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 388, in spawn
2021-07-08 08:28:12 10.164.0.33 [0] return torch.multiprocessing.start_processes(
2021-07-08 08:28:12 10.164.0.33 [0] File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
2021-07-08 08:28:12 10.164.0.33 [0] while not context.join():
2021-07-08 08:28:12 10.164.0.33 [0] File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 130, in join
2021-07-08 08:28:12 10.164.0.33 [0] raise ProcessExitedException(
2021-07-08 08:28:12 10.164.0.33 [0] torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV
Expected behavior
Training goes well without any problem.
Environment
Additional context
The text was updated successfully, but these errors were encountered: