-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error During Multi Core TPU Training #6048
Comments
I think code crashed in xla/torch_xla/csrc/tensor_methods.cpp Lines 339 to 367 in a01de39
hard for me to tell where does it crash. Do you have a small repo? |
I am working on the repo to share. |
Hello @JackCaoG Here is the repo for debug. Just update line 21 according to your file system and run run_train_1_1_TPU_multi.py Please ignore my debug prints at the start. best regards |
Hello @JackCaoG Is the repo fine for you? I can do anything that can help. |
I probably only have bandwidth to handle one issue #6002, @ManfeiBai do you have cycle to repo this? |
thanks, will do |
Hello @ManfeiBai I can do any modification to the repo that can help. best regards |
thanks, @mfatih7 , I'm trying to repro the info like I have run commands like:
and got this info now:
I have four devices locally like below, and how many devices are supposed to be used in
|
Hello @ManfeiBai Thank you for your answer. The error is similar to the error I get. Python version is different in your case. I am using Python 3.8 and you are using Python 3.10. I can exclude "TPU:1 DEB PNT 0" debug prints and recommit again. Anything I can do to help? |
testing locally again, by debugging with pdb, code seems crashed at
successed log:
the log shows 8 processes, so modify config of
so the next step would check where code crashed in since i tested |
Hello @ManfeiBai Thank you for your answer Here is more information. I appreciate any help in solving the problem. |
Do you think that the replicas of the model on TPU cores are different somehow? Here, it is written that
I am trying to implement the second option : |
thanks, my local test crashed at this line thanks, @mfatih7, my next step will change loss func to |
~thanks, @mfatih7, do you want to try again with modify I tested locally with this modify and finished with training and validation log on v4-8:~
this solution is not for v2-8/v3-8 with multi core, @mfatih7 give the right solution below |
Hello @ManfeiBai Thank you very much for your effort. But I think something is wrong. Disabling I think I have found the solution to the problem. I think before saving the initial checkpoint the model must be in the device. |
thanks, @mfatih7, you are right, we should use |
Thank you @ManfeiBai Regarding this post, I want to ask a final question. On line, I am loading the checkpoint and updating model parameters on all cores. Is this the correct option? Because here it is written that
Should I use |
thanks, good question, synced with @JackCaoG offline, for question 1, loading the checkpoint and updating model parameters on all cores in line is the correct option, for question 2, current code we don't need to run for situation |
Thank you for your effort @ManfeiBai |
Hello
While trying to run my learning loop on multiple cores of TPU v2 I get the error below.
Is it related to an XLA error or do I have errors in my script?
best regards
The text was updated successfully, but these errors were encountered: