-
Notifications
You must be signed in to change notification settings - Fork 505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error While Starting 2nd Epoch #6002
Comments
Hello I have more information about this. Any help I appreciate. |
Hi again Even if I run the procedure using .hdf5 files, not .pickle files I get the same error at the start of the 2nd chunk of the first epoch. |
Hello I am also using a customized version of BatchSampler. best regards |
Hello Even if I remove BatchSampler I get the same error. After the training of the first chunk finishes if I force only validation chunks to be processed no error occurs. best regards |
let me take a look.. |
Hello @JackCaoG Here is more information. Here is the part in the training loop:
In the first chunk of the first epoch, loss = loss_1 is selected and backpropagation and optimization work well. best regards |
I think it is one of the corner cases where we don't handle some of the view ops properly. Can you dump the IR and HLO using https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#common-debugging-environment-variables-combinations ? You should also run it with The real error is
so I want to check the HLO and see where this transpose is from and why it is being generated(which pytorch op). |
I am working on it now. |
Ok I think it is from this line of IR
let me see if I can find someone to take a look soon. What version of pytorch/xla you are using? |
Thank you I have Torch 2.1.1 and Torch-XLA 2.1.0 on my TPU v2 and v3 VMs. Is this a big update? best regards |
2.1 is fine. Let me see if I have cycle to try to repo this week |
Hello @JackCaoG Is there anything I can do about this problem for help? |
I think I understand what's the problem but I don't fully understand why, the only place that I find will trigger xla/torch_xla/csrc/tensor_methods.cpp Lines 1061 to 1077 in 694047e
The problem is that Do you have an easy way for me to repo this? Ideally on a single device(no multi process) and it is something that I can just copy paste and run. |
one way to unblock yourself is to figure out where this update is from and manually use the positive index(This is only possible is this logic is trigger by some python code, not the C++ code). What I can tell is that it happens on a tensor with size |
Hello @JackCaoG Here is the repo for a single-core TPU run. Please change the working directory on line in the config file. Run the run_train_1_1_TPU_single.py for debugging. After a successful short train and val cycle we get the error. The error occurs in operations placed in the functions in loss_functions.py during backward pass. After your comments, I tried to find the error location in the loss_functions.py by making modifications. You can observe the modifications on loss_functions_1_to_1.py To run training using loss_functions_1_to_1.py you can activate the line. You can compare the loss function files. But none of the changes helped. I am ready to make any changes that can help. best regards. |
I am able to repo.. let me look into be. BTW, I am using nightly so I enabled the
I saw
I think you want to just add a |
OK, thank you. |
Should be fixed by #6123 you can use tmr's nightly following https://github.com/pytorch/xla#python-packages |
Thank you very much @JackCaoG Should I wait till the merge of .whl file? In terms of speed, do you suggest Python 3.10 or is Python 3.8 fine? |
Yea you should wait for it to merge and then use the nightly the day after. I don't think python version makes that much of a difference, I would just pick whichever fits your current machine environment. |
Hello @JackCaoG I have tested my training loop with nightly releases. /home/mfatih/env3_8/lib/python3.8/site-packages/torch/autograd/init.py:266: UserWarning: aten::reshape: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) I think there is an issue regarding the registration of autograd. thank you. |
You need to install nightly for both pytorch and pytorch/xla, so
|
Actually I used the following commands
Let me check your advice |
either way should work, I saw that last nightly's night succed so new nightly whl should have the fix. We just happened to also build pytorch nightly whl with the torch_xla nightly whl. |
When I install using
and make pip list I get
and get the following error at the beginning of the execution. /usr/bin/env /home/mfatih/env3_8/bin/python /home/mfatih/.vscode-server/extensions/ms-python.python-2023.22.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 48037 -- /home/mfatih/17_featureMatching/run_train_1_1_TPU_single.py When I install using
and make pip list I get
and get the warning stated above /home/mfatih/env3_8/lib/python3.8/site-packages/torch/autograd/init.py:266: UserWarning: aten::reshape: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) What should I do? thank you |
does this warning block training or it hang the training? |
It does not stop the training. |
doesn't seems like a torch_xla issue when I look into it, I tried your command and I am able to run resnet on xla. You can post an issue on PyTorch if you have concern for this error message. Seems like something new on pytorch nightly. |
What about the installation
Why does this installation not work? |
Hello @JackCaoG I have opened an issue in the Pytorch forum as you suggested. |
Hello @JackCaoG What can we say about the backpropagation health now? best regards |
Hello
In one of my experiments, training and validation operations on both TPUv2(single core) and TPUv3(single core) devices finished successfully for the 1st chunk of the first epoch.
At the start of the 2nd chunk, I got the error below:
Dataloader does not do anything different for the 2nd chunk of the 1st epoch.
I appreciate any help.
best regards
The text was updated successfully, but these errors were encountered: