-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDP TensorMetric and NumpyMetric exception #3507
Comments
@Vozf mind share an example? |
Cant really share the whole repo, but error starts reproducing with with this small change. If you have exact questions, I'll happily answer.
Error mentioned above reproduces (the only change is the superclass changed to TensorMetric):
Basically all that's done is computation of this metric
And inside test_step
MAE is just an example this would reproduce for any metric with TensorMetric or NumpyMetric super class |
@Vozf mind tests it on the actual master, we have fixed several issues with metrics recently... |
Well the error got worse. DDP NumpyMetric.
For single gpu everything works like in 0.9.0 |
The error above is for NumpyMetric(edited comment)
This occurs even with single gpu, so that seems like a major bug in current master(0.9.1rc3) |
@Vozf could you move the initialization of the metric to |
It's inside init. Sorry for misunderstanding, I wanted to clarify what |
Could you provide a colab notebook where I can reproduce the errors you are getting? |
Does the colab have multigpu to reproduce ddp? |
@Vozf unfortunately just one GPU or TPU |
So it doesn't seem like an option unless you are planning to download notebook and execute it with multigpu |
I can execute on multi gpu just need some (minimal) code that can reproduce your error. |
Ok, I'll try to build an example |
Tested with 0.9.0. The error reproduces. The main part is the super class of mae in 1 case it works in other it raises ddp exception. The errors I mentioned with 0.9.1 should reproduce too, but I didn't test with it |
If changed to single gpu( |
I took a look at this, and cannot reproduce this on master which means that your problems was most likely solved by PR #3245. Thus, if you install that latest version of master you should be fine. |
Tried to reproduce on master, and wasn't able to. |
If that helps |
🐛 Bug
when trying to train with metrics inherited from tensormetric or numpy metric an exception occurs
pytorch 1.6.0
pytorch-lightning == 0.9.0
The text was updated successfully, but these errors were encountered: