Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix/fix gan example #2019

Merged
merged 3 commits into from
May 31, 2020
Merged

Bugfix/fix gan example #2019

merged 3 commits into from
May 31, 2020

Conversation

lobantseff
Copy link
Contributor

@lobantseff lobantseff commented May 30, 2020

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)

What does this PR do?

Fixes #1223

GAN example in repo was not working. Fixed typos and working in dp, and ddp mode.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

@mergify mergify bot requested a review from a team May 30, 2020 15:45
@Borda Borda added the bug Something isn't working label May 30, 2020
@Borda Borda added this to the 0.8.0 milestone May 30, 2020
@ternaus
Copy link

ternaus commented May 30, 2020

Is there a way to implement caching of the generated images?

Without it, the code works twice slower as we need to generate both in generator and discriminator phases.

@williamFalcon
Copy link
Contributor

@armavox we moved away from hparams.
Made the changes, verify?

@ternaus
Copy link

ternaus commented May 31, 2020

@williamFalcon is there a way to avoid doing self(z) in the discriminator phase and reuse images generated at the generator phase?

@lobantseff
Copy link
Contributor Author

@williamFalcon ok. Good.

@lobantseff
Copy link
Contributor Author

lobantseff commented May 31, 2020

@ternaus you’re right. However, the problem is that in forward(), model class, which is used by distributed training is replicated every time, then the training_step() is called for each copycat. So it seems that there is the only way to store buffered variables - to save it in the initial class before replicating, e.g. to initialize variable self.generated_images = None in __init__, and then put there value from one of the copies. But each copy tends to generate its own generated_images, hence the use of common buffer seems to be incorrect.

PyTorch developers warn about that: https://pytorch.org/docs/stable/nn.html#dataparallel-layers-multi-gpu-distributed

Maybe someone can propose a good idea of how to overcome this and make its own buffer for each replica.

Best regards, Artem.

@williamFalcon
Copy link
Contributor

@ternaus yes, easiest is to use ddp and NOT dp (dp is not recommended anyhow).

However, we're working on a fix to maintain state in dp @ananyahjha93

@williamFalcon williamFalcon merged commit 55fdfe3 into Lightning-AI:master May 31, 2020
@lobantseff lobantseff deleted the bugfix/fix-gan-example branch May 31, 2020 12:55
@rohitgr7 rohitgr7 mentioned this pull request May 31, 2020
5 tasks
@ananyahjha93
Copy link
Contributor

@williamFalcon thanks for pointing me to this updated GAN example.

@lobantseff
Copy link
Contributor Author

@williamFalcon @ananyahjha93 may I see the branch, where are you working on that problem?

@ananyahjha93
Copy link
Contributor

@armavox This is a [wip] commit but the current look of the solution is quite different from this. So, this is not a representation of the solution we are working on right now.

#1895

justusschock pushed a commit that referenced this pull request Jun 29, 2020
* 🐛 fixed fake example type assigning and hparams arg

* fixed GAN example to work with dp, ddp., ddp_cpu

* Update generative_adversarial_net.py

Co-authored-by: William Falcon <waf2107@columbia.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gan.py multi-gpu running problems
5 participants