-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call prepare_data once per node in DDP (torchelastic) #2163
Conversation
Hello @ananthsub! Thanks for updating this PR.
Comment last updated at 2020-06-13 07:37:59 UTC |
Mind add test for this use case? |
@Borda do you have examples of tests for |
Not in a template, but you can mock the class (inherit template) and overview prepare data to do extra call count so in test for multi GPU you can test how many times it was used... Similar as did for some loggers |
if ( | ||
not (self.use_ddp or self.use_ddp2) | ||
or (self.is_slurm_managing_tasks and int(os.environ["SLURM_LOCALID"]) == 0) | ||
# torchelastic or general non_slurm ddp | ||
or ( | ||
"WORLD_SIZE" in os.environ | ||
and ("GROUP_RANK" in os.environ or "NODE_RANK" in os.environ) | ||
and int(os.environ["LOCAL_RANK"]) == 0 | ||
) | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two concerns in terms of code quality:
- these extra 10 lines of code make the fit method even bigger than it already should be.
- it is hard to read (and to debug). suggestion: break it down like this:
condition1 = ...
condition2 = ...
...
if condition1 and condition2 and not condition3 ...
and choose good names for these variables. how does this sound?
See #2166 finishing this PR there |
Before submitting
What does this PR do?
Slurm and elastic training create the training processes per node outside of the lightning context. This means that when the fit function calls prepare_data, the assumption that it's only being called on proc 0 is broken and it gets called for each process.. This fixes PyTorchLightning#1878 by calling prepare_data once per node, depending on the local rank.
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃