Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call prepare_data once per node in DDP (torchelastic) #2163

Closed
wants to merge 1 commit into from
Closed

Call prepare_data once per node in DDP (torchelastic) #2163

wants to merge 1 commit into from

Conversation

ananthsub
Copy link
Contributor

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Slurm and elastic training create the training processes per node outside of the lightning context. This means that when the fit function calls prepare_data, the assumption that it's only being called on proc 0 is broken and it gets called for each process.. This fixes PyTorchLightning#1878 by calling prepare_data once per node, depending on the local rank.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Jun 12, 2020

Hello @ananthsub! Thanks for updating this PR.

Line 782:13: W503 line break before binary operator
Line 784:13: W503 line break before binary operator
Line 786:17: W503 line break before binary operator
Line 787:17: W503 line break before binary operator
Line 792:1: W293 blank line contains whitespace

Comment last updated at 2020-06-13 07:37:59 UTC

@mergify mergify bot requested a review from a team June 12, 2020 20:36
@Borda Borda added bug Something isn't working priority: 0 High priority task labels Jun 12, 2020
@Borda Borda added this to the 0.8.0 milestone Jun 12, 2020
@Borda Borda changed the title [trainer] Call prepare_data once per node in DDP/DDP2 training Call prepare_data once per node in DDP training Jun 12, 2020
@Borda
Copy link
Member

Borda commented Jun 12, 2020

Mind add test for this use case?

@ananthsub
Copy link
Contributor Author

ananthsub commented Jun 12, 2020

@Borda do you have examples of tests for prepare_data that I can look at as a template? I didn't see any under https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/trainer/test_trainer.py

@williamFalcon williamFalcon changed the title Call prepare_data once per node in DDP training Call prepare_data once per node in DDP training (+ torchelastic) Jun 12, 2020
@williamFalcon williamFalcon changed the title Call prepare_data once per node in DDP training (+ torchelastic) Call prepare_data once per node in DDP (torchelastic) Jun 12, 2020
@Borda
Copy link
Member

Borda commented Jun 12, 2020

Not in a template, but you can mock the class (inherit template) and overview prepare data to do extra call count so in test for multi GPU you can test how many times it was used... Similar as did for some loggers

Comment on lines +780 to +789
if (
not (self.use_ddp or self.use_ddp2)
or (self.is_slurm_managing_tasks and int(os.environ["SLURM_LOCALID"]) == 0)
# torchelastic or general non_slurm ddp
or (
"WORLD_SIZE" in os.environ
and ("GROUP_RANK" in os.environ or "NODE_RANK" in os.environ)
and int(os.environ["LOCAL_RANK"]) == 0
)
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two concerns in terms of code quality:

  1. these extra 10 lines of code make the fit method even bigger than it already should be.
  2. it is hard to read (and to debug). suggestion: break it down like this:
condition1 = ...
condition2 = ... 
... 

if condition1 and condition2 and not condition3 ...

and choose good names for these variables. how does this sound?

@williamFalcon
Copy link
Contributor

See #2166 finishing this PR there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: 0 High priority task
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prepare_data called multiple times per node for slurm and elastic training
5 participants