Call prepare_data once per node in DDP (torchelastic) #2163

ananthsub · 2020-06-12T20:34:58Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?
If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Slurm and elastic training create the training processes per node outside of the lightning context. This means that when the fit function calls prepare_data, the assumption that it's only being called on proc 0 is broken and it gets called for each process.. This fixes PyTorchLightning#1878 by calling prepare_data once per node, depending on the local rank.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2020-06-12T20:35:03Z

Hello @ananthsub! Thanks for updating this PR.

In the file pytorch_lightning/trainer/trainer.py:

Line 782:13: W503 line break before binary operator
Line 784:13: W503 line break before binary operator
Line 786:17: W503 line break before binary operator
Line 787:17: W503 line break before binary operator
Line 792:1: W293 blank line contains whitespace

Comment last updated at 2020-06-13 07:37:59 UTC

Borda · 2020-06-12T20:48:23Z

Mind add test for this use case?

ananthsub · 2020-06-12T20:52:12Z

@Borda do you have examples of tests for prepare_data that I can look at as a template? I didn't see any under https://github.com/PyTorchLightning/pytorch-lightning/blob/master/tests/trainer/test_trainer.py

Borda · 2020-06-12T21:34:10Z

Not in a template, but you can mock the class (inherit template) and overview prepare data to do extra call count so in test for multi GPU you can test how many times it was used... Similar as did for some loggers

awaelchli · 2020-06-12T23:21:59Z

pytorch_lightning/trainer/trainer.py

+        if (
+            not (self.use_ddp or self.use_ddp2)
+            or (self.is_slurm_managing_tasks and int(os.environ["SLURM_LOCALID"]) == 0)
+            # torchelastic or general non_slurm ddp
+            or (
+                "WORLD_SIZE" in os.environ
+                and ("GROUP_RANK" in os.environ or "NODE_RANK" in os.environ)
+                and int(os.environ["LOCAL_RANK"]) == 0
+            )
+        ):


I have two concerns in terms of code quality:

these extra 10 lines of code make the fit method even bigger than it already should be.

it is hard to read (and to debug). suggestion: break it down like this:

condition1 = ... condition2 = ... ... if condition1 and condition2 and not condition3 ...

and choose good names for these variables. how does this sound?

williamFalcon · 2020-06-13T02:51:46Z

See #2166 finishing this PR there

[trainer] Call prepare_data once per node in DDP/DDP2 training

1621538

mergify bot requested a review from a team June 12, 2020 20:36

Borda added bug Something isn't working priority: 0 High priority task labels Jun 12, 2020

Borda added this to the 0.8.0 milestone Jun 12, 2020

Borda changed the title ~~[trainer] Call prepare_data once per node in DDP/DDP2 training~~ Call prepare_data once per node in DDP training Jun 12, 2020

williamFalcon changed the title ~~Call prepare_data once per node in DDP training~~ Call prepare_data once per node in DDP training (+ torchelastic) Jun 12, 2020

williamFalcon changed the title ~~Call prepare_data once per node in DDP training (+ torchelastic)~~ Call prepare_data once per node in DDP (torchelastic) Jun 12, 2020

awaelchli suggested changes Jun 12, 2020

View reviewed changes

mergify bot requested a review from a team June 12, 2020 23:22

williamFalcon mentioned this pull request Jun 13, 2020

enable prepare_data from correct processes - clarify local vs global rank #2166

Merged

williamFalcon closed this Jun 13, 2020

Borda reopened this Jun 13, 2020

williamFalcon closed this in #2166 Jun 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call prepare_data once per node in DDP (torchelastic) #2163

Call prepare_data once per node in DDP (torchelastic) #2163

ananthsub commented Jun 12, 2020

pep8speaks commented Jun 12, 2020 •

edited

Loading

Borda commented Jun 12, 2020

ananthsub commented Jun 12, 2020 •

edited

Loading

Borda commented Jun 12, 2020

awaelchli Jun 12, 2020

williamFalcon commented Jun 13, 2020

Call prepare_data once per node in DDP (torchelastic) #2163

Call prepare_data once per node in DDP (torchelastic) #2163

Conversation

ananthsub commented Jun 12, 2020

Before submitting

What does this PR do?

PR review

Did you have fun?

pep8speaks commented Jun 12, 2020 • edited Loading

Comment last updated at 2020-06-13 07:37:59 UTC

Borda commented Jun 12, 2020

ananthsub commented Jun 12, 2020 • edited Loading

Borda commented Jun 12, 2020

awaelchli Jun 12, 2020

Choose a reason for hiding this comment

williamFalcon commented Jun 13, 2020

pep8speaks commented Jun 12, 2020 •

edited

Loading

ananthsub commented Jun 12, 2020 •

edited

Loading