Disable train dataloader shuffle when overfit_batches is active. #3501

PhilJd · 2020-09-15T07:54:21Z

What does this PR do?

This PR disables training data shuffling when overfitting batches. This was reported but not discussed in #2600.
Still, I've created this PR proactively as I think fixing this is quite important for several reasons:

The default for training is to use data shuffling, so overfit_batches is very likely to not work out of the box for the majority of users.
PL issues a warning and states that "We are turning training shuffle off for you" while actually turning shuffling off for val/test, which is misleading and encourages to look for a bug elsewhere when overfitting does not work (i.e., in the model code).

As this is not a new feature, I hope this fix might still make it into 1.0 ;)

Fixes #2600.

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2020-09-15T07:54:25Z

Hello @PhilJd! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-15 09:07:19 UTC

codecov · 2020-09-15T08:22:42Z

Codecov Report

Merging #3501 into master will increase coverage by 1%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #3501    +/-   ##
=======================================
+ Coverage      90%     91%    +1%     
=======================================
  Files         107     107            
  Lines        8149    8025   -124     
=======================================
- Hits         7332    7291    -41     
+ Misses        817     734    -83

williamFalcon · 2020-09-15T09:05:21Z

merging. failures are unrelated.

@tgaddair can you take a look at this in a follow up PR?

Not sure what's happening with this script?

Tue Sep 15 08:10:00 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/models/data/horovod/train_default_model.py", line 36, in <module>
Tue Sep 15 08:10:00 2020[1]<stderr>:    from tests.base import EvalModelTemplate  # noqa: E402
Tue Sep 15 08:10:00 2020[1]<stderr>:ModuleNotFoundError: No module named 'tests.base'
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/lightning/bin/horovodrun", line 8, in <module>
    sys.exit(run_commandline())
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/launch.py", line 722, in run_commandline
    _run(args)
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/launch.py", line 712, in _run
    return _run_static(args)
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/launch.py", line 573, in _run_static
    _launch_job(args, settings, nics, command)
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/launch.py", line 688, in _launch_job
    args.verbose)
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/launch.py", line 661, in run_controller
    gloo_run()
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/launch.py", line 677, in gloo_run_fn
    gloo_run(settings, nics, env, driver_ip, command)
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/gloo_run.py", line 271, in gloo_run
    launch_gloo(command, exec_command, settings, nics, env, server_ip)
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/gloo_run.py", line 258, in launch_gloo
    .format(name=name, code=exit_code))
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

tgaddair · 2020-09-15T20:41:40Z

Hey @williamFalcon, I can't repro this issue locally. What was the environment when the error occurred? Seems in that test it was unable to add the tests to the syspath.

mergify bot requested a review from a team September 15, 2020 07:55

PhilJd force-pushed the no_train_shuffle_in_overfit_batches branch from 4028a4b to 319cdeb Compare September 15, 2020 07:56

Disable train dataloader shuffle when overfit_batches is active.

8cf5ad5

PhilJd force-pushed the no_train_shuffle_in_overfit_batches branch from 319cdeb to 8cf5ad5 Compare September 15, 2020 07:57

pep8

bd252f6

williamFalcon merged commit b5dc699 into Lightning-AI:master Sep 15, 2020

Borda added the bug Something isn't working label Sep 15, 2020

Borda added this to the 0.9.x milestone Sep 15, 2020

This was referenced Sep 17, 2020

overfit_batches doesn't work #2311

Closed

Fix overfit_batches > 0 on distributed_backend = "ddp" #3534

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable train dataloader shuffle when overfit_batches is active. #3501

Disable train dataloader shuffle when overfit_batches is active. #3501

PhilJd commented Sep 15, 2020 •

edited

Loading

pep8speaks commented Sep 15, 2020 •

edited

Loading

codecov bot commented Sep 15, 2020

williamFalcon commented Sep 15, 2020 •

edited

Loading

tgaddair commented Sep 15, 2020

Disable train dataloader shuffle when overfit_batches is active. #3501

Disable train dataloader shuffle when overfit_batches is active. #3501

Conversation

PhilJd commented Sep 15, 2020 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

pep8speaks commented Sep 15, 2020 • edited Loading

Comment last updated at 2020-09-15 09:07:19 UTC

codecov bot commented Sep 15, 2020

Codecov Report

williamFalcon commented Sep 15, 2020 • edited Loading

tgaddair commented Sep 15, 2020

PhilJd commented Sep 15, 2020 •

edited

Loading

pep8speaks commented Sep 15, 2020 •

edited

Loading

williamFalcon commented Sep 15, 2020 •

edited

Loading