Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable train dataloader shuffle when overfit_batches is active. #3501

Merged

Conversation

PhilJd
Copy link
Contributor

@PhilJd PhilJd commented Sep 15, 2020

What does this PR do?

This PR disables training data shuffling when overfitting batches. This was reported but not discussed in #2600.
Still, I've created this PR proactively as I think fixing this is quite important for several reasons:

  • The default for training is to use data shuffling, so overfit_batches is very likely to not work out of the box for the majority of users.
  • PL issues a warning and states that "We are turning training shuffle off for you" while actually turning shuffling off for val/test, which is misleading and encourages to look for a bug elsewhere when overfitting does not work (i.e., in the model code).

As this is not a new feature, I hope this fix might still make it into 1.0 ;)

Fixes #2600.

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Sep 15, 2020

Hello @PhilJd! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-15 09:07:19 UTC

@mergify mergify bot requested a review from a team September 15, 2020 07:55
@PhilJd PhilJd force-pushed the no_train_shuffle_in_overfit_batches branch from 4028a4b to 319cdeb Compare September 15, 2020 07:56
@PhilJd PhilJd force-pushed the no_train_shuffle_in_overfit_batches branch from 319cdeb to 8cf5ad5 Compare September 15, 2020 07:57
@codecov
Copy link

codecov bot commented Sep 15, 2020

Codecov Report

Merging #3501 into master will increase coverage by 1%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #3501    +/-   ##
=======================================
+ Coverage      90%     91%    +1%     
=======================================
  Files         107     107            
  Lines        8149    8025   -124     
=======================================
- Hits         7332    7291    -41     
+ Misses        817     734    -83     

@williamFalcon
Copy link
Contributor

williamFalcon commented Sep 15, 2020

merging. failures are unrelated.

@tgaddair can you take a look at this in a follow up PR?

Not sure what's happening with this script?

Tue Sep 15 08:10:00 2020[1]<stderr>:  File "/home/runner/work/pytorch-lightning/pytorch-lightning/tests/models/data/horovod/train_default_model.py", line 36, in <module>
Tue Sep 15 08:10:00 2020[1]<stderr>:    from tests.base import EvalModelTemplate  # noqa: E402
Tue Sep 15 08:10:00 2020[1]<stderr>:ModuleNotFoundError: No module named 'tests.base'
Traceback (most recent call last):
  File "/usr/share/miniconda/envs/lightning/bin/horovodrun", line 8, in <module>
    sys.exit(run_commandline())
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/launch.py", line 722, in run_commandline
    _run(args)
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/launch.py", line 712, in _run
    return _run_static(args)
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/launch.py", line 573, in _run_static
    _launch_job(args, settings, nics, command)
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/launch.py", line 688, in _launch_job
    args.verbose)
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/launch.py", line 661, in run_controller
    gloo_run()
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/launch.py", line 677, in gloo_run_fn
    gloo_run(settings, nics, env, driver_ip, command)
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/gloo_run.py", line 271, in gloo_run
    launch_gloo(command, exec_command, settings, nics, env, server_ip)
  File "/usr/share/miniconda/envs/lightning/lib/python3.7/site-packages/horovod/runner/gloo_run.py", line 258, in launch_gloo
    .format(name=name, code=exit_code))
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

@williamFalcon williamFalcon merged commit b5dc699 into Lightning-AI:master Sep 15, 2020
@tgaddair
Copy link
Contributor

Hey @williamFalcon, I can't repro this issue locally. What was the environment when the error occurred? Seems in that test it was unable to add the tests to the syspath.

@Borda Borda added the bug Something isn't working label Sep 15, 2020
@Borda Borda added this to the 0.9.x milestone Sep 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trainer flag overfit_batches does not overwrite train dataloaders shuffle flag
5 participants