-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-set DataLoader.worker_init_fn
with seed_everything
#6960
Conversation
Hello @awaelchli! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-04-16 18:14:00 UTC |
Codecov Report
@@ Coverage Diff @@
## master #6960 +/- ##
=======================================
- Coverage 92% 89% -3%
=======================================
Files 194 196 +2
Lines 12386 13219 +833
=======================================
+ Hits 11414 11824 +410
- Misses 972 1395 +423 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the description!
Can you try and give this functionality more visibility throughout docs? Maybe also setting it where seed_everything is used
@rkern thanks again for your feedback. For torch < 1.6 manual_seed only accepts 32-bit integers so I had to do a slight modification. I also noticed that on macos and win our custom worker init function for numpy is not necessary. There seems to be no duplication of random state there. I couldn't find a clear answer why but I think the reason is that on these platforms the worker processes get created with a different method. |
|
This reverts commit 708581a.
@rkern One light bulb after the other turns on. You are of course right, I should have seen this in my testing. I was too much focused on avoiding duplicate states that I was completely blind to see the non-deterministic state. Thanks a lot for your comment. Appreciate your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
Thanks everyone for your constructive feedback. It was fun to implement this :) |
After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised: - We dropped the support for NumPy < 1.17 due to `SeedSequence` - In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`? - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to @rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)). - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed. To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library. [ghstack-poisoned]
After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised: - We dropped the support for NumPy < 1.17 due to `SeedSequence` - In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`? - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to @rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)). - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed. To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library. Differential Revision: [D28000619](https://our.internmc.facebook.com/intern/diff/D28000619) [ghstack-poisoned]
After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised: - We dropped the support for NumPy < 1.17 due to `SeedSequence` - In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`? - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to @rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)). - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed. To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library. Differential Revision: [D28000619](https://our.internmc.facebook.com/intern/diff/D28000619) [ghstack-poisoned]
Summary: Pull Request resolved: #56797 After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised: - We dropped the support for NumPy < 1.17 due to `SeedSequence` - In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`? - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)). - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed. To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library. Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D28000619 Pulled By: ejguan fbshipit-source-id: 5701c8124a38ea5ded69eb8eee70f9680877ffa6
Summary: Pull Request resolved: pytorch#56797 After adding default seeding strategy for NumPy random module within each worker of DataLoader pytorch#56488, two concerns are raised: - We dropped the support for NumPy < 1.17 due to `SeedSequence` - In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`? - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)). - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed. To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library. Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D28000619 Pulled By: ejguan fbshipit-source-id: 5701c8124a38ea5ded69eb8eee70f9680877ffa6
Why was the default not actually changed? However, if I look in |
It was decided that we have to keep it False to keep parity between versions (it could influence reproducibility from one version to the next). |
Summary: Pull Request resolved: pytorch#56797 After adding default seeding strategy for NumPy random module within each worker of DataLoader pytorch#56488, two concerns are raised: - We dropped the support for NumPy < 1.17 due to `SeedSequence` - In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`? - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)). - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed. To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library. Test Plan: Imported from OSS Reviewed By: H-Huang Differential Revision: D28000619 Pulled By: ejguan fbshipit-source-id: 5701c8124a38ea5ded69eb8eee70f9680877ffa6
What does this PR do?
Proposal:
Make Lightning add a default worker init function for seeding numpy when the user also calls
seed_everyting
.This prevents nasty bugs where data augmentations are the same across workers (when using e.g. numpy and not torch as random number generator).
As suggested by Ethan #6957 (reply in thread) , this feature can be turned on or off with
seed_everything(workers=True|False)
Default is currently on (False). It doesn't affect parity, and is therefore a safe feature to introduce.
However, the user has to remember to turn this on, something that likely gets forgotten even if we offer this convenient feature. We may want to consider turning it on in the future.
Vote:
seed_everything(workers=True)
seed_everything(workers=False)
Addresses: #6957
Explanation
To understand the seeds and randomness in the workers, consider the following example:
On the master branch, running the above code we get an output like this:
As you can see, the batches within the rank are repeated, meaning both workers produce the same batch/augmentation!
As suggested in the PyTorch docs and this blog post, we can fix this by providing custom worker function:
This fixes the problem for training on a single gpu. See below the output, in each rank we get different samples. Good!
However we still have problem with multi-gpu!! We need to add the global rank to the seed as well! The (hopefully) correct init is:
And our script produces random samples without duplicates across all ranks and workers:
Note: To reproduce this you need to run on linux, where you will see the duplicated samples. On the other platfors (darwin, win) you won't get duplicates, but the seeds in the worker processes are not deterministically chosen.
References:
[1] https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects/
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃