Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-set DataLoader.worker_init_fn with seed_everything #6960

Merged
merged 41 commits into from
Apr 19, 2021

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Apr 11, 2021

What does this PR do?

Proposal:
Make Lightning add a default worker init function for seeding numpy when the user also calls seed_everyting.
This prevents nasty bugs where data augmentations are the same across workers (when using e.g. numpy and not torch as random number generator).

As suggested by Ethan #6957 (reply in thread) , this feature can be turned on or off with seed_everything(workers=True|False)
Default is currently on (False). It doesn't affect parity, and is therefore a safe feature to introduce.
However, the user has to remember to turn this on, something that likely gets forgotten even if we offer this convenient feature. We may want to consider turning it on in the future.

Vote:

  • 🚀 Default seed_everything(workers=True)
  • 🎉 Default seed_everything(workers=False)

Addresses: #6957

Explanation

To understand the seeds and randomness in the workers, consider the following example:

import os
import torch
from torch.utils.data import Dataset

from pytorch_lightning import LightningModule, Trainer, seed_everything

import numpy as np
from torch.utils.data import Dataset


class RandomDataset(Dataset):
    def __getitem__(self, index):
        # numpy random state is the same in all workers!
        return np.random.randint(0, 10, 3)

    def __len__(self):
        return 16


class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def training_step(self, batch, batch_idx):
        print("rank:", self.global_rank, "batch:", batch.view(-1).tolist())
        return None

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = torch.utils.data.DataLoader(RandomDataset(), batch_size=2, num_workers=2)
    model = BoringModel()
    trainer = Trainer(
        default_root_dir=os.getcwd(),
        gpus=2,
        accelerator="ddp",
        limit_train_batches=4,
        max_epochs=2,
        weights_summary=None,
        progress_bar_refresh_rate=0,
    )
    trainer.fit(model, train_data)


if __name__ == '__main__':
    # ---------------------------------------------------
    # comment the line below to get duplicated batches!
    # ---------------------------------------------------
    seed_everything(1)
    run()

On the master branch, running the above code we get an output like this:

rank: 0 batch: [5, 4, 5, 4, 9, 9]
rank: 0 batch: [5, 4, 5, 4, 9, 9] <---- repeated 
rank: 0 batch: [7, 5, 2, 2, 6, 5]
rank: 0 batch: [7, 5, 2, 2, 6, 5] <---- repeated 

rank: 1 batch: [8, 9, 4, 6, 8, 8]
rank: 1 batch: [8, 9, 4, 6, 8, 8] <---- repeated 
rank: 1 batch: [0, 0, 8, 6, 9, 7]
rank: 1 batch: [0, 0, 8, 6, 9, 7] <---- repeated 

As you can see, the batches within the rank are repeated, meaning both workers produce the same batch/augmentation!
As suggested in the PyTorch docs and this blog post, we can fix this by providing custom worker function:

def worker_init_fn(worker_id):
    worker_seed = (torch.initial_seed() + worker_id) % (2**32)
    torch.manual_seed(worker_seed)
    numpy.random.seed(worker_seed)
    random.seed(worker_seed)

dataloader.worker_init_fn = worker_init_fn

This fixes the problem for training on a single gpu. See below the output, in each rank we get different samples. Good!

rank: 0 batch: [2, 2, 3, 0, 8, 9]
rank: 0 batch: [7, 2, 0, 0, 2, 2]
rank: 0 batch: [9, 6, 5, 0, 3, 1]

rank: 1 batch: [2, 2, 3, 0, 8, 9] <---- repeated from rank 0
rank: 1 batch: [7, 2, 0, 0, 2, 2] <---- repeated from rank 0
rank: 1 batch: [9, 6, 5, 0, 3, 1] <---- repeated from rank 0

However we still have problem with multi-gpu!! We need to add the global rank to the seed as well! The (hopefully) correct init is:

def worker_init_fn(worker_id):
    global_rank = rank_zero_only.rank
    worker_seed = (torch.initial_seed() + worker_id + global_rank) % (2**32)
    torch.manual_seed(worker_seed)
    numpy.random.seed(worker_seed)
    random.seed(worker_seed)

dataloader.worker_init_fn = worker_init_fn

And our script produces random samples without duplicates across all ranks and workers:

rank: 0 batch: [2, 2, 3, 0, 8, 9]
rank: 0 batch: [7, 2, 0, 0, 2, 2]
rank: 0 batch: [9, 6, 5, 0, 3, 1]
rank: 0 batch: [0, 8, 2, 6, 3, 2]

rank: 1 batch: [1, 6, 2, 5, 5, 7]
rank: 1 batch: [9, 0, 4, 8, 1, 4]
rank: 1 batch: [7, 2, 2, 1, 6, 6]
rank: 1 batch: [5, 3, 6, 5, 6, 2]

Note: To reproduce this you need to run on linux, where you will see the duplicated samples. On the other platfors (darwin, win) you won't get duplicates, but the seeds in the worker processes are not deterministically chosen.

References:
[1] https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects/

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

@awaelchli awaelchli added the feature Is an improvement or enhancement label Apr 11, 2021
@pep8speaks
Copy link

pep8speaks commented Apr 11, 2021

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-16 18:14:00 UTC

@codecov
Copy link

codecov bot commented Apr 11, 2021

Codecov Report

Merging #6960 (8514821) into master (f645df5) will decrease coverage by 3%.
The diff coverage is 55%.

@@           Coverage Diff           @@
##           master   #6960    +/-   ##
=======================================
- Coverage      92%     89%    -3%     
=======================================
  Files         194     196     +2     
  Lines       12386   13219   +833     
=======================================
+ Hits        11414   11824   +410     
- Misses        972    1395   +423     

@awaelchli awaelchli added the data handling Generic data-related topic label Apr 11, 2021
@awaelchli awaelchli marked this pull request as ready for review April 12, 2021 08:12
Copy link
Contributor

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the description!

Can you try and give this functionality more visibility throughout docs? Maybe also setting it where seed_everything is used

tests/trainer/test_dataloaders.py Outdated Show resolved Hide resolved
pytorch_lightning/utilities/seed.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/data_loading.py Outdated Show resolved Hide resolved
@Borda Borda added this to the 1.3 milestone Apr 12, 2021
requirements.txt Outdated Show resolved Hide resolved
@awaelchli
Copy link
Contributor Author

@rkern thanks again for your feedback. For torch < 1.6 manual_seed only accepts 32-bit integers so I had to do a slight modification. I also noticed that on macos and win our custom worker init function for numpy is not necessary. There seems to be no duplication of random state there. I couldn't find a clear answer why but I think the reason is that on these platforms the worker processes get created with a different method.

requirements.txt Outdated Show resolved Hide resolved
@rkern
Copy link

rkern commented Apr 16, 2021

multiprocessing uses spawn instead of fork on Windows (because there is no fork()) and on macOS in recent Pythons (I think because fork() is a little wonky when interacting with other macOS facilities). This means that instead of copying the state of the process, a new Python process is started up from scratch. That means that the global PRNG underneath np.random gets initialized with fresh OS entropy in each child process. Thus, you will get distinct states, but not deterministic states. You still need the worker_init_fn to give them appropriate seeds if you want determinism.

@awaelchli
Copy link
Contributor Author

@rkern One light bulb after the other turns on. You are of course right, I should have seen this in my testing. I was too much focused on avoiding duplicate states that I was completely blind to see the non-deterministic state. Thanks a lot for your comment. Appreciate your contribution!

@carmocca carmocca added the ready PRs ready to be merged label Apr 19, 2021
Copy link
Contributor

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

pytorch_lightning/trainer/data_loading.py Show resolved Hide resolved
@awaelchli awaelchli merged commit 60c1c8f into master Apr 19, 2021
@awaelchli awaelchli deleted the feature/worker_init_fn branch April 19, 2021 14:28
@awaelchli
Copy link
Contributor Author

Thanks everyone for your constructive feedback. It was fun to implement this :)

ejguan added a commit to pytorch/pytorch that referenced this pull request Apr 26, 2021
After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised:
- We dropped the support for NumPy < 1.17 due to `SeedSequence`
- In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`?
  - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to @rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)).
  - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed.

To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library.

[ghstack-poisoned]
ejguan added a commit to pytorch/pytorch that referenced this pull request Apr 26, 2021
After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised:
- We dropped the support for NumPy < 1.17 due to `SeedSequence`
- In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`?
  - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to @rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)).
  - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed.

To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library.

Differential Revision: [D28000619](https://our.internmc.facebook.com/intern/diff/D28000619)

[ghstack-poisoned]
ejguan added a commit to pytorch/pytorch that referenced this pull request Apr 26, 2021
After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised:
- We dropped the support for NumPy < 1.17 due to `SeedSequence`
- In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`?
  - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to @rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)).
  - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed.

To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library.

Differential Revision: [D28000619](https://our.internmc.facebook.com/intern/diff/D28000619)

[ghstack-poisoned]
facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request Apr 27, 2021
Summary:
Pull Request resolved: #56797

After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised:
- We dropped the support for NumPy < 1.17 due to `SeedSequence`
- In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`?
  - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)).
  - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed.

To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D28000619

Pulled By: ejguan

fbshipit-source-id: 5701c8124a38ea5ded69eb8eee70f9680877ffa6
crcrpar pushed a commit to crcrpar/pytorch that referenced this pull request May 7, 2021
Summary:
Pull Request resolved: pytorch#56797

After adding default seeding strategy for NumPy random module within each worker of DataLoader pytorch#56488, two concerns are raised:
- We dropped the support for NumPy < 1.17 due to `SeedSequence`
- In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`?
  - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)).
  - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed.

To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D28000619

Pulled By: ejguan

fbshipit-source-id: 5701c8124a38ea5ded69eb8eee70f9680877ffa6
@robogast
Copy link

robogast commented May 11, 2021

Why was the default not actually changed?
The votes indicates that the default for workers should be True, and the 1.3 release notes mention this PR.

However, if I look in master::seed_everything, the default is workers=False.

@awaelchli
Copy link
Contributor Author

It was decided that we have to keep it False to keep parity between versions (it could influence reproducibility from one version to the next).

krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#56797

After adding default seeding strategy for NumPy random module within each worker of DataLoader pytorch#56488, two concerns are raised:
- We dropped the support for NumPy < 1.17 due to `SeedSequence`
- In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`?
  - First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to rkern, we noticed this so-called [bad things](Lightning-AI/pytorch-lightning#6960 (comment)).
  - Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed.

To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library.

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D28000619

Pulled By: ejguan

fbshipit-source-id: 5701c8124a38ea5ded69eb8eee70f9680877ffa6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data handling Generic data-related topic feature Is an improvement or enhancement ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants