Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unet: disable deterministic on training #2204

Closed
wants to merge 4 commits into from
Closed

Conversation

bhack
Copy link
Contributor

@bhack bhack commented Mar 21, 2024

It seems that the backward of some upsample cuda kernels is not deterministic:
pytorch/pytorch#121324 (comment)

/cc @ezyang

@bhack
Copy link
Contributor Author

bhack commented Mar 21, 2024

It seems it doesn't like train deterministic fale and:

torch.backends.cudnn.deterministic = True

Is there a way to enable it only at inference?

@ezyang
Copy link
Contributor

ezyang commented Mar 22, 2024

This is the wrong place to put it, I think; want to put it in the benchmark runner in pytorch/pytorch

@bhack
Copy link
Contributor Author

bhack commented Mar 22, 2024

What do you mean?

@bhack
Copy link
Contributor Author

bhack commented Mar 22, 2024

Also, the CI failures are not related to this PR.

@bhack bhack mentioned this pull request Mar 22, 2024
@ezyang
Copy link
Contributor

ezyang commented Mar 24, 2024

There's a separate benchmark runner in pytorch/pytorch, and you should be able to set this on unet only there. It will be easier to deploy too, since making the change here means you also have to then bump the torchbench hash on pytorch/pytorch lol

@bhack
Copy link
Contributor Author

bhack commented Mar 24, 2024

The problem is that it seems it is not the same in the pytorch runner.

It seems we don't have the meta/machinery to disable the determinism only on training/backward that we have here.

Also pytorch/pytorch#121324 was reverted cause it was failing here/HUD and not on pytorch/pytorch.

Is this interpretation correct?

@xuzhao9
Copy link
Contributor

xuzhao9 commented Mar 24, 2024

The problem is that it seems it is not the same in the pytorch runner.

It seems we don't have the meta/machinery to disable the determinism only on training/backward.

Also the PR was reverted cause it was failing here/HUD and not on pytorch/pytorch.

Is this interpretation correct?

The failing CI in this PR is because of a separate pytorch/pytorch issue: pytorch/pytorch#122575. There is a CUDA memleak bug caused by another pytorch PR.

@bhack
Copy link
Contributor Author

bhack commented Mar 24, 2024

The failing CI in this PR is because of a separate pytorch/pytorch issue: pytorch/pytorch#122575. There is a CUDA memleak bug caused by another pytorch PR.

Thanks but it was unrelated to the mentioned comment as it was more related to why we need to use this PR here instead on pytorch/pytorch.

@bhack
Copy link
Contributor Author

bhack commented Mar 28, 2024

@lezcano
Copy link
Contributor

lezcano commented Mar 28, 2024

can you give more context? In what way is the eager implementation non-deterministic?

@bhack
Copy link
Contributor Author

bhack commented Mar 28, 2024

We had some eager twice reproducibility issue emerged at:
pytorch/pytorch#121324 (comment)

@bhack
Copy link
Contributor Author

bhack commented Mar 29, 2024

Green light here from the CI

@ezyang ezyang requested a review from xuzhao9 March 29, 2024 20:31
Copy link
Contributor

@xuzhao9 xuzhao9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -89,6 +88,7 @@ def jit_callback(self):
self.model = torch.jit.script(self.model)

def eval(self) -> Tuple[torch.Tensor]:
torch.backends.cudnn.deterministic = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not being used by the downstream benchmarks/dynamo/torchbench.py. So I am wondering how this change will fix the CI issue in pytorch/pytorch? @bhack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that more in general pytorch repo benchmark is not considering deterministic meta filed in this repository so it is hard to understand what to do there.

@xuzhao9
Copy link
Contributor

xuzhao9 commented Apr 1, 2024

@bhack I am fine with changing it here, so long as the eval() and train() code is consistent with the metadata.yaml.

However, I suggest:

  1. In the upstream code (https://github.com/milesial/Pytorch-UNet/blob/master/train.py), there is no specific change on torch.backends.cudnn.deterministic, so it seems we should avoid changing the default value at all, to keep it consistent with upstream.

  2. Please note that the CI in pytorch/pytorch does not call eval(). Instead, it only calls get_module() and uses its own benchmark function in benchmarks/dynamo/torchbench.py. I am wondering how changing the eval() function will fix the CI in pytorch/pytorch.

@bhack
Copy link
Contributor Author

bhack commented Apr 1, 2024

I don't understand point 1.
Are we still manipulating it in train?

About 2. I will reply inline

@xuzhao9
Copy link
Contributor

xuzhao9 commented Apr 1, 2024

My explanation on point 1:
The goal of this repo is to have a quick test of popular PyTorch model code on single GPU device. Therefore, we would like to keep consistent behavior with upstream model code. The upstream model code of pytorch_unet is https://github.com/milesial/Pytorch-UNet. In this repo, the train code simply does not touch the config option of torch.backends.cudnn.deterministic, i.e., keeps using the default value. Therefore, it seems to me that we should not tweak with this option in our code either. @bhack

@bhack
Copy link
Contributor Author

bhack commented Apr 1, 2024

In this repo, the train code simply does not touch the config option of torch.backends.cudnn.deterministic, i.e., keeps using the default value. Therefore, it seems to me that we should not tweak with this option in our code either

I still don't understand.. Are we tweaking in train? It seems to me not.. as it was in the init and we moved it only in the eval.

@xuzhao9
Copy link
Contributor

xuzhao9 commented Apr 1, 2024

In this repo, the train code simply does not touch the config option of torch.backends.cudnn.deterministic, i.e., keeps using the default value. Therefore, it seems to me that we should not tweak with this option in our code either

I still don't understand.. Are we tweaking in train? It seems to me not.. as it was in the init and we moved it only in the eval.

Previously on line 9-10: https://github.com/pytorch/benchmark/blob/main/torchbenchmark/models/pytorch_unet/__init__.py#L9-L10. So we are changing the default behavior in both train and eval. To me it seems we should remove these two lines and always use the default value of pytorch (since the upstream code uses the default value, too).

By default:

$ python -c "import torch; print(torch.backends.cudnn.deterministic)"
False

So I think that will be sufficient to your request?

@@ -6,7 +6,6 @@
from torch import optim
from typing import Tuple

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also remove this line, to keep a consistent behavior with upstream.

@@ -89,6 +88,7 @@ def jit_callback(self):
self.model = torch.jit.script(self.model)

def eval(self) -> Tuple[torch.Tensor]:
torch.backends.cudnn.deterministic = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about we remove line 91 torch.backends.cudnn.deterministic = True AND line 9-10 torch.backends.cudnn.benchmark = False and torch.backends.cudnn.deterministic = True altogether?

@@ -5,4 +5,4 @@ eval_benchmark: false
eval_deterministic: true
eval_nograd: true
train_benchmark: false
train_deterministic: true
train_deterministic: false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all the removal you are proposing what it will be the scope of this metadata?
Also who is controlling eager_two_runs_differ? See pytorch/pytorch#121324 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhack The metadata would be default value of pytorch, which is False for both train and eval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it on this repo too? github.com/search?q=repo%3Apytorch%2Fbenchmark%20eager_two&type=code

The version in pytorch/pytorch is the ground truth, and the pytorch/benchmark version will be auto-sync-ed from pytorch/pytorch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But who is going to define if two unet eager runs need to be the same or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that it could be deterministic when compiled if I have interpreted this correctly:
pytorch/pytorch#121769 (comment)

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But who is going to define if two unet eager runs need to be the same or not?

Sorry about the confusion. Yes, it is defined by pytorch/benchmark: https://github.com/pytorch/benchmark/blob/main/test.py#L72. If two eager runs differ, the pytorch/benchmark CI will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as it seems to be deterministic only compiled, at least in training, what do you want to do here and in the pytorch repo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our previous tests it is deterministic in the eager mode. If there is a PR changes this behavior and makes it only deterministic when compiled and non-deterministic in eager mode, it is up to the PyTorch team to decide if they should accept it. If the PyTorch Dev Infra team accepts it, we can skip the eager-mode deterministic test for this model.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 merged this pull request in 1d2550c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants