-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unet: disable deterministic on training #2204
Conversation
It seems it doesn't like train deterministic
Is there a way to enable it only at inference? |
This is the wrong place to put it, I think; want to put it in the benchmark runner in pytorch/pytorch |
What do you mean? |
Also, the CI failures are not related to this PR. |
There's a separate benchmark runner in pytorch/pytorch, and you should be able to set this on unet only there. It will be easier to deploy too, since making the change here means you also have to then bump the torchbench hash on pytorch/pytorch lol |
The problem is that it seems it is not the same in the pytorch runner. It seems we don't have the meta/machinery to disable the determinism only on training/backward that we have here. Also pytorch/pytorch#121324 was reverted cause it was failing here/HUD and not on pytorch/pytorch. Is this interpretation correct? |
The failing CI in this PR is because of a separate pytorch/pytorch issue: pytorch/pytorch#122575. There is a CUDA memleak bug caused by another pytorch PR. |
Thanks but it was unrelated to the mentioned comment as it was more related to why we need to use this PR here instead on |
can you give more context? In what way is the eager implementation non-deterministic? |
We had some eager twice reproducibility issue emerged at: |
Green light here from the CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@@ -89,6 +88,7 @@ def jit_callback(self): | |||
self.model = torch.jit.script(self.model) | |||
|
|||
def eval(self) -> Tuple[torch.Tensor]: | |||
torch.backends.cudnn.deterministic = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not being used by the downstream benchmarks/dynamo/torchbench.py
. So I am wondering how this change will fix the CI issue in pytorch/pytorch? @bhack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that more in general pytorch repo benchmark is not considering deterministic meta filed in this repository so it is hard to understand what to do there.
@bhack I am fine with changing it here, so long as the However, I suggest:
|
I don't understand point 1. About 2. I will reply inline |
My explanation on point 1: |
I still don't understand.. Are we tweaking in train? It seems to me not.. as it was in the init and we moved it only in the eval. |
Previously on line 9-10: https://github.com/pytorch/benchmark/blob/main/torchbenchmark/models/pytorch_unet/__init__.py#L9-L10. So we are changing the default behavior in both train and eval. To me it seems we should remove these two lines and always use the default value of pytorch (since the upstream code uses the default value, too). By default:
So I think that will be sufficient to your request? |
@@ -6,7 +6,6 @@ | |||
from torch import optim | |||
from typing import Tuple | |||
|
|||
torch.backends.cudnn.deterministic = True | |||
torch.backends.cudnn.benchmark = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also remove this line, to keep a consistent behavior with upstream.
@@ -89,6 +88,7 @@ def jit_callback(self): | |||
self.model = torch.jit.script(self.model) | |||
|
|||
def eval(self) -> Tuple[torch.Tensor]: | |||
torch.backends.cudnn.deterministic = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about we remove line 91 torch.backends.cudnn.deterministic = True
AND line 9-10 torch.backends.cudnn.benchmark = False
and torch.backends.cudnn.deterministic = True
altogether?
@@ -5,4 +5,4 @@ eval_benchmark: false | |||
eval_deterministic: true | |||
eval_nograd: true | |||
train_benchmark: false | |||
train_deterministic: true | |||
train_deterministic: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why all the removal you are proposing what it will be the scope of this metadata?
Also who is controlling eager_two_runs_differ
? See pytorch/pytorch#121324 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bhack The metadata would be default value of pytorch, which is False
for both train
and eval
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torchbench.py
is controlling eager_two_runs_differ
: https://github.com/pytorch/pytorch/blob/76a87e33a0fdd0639842218ce26943e7b4f3838b/benchmarks/dynamo/common.py#L2478
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it on this repo too?
https://github.com/search?q=repo%3Apytorch%2Fbenchmark%20eager_two&type=code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it on this repo too? github.com/search?q=repo%3Apytorch%2Fbenchmark%20eager_two&type=code
The version in pytorch/pytorch
is the ground truth, and the pytorch/benchmark version will be auto-sync-ed from pytorch/pytorch
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But who is going to define if two unet eager runs need to be the same or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that it could be deterministic when compiled if I have interpreted this correctly:
pytorch/pytorch#121769 (comment)
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But who is going to define if two unet eager runs need to be the same or not?
Sorry about the confusion. Yes, it is defined by pytorch/benchmark
: https://github.com/pytorch/benchmark/blob/main/test.py#L72. If two eager runs differ, the pytorch/benchmark
CI will fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as it seems to be deterministic only compiled, at least in training, what do you want to do here and in the pytorch repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our previous tests it is deterministic in the eager mode. If there is a PR changes this behavior and makes it only deterministic when compiled and non-deterministic in eager mode, it is up to the PyTorch team to decide if they should accept it. If the PyTorch Dev Infra team accepts it, we can skip the eager-mode deterministic test for this model.
It seems that the backward of some upsample cuda kernels is not deterministic:
pytorch/pytorch#121324 (comment)
/cc @ezyang