Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unet: disable deterministic on training #2204

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion torchbenchmark/models/pytorch_unet/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
from torch import optim
from typing import Tuple

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also remove this line, to keep a consistent behavior with upstream.


from .pytorch_unet.unet import UNet
Expand Down Expand Up @@ -89,6 +88,7 @@ def jit_callback(self):
self.model = torch.jit.script(self.model)

def eval(self) -> Tuple[torch.Tensor]:
torch.backends.cudnn.deterministic = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not being used by the downstream benchmarks/dynamo/torchbench.py. So I am wondering how this change will fix the CI issue in pytorch/pytorch? @bhack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that more in general pytorch repo benchmark is not considering deterministic meta filed in this repository so it is hard to understand what to do there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about we remove line 91 torch.backends.cudnn.deterministic = True AND line 9-10 torch.backends.cudnn.benchmark = False and torch.backends.cudnn.deterministic = True altogether?

self.model.eval()
with torch.no_grad():
with torch.cuda.amp.autocast(enabled=self.args.amp):
Expand Down
2 changes: 1 addition & 1 deletion torchbenchmark/models/pytorch_unet/metadata.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,4 @@ eval_benchmark: false
eval_deterministic: true
eval_nograd: true
train_benchmark: false
train_deterministic: true
train_deterministic: false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all the removal you are proposing what it will be the scope of this metadata?
Also who is controlling eager_two_runs_differ? See pytorch/pytorch#121324 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhack The metadata would be default value of pytorch, which is False for both train and eval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it on this repo too? github.com/search?q=repo%3Apytorch%2Fbenchmark%20eager_two&type=code

The version in pytorch/pytorch is the ground truth, and the pytorch/benchmark version will be auto-sync-ed from pytorch/pytorch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But who is going to define if two unet eager runs need to be the same or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that it could be deterministic when compiled if I have interpreted this correctly:
pytorch/pytorch#121769 (comment)

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But who is going to define if two unet eager runs need to be the same or not?

Sorry about the confusion. Yes, it is defined by pytorch/benchmark: https://github.com/pytorch/benchmark/blob/main/test.py#L72. If two eager runs differ, the pytorch/benchmark CI will fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as it seems to be deterministic only compiled, at least in training, what do you want to do here and in the pytorch repo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our previous tests it is deterministic in the eager mode. If there is a PR changes this behavior and makes it only deterministic when compiled and non-deterministic in eager mode, it is up to the PyTorch team to decide if they should accept it. If the PyTorch Dev Infra team accepts it, we can skip the eager-mode deterministic test for this model.

Loading