Unet: disable deterministic on training #2204

bhack · 2024-03-21T11:42:24Z

It seems that the backward of some upsample cuda kernels is not deterministic:
pytorch/pytorch#121324 (comment)

/cc @ezyang

bhack · 2024-03-21T21:12:40Z

It seems it doesn't like train deterministic fale and:

benchmark/torchbenchmark/models/pytorch_unet/__init__.py

Line 9 in 2196021

torch.backends.cudnn.deterministic = True

Is there a way to enable it only at inference?

ezyang · 2024-03-22T03:48:55Z

This is the wrong place to put it, I think; want to put it in the benchmark runner in pytorch/pytorch

bhack · 2024-03-22T10:28:33Z

What do you mean?

bhack · 2024-03-22T12:20:25Z

Also, the CI failures are not related to this PR.

ezyang · 2024-03-24T02:00:51Z

There's a separate benchmark runner in pytorch/pytorch, and you should be able to set this on unet only there. It will be easier to deploy too, since making the change here means you also have to then bump the torchbench hash on pytorch/pytorch lol

bhack · 2024-03-24T11:21:38Z

The problem is that it seems it is not the same in the pytorch runner.

It seems we don't have the meta/machinery to disable the determinism only on training/backward that we have here.

Also pytorch/pytorch#121324 was reverted cause it was failing here/HUD and not on pytorch/pytorch.

Is this interpretation correct?

xuzhao9 · 2024-03-24T13:40:41Z

The problem is that it seems it is not the same in the pytorch runner.

It seems we don't have the meta/machinery to disable the determinism only on training/backward.

Also the PR was reverted cause it was failing here/HUD and not on pytorch/pytorch.

Is this interpretation correct?

The failing CI in this PR is because of a separate pytorch/pytorch issue: pytorch/pytorch#122575. There is a CUDA memleak bug caused by another pytorch PR.

bhack · 2024-03-24T14:49:10Z

The failing CI in this PR is because of a separate pytorch/pytorch issue: pytorch/pytorch#122575. There is a CUDA memleak bug caused by another pytorch PR.

Thanks but it was unrelated to the mentioned comment as it was more related to why we need to use this PR here instead on pytorch/pytorch.

bhack · 2024-03-28T12:47:49Z

/cc @lezcano pytorch/pytorch#121769 (comment)

lezcano · 2024-03-28T12:59:11Z

can you give more context? In what way is the eager implementation non-deterministic?

bhack · 2024-03-28T13:46:23Z

We had some eager twice reproducibility issue emerged at:
pytorch/pytorch#121324 (comment)

bhack · 2024-03-29T14:42:31Z

Green light here from the CI

xuzhao9

LGTM

facebook-github-bot · 2024-04-01T00:38:03Z

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

xuzhao9 · 2024-04-01T00:38:52Z

torchbenchmark/models/pytorch_unet/__init__.py

@@ -89,6 +88,7 @@ def jit_callback(self):
            self.model = torch.jit.script(self.model)

    def eval(self) -> Tuple[torch.Tensor]:
+        torch.backends.cudnn.deterministic = True


This function is not being used by the downstream benchmarks/dynamo/torchbench.py. So I am wondering how this change will fix the CI issue in pytorch/pytorch? @bhack

The problem is that more in general pytorch repo benchmark is not considering deterministic meta filed in this repository so it is hard to understand what to do there.

xuzhao9 · 2024-04-01T00:43:35Z

@bhack I am fine with changing it here, so long as the eval() and train() code is consistent with the metadata.yaml.

However, I suggest:

In the upstream code (https://github.com/milesial/Pytorch-UNet/blob/master/train.py), there is no specific change on torch.backends.cudnn.deterministic, so it seems we should avoid changing the default value at all, to keep it consistent with upstream.
Please note that the CI in pytorch/pytorch does not call eval(). Instead, it only calls get_module() and uses its own benchmark function in benchmarks/dynamo/torchbench.py. I am wondering how changing the eval() function will fix the CI in pytorch/pytorch.

bhack · 2024-04-01T09:41:46Z

I don't understand point 1.
Are we still manipulating it in train?

About 2. I will reply inline

xuzhao9 · 2024-04-01T17:09:10Z

My explanation on point 1:
The goal of this repo is to have a quick test of popular PyTorch model code on single GPU device. Therefore, we would like to keep consistent behavior with upstream model code. The upstream model code of pytorch_unet is https://github.com/milesial/Pytorch-UNet. In this repo, the train code simply does not touch the config option of torch.backends.cudnn.deterministic, i.e., keeps using the default value. Therefore, it seems to me that we should not tweak with this option in our code either. @bhack

bhack · 2024-04-01T17:14:18Z

In this repo, the train code simply does not touch the config option of torch.backends.cudnn.deterministic, i.e., keeps using the default value. Therefore, it seems to me that we should not tweak with this option in our code either

I still don't understand.. Are we tweaking in train? It seems to me not.. as it was in the init and we moved it only in the eval.

xuzhao9 · 2024-04-01T17:24:33Z

In this repo, the train code simply does not touch the config option of torch.backends.cudnn.deterministic, i.e., keeps using the default value. Therefore, it seems to me that we should not tweak with this option in our code either

I still don't understand.. Are we tweaking in train? It seems to me not.. as it was in the init and we moved it only in the eval.

Previously on line 9-10: https://github.com/pytorch/benchmark/blob/main/torchbenchmark/models/pytorch_unet/__init__.py#L9-L10. So we are changing the default behavior in both train and eval. To me it seems we should remove these two lines and always use the default value of pytorch (since the upstream code uses the default value, too).

By default:

$ python -c "import torch; print(torch.backends.cudnn.deterministic)"
False

So I think that will be sufficient to your request?

xuzhao9 · 2024-04-01T17:25:43Z

torchbenchmark/models/pytorch_unet/__init__.py

@@ -6,7 +6,6 @@
 from torch import optim
 from typing import Tuple

-torch.backends.cudnn.deterministic = True
 torch.backends.cudnn.benchmark = False


I think we should also remove this line, to keep a consistent behavior with upstream.

xuzhao9 · 2024-04-01T17:26:42Z

torchbenchmark/models/pytorch_unet/__init__.py

@@ -89,6 +88,7 @@ def jit_callback(self):
            self.model = torch.jit.script(self.model)

    def eval(self) -> Tuple[torch.Tensor]:
+        torch.backends.cudnn.deterministic = True


What about we remove line 91 torch.backends.cudnn.deterministic = True AND line 9-10 torch.backends.cudnn.benchmark = False and torch.backends.cudnn.deterministic = True altogether?

bhack · 2024-04-01T17:32:51Z

torchbenchmark/models/pytorch_unet/metadata.yaml

@@ -5,4 +5,4 @@ eval_benchmark: false
 eval_deterministic: true
 eval_nograd: true
 train_benchmark: false
-train_deterministic: true
+train_deterministic: false


Why all the removal you are proposing what it will be the scope of this metadata?
Also who is controlling eager_two_runs_differ? See pytorch/pytorch#121324 (comment)

@bhack The metadata would be default value of pytorch, which is False for both train and eval.

torchbench.py is controlling eager_two_runs_differ: https://github.com/pytorch/pytorch/blob/76a87e33a0fdd0639842218ce26943e7b4f3838b/benchmarks/dynamo/common.py#L2478

Is it on this repo too?
https://github.com/search?q=repo%3Apytorch%2Fbenchmark%20eager_two&type=code

Is it on this repo too? github.com/search?q=repo%3Apytorch%2Fbenchmark%20eager_two&type=code

The version in pytorch/pytorch is the ground truth, and the pytorch/benchmark version will be auto-sync-ed from pytorch/pytorch.

But who is going to define if two unet eager runs need to be the same or not?

It seems that it could be deterministic when compiled if I have interpreted this correctly:
pytorch/pytorch#121769 (comment)

What do you think?

But who is going to define if two unet eager runs need to be the same or not?

Sorry about the confusion. Yes, it is defined by pytorch/benchmark: https://github.com/pytorch/benchmark/blob/main/test.py#L72. If two eager runs differ, the pytorch/benchmark CI will fail.

So as it seems to be deterministic only compiled, at least in training, what do you want to do here and in the pytorch repo?

In our previous tests it is deterministic in the eager mode. If there is a PR changes this behavior and makes it only deterministic when compiled and non-deterministic in eager mode, it is up to the PyTorch team to decide if they should accept it. If the PyTorch Dev Infra team accepts it, we can skip the eager-mode deterministic test for this model.

facebook-github-bot · 2024-04-02T15:33:55Z

@xuzhao9 merged this pull request in 1d2550c.

Unet: disable deterministic on training

082ac0f

facebook-github-bot added the cla signed label Mar 21, 2024

bhack mentioned this pull request Mar 21, 2024

Force upsample to be float32 pytorch/pytorch#121324

Closed

bhack had a problem deploying to docker-s3-upload March 21, 2024 16:49 — with GitHub Actions Failure

Move cudnn deterministic in eval

b6110d3

bhack had a problem deploying to docker-s3-upload March 22, 2024 03:48 — with GitHub Actions Failure

bhack mentioned this pull request Mar 22, 2024

Add sam_fast torchbench #2182

Closed

Merge branch 'pytorch:main' into patch-1

6e90c60

bhack had a problem deploying to docker-s3-upload March 26, 2024 01:37 — with GitHub Actions Failure

bhack temporarily deployed to docker-s3-upload March 28, 2024 12:29 — with GitHub Actions Inactive

bhack temporarily deployed to docker-s3-upload March 28, 2024 12:30 — with GitHub Actions Inactive

Merge branch 'pytorch:main' into patch-1

65b4bbc

bhack temporarily deployed to docker-s3-upload March 28, 2024 17:14 — with GitHub Actions Inactive

bhack temporarily deployed to docker-s3-upload March 28, 2024 17:15 — with GitHub Actions Inactive

ezyang requested a review from xuzhao9 March 29, 2024 20:31

xuzhao9 approved these changes Mar 30, 2024

View reviewed changes

xuzhao9 reviewed Apr 1, 2024

View reviewed changes

bhack commented Apr 1, 2024

View reviewed changes

facebook-github-bot closed this in 1d2550c Apr 2, 2024

facebook-github-bot added the Merged label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unet: disable deterministic on training #2204

Unet: disable deterministic on training #2204

bhack commented Mar 21, 2024

bhack commented Mar 21, 2024

ezyang commented Mar 22, 2024

bhack commented Mar 22, 2024

bhack commented Mar 22, 2024

ezyang commented Mar 24, 2024

bhack commented Mar 24, 2024 •

edited

Loading

xuzhao9 commented Mar 24, 2024

bhack commented Mar 24, 2024

bhack commented Mar 28, 2024

lezcano commented Mar 28, 2024

bhack commented Mar 28, 2024

bhack commented Mar 29, 2024

xuzhao9 left a comment

facebook-github-bot commented Apr 1, 2024

xuzhao9 Apr 1, 2024

bhack Apr 1, 2024

xuzhao9 commented Apr 1, 2024 •

edited

Loading

bhack commented Apr 1, 2024

xuzhao9 commented Apr 1, 2024

bhack commented Apr 1, 2024 •

edited

Loading

xuzhao9 commented Apr 1, 2024 •

edited

Loading

xuzhao9 Apr 1, 2024

xuzhao9 Apr 1, 2024

bhack Apr 1, 2024

xuzhao9 Apr 1, 2024

xuzhao9 Apr 1, 2024

bhack Apr 1, 2024

xuzhao9 Apr 1, 2024

bhack Apr 1, 2024

bhack Apr 2, 2024

xuzhao9 Apr 2, 2024

bhack Apr 2, 2024

xuzhao9 Apr 2, 2024

facebook-github-bot commented Apr 2, 2024

Unet: disable deterministic on training #2204

Unet: disable deterministic on training #2204

Conversation

bhack commented Mar 21, 2024

bhack commented Mar 21, 2024

ezyang commented Mar 22, 2024

bhack commented Mar 22, 2024

bhack commented Mar 22, 2024

ezyang commented Mar 24, 2024

bhack commented Mar 24, 2024 • edited Loading

xuzhao9 commented Mar 24, 2024

bhack commented Mar 24, 2024

bhack commented Mar 28, 2024

lezcano commented Mar 28, 2024

bhack commented Mar 28, 2024

bhack commented Mar 29, 2024

xuzhao9 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xuzhao9 commented Apr 1, 2024 • edited Loading

bhack commented Apr 1, 2024

xuzhao9 commented Apr 1, 2024

bhack commented Apr 1, 2024 • edited Loading

xuzhao9 commented Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 2, 2024

bhack commented Mar 24, 2024 •

edited

Loading

xuzhao9 commented Apr 1, 2024 •

edited

Loading

bhack commented Apr 1, 2024 •

edited

Loading

xuzhao9 commented Apr 1, 2024 •

edited

Loading