Tf32 warnings #6816

qingpeng9802 · 2023-08-03T09:51:43Z

about #6754 .

Description

show a warning if any thing may enable tf32 is detected

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
In-line docstrings updated.
Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

when a function/class is imported Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

for more information, see https://pre-commit.ci

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

…into tf32-warnings

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

qingpeng9802 · 2023-08-03T10:15:18Z

/black

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

wyli

thanks! please see some minor inline comments

monai/losses/__init__.py

monai/utils/tf32.py

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

tests/utils.py

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

wyli · 2023-08-04T22:06:27Z

/black

wyli

Thanks, it looks good to me, testing with more environments

qingpeng9802 · 2023-08-05T12:10:47Z

The failed test is caused by #2161 (also see kornia/kornia#1951). Pytorch only initializes torch.cuda once if torch.cuda is called, and there is no way to revert the initialized state (pytorch/pytorch#28829). The only solution is letting users to adjust the position of os.environ["CUDA_VISIBLE_DEVICES"], which is recommend by a pytorch maintainer.

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

…into tf32-warnings

qingpeng9802 · 2023-08-05T16:53:02Z

/black

for more information, see https://pre-commit.ci

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

wyli · 2023-08-05T17:07:06Z

The failed test is caused by #2161 (also see kornia/kornia#1951). Pytorch only initializes torch.cuda once if torch.cuda is called, and there is no way to revert the initialized state (pytorch/pytorch#28829). The only solution is letting users to adjust the position of os.environ["CUDA_VISIBLE_DEVICES"], which is recommend by a pytorch maintainer.

ok... in this case how about only calling detect_default_tf32 in the config printing ?

MONAI/monai/config/deviceconfig.py

Line 195 in 65cf5fe

def get_gpu_info() -> OrderedDict:

currently the import monai is a bit slow already.

qingpeng9802 · 2023-08-05T17:35:54Z

Per my test, the function reset_torch_cuda_after_run in the commit 348f089 can resolve the torch.cuda issue. @wyli Could you trigger the GPU CIs to confirm?
Also, this new added warning increases time <0.005% so it should be fine.

wyli · 2023-08-05T17:49:16Z

/build

qingpeng9802 · 2023-08-06T07:57:46Z

This failed case is kind of interesting🤔
Since this issue pytorch/pytorch#80876 is resolved in PyTorch 1.12.1, @wyli Could you trigger the GPU CI PT113 and PT210?

My test env is Pytorch version: 2.0.0+cu117 NVIDIA-SMI 515.65.01 Driver Version: 516.94 CUDA Version: 11.7, and this env can pass tests.test_set_visible_devices.

wyli · 2023-08-06T09:46:20Z

Sure, I’ll try to rerun them. I think the utility function in the PR looks good, but adding it and the workaround to monai/__init__.py is a bit risky. (I read the relevant GitHub discussions, most of them are trying to avoid unnecessary early calls to torch.cuda) @ericspod @Nic-Ma

qingpeng9802 · 2023-08-06T13:32:28Z

The key of torch.cuda issue is calling to https://github.com/pytorch/pytorch/blob/v2.0.1/torch/cuda/__init__.py#L219. IMO, importlib.reload should be relatively safe on PyTorch(Python) side. The main risk is that it is unclear whether PyTorch's initialization operation on CUDA is idempotent. That is, we would expect get the same result on the CUDA state after twice initialization.

Actually, I have an ugly but safe solution: subprocess.check_output("nvidia-smi") as alternative 😅.

monai/utils/tf32.py

qingpeng9802 · 2023-08-07T07:13:17Z

As the disscussion above, we prefer to use alt 2. Feel free to add any comment on alt 2. If alt 2. is okay, I can push an alt 2. commit.

Looks like PyNVML is a safe choice. There are some similar usages in PyTorch, such as https://github.com/pytorch/pytorch/blob/v2.0.1/torch/cuda/__init__.py#L771-L794

wyli · 2023-08-07T09:09:54Z

sure, I think we can go for option 2 if it works fine across the test platforms and the import timing doesn't increase too much python -X importtime -c "import monai"

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

qingpeng9802 · 2023-08-07T10:19:00Z

/black

for more information, see https://pre-commit.ci

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

wyli · 2023-08-07T11:20:17Z

/build

requirements.txt

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

…into tf32-warnings

for more information, see https://pre-commit.ci

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

wyli · 2023-08-07T16:18:54Z

/build

Following #6816 ### Description make `is_tf32_env()` safer. check `cuda` to prevent fallthrough case when `pynvml` is not found ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. - [ ] Quick tests passed locally by running `./runtests.sh --quick --unittests --disttests`. - [ ] In-line docstrings updated. - [ ] Documentation updated, tested `make html` command in the `docs/` folder. Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

qingpeng9802 and others added 8 commits August 3, 2023 16:52

rename precision doc

5f23835

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

add version_geq

6014216

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

detect default tf32 settings

e271ae2

when a function/class is imported Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

refactor is_tf32_env()

5d286aa

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

3b2345a

for more information, see https://pre-commit.ci

fix style E402

35438b8

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

Merge branch 'tf32-warnings' of https://github.com/qingpeng9802/MONAI …

51f3270

…into tf32-warnings

fix style E722

5697a31

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

[MONAI] code formatting

948ccc3

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

wyli reviewed Aug 3, 2023

View reviewed changes

monai/losses/__init__.py Outdated Show resolved Hide resolved

monai/utils/tf32.py Show resolved Hide resolved

refactor the usage of detect_default_tf32()

96ca146

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

qingpeng9802 commented Aug 4, 2023

View reviewed changes

tests/utils.py Show resolved Hide resolved

improve is_tf32_env()

d8a65bb

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

qingpeng9802 marked this pull request as ready for review August 4, 2023 18:48

monai-bot and others added 2 commits August 4, 2023 22:00

[MONAI] code formatting

93ff777

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

Merge branch 'dev' into tf32-warnings

0ad3a73

wyli approved these changes Aug 4, 2023

View reviewed changes

wyli enabled auto-merge (squash) August 4, 2023 22:22

wyli disabled auto-merge August 4, 2023 22:37

qingpeng9802 added 2 commits August 6, 2023 00:51

resolve torch.cuda initialization order issue

348f089

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

Merge branch 'tf32-warnings' of https://github.com/qingpeng9802/MONAI …

04b71c3

…into tf32-warnings

pre-commit-ci bot and others added 2 commits August 5, 2023 16:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

9a4310f

for more information, see https://pre-commit.ci

[MONAI] code formatting

a783158

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

qingpeng9802 commented Aug 6, 2023

View reviewed changes

monai/utils/tf32.py Outdated Show resolved Hide resolved

qingpeng9802 added 2 commits August 7, 2023 18:07

use pynvml to avoid torch.cuda call

13cd277

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

minor fix

fddc87f

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

pre-commit-ci bot and others added 3 commits August 7, 2023 10:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

2a6ff88

for more information, see https://pre-commit.ci

[MONAI] code formatting

e1b2075

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

Merge branch 'dev' into tf32-warnings

73cb104

wyli reviewed Aug 7, 2023

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

qingpeng9802 and others added 5 commits August 7, 2023 22:03

fix import pynvml

72c8a72

Signed-off-by: Qingpeng Li <qingpeng9802@gmail.com>

Merge branch 'tf32-warnings' of https://github.com/qingpeng9802/MONAI …

d1ed95b

…into tf32-warnings

[pre-commit.ci] auto fixes from pre-commit.com hooks

055ed46

for more information, see https://pre-commit.ci

[MONAI] code formatting

f4571f1

Signed-off-by: monai-bot <monai.miccai2019@gmail.com>

Merge branch 'dev' into tf32-warnings

147224c

wyli enabled auto-merge (squash) August 7, 2023 16:19

wyli merged commit ca96867 into Project-MONAI:dev Aug 7, 2023
28 of 32 checks passed

qingpeng9802 mentioned this pull request Aug 8, 2023

make is_tf32_env() safer #6839

Merged

7 tasks

qingpeng9802 deleted the tf32-warnings branch August 15, 2023 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tf32 warnings #6816

Tf32 warnings #6816

qingpeng9802 commented Aug 3, 2023 •

edited

Loading

qingpeng9802 commented Aug 3, 2023 •

edited by wyli

Loading

wyli left a comment

wyli commented Aug 4, 2023 •

edited

Loading

wyli left a comment

qingpeng9802 commented Aug 5, 2023 •

edited

Loading

qingpeng9802 commented Aug 5, 2023

wyli commented Aug 5, 2023

qingpeng9802 commented Aug 5, 2023

wyli commented Aug 5, 2023

qingpeng9802 commented Aug 6, 2023

wyli commented Aug 6, 2023

qingpeng9802 commented Aug 6, 2023 •

edited

Loading

qingpeng9802 commented Aug 7, 2023 •

edited

Loading

wyli commented Aug 7, 2023

qingpeng9802 commented Aug 7, 2023

wyli commented Aug 7, 2023

wyli commented Aug 7, 2023

Tf32 warnings #6816

Tf32 warnings #6816

Conversation

qingpeng9802 commented Aug 3, 2023 • edited Loading

Description

Types of changes

qingpeng9802 commented Aug 3, 2023 • edited by wyli Loading

wyli left a comment

Choose a reason for hiding this comment

wyli commented Aug 4, 2023 • edited Loading

wyli left a comment

Choose a reason for hiding this comment

qingpeng9802 commented Aug 5, 2023 • edited Loading

qingpeng9802 commented Aug 5, 2023

wyli commented Aug 5, 2023

qingpeng9802 commented Aug 5, 2023

wyli commented Aug 5, 2023

qingpeng9802 commented Aug 6, 2023

wyli commented Aug 6, 2023

qingpeng9802 commented Aug 6, 2023 • edited Loading

qingpeng9802 commented Aug 7, 2023 • edited Loading

wyli commented Aug 7, 2023

qingpeng9802 commented Aug 7, 2023

wyli commented Aug 7, 2023

wyli commented Aug 7, 2023

qingpeng9802 commented Aug 3, 2023 •

edited

Loading

qingpeng9802 commented Aug 3, 2023 •

edited by wyli

Loading

wyli commented Aug 4, 2023 •

edited

Loading

qingpeng9802 commented Aug 5, 2023 •

edited

Loading

qingpeng9802 commented Aug 6, 2023 •

edited

Loading

qingpeng9802 commented Aug 7, 2023 •

edited

Loading