Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix deprecated torch.cuda.amp.GradScaler FutureWarning for pytorch 2.4+ #3132

Merged
merged 9 commits into from
Oct 7, 2024

Conversation

Mon-ius
Copy link
Contributor

@Mon-ius Mon-ius commented Sep 29, 2024

With the release of PyTorch 2.4, the torch.cuda.amp.GradScaler API has been officially marked as deprecated.

This pull request (PR) addresses the following issue:

accelerate/accelerator.py:494: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)

@Mon-ius Mon-ius changed the title fix deprecated FutureWarning for pytorch 2.4+ fix deprecated torch.cuda.amp.GradScaler FutureWarning for pytorch 2.4+ Sep 29, 2024
@BenjaminBossan
Copy link
Member

Thanks for working on this deprecation. Could you please run make quality and make style on your PR?

@Mon-ius
Copy link
Contributor Author

Mon-ius commented Oct 1, 2024

@BenjaminBossan I have finished ur requests, would u plz trigger the workflows 🤗

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Mon-ius
Copy link
Contributor Author

Mon-ius commented Oct 1, 2024

strange that, I run make quality okay on my side.

make quality
ruff check .
All checks passed!
ruff format --check .
150 files already formatted
doc-builder style src/accelerate docs/source --max_len 119 --check_only

@BenjaminBossan I haved checked the e583d2e results, seems that this workflow has bugs to be fixed.

It indicates all test passed, but failed at last,

make quality
  shell: /usr/bin/bash -e {0}
  env:
    pythonLocation: /opt/hostedtoolcache/Python/[3](https://github.com/huggingface/accelerate/actions/runs/11116652373/job/30907116341#step:5:3).8.18/x64
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.8.18/x6[4](https://github.com/huggingface/accelerate/actions/runs/11116652373/job/30907116341#step:5:4)/lib
ruff check .
All checks passed!
ruff format --check .
1[5](https://github.com/huggingface/accelerate/actions/runs/11116652373/job/30907116341#step:5:5)0 files already formatted
doc-builder style src/accelerate docs/source --max_len 119 --check_only
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.18/x[6](https://github.com/huggingface/accelerate/actions/runs/11116652373/job/30907116341#step:5:6)4/bin/doc-builder", line 8, in <module>
    sys.exit(main())
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/doc_builder/commands/doc_builder_cli.py", line 4[7](https://github.com/huggingface/accelerate/actions/runs/11116652373/job/30907116341#step:5:8), in main
    args.func(args)
  File "/opt/hostedtoolcache/Python/3.[8](https://github.com/huggingface/accelerate/actions/runs/11116652373/job/30907116341#step:5:9).18/x64/lib/python3.8/site-packages/doc_builder/commands/style.py", line 28, in style_command
    raise ValueError(f"{len(changed)} files should be restyled!")
ValueError: 1 files should be restyled!
make: *** [Makefile:17: quality] Error 1
Error: Process completed with exit code 2.

@BenjaminBossan
Copy link
Member

It's the doc-builder that is complaining. This is the diff I get from running doc-builder style src/accelerate docs/source --max_len 119:

@@ -127,10 +127,11 @@ def find_executable_batch_size(function: callable = None, starting_batch_size: i
 
 
     >>> @find_executable_batch_size(starting_batch_size=128)
-    ... def train(batch_size, model, optimizer): ...
+    ... def train(batch_size, model, optimizer):
+    ...     ...
 
 
-    ... train(model, optimizer)
+    >>> train(model, optimizer)
     ```
     """
     if function is None

@Mon-ius
Copy link
Contributor Author

Mon-ius commented Oct 1, 2024

@find_executable_batch_size(starting_batch_size=128)

Interesting, but I didn't touch the file under src/accelerate/utils/memory.py. I ran your command on my side: doc-builder style src/accelerate docs/source --max_len 119, and there were no issues.

I just updated the code based on your message. I'm aware that the doc-builder on python3.8 is slightly different from the one running on python3.12 on my side.

@BenjaminBossan, Could you please trigger the CI again?

@BenjaminBossan
Copy link
Member

Thanks @Mon-ius. I think one of your previous commits made a change to memory.py due to ruff or so. Thus let's completely undo any changes to the file and it should pass. That means, restoring the one line that's currently removed.

@Mon-ius
Copy link
Contributor Author

Mon-ius commented Oct 1, 2024

@BenjaminBossan I just deleted the file and download the original one, seems just one blank line make the CI failed.

Could you plz trigger it again? 🤗

@Mon-ius
Copy link
Contributor Author

Mon-ius commented Oct 1, 2024

Hi, @BenjaminBossan

Seems all checks have been passed now, could u plz to merge it 🤗

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this change with older PyTorch versions and unfortunately, it is only supported starting from 2.3 on. Since we want to support older versions as well, we need to make a torch version check.

Probably a similar argument applies to the other grad checkers, like torch.npu.amp.GradScaler, but I haven't tested.

@Mon-ius
Copy link
Contributor Author

Mon-ius commented Oct 2, 2024

@BenjaminBossan Thx for your review, I applied a simple pytorch version check for 2.3 on GPU device, could u plz to trigger the CI?

@Mon-ius
Copy link
Contributor Author

Mon-ius commented Oct 2, 2024

For npu device, I currently dont hold one to test, but I also applied the strategy on last seperate commit 🤗

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the version fixes. I have a few smaller comments.

It would probably make sense to factor out the grad scaler instantiation to a utility function to avoid code duplication, but I'll let Zach be the judge of that once he's back in office. For now, it can stay as is.

scaler (`torch.cuda.amp.GradScaler`, *optional*):
An optional gradient scaler instance to save
scaler (`torch.amp.GradScaler`, *optional*) for pytorch>2.3:
An optional gradient scaler instance to save; for lower version, check `torch.cuda.amp.GradScaler`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this extra explanation in the docstring, it should be quite clear as is what this refers to.

@@ -209,8 +209,9 @@ def register_comm_hook(self, model):
class GradScalerKwargs(KwargsHandler):
"""
Use this object in your [`Accelerator`] to customize the behavior of mixed precision, specifically how the
`torch.cuda.amp.GradScaler` used is created. Please refer to the documentation of this
[scaler](https://pytorch.org/docs/stable/amp.html?highlight=gradscaler) for more information on each argument.
`torch.amp.GradScaler` used is created for pytoch>2.3 or `torch.cuda.amp.GradScaler` for lower version. Please
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, let's not overexplain.

@@ -494,11 +495,17 @@ def __init__(
elif is_musa_available():
self.scalar = torch.musa.amp.GradScaler(**kwargs)
elif is_npu_available():
self.scaler = torch.npu.amp.GradScaler(**kwargs)
if version.parse(torch.__version__) > version.parse("2.3"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we both can't test this, let's not touch NPU for now.

@Mon-ius
Copy link
Contributor Author

Mon-ius commented Oct 2, 2024

@BenjaminBossan Thanks again for the comments, I have reverted the code related to the documentation and NPU. 🤗

@Mon-ius
Copy link
Contributor Author

Mon-ius commented Oct 2, 2024

seems there is an CI self issue, which lacks pillow on test environment for failed

=========================== short test summary info ============================
FAILED tests/trainer/test_trainer.py::TrainerIntegrationTest::test_trainer_saves_image_processor - ImportError: 
CLIPImageProcessor requires the PIL library but it was not found in your environment. You can install it with pip:
`pip install pillow`. Please note that you may need to restart your runtime after installation.
FAILED tests/trainer/test_trainer.py::TrainerIntegrationTest::test_trainer_saves_processor - ImportError: 
CLIPImageProcessor requires the PIL library but it was not found in your environment. You can install it with pip:
`pip install pillow`. Please note that you may need to restart your runtime after installation.
=========== 2 failed, 172 passed, 81 skipped, 42 warnings in 28.79s ============

@BenjaminBossan
Copy link
Member

Yes, appears to be unrelated.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your continued work on this PR. I'll leave this open for Zach to give the final review.

@Mon-ius
Copy link
Contributor Author

Mon-ius commented Oct 6, 2024

Thanks for the version fixes. I have a few smaller comments.

It would probably make sense to factor out the grad scaler instantiation to a utility function to avoid code duplication, but I'll let Zach be the judge of that once he's back in office. For now, it can stay as is.

@BenjaminBossan, how about Zach's review of this PR 🤗

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks okay to me. This hints that we should probably have a get_grad_scaler util func in Accelerate which can handle the backends automatically as part of the ops util package, but for the scope of this PR this works great.

Thanks!

@muellerzr muellerzr merged commit e93b056 into huggingface:main Oct 7, 2024
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants