Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nail in edge case of torch dtype being overriden permantly in the case of an error #35845

Merged
merged 8 commits into from
Feb 6, 2025

Conversation

muellerzr
Copy link
Contributor

@muellerzr muellerzr commented Jan 22, 2025

What does this PR do?

@BenjaminBossan and I have been trying to hunt down an issue causing some headache over the last few days, as to why some tests were failing if done in specific instances.

What we cornered it to, is if a user:

  1. Inits a model (either from_pretrained or _from_config), with a dtype other than the torch default dtype
  2. And then inside of a very specific window of time the model loading fails before we have a chance to restore the old torch dtype

What will occur is the torch dtype will permantly change, which is why some users may "discover" this issue and it goes away after a full restart (a likely scenario that could happen). What we currently do doesn't catch if we have an exception, this does, and is used as an additive

This PR introduces a safety decorator called restore_default_torch_dtype which will restore whatever torch's original dtype was prior to calling the function (since we restore it anyways at the end).

This PR also adds a new test_injection function. The aim is for this function to be mocked at critical points where we need testing, and we cannot easily get to it (e.g. in this case, there was only one function we could mock in from_pretrained and zero functions we could mock in _from_config to make sure this fix worked!)

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SunMarc @BenjaminBossan @ArthurZucker

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this bug. I had a few points of discussion, please LMK what you think.

@@ -245,6 +246,25 @@ def set_zero3_state():
_is_ds_init_called = False


def restore_default_torch_dtype(func):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach should generally work. It would fail if there was ever a case where calling from_pretrained should purposefully change the default dtype for the rest of the process. I'm not sure if such a use case exist, just wanted to highlight it.

muellerzr and others added 2 commits January 23, 2025 15:03
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. I noticed two more potential issues, please check.

@muellerzr muellerzr merged commit 1ce0e29 into main Feb 6, 2025
26 checks passed
@muellerzr muellerzr deleted the muellerzr-restorable-default branch February 6, 2025 14:05
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

MekkCyber pushed a commit that referenced this pull request Feb 7, 2025
…e of an error (#35845)

* Nail in edge case of torch dtype

* Rm unused func

* Apply suggestions from code review

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

* Refactor tests to only mock what we need, don't introduce injection functions

* SetUp/TearDown

* Do super

---------

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
elvircrn pushed a commit to elvircrn/transformers that referenced this pull request Feb 13, 2025
…e of an error (huggingface#35845)

* Nail in edge case of torch dtype

* Rm unused func

* Apply suggestions from code review

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

* Refactor tests to only mock what we need, don't introduce injection functions

* SetUp/TearDown

* Do super

---------

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
sbucaille pushed a commit to sbucaille/transformers that referenced this pull request Feb 16, 2025
…e of an error (huggingface#35845)

* Nail in edge case of torch dtype

* Rm unused func

* Apply suggestions from code review

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

* Refactor tests to only mock what we need, don't introduce injection functions

* SetUp/TearDown

* Do super

---------

Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants