Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running float8 tests fails on main due to unrelated error #991

Closed
vkuzo opened this issue Oct 2, 2024 · 7 comments
Closed

running float8 tests fails on main due to unrelated error #991

vkuzo opened this issue Oct 2, 2024 · 7 comments
Assignees

Comments

@vkuzo
Copy link
Contributor

vkuzo commented Oct 2, 2024

I just rebased on latest main branch, running float8 tests fails with an unrelated error:

(pytorch) [vasiliy@devgpu006.vll6 ~/local/ao (main)]$ pytest test/float8/test_base.py -s -x
======================================================================================================= test session starts =======================================================================================================
platform linux -- Python 3.11.0, pytest-8.3.3, pluggy-1.5.0
rootdir: /data/users/vasiliy/ao
plugins: hypothesis-6.111.2, typeguard-4.3.0
collected 0 items / 1 error                                                                                                                                                                                                       

============================================================================================================= ERRORS ==============================================================================================================
____________________________________________________________________________________________ ERROR collecting test/float8/test_base.py ____________________________________________________________________________________________
test/float8/test_base.py:19: in <module>
    from torchao.utils import TORCH_VERSION_AT_LEAST_2_5
torchao/__init__.py:31: in <module>
    from torchao.quantization import (
torchao/quantization/__init__.py:8: in <module>
    from .quant_api import *  # noqa: F403
torchao/quantization/quant_api.py:26: in <module>
    from torchao.dtypes.uintx.uintx import UintxLayoutType
torchao/dtypes/__init__.py:4: in <module>
    from .affine_quantized_tensor import (
torchao/dtypes/affine_quantized_tensor.py:3: in <module>
    import torchao.ops
torchao/ops.py:8: in <module>
    lib.define("quant_llm_linear(int EXPONENT, int MANTISSA, Tensor _in_feats, Tensor _weights, Tensor _scales, int splitK) -> Tensor")
../pytorch/torch/library.py:146: in define
    result = self.m.define(schema, alias_analysis, tuple(tags))
E   RuntimeError: Tried to register an operator (torchao::quant_llm_linear(int EXPONENT, int MANTISSA, Tensor _in_feats, Tensor _weights, Tensor _scales, int splitK) -> Tensor) with the same name and overload name multiple times. Each overload's schema should only be registered with a single call to def(). Duplicate registration: registered at /dev/null:241. Original registration: registered at /data/users/vasiliy/ao/torchao/csrc/fp6_llm.cpp:5
===================================================================================================== short test summary info =====================================================================================================
ERROR test/float8/test_base.py - RuntimeError: Tried to register an operator (torchao::quant_llm_linear(int EXPONENT, int MANTISSA, Tensor _in_feats, Tensor _weights, Tensor _scales, int splitK) -> Tensor) with the same name and overload name multiple tim...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!======================================================================================================== 1 error in 1.85s =========================================================================================================
(pytorch) [vasiliy@devgpu006.vll6 ~/local/ao (main)]$ git log -n 1
commit 9229df9acd912bcf00e8faf138a33382d94e23b2 (HEAD -> main, origin/main, origin/HEAD)
Author: Apurva Jain <apurvajain.kota@gmail.com>
Date:   Tue Oct 1 17:55:26 2024 -0700

    Float8 dynamic autoquant (#946)
(pytorch) [vasiliy@devgpu006.vll6 ~/local/ao (main)]$ git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

@vkuzo
Copy link
Contributor Author

vkuzo commented Oct 2, 2024

assign to oncall

@vkuzo
Copy link
Contributor Author

vkuzo commented Oct 2, 2024

#949 looks related

@vkuzo
Copy link
Contributor Author

vkuzo commented Oct 2, 2024

Actually, I think #914 was the root cause. When I'm on top of main after that PR, I can't import torchao (hit the same error as the issue summary). When I go to the commit before that PR, things work.

proof: https://gist.github.com/vkuzo/bd283c85b9ddd72b300e92cff2373589

cc @HDCharles

@jainapurva
Copy link
Contributor

Try checking the cuda version and torch version. Install torch nightly and the corresponding cuda. I was having same issue.

@vkuzo
Copy link
Contributor Author

vkuzo commented Oct 2, 2024

The fix was to rebuild torchao with USE_CPP=1: with-proxy USE_CPP=1 pip install -e .. Note that rebuilding with USE_CPP=0 didn't fix things, presumably because the outdated artifacts were not rebuilt and they conflicted with the new op registration logic.

@vkuzo vkuzo closed this as completed Oct 2, 2024
@gau-nernst
Copy link
Collaborator

@vkuzo Have you tried uninstall torchao (and remove the built CUDA extension in torchao/_C.cpython-xxx.so) and reinstall it? I think the problem is you still have the old CUDA extension before #949, which will define custom ops in C++. And post-#949, these ops are defined in Python instead.

Edit: just saw you solved the issue in your latest comment 😄

@vkuzo
Copy link
Contributor Author

vkuzo commented Oct 2, 2024

yeah I forgot I was building with CPP=0 but have previously built with CPP=1, pretty edge casey

yanbing-j pushed a commit to yanbing-j/ao that referenced this issue Dec 9, 2024
clarify why directory is here
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants