running float8 tests fails on main due to unrelated error #991

vkuzo · 2024-10-02T14:50:16Z

I just rebased on latest main branch, running float8 tests fails with an unrelated error:

(pytorch) [vasiliy@devgpu006.vll6 ~/local/ao (main)]$ pytest test/float8/test_base.py -s -x
======================================================================================================= test session starts =======================================================================================================
platform linux -- Python 3.11.0, pytest-8.3.3, pluggy-1.5.0
rootdir: /data/users/vasiliy/ao
plugins: hypothesis-6.111.2, typeguard-4.3.0
collected 0 items / 1 error                                                                                                                                                                                                       

============================================================================================================= ERRORS ==============================================================================================================
____________________________________________________________________________________________ ERROR collecting test/float8/test_base.py ____________________________________________________________________________________________
test/float8/test_base.py:19: in <module>
    from torchao.utils import TORCH_VERSION_AT_LEAST_2_5
torchao/__init__.py:31: in <module>
    from torchao.quantization import (
torchao/quantization/__init__.py:8: in <module>
    from .quant_api import *  # noqa: F403
torchao/quantization/quant_api.py:26: in <module>
    from torchao.dtypes.uintx.uintx import UintxLayoutType
torchao/dtypes/__init__.py:4: in <module>
    from .affine_quantized_tensor import (
torchao/dtypes/affine_quantized_tensor.py:3: in <module>
    import torchao.ops
torchao/ops.py:8: in <module>
    lib.define("quant_llm_linear(int EXPONENT, int MANTISSA, Tensor _in_feats, Tensor _weights, Tensor _scales, int splitK) -> Tensor")
../pytorch/torch/library.py:146: in define
    result = self.m.define(schema, alias_analysis, tuple(tags))
E   RuntimeError: Tried to register an operator (torchao::quant_llm_linear(int EXPONENT, int MANTISSA, Tensor _in_feats, Tensor _weights, Tensor _scales, int splitK) -> Tensor) with the same name and overload name multiple times. Each overload's schema should only be registered with a single call to def(). Duplicate registration: registered at /dev/null:241. Original registration: registered at /data/users/vasiliy/ao/torchao/csrc/fp6_llm.cpp:5
===================================================================================================== short test summary info =====================================================================================================
ERROR test/float8/test_base.py - RuntimeError: Tried to register an operator (torchao::quant_llm_linear(int EXPONENT, int MANTISSA, Tensor _in_feats, Tensor _weights, Tensor _scales, int splitK) -> Tensor) with the same name and overload name multiple tim...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!======================================================================================================== 1 error in 1.85s =========================================================================================================
(pytorch) [vasiliy@devgpu006.vll6 ~/local/ao (main)]$ git log -n 1
commit 9229df9acd912bcf00e8faf138a33382d94e23b2 (HEAD -> main, origin/main, origin/HEAD)
Author: Apurva Jain <apurvajain.kota@gmail.com>
Date:   Tue Oct 1 17:55:26 2024 -0700

    Float8 dynamic autoquant (#946)
(pytorch) [vasiliy@devgpu006.vll6 ~/local/ao (main)]$ git status
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean

The text was updated successfully, but these errors were encountered:

vkuzo · 2024-10-02T14:50:38Z

assign to oncall

vkuzo · 2024-10-02T14:55:25Z

#949 looks related

vkuzo · 2024-10-02T15:02:25Z

Actually, I think #914 was the root cause. When I'm on top of main after that PR, I can't import torchao (hit the same error as the issue summary). When I go to the commit before that PR, things work.

proof: https://gist.github.com/vkuzo/bd283c85b9ddd72b300e92cff2373589

cc @HDCharles

jainapurva · 2024-10-02T19:48:30Z

Try checking the cuda version and torch version. Install torch nightly and the corresponding cuda. I was having same issue.

vkuzo · 2024-10-02T23:05:29Z

The fix was to rebuild torchao with USE_CPP=1: with-proxy USE_CPP=1 pip install -e .. Note that rebuilding with USE_CPP=0 didn't fix things, presumably because the outdated artifacts were not rebuilt and they conflicted with the new op registration logic.

gau-nernst · 2024-10-02T23:06:29Z

@vkuzo Have you tried uninstall torchao (and remove the built CUDA extension in torchao/_C.cpython-xxx.so) and reinstall it? I think the problem is you still have the old CUDA extension before #949, which will define custom ops in C++. And post-#949, these ops are defined in Python instead.

Edit: just saw you solved the issue in your latest comment 😄

vkuzo · 2024-10-02T23:18:48Z

yeah I forgot I was building with CPP=0 but have previously built with CPP=1, pretty edge casey

clarify why directory is here

vkuzo assigned jerryzh168 Oct 2, 2024

vkuzo closed this as completed Oct 2, 2024

yanbing-j pushed a commit to yanbing-j/ao that referenced this issue Dec 9, 2024

Update README.md (pytorch#991)

d438485

clarify why directory is here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running float8 tests fails on main due to unrelated error #991

running float8 tests fails on main due to unrelated error #991

vkuzo commented Oct 2, 2024

vkuzo commented Oct 2, 2024

vkuzo commented Oct 2, 2024

vkuzo commented Oct 2, 2024

jainapurva commented Oct 2, 2024

vkuzo commented Oct 2, 2024

gau-nernst commented Oct 2, 2024

vkuzo commented Oct 2, 2024

running float8 tests fails on main due to unrelated error #991

running float8 tests fails on main due to unrelated error #991

Comments

vkuzo commented Oct 2, 2024

vkuzo commented Oct 2, 2024

vkuzo commented Oct 2, 2024

vkuzo commented Oct 2, 2024

jainapurva commented Oct 2, 2024

vkuzo commented Oct 2, 2024

gau-nernst commented Oct 2, 2024

vkuzo commented Oct 2, 2024