Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Named Symbol not found (torchchat #1298) #1110

Open
mikekgfb opened this issue Oct 17, 2024 · 11 comments
Open

Named Symbol not found (torchchat #1298) #1110

mikekgfb opened this issue Oct 17, 2024 · 11 comments
Labels
good first issue Good for newcomers

Comments

@mikekgfb
Copy link

Quantized model gets a CUDA error "Named symbol not found".
see pytorch/torchchat#1298

@mikekgfb
Copy link
Author

Cuda version:
NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2

This ran on Google colab. Detailed trace-back/repro: https://colab.research.google.com/drive/1PRneJBaS5TlJaIgc4Lwv2muiePp6T9Ss?usp=sharing

$nvidia-smi
Thu Oct 17 19:52:30 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

$ python torchchat.py generate stories110M --quant torchchat/quant_config/cuda.json --prompt "It was a dark and stormy night, and"
2024-10-17 19:52:38.809413: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-17 19:52:39.031641: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-17 19:52:39.096690: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-17 19:52:39.466437: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-17 19:52:41.795530: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Downloading builder script: 100% 5.67k/5.67k [00:00<00:00, 18.9MB/s]
Downloading https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt...
Downloading https://github.com/karpathy/llama2.c/raw/master/tokenizer.model...
Moving model to /root/.torchchat/model-cache/stories110M.
Using device=cuda Tesla T4
Loading model...
Time to load model: 0.40 seconds
Quantizing the model with: {'executor': {'accelerator': 'cuda'}, 'precision': {'dtype': 'bf16'}, 'linear:int4': {'groupsize': 256}}
Time to quantize model: 0.13 seconds
Traceback (most recent call last):
  File "/content/torchchat-1/torchchat.py", line 88, in <module>
    generate_main(args)
  File "/content/torchchat-1/torchchat/generate.py", line 1215, in main
    gen = Generator(
  File "/content/torchchat-1/torchchat/generate.py", line 290, in __init__
    self.model = _initialize_model(self.builder_args, self.quantize, self.tokenizer)
  File "/content/torchchat-1/torchchat/cli/builder.py", line 574, in _initialize_model
    quantize_model(
  File "/content/torchchat-1/torchchat/utils/quantize.py", line 114, in quantize_model
    quantize_(model, int4_weight_only(q_kwargs["groupsize"]))
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 462, in quantize_
    _replace_with_custom_fn_if_matches_filter(
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 202, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 202, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 202, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  [Previous line repeated 2 more times]
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 198, in _replace_with_custom_fn_if_matches_filter
    model = replacement_fn(model)
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 392, in insert_subclass
    lin.weight = torch.nn.Parameter(constructor(lin.weight, **kwargs), requires_grad=requires_grad)
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 553, in apply_int4_weight_only_quant
    return to_affine_quantized_intx(weight, mapping_type, block_size, target_dtype, quant_min, quant_max, eps, zero_point_dtype=zero_point_dtype, preserve_zero=preserve_zero, zero_point_domain=zero_point_domain, layout_type=layout_type, use_hqq=use_hqq)
  File "/usr/local/lib/python3.10/dist-packages/torchao/dtypes/affine_quantized_tensor.py", line 286, in from_hp_to_intx
    layout_tensor = layout_tensor_ctr(data, scale, zero_point, layout_type)
  File "/usr/local/lib/python3.10/dist-packages/torchao/dtypes/affine_quantized_tensor.py", line 1033, in from_plain
    scale_and_zero = pack_tinygemm_scales_and_zeros(scale, zero_point)
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/utils.py", line 322, in pack_tinygemm_scales_and_zeros
    torch.cat(
RuntimeError: CUDA error: named symbol not found
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@gau-nernst
Copy link
Collaborator

I believe this is because tinygemm does not support sm75 i.e. T4

@msaroufim
Copy link
Member

We could throw a better error message

@msaroufim
Copy link
Member

msaroufim commented Oct 21, 2024

I couldn't repro this on a fresh google colab t4 gpu. Might be something more environment specific in the linked notebook and is likely just an issue with needing specific cuda versions installed

In particular please note that you can get torchao linked against a specific cuda version by installing it from the pytorch index https://github.com/pytorch/ao#installation otherwise installing from source is generally less finicky

# -*- coding: utf-8 -*-
"""Untitled103.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1AgFr2Ofz4aEc3s_KFX-1LL4NaCpzCjv9
"""

# ! USE_CPP=0 pip install git+https://github.com/pytorch/ao.git@msaroufim/better-tinygemmwarning-for-google-colab --force-reinstall
! USE_CPP=0 pip install git+https://github.com/pytorch/ao.git --force-reinstall
! pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly

# prompt: toy pytorch model with a single linear layer

import torch
import torch.nn as nn

class ToyModel(nn.Module):
  def __init__(self, input_size, output_size):
    super(ToyModel, self).__init__()
    self.linear = nn.Linear(input_size, output_size)

  def forward(self, x):
    return self.linear(x)

# Example usage
input_size = 10
output_size = 1
model = ToyModel(input_size, output_size).cuda()

# Create some sample input data
input_data = torch.randn(1, input_size).cuda()

# Perform a forward pass
output = model(input_data)

print(output)

import torch
import torchao
from torchao.quantization.quant_api import (
    quantize_,
    int8_dynamic_activation_int8_weight,
    int4_weight_only,
    int8_weight_only
)


quantize_(model, int4_weight_only())

model(input_data)

@gau-nernst
Copy link
Collaborator

The error comes from PyTorch's torch.ops.aten._convert_weight_to_int4pack(), not torchao.

I can reproduce with the following snippet

import torch

x = torch.randint(0, 255, size=(1024, 1024), dtype=torch.uint8).cuda()
x = x.view(torch.int32)  # I think 2.4 expects int32, 2.5 expect uint8
torch.ops.aten._convert_weight_to_int4pack(x, innerKTiles=8)

@msaroufim Your example does not reproduce the error because the weight is too small. When I changed to the following, the error is reproduced

# Example usage
input_size = 1024
output_size = 1024
model = ToyModel(input_size, output_size).cuda().bfloat16()  # also require BF16

# Create some sample input data
input_data = torch.randn(1, input_size).cuda().bfloat16()

@mikekgfb
Copy link
Author

mikekgfb commented Nov 5, 2024

This might just be the leading edge for bfloat16 doesn't work on T4 in PyTorch because it's not supported by HW? Directionally, would we look at adding this support, or... we just put a better error message?

so, it's def bfloat causing this, as can be ascertained with this command:

! python torchchat.py generate stories110M --quant '{"executor": {"accelerator": "cuda"}, "precision": {"dtype": "bf16"}, "linear:int4": {"groupsize": 256}}' --prompt "It was a dark and stormy night, and"

Sadly, switching to float 16 does not work either:

! python torchchat.py generate stories110M --quant '{"executor": {"accelerator": "cuda"}, "precision": {"dtype": "float16"}, "linear:int4": {"groupsize": 256}}' --prompt "It was a dark and stormy night, and"

Quantizing the model with: {'executor': {'accelerator': 'cuda'}, 'precision': {'dtype': 'float16'}, 'linear:int4': {'groupsize': 256}}
Time to quantize model: 0.19 seconds
Traceback (most recent call last):
  File "/content/torchchat/torchchat.py", line 88, in <module>
    generate_main(args)
  File "/content/torchchat/torchchat/generate.py", line 1228, in main
    gen = Generator(
  File "/content/torchchat/torchchat/generate.py", line 293, in __init__
    self.model = _initialize_model(self.builder_args, self.quantize, self.tokenizer)
  File "/content/torchchat/torchchat/cli/builder.py", line 661, in _initialize_model
    quantize_model(
  File "/content/torchchat/torchchat/utils/quantize.py", line 115, in quantize_model
    quantize_(model, int4_weight_only(q_kwargs["groupsize"]))
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 462, in quantize_
    _replace_with_custom_fn_if_matches_filter(
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 202, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 202, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 202, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  [Previous line repeated 2 more times]
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 198, in _replace_with_custom_fn_if_matches_filter
    model = replacement_fn(model)
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 392, in insert_subclass
    lin.weight = torch.nn.Parameter(constructor(lin.weight, **kwargs), requires_grad=requires_grad)
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 553, in apply_int4_weight_only_quant
    return to_affine_quantized_intx(weight, mapping_type, block_size, target_dtype, quant_min, quant_max, eps, zero_point_dtype=zero_point_dtype, preserve_zero=preserve_zero, zero_point_domain=zero_point_domain, layout_type=layout_type, use_hqq=use_hqq)
  File "/usr/local/lib/python3.10/dist-packages/torchao/dtypes/affine_quantized_tensor.py", line 286, in from_hp_to_intx
    layout_tensor = layout_tensor_ctr(data, scale, zero_point, layout_type)
  File "/usr/local/lib/python3.10/dist-packages/torchao/dtypes/affine_quantized_tensor.py", line 1033, in from_plain
    scale_and_zero = pack_tinygemm_scales_and_zeros(scale, zero_point)
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/utils.py", line 319, in pack_tinygemm_scales_and_zeros
    guard_dtype_size(scales, "scales", dtype=dtype, size=zeros.size())
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/utils.py", line 128, in guard_dtype_size
    raise ValueError(f"Expected Tensor argument {arg_name} to have dtype {dtype}, but got {tensor_arg.dtype} instead.")
ValueError: Expected Tensor argument scales to have dtype torch.bfloat16, but got torch.float16 instead.

Ditto for torch.float32, same error - would suggest that int4 linear quantization isn't available unless the hardware has support for BF16? (This would be much less of an issue if the bread and butter of Google colab were not T4 which is a super convenient place to allow uses to try quick experiments/ramp up....)

@mikekgfb
Copy link
Author

mikekgfb commented Nov 5, 2024

pytorch/torchchat#1344 recognizes whether the target GPU has bfloat16, and avoids using it (for fast* alias dtypes, by using fp16), or issues an error (if bf16 is explicitly specified)

@gau-nernst
Copy link
Collaborator

tinygemm (the INT4 weight-only kernel for CUDA) only supports BF16 https://github.com/pytorch/pytorch/blob/86d7d39bffd3b7b099310fb351b2b36f99981d6f/aten/src/ATen/native/cuda/int4mm.cu

Though in theory it should be possible to make it work with FP16, similar to #1147

@jerryzh168
Copy link
Contributor

tinygemm (the INT4 weight-only kernel for CUDA) only supports BF16 pytorch/pytorch@86d7d39/aten/src/ATen/native/cuda/int4mm.cu

Though in theory it should be possible to make it work with FP16, similar to #1147

yeah the meta-internal tinygemm version actually already supports fp16, we just need to upstream the changes. cc @yanboliang plans to do this

@pawarmanasi07
Copy link

Hi! I'm interested in contributing to this issue.
I'm new to open source, so before implementing better error handling for BF16 support detection, I wanted to check if anyone is already working on this. As discussed above, I'm considering adding explicit checks for hardware BF16 support with clear error messages.

@mikekgfb
Copy link
Author

Hi! I'm interested in contributing to this issue. I'm new to open source, so before implementing better error handling for BF16 support detection, I wanted to check if anyone is already working on this. As discussed above, I'm considering adding explicit checks for hardware BF16 support with clear error messages.

At a minimum we'd want a better error message (for this kernel only....pytorch overall simply emulates bf16 elsewhere).

It would be cool if we could enable this kernel to work with BF16. Pytorch emulates it elsewhere - basically if tinygemm supports fp32 computation, we can block as fp16 and compute as fp32 (bf16 is the most significant 16 bit of FP32, so it's easiy to convert to/fro.). Especially if we already accumulate in fp32 (which some GEMMs do by default to avoid rounding artifacts)

This is mostly an issue because T4 is so widespread as google colab GPU accelerator. Otherwise I can't think of a platform that would be this long lasting and still be in broad use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants