Named Symbol not found (torchchat #1298) #1110

mikekgfb · 2024-10-17T19:56:13Z

Quantized model gets a CUDA error "Named symbol not found".
see pytorch/torchchat#1298

mikekgfb · 2024-10-17T19:57:51Z

Cuda version:
NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2

This ran on Google colab. Detailed trace-back/repro: https://colab.research.google.com/drive/1PRneJBaS5TlJaIgc4Lwv2muiePp6T9Ss?usp=sharing

$nvidia-smi
Thu Oct 17 19:52:30 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   39C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

$ python torchchat.py generate stories110M --quant torchchat/quant_config/cuda.json --prompt "It was a dark and stormy night, and"
2024-10-17 19:52:38.809413: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-17 19:52:39.031641: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-17 19:52:39.096690: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-17 19:52:39.466437: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-17 19:52:41.795530: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Downloading builder script: 100% 5.67k/5.67k [00:00<00:00, 18.9MB/s]
Downloading https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt...
Downloading https://github.com/karpathy/llama2.c/raw/master/tokenizer.model...
Moving model to /root/.torchchat/model-cache/stories110M.
Using device=cuda Tesla T4
Loading model...
Time to load model: 0.40 seconds
Quantizing the model with: {'executor': {'accelerator': 'cuda'}, 'precision': {'dtype': 'bf16'}, 'linear:int4': {'groupsize': 256}}
Time to quantize model: 0.13 seconds
Traceback (most recent call last):
  File "/content/torchchat-1/torchchat.py", line 88, in <module>
    generate_main(args)
  File "/content/torchchat-1/torchchat/generate.py", line 1215, in main
    gen = Generator(
  File "/content/torchchat-1/torchchat/generate.py", line 290, in __init__
    self.model = _initialize_model(self.builder_args, self.quantize, self.tokenizer)
  File "/content/torchchat-1/torchchat/cli/builder.py", line 574, in _initialize_model
    quantize_model(
  File "/content/torchchat-1/torchchat/utils/quantize.py", line 114, in quantize_model
    quantize_(model, int4_weight_only(q_kwargs["groupsize"]))
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 462, in quantize_
    _replace_with_custom_fn_if_matches_filter(
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 202, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 202, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 202, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  [Previous line repeated 2 more times]
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 198, in _replace_with_custom_fn_if_matches_filter
    model = replacement_fn(model)
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 392, in insert_subclass
    lin.weight = torch.nn.Parameter(constructor(lin.weight, **kwargs), requires_grad=requires_grad)
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 553, in apply_int4_weight_only_quant
    return to_affine_quantized_intx(weight, mapping_type, block_size, target_dtype, quant_min, quant_max, eps, zero_point_dtype=zero_point_dtype, preserve_zero=preserve_zero, zero_point_domain=zero_point_domain, layout_type=layout_type, use_hqq=use_hqq)
  File "/usr/local/lib/python3.10/dist-packages/torchao/dtypes/affine_quantized_tensor.py", line 286, in from_hp_to_intx
    layout_tensor = layout_tensor_ctr(data, scale, zero_point, layout_type)
  File "/usr/local/lib/python3.10/dist-packages/torchao/dtypes/affine_quantized_tensor.py", line 1033, in from_plain
    scale_and_zero = pack_tinygemm_scales_and_zeros(scale, zero_point)
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/utils.py", line 322, in pack_tinygemm_scales_and_zeros
    torch.cat(
RuntimeError: CUDA error: named symbol not found
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

gau-nernst · 2024-10-18T00:53:48Z

I believe this is because tinygemm does not support sm75 i.e. T4

msaroufim · 2024-10-18T00:56:41Z

We could throw a better error message

msaroufim · 2024-10-21T21:23:14Z

I couldn't repro this on a fresh google colab t4 gpu. Might be something more environment specific in the linked notebook and is likely just an issue with needing specific cuda versions installed

In particular please note that you can get torchao linked against a specific cuda version by installing it from the pytorch index https://github.com/pytorch/ao#installation otherwise installing from source is generally less finicky

# -*- coding: utf-8 -*-
"""Untitled103.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1AgFr2Ofz4aEc3s_KFX-1LL4NaCpzCjv9
"""

# ! USE_CPP=0 pip install git+https://github.com/pytorch/ao.git@msaroufim/better-tinygemmwarning-for-google-colab --force-reinstall
! USE_CPP=0 pip install git+https://github.com/pytorch/ao.git --force-reinstall
! pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly

# prompt: toy pytorch model with a single linear layer

import torch
import torch.nn as nn

class ToyModel(nn.Module):
  def __init__(self, input_size, output_size):
    super(ToyModel, self).__init__()
    self.linear = nn.Linear(input_size, output_size)

  def forward(self, x):
    return self.linear(x)

# Example usage
input_size = 10
output_size = 1
model = ToyModel(input_size, output_size).cuda()

# Create some sample input data
input_data = torch.randn(1, input_size).cuda()

# Perform a forward pass
output = model(input_data)

print(output)

import torch
import torchao
from torchao.quantization.quant_api import (
    quantize_,
    int8_dynamic_activation_int8_weight,
    int4_weight_only,
    int8_weight_only
)


quantize_(model, int4_weight_only())

model(input_data)

gau-nernst · 2024-10-22T02:37:59Z

The error comes from PyTorch's torch.ops.aten._convert_weight_to_int4pack(), not torchao.

I can reproduce with the following snippet

import torch

x = torch.randint(0, 255, size=(1024, 1024), dtype=torch.uint8).cuda()
x = x.view(torch.int32)  # I think 2.4 expects int32, 2.5 expect uint8
torch.ops.aten._convert_weight_to_int4pack(x, innerKTiles=8)

@msaroufim Your example does not reproduce the error because the weight is too small. When I changed to the following, the error is reproduced

# Example usage
input_size = 1024
output_size = 1024
model = ToyModel(input_size, output_size).cuda().bfloat16()  # also require BF16

# Create some sample input data
input_data = torch.randn(1, input_size).cuda().bfloat16()

mikekgfb · 2024-11-05T19:16:28Z

This might just be the leading edge for bfloat16 doesn't work on T4 in PyTorch because it's not supported by HW? Directionally, would we look at adding this support, or... we just put a better error message?

so, it's def bfloat causing this, as can be ascertained with this command:

! python torchchat.py generate stories110M --quant '{"executor": {"accelerator": "cuda"}, "precision": {"dtype": "bf16"}, "linear:int4": {"groupsize": 256}}' --prompt "It was a dark and stormy night, and"

Sadly, switching to float 16 does not work either:

! python torchchat.py generate stories110M --quant '{"executor": {"accelerator": "cuda"}, "precision": {"dtype": "float16"}, "linear:int4": {"groupsize": 256}}' --prompt "It was a dark and stormy night, and"

Quantizing the model with: {'executor': {'accelerator': 'cuda'}, 'precision': {'dtype': 'float16'}, 'linear:int4': {'groupsize': 256}}
Time to quantize model: 0.19 seconds
Traceback (most recent call last):
  File "/content/torchchat/torchchat.py", line 88, in <module>
    generate_main(args)
  File "/content/torchchat/torchchat/generate.py", line 1228, in main
    gen = Generator(
  File "/content/torchchat/torchchat/generate.py", line 293, in __init__
    self.model = _initialize_model(self.builder_args, self.quantize, self.tokenizer)
  File "/content/torchchat/torchchat/cli/builder.py", line 661, in _initialize_model
    quantize_model(
  File "/content/torchchat/torchchat/utils/quantize.py", line 115, in quantize_model
    quantize_(model, int4_weight_only(q_kwargs["groupsize"]))
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 462, in quantize_
    _replace_with_custom_fn_if_matches_filter(
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 202, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 202, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 202, in _replace_with_custom_fn_if_matches_filter
    new_child = _replace_with_custom_fn_if_matches_filter(
  [Previous line repeated 2 more times]
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 198, in _replace_with_custom_fn_if_matches_filter
    model = replacement_fn(model)
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 392, in insert_subclass
    lin.weight = torch.nn.Parameter(constructor(lin.weight, **kwargs), requires_grad=requires_grad)
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/quant_api.py", line 553, in apply_int4_weight_only_quant
    return to_affine_quantized_intx(weight, mapping_type, block_size, target_dtype, quant_min, quant_max, eps, zero_point_dtype=zero_point_dtype, preserve_zero=preserve_zero, zero_point_domain=zero_point_domain, layout_type=layout_type, use_hqq=use_hqq)
  File "/usr/local/lib/python3.10/dist-packages/torchao/dtypes/affine_quantized_tensor.py", line 286, in from_hp_to_intx
    layout_tensor = layout_tensor_ctr(data, scale, zero_point, layout_type)
  File "/usr/local/lib/python3.10/dist-packages/torchao/dtypes/affine_quantized_tensor.py", line 1033, in from_plain
    scale_and_zero = pack_tinygemm_scales_and_zeros(scale, zero_point)
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/utils.py", line 319, in pack_tinygemm_scales_and_zeros
    guard_dtype_size(scales, "scales", dtype=dtype, size=zeros.size())
  File "/usr/local/lib/python3.10/dist-packages/torchao/quantization/utils.py", line 128, in guard_dtype_size
    raise ValueError(f"Expected Tensor argument {arg_name} to have dtype {dtype}, but got {tensor_arg.dtype} instead.")
ValueError: Expected Tensor argument scales to have dtype torch.bfloat16, but got torch.float16 instead.

Ditto for torch.float32, same error - would suggest that int4 linear quantization isn't available unless the hardware has support for BF16? (This would be much less of an issue if the bread and butter of Google colab were not T4 which is a super convenient place to allow uses to try quick experiments/ramp up....)

mikekgfb · 2024-11-05T19:45:52Z

pytorch/torchchat#1344 recognizes whether the target GPU has bfloat16, and avoids using it (for fast* alias dtypes, by using fp16), or issues an error (if bf16 is explicitly specified)

gau-nernst · 2024-11-05T23:22:54Z

tinygemm (the INT4 weight-only kernel for CUDA) only supports BF16 https://github.com/pytorch/pytorch/blob/86d7d39bffd3b7b099310fb351b2b36f99981d6f/aten/src/ATen/native/cuda/int4mm.cu

Though in theory it should be possible to make it work with FP16, similar to #1147

jerryzh168 · 2024-11-05T23:50:09Z

tinygemm (the INT4 weight-only kernel for CUDA) only supports BF16 pytorch/pytorch@86d7d39/aten/src/ATen/native/cuda/int4mm.cu

Though in theory it should be possible to make it work with FP16, similar to #1147

yeah the meta-internal tinygemm version actually already supports fp16, we just need to upstream the changes. cc @yanboliang plans to do this

pawarmanasi07 · 2024-12-30T12:54:14Z

Hi! I'm interested in contributing to this issue.
I'm new to open source, so before implementing better error handling for BF16 support detection, I wanted to check if anyone is already working on this. As discussed above, I'm considering adding explicit checks for hardware BF16 support with clear error messages.

mikekgfb · 2025-01-24T04:45:30Z

Hi! I'm interested in contributing to this issue. I'm new to open source, so before implementing better error handling for BF16 support detection, I wanted to check if anyone is already working on this. As discussed above, I'm considering adding explicit checks for hardware BF16 support with clear error messages.

At a minimum we'd want a better error message (for this kernel only....pytorch overall simply emulates bf16 elsewhere).

It would be cool if we could enable this kernel to work with BF16. Pytorch emulates it elsewhere - basically if tinygemm supports fp32 computation, we can block as fp16 and compute as fp32 (bf16 is the most significant 16 bit of FP32, so it's easiy to convert to/fro.). Especially if we already accumulate in fp32 (which some GEMMs do by default to avoid rounding artifacts)

This is mostly an issue because T4 is so widespread as google colab GPU accelerator. Otherwise I can't think of a platform that would be this long lasting and still be in broad use?

msaroufim added the good first issue Good for newcomers label Oct 18, 2024

msaroufim mentioned this issue Oct 18, 2024

[WIP] Better tinygemm warning for T4 #1112

Closed

msaroufim added the needs reproduction label Oct 21, 2024

msaroufim removed the needs reproduction label Oct 25, 2024

gau-nernst mentioned this issue Oct 30, 2024

BF16 support for Quant-LLM kernel #1147

Merged

This was referenced Nov 6, 2024

recognize and issue error if GPU does not support bf16 pytorch/torchchat#1344

Closed

RuntimeError: CUDA error: named symbol not found pytorch/torchchat#1298

Open

yanbing-j pushed a commit to yanbing-j/ao that referenced this issue Dec 9, 2024

integrate chat tokenizer and add llama3-8B model option (pytorch#1110)

6de408c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Named Symbol not found (torchchat #1298) #1110

Named Symbol not found (torchchat #1298) #1110

mikekgfb commented Oct 17, 2024

mikekgfb commented Oct 17, 2024

gau-nernst commented Oct 18, 2024

msaroufim commented Oct 18, 2024

msaroufim commented Oct 21, 2024 •

edited

Loading

gau-nernst commented Oct 22, 2024

mikekgfb commented Nov 5, 2024

mikekgfb commented Nov 5, 2024

gau-nernst commented Nov 5, 2024

jerryzh168 commented Nov 5, 2024

pawarmanasi07 commented Dec 30, 2024

mikekgfb commented Jan 24, 2025

Named Symbol not found (torchchat #1298) #1110

Named Symbol not found (torchchat #1298) #1110

Comments

mikekgfb commented Oct 17, 2024

mikekgfb commented Oct 17, 2024

gau-nernst commented Oct 18, 2024

msaroufim commented Oct 18, 2024

msaroufim commented Oct 21, 2024 • edited Loading

gau-nernst commented Oct 22, 2024

mikekgfb commented Nov 5, 2024

mikekgfb commented Nov 5, 2024

gau-nernst commented Nov 5, 2024

jerryzh168 commented Nov 5, 2024

pawarmanasi07 commented Dec 30, 2024

mikekgfb commented Jan 24, 2025

msaroufim commented Oct 21, 2024 •

edited

Loading