Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support int8 KVCacheQuant and W8A8 inference in vllm #1112

Closed
wants to merge 52 commits into from

Conversation

AniZpZ
Copy link

@AniZpZ AniZpZ commented Sep 20, 2023

We have recently implemented and tested int8 KV-Cache quantization and W8A8 inference in vLLM. We find out that our quantization implementation can increase the throughput over 20% and reduce the first token latency under heavy load. In contrast, W4A16 quant methods(eg. AWQ-based method) provided in vllm cannot improve throughput according to pr1032 because it can not benefit from int8 tensor core. So we propose this PR as an alternative quantization method.

Updates!!!
We have made some more progress in #1112 (comment)

More Updates!!!
If you want to properly eval mmlu dataset with vllm, some modify on sampler must be done. The code can be found in our mmlu_eval branch

Important message!!!
We spilt the pr into two parts for easier review and use. The w8a8 inference part is in #1508 and the kv cache quant part is in #1507

What we have right now:

  1. int8 KV-Cache quantization related works:
    a. Quant\Dequant helper functions adapted from Faster Transformer
    b. Quantized version CUDA kernels
    c. Unit tests for the added kernels
  2. W8A8 inference related works:
    a. Int8 Gemm kernels adapted from torch-int
    b. W8A8 linear layer modules
    c. Support W8A8 inference on Llama model
  3. Test result based on our own dataset

What we plan to do:

  • 1. Further kernel fusion
  • 2. Code refactoring and cleaning
  • 3. Opimize int8 GEMM kernel
  • 4. Release SmoothQuant for LLaMA
  • 5. Add code for generating KV-Cache quantization parameters (scales and zero points)
  • 6. Experiments on more datasets

How to test throughput
A. how to enable w8a8 inference
0. install cutlass because we currently use cutlass gemm kernel. We plan to replace them with cublas gemm kernel soon.
We support cublas gemm kernel now, you can remove cutlass gemm kernel in setup.py

  1. install smoothquant and torch-int for llama. Use "examples/generate_act_scales.py" to generate act scale, and then use "examples/export_int8_llama.py" to export int8 model. Please note to check and change the 'architectures' field in the model's config.json from 'Int8LlamaForCausalLM' to 'LlamaForCausalLM'.
  2. update vllm and execute
python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --quantization smoothquant

B. how to enable kv cache quant

  1. use vllm/kv_quant/calibrate.py to genearte scales and use vllm/kv_quant/export_kv_params.py to export kv caches.
  2. exeute
python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --kv-cache-dtype=int8 --kv-quant-params-path=/path/to/kv_params_dir

And you can use kv cache quant and w8a8 inference together

Experiment Result
current test result in our datasets on A100 80G (updates with quant&rms fusion and gemm d2h bug fix)

Throughput of FP16 LLaMA-13B:

Throughput:  4.9945 requests/s, 543.0660 token/s
Average latency: 31.7689 s

Throughput of Int8 LLaMA-13B with int8 KVCacheQuant:

Throughput: 6.1147 requests/s, 664.8646 token/s, 
Average latency: 27.4222 s

Throughput of Int8 LLaMA-13B with int8 KVCacheQuant, using cublas gemm kernel:

Throughput: 6.4723 requests/s, 703.7514 token/s, 
Average latency: 25.9912 s

How to evalute model performance
We add evaluation method of quanted models, currently support mmlu datasets.
You can find detail in benchmarks/benchmark_evaluation.py

python benchmark_evaluation.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --dev-data-path=/path/to/mmlu/dev/ --test-data-path=/path/to/mmlu/test/ --kv-cache-dtype=int8 --kv-quant-params-path=/path/to/kv_params_dir --quantization=smoothquant

Updates
We have released SmoothQuant for LLaMA in
https://github.com/AniZpZ/smoothquant/tree/llama-dev
https://github.com/AniZpZ/torch-int/tree/llama-dev

The code for generating KV-Cache quantization parameters is ready, check vllm/kv_quant fold

We replace int8 gemm with cublas version and the increasement of throughput comes to around 30%

@casper-hansen
Copy link
Contributor

casper-hansen commented Sep 20, 2023

This is interesting work! I was going to implement int8 in AutoAWQ with time as the authors of SmoothQuant (this PR) and AWQ are the same. My best guestimate is that single_query_cached_kv_attention_quantized_kernel is doing the heavy lifting of throughput here as it comes from FasterTransformer which is well optimized.

@AniZpZ AniZpZ changed the title [Enhancement] Support int8 KVCacheQuant and W8A8 inference in vllm [WIP] Support int8 KVCacheQuant and W8A8 inference in vllm Sep 21, 2023
@viktor-ferenczi
Copy link
Contributor

viktor-ferenczi commented Sep 21, 2023

I fully support this, since the 4-bit AWQ model proved to have inferior quality for my use cases. Having 8 bit weights with 8 bit activation cache would be the best of both worlds, allowing for almost no loss of quality (perplexity) while being able to run inference more efficiently. I would also keep an W8A16 mode as an option, should the precision of the activations and the KV cache would make a difference in specific use cases.

@viktor-ferenczi viktor-ferenczi mentioned this pull request Sep 21, 2023
@zhyncs
Copy link
Contributor

zhyncs commented Sep 21, 2023

Hi vLLM genius @WoosukKwon @zhuohan123

This is the latest development from our team regarding quantitative support for vllm, we have done something similar to #1032 before. At that time, we didn't mention pr after the benchmark results showed a drop in throughput, but later we found out that #1032 was merged, which is very encouraging. Therefore, we continue to do performance optimization on this basis, and send out the pr in WIP state in advance, hoping to get some comments and suggestions, and finally merge into the vllm codebase smoothly. Cheers!

@casper-hansen
Copy link
Contributor

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

@AniZpZ
Copy link
Author

AniZpZ commented Sep 21, 2023

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

We implement quantization with smoothquant method for W8A8 I will release the code later. The perplexity is identical to a standard smoothquant method if you do W8A8 inference without int8 KVCacheQuant.

Quantization details are discussed in this paper(Xiao et. al)

@casper-hansen
Copy link
Contributor

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

We implement quantization with smoothquant method for W8A8 I will release the code later. The perplexity is identical to a standard smoothquant method if you do W8A8 inference without int8 KVCacheQuant.

Quantization details are discussed in this paper(Xiao et. al)

SmoothQuant only supports OPT models. How can we test this PR when the SmoothQuant repository does not support LLaMa models? If you implement this PR without the quantization code, you will inevitably end up with a bad perplexity if you naively use W8A8 as you have no calibration dataset.

See this image, accuracy ends up being worse than INT4 if you naively convert weights to W8A8. You need the SmoothQuant or AWQ method to convert if you want to preserve accuracy. You need a framework for this, which is why I created AutoAWQ - I will look to implement INT8 quantization using the torch-int modules and would love your help with this so we can support all models in vLLM (LLaMa, MPT, Falcon, etc.) without accuracy degradation.

image

@AniZpZ
Copy link
Author

AniZpZ commented Sep 21, 2023

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

We implement quantization with smoothquant method for W8A8 I will release the code later. The perplexity is identical to a standard smoothquant method if you do W8A8 inference without int8 KVCacheQuant.
Quantization details are discussed in this paper(Xiao et. al)

SmoothQuant only supports OPT models. How can we test this PR when the SmoothQuant repository does not support LLaMa models? If you implement this PR without the quantization code, you will inevitably end up with a bad perplexity if you naively use W8A8 as you have no calibration dataset.

See this image, accuracy ends up being worse than INT4 if you naively convert weights to W8A8. You need the SmoothQuant or AWQ method to convert if you want to preserve accuracy. You need a framework for this, which is why I created AutoAWQ - I will look to implement INT8 quantization using the torch-int modules and would love your help with this so we can support all models in vLLM (LLaMa, MPT, Falcon, etc.) without accuracy degradation.

image

We implement smoothquant for llama ourself, you can find code here: https://github.com/AniZpZ/smoothquant/tree/llama-dev and easily quantize and export model with export_int8_llama.py
It should work with https://github.com/AniZpZ/torch-int/tree/llama-dev

@casper-hansen
Copy link
Contributor

Hi @AniZpZ @zhyncs, thank you for your great work with this PR.

I have now had more time to explore your fast implementation and found that Nvidia only has support for INT8 for high throughput, which makes this PR achieve higher throughput than INT4 due to software capabilities.

Is your proposal to run W8A16? Your code does not have A8 implemented in the llama.py model definition.

SmoothQuant implements W8A8, but it seems silly to run A8 as there should be little benefit speed-wise. Therefore, I see this as a natural choice. I want to confirm this with you for my implementation in AutoAWQ as I want to push INT8 models out using your initial LLaMa implementation, just using the AWQ method for minimum perplexity loss.

@AniZpZ
Copy link
Author

AniZpZ commented Sep 23, 2023

Hi @AniZpZ @zhyncs, thank you for your great work with this PR.

I have now had more time to explore your fast implementation and found that Nvidia only has support for INT8 for high throughput, which makes this PR achieve higher throughput than INT4 due to software capabilities.

Is your proposal to run W8A16? Your code does not have A8 implemented in the llama.py model definition.

SmoothQuant implements W8A8, but it seems silly to run A8 as there should be little benefit speed-wise. Therefore, I see this as a natural choice. I want to confirm this with you for my implementation in AutoAWQ as I want to push INT8 models out using your initial LLaMa implementation, just using the AWQ method for minimum perplexity loss.

Our proposal is to run in W8A8. If you enable smoothquant, we will replace rmsnorm and linear layer with our custom int8 rmsnorm and w8a8linears which quant activations and impelement int8 gemm. You can find the detail in w8a8linear.py
If you want enable tensor core to do int8 caclulation, weights and activations should both be int8.

@ChristineSeven
Copy link

@HandH1998
when compiling the vllm from this branch kv_quant, another issue:

Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache.o.d -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c -c /app/vllm/csrc/cache.cpp -o /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache.o -g -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=cache_ops -D_GLIBCXX_USE_CXX11_ABI=0
[2/2] /usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c -c /app/vllm/csrc/cache_kernels.cu -o /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache_kernels.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 --threads 8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=cache_ops -D_GLIBCXX_USE_CXX11_ABI=0
FAILED: /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache_kernels.o
/usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c -c /app/vllm/csrc/cache_kernels.cu -o /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache_kernels.o -D__CUDA_NO_HALF_OPERATORS
-D__CUDA_NO_HALF_CONVERSIONS
_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 --threads 8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=cache_ops -D_GLIBCXX_USE_CXX11_ABI=0
/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

/app/vllm/csrc/quant_utils.cuh(217): error: identifier "__float22bfloat162_rn" is undefined

/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

1 error detected in the compilation of "/app/vllm/csrc/cache_kernels.cu".
/app/vllm/csrc/quant_utils.cuh(217): error: identifier "__float22bfloat162_rn" is undefined

/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

1 error detected in the compilation of "/app/vllm/csrc/cache_kernels.cu".
/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "setup.py", line 257, in
setuptools.setup(
File "/usr/local/lib/python3.8/dist-packages/setuptools/init.py", line 153, in setup
return distutils.core.setup(**attrs)
File "/usr/lib/python3.8/distutils/core.py", line 148, in setup
dist.run_commands()
File "/usr/lib/python3.8/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py", line 74, in run
self.do_egg_install()
File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py", line 116, in do_egg_install
self.run_command('bdist_egg')
File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.8/dist-packages/setuptools/command/bdist_egg.py", line 164, in run
cmd = self.call_command('install_lib', warn_dir=0)
File "/usr/local/lib/python3.8/dist-packages/setuptools/command/bdist_egg.py", line 150, in call_command
self.run_command(cmdname)
File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install_lib.py", line 11, in run
self.build()
File "/usr/lib/python3.8/distutils/command/install_lib.py", line 109, in build
self.run_command('build_ext')
File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.8/dist-packages/setuptools/command/build_ext.py", line 79, in run
_build_ext.run(self)
File "/usr/local/lib/python3.8/dist-packages/Cython/Distutils/old_build_ext.py", line 186, in run
_build_ext.build_ext.run(self)
File "/usr/lib/python3.8/distutils/command/build_ext.py", line 340, in run
self.build_extensions()
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 843, in build_extensions
build_ext.build_extensions(self)
File "/usr/local/lib/python3.8/dist-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
_build_ext.build_ext.build_extensions(self)
File "/usr/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
self._build_extensions_serial()
File "/usr/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
self.build_extension(ext)
File "/usr/local/lib/python3.8/dist-packages/setuptools/command/build_ext.py", line 202, in build_extension
_build_ext.build_extension(self, ext)
File "/usr/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
objects = self.compiler.compile(sources,
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile
_write_ninja_file_and_compile_objects(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects
_run_ninja_build(
File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

@AniZpZ
Copy link
Author

AniZpZ commented Nov 27, 2023

@HandH1998 when compiling the vllm from this branch kv_quant, another issue:

Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/2] c++ -MMD -MF /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache.o.d -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c -c /app/vllm/csrc/cache.cpp -o /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache.o -g -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=cache_ops -D_GLIBCXX_USE_CXX11_ABI=0 [2/2] /usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c -c /app/vllm/csrc/cache_kernels.cu -o /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache_kernels.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 --threads 8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=cache_ops -D_GLIBCXX_USE_CXX11_ABI=0 FAILED: /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache_kernels.o /usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c -c /app/vllm/csrc/cache_kernels.cu -o /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache_kernels.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 --threads 8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=cache_ops -D_GLIBCXX_USE_CXX11_ABI=0 /app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

/app/vllm/csrc/quant_utils.cuh(217): error: identifier "__float22bfloat162_rn" is undefined

/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

1 error detected in the compilation of "/app/vllm/csrc/cache_kernels.cu". /app/vllm/csrc/quant_utils.cuh(217): error: identifier "__float22bfloat162_rn" is undefined

/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

1 error detected in the compilation of "/app/vllm/csrc/cache_kernels.cu". /app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero detected during: instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" (61): here instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]" /usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "setup.py", line 257, in setuptools.setup( File "/usr/local/lib/python3.8/dist-packages/setuptools/init.py", line 153, in setup return distutils.core.setup(**attrs) File "/usr/lib/python3.8/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/lib/python3.8/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py", line 74, in run self.do_egg_install() File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install.py", line 116, in do_egg_install self.run_command('bdist_egg') File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/local/lib/python3.8/dist-packages/setuptools/command/bdist_egg.py", line 164, in run cmd = self.call_command('install_lib', warn_dir=0) File "/usr/local/lib/python3.8/dist-packages/setuptools/command/bdist_egg.py", line 150, in call_command self.run_command(cmdname) File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/local/lib/python3.8/dist-packages/setuptools/command/install_lib.py", line 11, in run self.build() File "/usr/lib/python3.8/distutils/command/install_lib.py", line 109, in build self.run_command('build_ext') File "/usr/lib/python3.8/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.8/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/usr/local/lib/python3.8/dist-packages/setuptools/command/build_ext.py", line 79, in run _build_ext.run(self) File "/usr/local/lib/python3.8/dist-packages/Cython/Distutils/old_build_ext.py", line 186, in run _build_ext.build_ext.run(self) File "/usr/lib/python3.8/distutils/command/build_ext.py", line 340, in run self.build_extensions() File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 843, in build_extensions build_ext.build_extensions(self) File "/usr/local/lib/python3.8/dist-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions _build_ext.build_ext.build_extensions(self) File "/usr/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions self._build_extensions_serial() File "/usr/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial self.build_extension(ext) File "/usr/local/lib/python3.8/dist-packages/setuptools/command/build_ext.py", line 202, in build_extension _build_ext.build_extension(self, ext) File "/usr/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension objects = self.compiler.compile(sources, File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 658, in unix_wrap_ninja_compile _write_ninja_file_and_compile_objects( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1574, in _write_ninja_file_and_compile_objects _run_ninja_build( File "/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error compiling objects for extension

__float22bfloat162_rn is a CUDA function, could you please check your CUDA version? The code had been tested with CUDA 11.8.

@ChristineSeven
Copy link

ChristineSeven commented Nov 27, 2023

@AniZpZ
exactly cuda 11.8 and compiled successfullly on another branch w8a8.
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

@HandH1998
Copy link
Contributor

@AniZpZ exactly cuda 11.8 and compiled successfullly on another branch w8a8. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

@AniZpZ exactly cuda 11.8 and compiled successfullly on another branch w8a8. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

Could you please check your environment variable TORCH_CUDA_ARCH_LIST? The cuda arch should >= 8.0.

@AniZpZ
Copy link
Author

AniZpZ commented Dec 1, 2023

@AniZpZ exactly cuda 11.8 and compiled successfullly on another branch w8a8. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

It looks like a problem occured when GPU arch does not match the requirement. Please check you GPU arch.

@ChristineSeven
Copy link

TORCH_CUDA_ARCH_LIST

@AniZpZ exactly cuda 11.8 and compiled successfullly on another branch w8a8. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

@AniZpZ exactly cuda 11.8 and compiled successfullly on another branch w8a8. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

Could you please check your environment variable TORCH_CUDA_ARCH_LIST? The cuda arch should >= 8.0.

@AniZpZ @HandH1998
['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
Here is the cuda arch support list. So is there a probelm? Thanks!

@HandH1998
Copy link
Contributor

TORCH_CUDA_ARCH_LIST

@AniZpZ exactly cuda 11.8 and compiled successfullly on another branch w8a8. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

@AniZpZ exactly cuda 11.8 and compiled successfullly on another branch w8a8. nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

Could you please check your environment variable TORCH_CUDA_ARCH_LIST? The cuda arch should >= 8.0.

@AniZpZ @HandH1998 ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86'] Here is the cuda arch support list. So is there a probelm? Thanks!

export TORCH_CUDA_ARCH_LIST=80;86

@shatealaboxiaowang
Copy link

shatealaboxiaowang commented Dec 8, 2023

We have recently implemented and tested int8 KV-Cache quantization and W8A8 inference in vLLM. We find out that our quantization implementation can increase the throughput over 20% and reduce the first token latency under heavy load. In contrast, W4A16 quant methods(eg. AWQ-based method) provided in vllm cannot improve throughput according to pr1032 because it can not benefit from int8 tensor core. So we propose this PR as an alternative quantization method.

Updates!!! We have made some more progress in #1112 (comment)

More Updates!!! If you want to properly eval mmlu dataset with vllm, some modify on sampler must be done. The code can be found in our mmlu_eval branch

Important message!!! We spilt the pr into two parts for easier review and use. The w8a8 inference part is in #1508 and the kv cache quant part is in #1507

What we have right now:

  1. int8 KV-Cache quantization related works:
    a. Quant\Dequant helper functions adapted from Faster Transformer
    b. Quantized version CUDA kernels
    c. Unit tests for the added kernels
  2. W8A8 inference related works:
    a. Int8 Gemm kernels adapted from torch-int
    b. W8A8 linear layer modules
    c. Support W8A8 inference on Llama model
  3. Test result based on our own dataset

What we plan to do:

  • 1. Further kernel fusion
  • 2. Code refactoring and cleaning
  • 3. Opimize int8 GEMM kernel
  • 4. Release SmoothQuant for LLaMA
  • 5. Add code for generating KV-Cache quantization parameters (scales and zero points)
  • 6. Experiments on more datasets

How to test throughput A. how to enable w8a8 inference 0. install cutlass because we currently use cutlass gemm kernel. We plan to replace them with cublas gemm kernel soon. We support cublas gemm kernel now, you can remove cutlass gemm kernel in setup.py

  1. install smoothquant and torch-int for llama. Use "examples/generate_act_scales.py" to generate act scale, and then use "examples/export_int8_llama.py" to export int8 model. Please note to check and change the 'architectures' field in the model's config.json from 'Int8LlamaForCausalLM' to 'LlamaForCausalLM'.
  2. update vllm and execute
python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --quantization smoothquant

B. how to enable kv cache quant

  1. use vllm/kv_quant/calibrate.py to genearte scales and use vllm/kv_quant/export_kv_params.py to export kv caches.
  2. exeute
python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --kv-cache-dtype=int8 --kv-quant-params-path=/path/to/kv_params_dir

And you can use kv cache quant and w8a8 inference together

Experiment Result current test result in our datasets on A100 80G (updates with quant&rms fusion and gemm d2h bug fix)

Throughput of FP16 LLaMA-13B:

Throughput:  4.9945 requests/s, 543.0660 token/s
Average latency: 31.7689 s

Throughput of Int8 LLaMA-13B with int8 KVCacheQuant:

Throughput: 6.1147 requests/s, 664.8646 token/s, 
Average latency: 27.4222 s

Throughput of Int8 LLaMA-13B with int8 KVCacheQuant, using cublas gemm kernel:

Throughput: 6.4723 requests/s, 703.7514 token/s, 
Average latency: 25.9912 s

How to evalute model performance We add evaluation method of quanted models, currently support mmlu datasets. You can find detail in benchmarks/benchmark_evaluation.py

python benchmark_evaluation.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --dev-data-path=/path/to/mmlu/dev/ --test-data-path=/path/to/mmlu/test/ --kv-cache-dtype=int8 --kv-quant-params-path=/path/to/kv_params_dir --quantization=smoothquant

Updates We have released SmoothQuant for LLaMA in https://github.com/AniZpZ/smoothquant/tree/llama-dev https://github.com/AniZpZ/torch-int/tree/llama-dev

The code for generating KV-Cache quantization parameters is ready, check vllm/kv_quant fold

We replace int8 gemm with cublas version and the increasement of throughput comes to around 30%

Which branch of vllm does kv cache int8 run on?
(1)I run it on kv_quant , but error occurs as follow:
ImportError: /home/vllm/vllm-kv_quant/vllm/cuda_utils.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNSt15__exception_ptr13exception_ptr10_M_releaseEv
(2)I run it on kv-quant-merge branch, but error ocurrs as follow:
KeyError: 'model.layers.0.self_attn.qkv_proj.dequant_scale'

my command is: python -m vllm.entrypoints.api_server --model=/home/models/quant/smoothquant/Codellama-13b-int8-02/CodeLlama-13b-hf-smoothquant/ --tokenizer /home/models/CodeLlama-13b-hf/ --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --kv-cache-dtype=int8 --kv-quant-params-path=/home/models/quant/kv-cache-turbomind/
Have you encountered this error?

@ChristineSeven
Copy link

ChristineSeven commented Dec 8, 2023

@HandH1998
I used the 2 ways to quant and scored my fined llama 70b model, and not enable kv cache quant.

  1. model = AutoModelForCausalLM.from_pretrained('xxx',device_map='auto',load_in_8bit=True) , this one got the closed score as the none quanted one.
  2. use the kv-quant-merged branch to compile, the score is bad.
    So why this could be so different? I used the orginal file "dataset/val.jsonl.zst" as the validation file to export act scales and model, would this be the problem?

@AniZpZ
Copy link
Author

AniZpZ commented Dec 8, 2023

@HandH1998 I used the 2 ways to quant and scored my fined llama 70b model, and not enable kv cache quant.

  1. model = AutoModelForCausalLM.from_pretrained('xxx',device_map='auto',load_in_8bit=True) , this one got the closed score as the none quanted one.
  2. use the kv-quant-merged branch to compile, the score is bad.
    So why this could be so different? I used the orginal file "dataset/val.jsonl.zst" as the validation file to export act scales and model, would this be the problem?

method 1 is weight only quantization. Please use our new branch(#1508) to test w8a8 inference.

@AniZpZ
Copy link
Author

AniZpZ commented Dec 8, 2023

dequant_scale

This branch mix kv cache quant with w8a8 model quant. Please try kv cache quantization with our new branch(#1507)

@ChristineSeven
Copy link

dequant_scale

This branch mix kv cache quant with w8a8 model quant. Please try kv cache quantization with our new branch(#1507)

with the #1507, there is also this issue. I think if not enable kv cache quant, it should not be worse.
I used the 2 ways to quant and scored my fined llama 70b model, and not enable kv cache quant.

  1. model = AutoModelForCausalLM.from_pretrained('xxx',device_map='auto',load_in_8bit=True) , this one got the closed score as the none quanted one.
    2.use the kv_quant branch to compile, the score is bad.
    So why this could be so different? I used the orginal file "dataset/val.jsonl.zst" as the validation file to export act scales and model, would this be the problem?

@warlock135
Copy link

@AniZpZ
When I tried to calibrate, it results an error:

Traceback (most recent call last):
File "/work/vllm_merge/vllm/kv_quant/calibrate.py", line 124, in
fire.Fire(calibrate)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/work/vllm_merge/vllm/kv_quant/calibrate.py", line 115, in calibrate
calib_ctx.calibrate(all_data)
File "/work/vllm_merge/vllm/kv_quant/calibration.py", line 283, in calibrate
_ = model(data.to(self.device))
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1068, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/work/vllm_merge/vllm/kv_quant/calibration.py", line 167, in _forward
key, value = out.pop(-1)
TypeError: cannot unpack non-iterable NoneType object

My calibration command:
python3 vllm/kv_quant/calibrate.py /work/llama2-7b-hf c4 128 2048 /work/act-scales

Additional info:
key, value = out.pop(-1)
out is a list [tensor, None]

@HandH1998
Copy link
Contributor

HandH1998 commented Jan 4, 2024

@warlock135 , hi, you can try transformers==4.34.0. The latest transformers modified the forward code, which makes the calibration fail to get key and value.

@hikmet-demir
Copy link

Any hope for 8 bit quantization to be merged any time soon? 4-bit models doesn't provide great results for complicated cases unfortunately.

@HandH1998
Copy link
Contributor

Any hope for 8 bit quantization to be merged any time soon? 4-bit models doesn't provide great results for complicated cases unfortunately.

We hope this feature can be merged soon, too. But it doesn't seem to get a lot of attention of main reviewers. If possible, we hope you can raise an issue. Thanks.

@xyfZzz
Copy link

xyfZzz commented Jan 18, 2024

@warlock135 , hi, you can try transformers==4.34.0. The latest transformers modified the forward code, which makes the calibration fail to get key and value.

@HandH1998 Can the latest transformers version be used for quantization currently using the latest AutoSmoothQuant (https://github.com/AniZpZ/AutoSmoothQuant.git)?

@xyfZzz
Copy link

xyfZzz commented Jan 18, 2024

@warlock135 , hi, you can try transformers==4.34.0. The latest transformers modified the forward code, which makes the calibration fail to get key and value.

@HandH1998 Can the latest transformers version be used for quantization currently using the latest AutoSmoothQuant (https://github.com/AniZpZ/AutoSmoothQuant.git)?

@HandH1998 When I used transformers 4.36.2 to perform quantization in the AutoSmoothQuant project, the following error message appeared:
ImportError: cannot import name 'SiLUActivation' from 'transformers.activations'

@HandH1998
Copy link
Contributor

@warlock135 , hi, you can try transformers==4.34.0. The latest transformers modified the forward code, which makes the calibration fail to get key and value.

@HandH1998 Can the latest transformers version be used for quantization currently using the latest AutoSmoothQuant (https://github.com/AniZpZ/AutoSmoothQuant.git)?

@HandH1998 When I used transformers 4.36.2 to perform quantization in the AutoSmoothQuant project, the following error message appeared: ImportError: cannot import name 'SiLUActivation' from 'transformers.activations'

you need to use transformers==4.34.0

@wDevil
Copy link

wDevil commented Feb 13, 2024

what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI

@AniZpZ
Copy link
Author

AniZpZ commented Feb 19, 2024

what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI

what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI

Thank you. We will look into the technic.

@andakai
Copy link

andakai commented Apr 1, 2024

what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI

what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI

Thank you. We will look into the technic.

Hi, @AniZpZ , is here any progress on the lower bit quantization?

@Opdoop
Copy link

Opdoop commented Apr 14, 2024

@AniZpZ Can you provide a docker or environment requirement for both smoothquent and vllm kv-quant-merge branch?

@AniZpZ
Copy link
Author

AniZpZ commented Apr 15, 2024

what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI

what are you think about this kv cache quant technic? https://github.com/jy-yuan/KIVI

Thank you. We will look into the technic.

Hi, @AniZpZ , is here any progress on the lower bit quantization?

We are working on lower bit quantization.

@AniZpZ
Copy link
Author

AniZpZ commented Apr 15, 2024

@AniZpZ Can you provide a docker or environment requirement for both smoothquent and vllm kv-quant-merge branch?

The enviroment requirement should be the similar to original vLLM.

@Opdoop
Copy link

Opdoop commented Apr 15, 2024

@AniZpZ When I build smoothquant from source, it requires CUDA 11.8. And build kv-quant-merge branch of vllm from source, it requires CUDA 12.0. Is that expected? Can you kindly provide a list of CUDA/torch/transformers/ray dependency for using smoothquant with your vllm branch? Thanks in advance.

@simon-mo
Copy link
Collaborator

simon-mo commented Oct 1, 2024

Closing as we do have W8A8 and Int8 support nowadays. 🙏

@simon-mo simon-mo closed this Oct 1, 2024
@SherrySwift
Copy link

Hi, is there any plan to support int4 KVCacheQuant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.