Support int8 KVCacheQuant and W8A8 inference in vllm #1112

AniZpZ · 2023-09-20T13:20:00Z

We have recently implemented and tested int8 KV-Cache quantization and W8A8 inference in vLLM. We find out that our quantization implementation can increase the throughput over 20% and reduce the first token latency under heavy load. In contrast, W4A16 quant methods(eg. AWQ-based method) provided in vllm cannot improve throughput according to pr1032 because it can not benefit from int8 tensor core. So we propose this PR as an alternative quantization method.

Updates!!!
We have made some more progress in #1112 (comment)

More Updates!!!
If you want to properly eval mmlu dataset with vllm, some modify on sampler must be done. The code can be found in our mmlu_eval branch

Important message!!!
We spilt the pr into two parts for easier review and use. The w8a8 inference part is in #1508 and the kv cache quant part is in #1507

What we have right now:

int8 KV-Cache quantization related works:
a. Quant\Dequant helper functions adapted from Faster Transformer
b. Quantized version CUDA kernels
c. Unit tests for the added kernels
W8A8 inference related works:
a. Int8 Gemm kernels adapted from torch-int
b. W8A8 linear layer modules
c. Support W8A8 inference on Llama model
Test result based on our own dataset

What we plan to do:

1. Further kernel fusion
2. Code refactoring and cleaning
3. Opimize int8 GEMM kernel
4. Release SmoothQuant for LLaMA
5. Add code for generating KV-Cache quantization parameters (scales and zero points)
6. Experiments on more datasets

How to test throughput
A. how to enable w8a8 inference
~~0. install cutlass because we currently use cutlass gemm kernel. We plan to replace them with cublas gemm kernel soon.~~
We support cublas gemm kernel now, you can remove cutlass gemm kernel in setup.py

install smoothquant and torch-int for llama. Use "examples/generate_act_scales.py" to generate act scale, and then use "examples/export_int8_llama.py" to export int8 model. Please note to check and change the 'architectures' field in the model's config.json from 'Int8LlamaForCausalLM' to 'LlamaForCausalLM'.
update vllm and execute

python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --quantization smoothquant

B. how to enable kv cache quant

use vllm/kv_quant/calibrate.py to genearte scales and use vllm/kv_quant/export_kv_params.py to export kv caches.
exeute

python ./vllm/entrypoints/api_server.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --max-num-batched-tokens=70000 --block-size=16 --swap-space=20 --kv-cache-dtype=int8 --kv-quant-params-path=/path/to/kv_params_dir

And you can use kv cache quant and w8a8 inference together

Experiment Result
current test result in our datasets on A100 80G (updates with quant&rms fusion and gemm d2h bug fix)

Throughput of FP16 LLaMA-13B:

Throughput:  4.9945 requests/s, 543.0660 token/s
Average latency: 31.7689 s

Throughput of Int8 LLaMA-13B with int8 KVCacheQuant:

Throughput: 6.1147 requests/s, 664.8646 token/s, 
Average latency: 27.4222 s

Throughput of Int8 LLaMA-13B with int8 KVCacheQuant, using cublas gemm kernel:

Throughput: 6.4723 requests/s, 703.7514 token/s, 
Average latency: 25.9912 s

How to evalute model performance
We add evaluation method of quanted models, currently support mmlu datasets.
You can find detail in benchmarks/benchmark_evaluation.py

python benchmark_evaluation.py --model=/path/to/quantized/model --tokenizer=/path/to/tokenizer --dev-data-path=/path/to/mmlu/dev/ --test-data-path=/path/to/mmlu/test/ --kv-cache-dtype=int8 --kv-quant-params-path=/path/to/kv_params_dir --quantization=smoothquant

Updates
We have released SmoothQuant for LLaMA in
https://github.com/AniZpZ/smoothquant/tree/llama-dev
https://github.com/AniZpZ/torch-int/tree/llama-dev

The code for generating KV-Cache quantization parameters is ready, check vllm/kv_quant fold

We replace int8 gemm with cublas version and the increasement of throughput comes to around 30%

casper-hansen · 2023-09-20T13:50:08Z

This is interesting work! I was going to implement int8 in AutoAWQ with time as the authors of SmoothQuant (this PR) and AWQ are the same. My best guestimate is that single_query_cached_kv_attention_quantized_kernel is doing the heavy lifting of throughput here as it comes from FasterTransformer which is well optimized.

viktor-ferenczi · 2023-09-21T06:46:46Z

I fully support this, since the 4-bit AWQ model proved to have inferior quality for my use cases. Having 8 bit weights with 8 bit activation cache would be the best of both worlds, allowing for almost no loss of quality (perplexity) while being able to run inference more efficiently. I would also keep an W8A16 mode as an option, should the precision of the activations and the KV cache would make a difference in specific use cases.

zhyncs · 2023-09-21T08:51:58Z

Hi vLLM genius @WoosukKwon @zhuohan123

This is the latest development from our team regarding quantitative support for vllm, we have done something similar to #1032 before. At that time, we didn't mention pr after the benchmark results showed a drop in throughput, but later we found out that #1032 was merged, which is very encouraging. Therefore, we continue to do performance optimization on this basis, and send out the pr in WIP state in advance, hoping to get some comments and suggestions, and finally merge into the vllm codebase smoothly. Cheers!

casper-hansen · 2023-09-21T09:03:59Z

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

AniZpZ · 2023-09-21T09:17:37Z

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

We implement quantization with smoothquant method for W8A8 I will release the code later. The perplexity is identical to a standard smoothquant method if you do W8A8 inference without int8 KVCacheQuant.

Quantization details are discussed in this paper(Xiao et. al)

casper-hansen · 2023-09-21T09:46:44Z

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

We implement quantization with smoothquant method for W8A8 I will release the code later. The perplexity is identical to a standard smoothquant method if you do W8A8 inference without int8 KVCacheQuant.

Quantization details are discussed in this paper(Xiao et. al)

SmoothQuant only supports OPT models. How can we test this PR when the SmoothQuant repository does not support LLaMa models? If you implement this PR without the quantization code, you will inevitably end up with a bad perplexity if you naively use W8A8 as you have no calibration dataset.

See this image, accuracy ends up being worse than INT4 if you naively convert weights to W8A8. You need the SmoothQuant or AWQ method to convert if you want to preserve accuracy. You need a framework for this, which is why I created AutoAWQ - I will look to implement INT8 quantization using the torch-int modules and would love your help with this so we can support all models in vLLM (LLaMa, MPT, Falcon, etc.) without accuracy degradation.

AniZpZ · 2023-09-21T10:03:59Z

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

@AniZpZ @zhyncs This is great work! My understanding is that this PR converts FP16 -> INT8 dynamically without computing a loss function to optimize perplexity. Have you evaluated perplexity on this approach?

We implement quantization with smoothquant method for W8A8 I will release the code later. The perplexity is identical to a standard smoothquant method if you do W8A8 inference without int8 KVCacheQuant.
Quantization details are discussed in this paper(Xiao et. al)

SmoothQuant only supports OPT models. How can we test this PR when the SmoothQuant repository does not support LLaMa models? If you implement this PR without the quantization code, you will inevitably end up with a bad perplexity if you naively use W8A8 as you have no calibration dataset.

See this image, accuracy ends up being worse than INT4 if you naively convert weights to W8A8. You need the SmoothQuant or AWQ method to convert if you want to preserve accuracy. You need a framework for this, which is why I created AutoAWQ - I will look to implement INT8 quantization using the torch-int modules and would love your help with this so we can support all models in vLLM (LLaMa, MPT, Falcon, etc.) without accuracy degradation.

We implement smoothquant for llama ourself, you can find code here: https://github.com/AniZpZ/smoothquant/tree/llama-dev and easily quantize and export model with export_int8_llama.py
It should work with https://github.com/AniZpZ/torch-int/tree/llama-dev

casper-hansen · 2023-09-22T20:33:28Z

Hi @AniZpZ @zhyncs, thank you for your great work with this PR.

I have now had more time to explore your fast implementation and found that Nvidia only has support for INT8 for high throughput, which makes this PR achieve higher throughput than INT4 due to software capabilities.

Is your proposal to run W8A16? Your code does not have A8 implemented in the llama.py model definition.

SmoothQuant implements W8A8, but it seems silly to run A8 as there should be little benefit speed-wise. Therefore, I see this as a natural choice. I want to confirm this with you for my implementation in AutoAWQ as I want to push INT8 models out using your initial LLaMa implementation, just using the AWQ method for minimum perplexity loss.

AniZpZ · 2023-09-23T02:53:44Z

Hi @AniZpZ @zhyncs, thank you for your great work with this PR.

I have now had more time to explore your fast implementation and found that Nvidia only has support for INT8 for high throughput, which makes this PR achieve higher throughput than INT4 due to software capabilities.

Is your proposal to run W8A16? Your code does not have A8 implemented in the llama.py model definition.

SmoothQuant implements W8A8, but it seems silly to run A8 as there should be little benefit speed-wise. Therefore, I see this as a natural choice. I want to confirm this with you for my implementation in AutoAWQ as I want to push INT8 models out using your initial LLaMa implementation, just using the AWQ method for minimum perplexity loss.

Our proposal is to run in W8A8. If you enable smoothquant, we will replace rmsnorm and linear layer with our custom int8 rmsnorm and w8a8linears which quant activations and impelement int8 gemm. You can find the detail in w8a8linear.py
If you want enable tensor core to do int8 caclulation, weights and activations should both be int8.

ChristineSeven · 2023-11-27T09:23:43Z

@HandH1998
when compiling the vllm from this branch kv_quant, another issue:

Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache.o.d -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c -c /app/vllm/csrc/cache.cpp -o /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache.o -g -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=cache_ops -D_GLIBCXX_USE_CXX11_ABI=0
[2/2] /usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c -c /app/vllm/csrc/cache_kernels.cu -o /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache_kernels.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 --threads 8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="gcc"' '-DPYBIND11_STDLIB="libstdcpp"' '-DPYBIND11_BUILD_ABI="cxxabi1011"' -DTORCH_EXTENSION_NAME=cache_ops -D_GLIBCXX_USE_CXX11_ABI=0
FAILED: /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache_kernels.o
/usr/local/cuda/bin/nvcc -I/usr/local/lib/python3.8/dist-packages/torch/include -I/usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.8/dist-packages/torch/include/TH -I/usr/local/lib/python3.8/dist-packages/torch/include/THC -I/usr/local/cuda/include -I/usr/include/python3.8 -c -c /app/vllm/csrc/cache_kernels.cu -o /app/vllm/build/temp.linux-x86_64-3.8/csrc/cache_kernels.o -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -O2 -std=c++17 -D_GLIBCXX_USE_CXX11_ABI=0 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 --threads 8 -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=cache_ops -D_GLIBCXX_USE_CXX11_ABI=0
/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

/app/vllm/csrc/quant_utils.cuh(217): error: identifier "__float22bfloat162_rn" is undefined

/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

1 error detected in the compilation of "/app/vllm/csrc/cache_kernels.cu".
/app/vllm/csrc/quant_utils.cuh(217): error: identifier "__float22bfloat162_rn" is undefined

/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type

/app/vllm/csrc/cache_kernels.cu(45): warning #1143-D: arithmetic on pointer to void or function type

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=size_t, one_sided=false, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/c10/core/TensorImpl.h(77): here

/usr/local/lib/python3.8/dist-packages/torch/include/c10/util/irange.h(54): warning #186-D: pointless comparison of unsigned integer with zero
detected during:
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator==(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
(61): here
instantiation of "__nv_bool c10::detail::integer_iterator<I, one_sided, >::operator!=(const c10::detail::integer_iterator<I, one_sided, > &) const [with I=std::size_t, one_sided=true, =0]"
/usr/local/lib/python3.8/dist-packages/torch/include/ATen/core/qualified_name.h(73): here

/app/vllm/csrc/cache_kernels.cu(399): warning #550-D: variable "src_key_indices" was set but never used

/app/vllm/csrc/cache_kernels.cu(400): warning #550-D: variable "src_value_indices" was set but never used

1 error detected in the compilation of "/app/vllm/csrc/cache_kernels.cu".
/app/vllm/csrc/cache_kernels.cu(44): warning #1143-D: arithmetic on pointer to void or function type