-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Scale factors and benchmarks #2
Comments
@jeromeku
|
Many thanks for the response! Do you have the script used to test against other methods? Especially interested in reproducing the results against QoQ. Also can't seem to find the |
You can reproduce the QQQ results following the |
@HandH1998 |
@brisker dynamic quantization |
@HandH1998 |
@brisker As QServe doesn't offer a precision of w4a8f16, we directly compare QQQ with QServe using w4a8kv4. On the other hand, QServe employs various techniques to mitigate the impact of kv4. According to their paper, |
@HandH1998 |
We think the speedup of QQQ w4a8g128 is limited to the high dtype conversion overhead between FP16 and INT8 as shown in the following picture. QQQ only focuses on the weight quantization, and we don't plan to develop a w4a8g128-kv8. Actually, it can increase the computing throughput of large batch size to replace kvfp16 with kv8, but is not effective for small batch size. If you want to try QQQ with low-bit kv cache, we recommend our vllm PR which provides fp8 kv cache. |
Thank you for your advice! |
@AniZpZ |
@brisker We developed a new version based on this PR to support dynamic activation per-token quantization. We think the online activation quantization will introduce additional overhead, resulting in slower inference speed compared to FP16 at smaller batch sizes. However, as the batch size increases, the scenario becomes compute-bound, and w8a8 is likely to outperform other quantization methods. |
@HandH1998 |
Yes. |
But the w4a8 inference time is nearly double of that of fp16. Is there any bug in this repo? (w4a8 quantize Nan loss is also weird)
|
@brisker |
|
|
(QQQ) root@train-nndf-vllm-2-0:/data1/QQQ-main# python examples/quant_model.py --model_path /dataset/LM-public/LLM/Llama-2-7b --tokenizer_path /dataset/LM-public/LLM/Llama-2-7b --batch_size 8 --dtype float16 --quant_config quant_config/llama/w4a8.yaml --save_path ./debug
|
|
|
|
@HandH1998 |
@HandH1998 |
@HandH1998
Currently, I can not have access to github.com to download files online. So I tried to build the cutlass myself, and after I build cutlass from source successfully, and run the install of vllm, same error occurs. Any advice on this? Thanks in advance. |
@brisker I have never encountered this problem... |
@HandH1998 |
The cutlass version can be found at |
@HandH1998 (the following speed results are directly summarized by vllm on the command line)
|
|
Great paper and thanks for open sourcing the code.
A couple questions:
GEMM
,FastFP16toInt8
)?W4A8
kernel, why is there a need for an additional channel-wise scale factor inFusedDequantQuant
? I.e., theInt4
weights are dequantized toFP16
using group-wise scale factors, then quantized toInt8
using an additional channel-wise scale then fed toInt8
GEMM. In contrast, in the channel-wiseW4A8
kernel, theInt4
weights are directly converted toInt8
then fed toInt8
GEMM.The text was updated successfully, but these errors were encountered: