[Feature] Blazing fast W4A16 inference #202

lzhangzz · 2023-08-07T05:28:58Z

Preliminary benchmark on Llama-2-7B and A100 GPU (context length=1, generate=512)

throughput (token/s)

batch size	w4	w16
1	238.61	97.61
4	903.89	379.22
8	1662.58	735.44
16	2799.18	1374.98
32	3779.73	2408.16
64	4761.80	3878.41
128	5638.81	4918.97

512 (instead of 2048) tokens are generated to focus on GEMM workload (instead of Attention which consists of GEMVs mostly).

docs/en/serving.md

lvhan028 · 2023-08-07T06:32:11Z

May fix linting

lvhan028 · 2023-08-07T06:32:27Z

May add news in README

lvhan028 · 2023-08-07T07:31:40Z

@irexyc Please help check if it works on windows-3080

lvhan028 · 2023-08-07T10:40:39Z

FileNotFoundError: [Errno 2] No such file or directory: '/nvme/shared_data/llama2/huggingface/llama-2-7b-chat/../llm-awq/llama-2-7b-chat-w4-g128-awq.pt'

The command is:

python lmdeploy/serve/turbomind/deploy.py llama2 /nvme/shared_data/llama2/huggingface/llama-2-7b-chat/ --model_format awq --group_size 128 --quant_path ../llm-awq/llama-2-7b-chat-w4-g128-awq.pt --dst-path workspace/llama-2-7b-chat-w4

lmdeploy/serve/turbomind/deploy.py

lvhan028 · 2023-08-07T11:15:50Z

When compling turbomind, it raised error:
ptxas /tmp/tmpxft_000163cd_00000000-9_gemm_s4_f16.compute_70.ptx, line 80871; error : Feature 'movmatrix' requires .target sm_75 or higher
ptxas /tmp/tmpxft_000163cd_00000000-9_gemm_s4_f16.compute_70.ptx, line 80902; error : Feature 'movmatrix' requires .target sm_75 or higher
ptxas /tmp/tmpxft_000163cd_00000000-9_gemm_s4_f16.compute_70.ptx, line 80927; error : Feature 'movmatrix' requires .target sm_75 or higher
ptxas /tmp/tmpxft_000163cd_00000000-9_gemm_s4_f16.compute_70.ptx, line 80948; error : Feature 'movmatrix' requires .target sm_75 or higher
ptxas /tmp/tmpxft_000163cd_00000000-9_gemm_s4_f16.compute_70.ptx, line 80969; error : Feature 'movmatrix' requires .target sm_75 or higher

lvhan028 · 2023-08-07T11:16:47Z

After adding -DSM=80 and -DCMAKE_CUDA_ARCHITECTURES=80, it can be built successfully

lvhan028 · 2023-08-07T11:17:17Z

Use turbomind/chat.py to talk with AI, it is aborted

lvhan028 · 2023-08-07T11:18:27Z

Use turbomind/client.py to communicate via TIS, it raised an error:
UNAVAILABLE: Not found: unable to load shared library: /opt/tritonserver/backends/turbomind/libtransformer-shared.so: undefined symbol: _ZN9turbomind9GemmS4F16D1Ev

src/turbomind/models/llama/LlamaFfnLayer.cc

lmdeploy/serve/turbomind/deploy.py

src/turbomind/kernels/gemm_s_f16/metric.h

src/turbomind/kernels/gemm_s_f16/common.h

grimoire

LGTM

lzhangzz added 4 commits August 7, 2023 04:16

add w4a16

bf2f02b

fix deploy.py

370d9de

add doc

a6ad64a

add w4a16 kernels

dae0849

lvhan028 requested review from grimoire, pppppM and lvhan028 August 7, 2023 06:16

lvhan028 reviewed Aug 7, 2023

View reviewed changes

docs/en/serving.md Show resolved Hide resolved

lvhan028 requested a review from irexyc August 7, 2023 07:31

lvhan028 reviewed Aug 7, 2023

View reviewed changes

lmdeploy/serve/turbomind/deploy.py Show resolved Hide resolved

lvhan028 added the enhancement New feature or request label Aug 7, 2023

lzhangzz added 3 commits August 7, 2023 11:27

fuse w1/w3 & bugfixes

0cc8cfe

fix typo

f9dcbfc

python

d699d58

lvhan028 reviewed Aug 7, 2023

View reviewed changes

src/turbomind/models/llama/LlamaFfnLayer.cc Show resolved Hide resolved

lvhan028 reviewed Aug 8, 2023

View reviewed changes

lmdeploy/serve/turbomind/deploy.py Show resolved Hide resolved

guard sm75/80 features

65486ac

lzhangzz mentioned this pull request Aug 8, 2023

[Bug] performance in A10 for LLAMA2 -7B is slow #210

Closed

2 tasks

irexyc reviewed Aug 9, 2023

View reviewed changes

src/turbomind/kernels/gemm_s_f16/metric.h Show resolved Hide resolved

lzhangzz added 2 commits August 9, 2023 04:01

add missing header

b2394f9

refactor

119c562

grimoire reviewed Aug 9, 2023

View reviewed changes

src/turbomind/kernels/gemm_s_f16/common.h Outdated Show resolved Hide resolved

lzhangzz added 4 commits August 9, 2023 14:13

qkvo bias

d0c196e

update cost model

0da4a94

fix lint

25a4e52

update deploy.py

266666f

lvhan028 approved these changes Aug 11, 2023

View reviewed changes

grimoire approved these changes Aug 14, 2023

View reviewed changes

irexyc approved these changes Aug 14, 2023

View reviewed changes

lvhan028 merged commit c3290ca into InternLM:main Aug 14, 2023

lvhan028 mentioned this pull request Aug 18, 2023

[Bug] int4模型triton server请求异常 #266

Closed

2 tasks

lzhangzz mentioned this pull request Oct 9, 2023

[Feature] INT 4推理能支持 SM70 吗 #530

Closed

lvhan028 mentioned this pull request Nov 22, 2023

关于lmdeploy中4比特量化的推理实现 #737

Closed

lzhangzz mentioned this pull request Mar 19, 2024

[Feature] Add new AWQ kernels #1301

Closed

zhyncs mentioned this pull request Apr 24, 2024

[Feature] TurboMind support W8A8 or FP8 KV Cache #1463

Closed

zhyncs mentioned this pull request Jul 9, 2024

[Benchmark] TurboMind benchmark with GLM-4-9B-Chat and Qwen2-72B-Instruct vs vLLM #1974

Closed

zhyncs mentioned this pull request Jul 18, 2024

64% faster awq sgl-project/sglang#229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Blazing fast W4A16 inference #202

[Feature] Blazing fast W4A16 inference #202

lzhangzz commented Aug 7, 2023 •

edited

Loading

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

grimoire left a comment

[Feature] Blazing fast W4A16 inference #202

[Feature] Blazing fast W4A16 inference #202

Conversation

lzhangzz commented Aug 7, 2023 • edited Loading

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

lvhan028 commented Aug 7, 2023

grimoire left a comment

Choose a reason for hiding this comment

lzhangzz commented Aug 7, 2023 •

edited

Loading