[WIP] AWQ Faster Kernels #3289

casper-hansen · 2024-03-08T23:11:21Z

New AWQ kernels have been introduced by the AWQ authors:

new weight packing format
uses semaphores during execution
uses a mix of GEMV and GEMM for optimal speed
decoding speed scales much better

Testing Model: casperhansen/mistral-instruct-v0.2-gemvfast-awq

This PR is currently implemented as a draft:

Include new kernels and build them
Implement new weight loading for packed + interleaved weights.
Implement forward pass

Benchmark (1x A100)

Planning some benchmarks:

python benchmarks/benchmark_throughput.py --input-len 100 --output-len 1000 --quantization awq --num-prompts 100 --max-model-len 1100 --dtype half --model casperhansen/mistral-7b-instruct-v0.1-awq

casper-hansen · 2024-03-10T18:26:22Z

@WoosukKwon I have used the same shapes as referenced in the original implementation, yet it does not load in vLLM for reasons I am unsure how to fix. If I add interleaving to the packed shards, nothing happens as the interleaving and packed factor cancel each other out. See the WQLinear_GEMVFast in AutoAWQ for reference.

How should we proceed to implement weight loading for this new format?

shiqingzhangCSU · 2024-03-14T02:43:58Z

@WoosukKwon I have used the same shapes as referenced in the original implementation, yet it does not load in vLLM for reasons I am unsure how to fix. If I add interleaving to the packed shards, nothing happens as the interleaving and packed factor cancel each other out. See the WQLinear_GEMVFast in AutoAWQ for reference.

How should we proceed to implement weight loading for this new format?

Hello, is there any progress?

casper-hansen · 2024-03-14T13:47:26Z

@shiqingzhangCSU currently there is no progress. if you have suggestions or fixes, please open a PR to my fork. i am hoping to have this feature in vLLM soon, but the weight loading is a blocker.

itsuncheng · 2024-03-15T07:51:05Z

@casper-hansen Hi, I'm meeting this same issue. To unblock, would you mind sharing which previous version of AutoAWQ works with vLLM?

garycaokai · 2024-03-28T02:29:09Z

vllm/model_executor/layers/quantization/awq.py

+                qweight, {
+                    "input_dim": 1,
+                    "output_dim": 0,
+                    "packed_dim": 1,


change to "packed_dim": 0,
can load the weight

robertgshaw2-neuralmagic · 2024-03-30T17:17:23Z

I have identified the source of the issue.

There is faulty logic in MergedColumnParallelLinear and QKVParallelLinear for the case where output_dim=1 AND packed_dim=1. awq_gemv_fast is the first quantization kernel with this case.

Working on a fix that avoids breaking GPTQ

bratao · 2024-04-06T02:51:38Z

@robertgshaw2-neuralmagic any luck with this patch? I benchmarked and those kernels are really something. Great boost on my internal tests!

casper-hansen · 2024-04-06T09:37:36Z

@robertgshaw2-neuralmagic any luck with this patch? I benchmarked and those kernels are really something. Great boost on my internal tests!

@bratao I believe rob has a branch over in the neuralmagic fork. We discussed how to solve the issues and it seems there is a path forward for loading weights correctly. The forward pass also needs a modification from current state in the referenced branch, similar to the PR I recently created in AutoAWQ.

https://github.com/neuralmagic/nm-vllm/tree/awq_faster_kernel

vllm/model_executor/layers/quantization/awq.py

Fix gemv_fast model loading

casper-hansen · 2024-04-20T19:12:49Z

I merged @chu-tianxiang's PR and made some more modifications to catch up to the main branch. I will abandon this PR for now and leave it as a draft for someone else to finish.

Here is my list of issues that I was facing:

We are missing the batch dimension in the input variable being passed around. This is suboptimal for heuristics, which AWQ relies on for choosing kernels. It is also not entirely clear what the input contains any longer.
The forward runs but generates no output. I am not sure what is causing this, probably the obscure weight loading.
The speed is much slower than benchmarked, indicating an issue either with vLLM entirely or the heuristics not being triggered correctly. For reference, it is not just a little faster, but a lot faster than previous generation of kernels.

casper-hansen added 2 commits March 8, 2024 22:49

Add new AWQ kernels

bbdc205

Implement new quantized sizes

8694aeb

casper-hansen marked this pull request as draft March 8, 2024 23:11

Update create_weights

1eb9be1

casper-hansen mentioned this pull request Mar 15, 2024

AWQ + Marlin Error #3392

Closed

chu-tianxiang added 2 commits March 15, 2024 21:24

Fix awq model loading

3f17c4e

minor fix

776a261

orendar mentioned this pull request Mar 18, 2024

[Feature] Add new AWQ kernels InternLM/lmdeploy#1301

Closed

garycaokai reviewed Mar 28, 2024

View reviewed changes

wzf03 reviewed Apr 12, 2024

View reviewed changes

vllm/model_executor/layers/quantization/awq.py Outdated Show resolved Hide resolved

casper-hansen added 7 commits April 19, 2024 16:13

Merge pull request #1 from chu-tianxiang/awq_faster_kernels

d157f96

Fix gemv_fast model loading

Merge branch 'main' into awq_faster_kernels

d218383

Refactor different versions

bec7ffc

Format code

419dd87

Comment formatting

27e7cd7

Update get_min_capability for gemv_fast

2b27ea0

Fix weight loading

5f255a8

casper-hansen mentioned this pull request Apr 20, 2024

LLaMA-3 issues when used with vLLM casper-hansen/AutoAWQ#452

Open

casper-hansen added 3 commits April 20, 2024 19:32

Merge branch 'main' into awq_faster_kernels

bf2d072

Add new kernels to CMake build system

7ade193

forward runs, but no output generated

c0dfb0b

hmellor mentioned this pull request Sep 20, 2024

AWQ: Implement new kernels (64% faster decoding) #3025

Open

hmellor mentioned this pull request Sep 20, 2024

Question: Would a PR integrating ExLlamaV2 kernels with AWQ be accepted? #2645

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] AWQ Faster Kernels #3289

[WIP] AWQ Faster Kernels #3289

casper-hansen commented Mar 8, 2024 •

edited

Loading

casper-hansen commented Mar 10, 2024

shiqingzhangCSU commented Mar 14, 2024

casper-hansen commented Mar 14, 2024

itsuncheng commented Mar 15, 2024

garycaokai Mar 28, 2024

robertgshaw2-neuralmagic commented Mar 30, 2024 •

edited

Loading

bratao commented Apr 6, 2024

casper-hansen commented Apr 6, 2024

casper-hansen commented Apr 20, 2024

[WIP] AWQ Faster Kernels #3289

Are you sure you want to change the base?

[WIP] AWQ Faster Kernels #3289

Conversation

casper-hansen commented Mar 8, 2024 • edited Loading

Benchmark (1x A100)

casper-hansen commented Mar 10, 2024

shiqingzhangCSU commented Mar 14, 2024

casper-hansen commented Mar 14, 2024

itsuncheng commented Mar 15, 2024

garycaokai Mar 28, 2024

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Mar 30, 2024 • edited Loading

bratao commented Apr 6, 2024

casper-hansen commented Apr 6, 2024

casper-hansen commented Apr 20, 2024

casper-hansen commented Mar 8, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Mar 30, 2024 •

edited

Loading