Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] AWQ Faster Kernels #3289

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

casper-hansen
Copy link
Contributor

@casper-hansen casper-hansen commented Mar 8, 2024

New AWQ kernels have been introduced by the AWQ authors:

  • new weight packing format
  • uses semaphores during execution
  • uses a mix of GEMV and GEMM for optimal speed
  • decoding speed scales much better

Testing Model: casperhansen/mistral-instruct-v0.2-gemvfast-awq

This PR is currently implemented as a draft:

  • Include new kernels and build them
  • Implement new weight loading for packed + interleaved weights.
  • Implement forward pass

Benchmark (1x A100)

Planning some benchmarks:

python benchmarks/benchmark_throughput.py --input-len 100 --output-len 1000 --quantization awq --num-prompts 100 --max-model-len 1100 --dtype half --model casperhansen/mistral-7b-instruct-v0.1-awq

@casper-hansen casper-hansen marked this pull request as draft March 8, 2024 23:11
@casper-hansen
Copy link
Contributor Author

@WoosukKwon I have used the same shapes as referenced in the original implementation, yet it does not load in vLLM for reasons I am unsure how to fix. If I add interleaving to the packed shards, nothing happens as the interleaving and packed factor cancel each other out. See the WQLinear_GEMVFast in AutoAWQ for reference.

How should we proceed to implement weight loading for this new format?

@shiqingzhangCSU
Copy link

@WoosukKwon I have used the same shapes as referenced in the original implementation, yet it does not load in vLLM for reasons I am unsure how to fix. If I add interleaving to the packed shards, nothing happens as the interleaving and packed factor cancel each other out. See the WQLinear_GEMVFast in AutoAWQ for reference.

How should we proceed to implement weight loading for this new format?

Hello, is there any progress?

@casper-hansen
Copy link
Contributor Author

@shiqingzhangCSU currently there is no progress. if you have suggestions or fixes, please open a PR to my fork. i am hoping to have this feature in vLLM soon, but the weight loading is a blocker.

@itsuncheng
Copy link

@casper-hansen Hi, I'm meeting this same issue. To unblock, would you mind sharing which previous version of AutoAWQ works with vLLM?

qweight, {
"input_dim": 1,
"output_dim": 0,
"packed_dim": 1,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to "packed_dim": 0,
can load the weight

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Mar 30, 2024

I have identified the source of the issue.

There is faulty logic in MergedColumnParallelLinear and QKVParallelLinear for the case where output_dim=1 AND packed_dim=1. awq_gemv_fast is the first quantization kernel with this case.

Working on a fix that avoids breaking GPTQ

@bratao
Copy link

bratao commented Apr 6, 2024

@robertgshaw2-neuralmagic any luck with this patch? I benchmarked and those kernels are really something. Great boost on my internal tests!

@casper-hansen
Copy link
Contributor Author

@robertgshaw2-neuralmagic any luck with this patch? I benchmarked and those kernels are really something. Great boost on my internal tests!

@bratao I believe rob has a branch over in the neuralmagic fork. We discussed how to solve the issues and it seems there is a path forward for loading weights correctly. The forward pass also needs a modification from current state in the referenced branch, similar to the PR I recently created in AutoAWQ.

https://github.com/neuralmagic/nm-vllm/tree/awq_faster_kernel

@casper-hansen
Copy link
Contributor Author

I merged @chu-tianxiang's PR and made some more modifications to catch up to the main branch. I will abandon this PR for now and leave it as a draft for someone else to finish.

Here is my list of issues that I was facing:

  1. We are missing the batch dimension in the input variable being passed around. This is suboptimal for heuristics, which AWQ relies on for choosing kernels. It is also not entirely clear what the input contains any longer.
  2. The forward runs but generates no output. I am not sure what is causing this, probably the obscure weight loading.
  3. The speed is much slower than benchmarked, indicating an issue either with vLLM entirely or the heuristics not being triggered correctly. For reference, it is not just a little faster, but a lot faster than previous generation of kernels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants