Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gemv_fast model loading #1

Conversation

chu-tianxiang
Copy link

GEMV fast shares many similarities with marlin both in algorithm and packing. Additional measures are required for managing the tiling/interleaving process in order to properly load the weights. Currently I following the practice in Marlin. It may worth some refactoring in the future, for example, separate the pack_factor into output_pack_factor and input_pack_factor.

The code works for llama now, however more shape checks are needed especially because of the calculate_zeros_width padding.

@garycaokai
Copy link

Great job, have you test the performance of awq_fast, I have tried the mistral-instruct-v0.2-gemvfast-awq model, the performance is not so fast

@chu-tianxiang
Copy link
Author

Great job, have you test the performance of awq_fast, I have tried the mistral-instruct-v0.2-gemvfast-awq model, the performance is not so fast

I only ran a few tests comparing vllm.LLM.generate (old gemm and new gemv fast) and AutoAWQForCausalLM.generate. GEMV fast is faster than old GEMM across all batch sizes. vllm is comparable or better than AutoAWQForCausalLM at small batch size while slightly worse at large batch size (64).

@casper-hansen casper-hansen merged commit d157f96 into casper-hansen:awq_faster_kernels Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants