Fix gemv_fast model loading #1

chu-tianxiang · 2024-03-15T14:00:45Z

GEMV fast shares many similarities with marlin both in algorithm and packing. Additional measures are required for managing the tiling/interleaving process in order to properly load the weights. Currently I following the practice in Marlin. It may worth some refactoring in the future, for example, separate the pack_factor into output_pack_factor and input_pack_factor.

The code works for llama now, however more shape checks are needed especially because of the calculate_zeros_width padding.

garycaokai · 2024-03-29T06:36:20Z

Great job, have you test the performance of awq_fast, I have tried the mistral-instruct-v0.2-gemvfast-awq model, the performance is not so fast

chu-tianxiang · 2024-04-01T10:13:24Z

Great job, have you test the performance of awq_fast, I have tried the mistral-instruct-v0.2-gemvfast-awq model, the performance is not so fast

I only ran a few tests comparing vllm.LLM.generate (old gemm and new gemv fast) and AutoAWQForCausalLM.generate. GEMV fast is faster than old GEMM across all batch sizes. vllm is comparable or better than AutoAWQForCausalLM at small batch size while slightly worse at large batch size (64).

chu-tianxiang added 2 commits March 15, 2024 21:24

Fix awq model loading

3f17c4e

minor fix

776a261

casper-hansen merged commit d157f96 into casper-hansen:awq_faster_kernels Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix gemv_fast model loading #1

Fix gemv_fast model loading #1

chu-tianxiang commented Mar 15, 2024

garycaokai commented Mar 29, 2024

chu-tianxiang commented Apr 1, 2024

Fix gemv_fast model loading #1

Fix gemv_fast model loading #1

Conversation

chu-tianxiang commented Mar 15, 2024

garycaokai commented Mar 29, 2024

chu-tianxiang commented Apr 1, 2024