LLaMA-3 issues when used with vLLM #452

catid · 2024-04-20T03:21:14Z

I tried these two quantization approaches:

model_path = '/home/catid/models/Meta-Llama-3-70B-Instruct'
quant_path = 'cat-llama-3-70b-q128-w4-gemvfast'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv_fast" }

model_path = '/home/catid/models/Meta-Llama-3-70B-Instruct'
quant_path = 'cat-llama-3-70b-q128-w4-gemvfast'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv" }

Both result in the same error in vLLM:

  File "/home/catid/sources/vllm/vllm/model_executor/layers/linear.py", line 558, in weight_loader
    loaded_weight = loaded_weight.narrow(input_dim, start_idx,
RuntimeError: start (0) + length (14336) exceeds dimension size (8192).
(RayWorkerWrapper pid=45548) ERROR 04-20 03:14:37 worker_base.py:153] Error executing method load_model. This might cause deadlock in distributed execution.

gemm works fine though

The text was updated successfully, but these errors were encountered:

casper-hansen · 2024-04-20T09:18:04Z

GEMVFast is not implemented in vLLM yet

casper-hansen · 2024-04-20T10:59:34Z

I'm planning a PR to implement this functionality in vLLM

vllm-project/vllm#3289

SinanAkkoyun · 2024-04-21T19:29:54Z

I'm planning a PR to implement this functionality in vLLM

Is there an alternative to implementing continuous batching with GEMVFast? I'd really like to generate a new separate instance while simultaneously generating old batch without waiting for the old batch

casper-hansen · 2024-04-29T10:26:10Z

Currently, there is no option for it. You will have to wait until other software packages support it.

danielstankw · 2024-05-08T13:26:17Z

@catid
What RAM do you use for that?
My 31GB gets overfilled when quantizing the model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaMA-3 issues when used with vLLM #452

LLaMA-3 issues when used with vLLM #452

catid commented Apr 20, 2024

casper-hansen commented Apr 20, 2024

casper-hansen commented Apr 20, 2024

SinanAkkoyun commented Apr 21, 2024

casper-hansen commented Apr 29, 2024

danielstankw commented May 8, 2024

LLaMA-3 issues when used with vLLM #452

LLaMA-3 issues when used with vLLM #452

Comments

catid commented Apr 20, 2024

casper-hansen commented Apr 20, 2024

casper-hansen commented Apr 20, 2024

SinanAkkoyun commented Apr 21, 2024

casper-hansen commented Apr 29, 2024

danielstankw commented May 8, 2024