Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaMA-3 issues when used with vLLM #452

Open
catid opened this issue Apr 20, 2024 · 5 comments
Open

LLaMA-3 issues when used with vLLM #452

catid opened this issue Apr 20, 2024 · 5 comments

Comments

@catid
Copy link

catid commented Apr 20, 2024

I tried these two quantization approaches:

model_path = '/home/catid/models/Meta-Llama-3-70B-Instruct'
quant_path = 'cat-llama-3-70b-q128-w4-gemvfast'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv_fast" }
model_path = '/home/catid/models/Meta-Llama-3-70B-Instruct'
quant_path = 'cat-llama-3-70b-q128-w4-gemvfast'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "gemv" }

Both result in the same error in vLLM:

  File "/home/catid/sources/vllm/vllm/model_executor/layers/linear.py", line 558, in weight_loader
    loaded_weight = loaded_weight.narrow(input_dim, start_idx,
RuntimeError: start (0) + length (14336) exceeds dimension size (8192).
(RayWorkerWrapper pid=45548) ERROR 04-20 03:14:37 worker_base.py:153] Error executing method load_model. This might cause deadlock in distributed execution.

gemm works fine though

@casper-hansen
Copy link
Owner

GEMVFast is not implemented in vLLM yet

@casper-hansen
Copy link
Owner

I'm planning a PR to implement this functionality in vLLM

vllm-project/vllm#3289

@SinanAkkoyun
Copy link

I'm planning a PR to implement this functionality in vLLM

Is there an alternative to implementing continuous batching with GEMVFast? I'd really like to generate a new separate instance while simultaneously generating old batch without waiting for the old batch

@casper-hansen
Copy link
Owner

Currently, there is no option for it. You will have to wait until other software packages support it.

@danielstankw
Copy link

@catid
What RAM do you use for that?
My 31GB gets overfilled when quantizing the model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants