Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: The quantization method awq is not supported for the current GPU. Minimum capability: 80. Current capability: 75. #1282

Closed
wasertech opened this issue Oct 7, 2023 · 2 comments

Comments

@wasertech
Copy link

AutoAWQ states that in order to use AWQ, you need a GPU with:

Compute Capability 7.5 (sm75). Turing and later architectures are supported.

But when I try to use vLLM to serve my AWQ LLM:

+ python app.py --host 0.0.0.0 --port 5085 --model wasertech/assistant-llama2-7b-chat-awq --tokenizer hf-internal-testing/llama-tokenizer --dtype half --tensor-parallel-size 1 --gpu-memory-utilization 0.65 --quantization awq
Downloading (…)lve/main/config.json: 100%|███████| 677/677 [00:00<00:00, 118kB/s]
INFO 10-07 06:41:25 llm_engine.py:72] Initializing an LLM engine with config: model='wasertech/assistant-llama2-7b-chat-awq', tokenizer='hf-internal-testing/llama-tokenizer', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0)
Downloading (…)cial_tokens_map.json: 100%|████| 72.0/72.0 [00:00<00:00, 14.2kB/s]
Downloading (…)e6/added_tokens.json: 100%|████| 42.0/42.0 [00:00<00:00, 8.29kB/s]
Downloading (…)okenizer_config.json: 100%|██████| 825/825 [00:00<00:00, 82.4kB/s]
Downloading (…)e6/quant_config.json: 100%|████| 90.0/90.0 [00:00<00:00, 15.4kB/s]
Downloading (…)neration_config.json: 100%|██████| 132/132 [00:00<00:00, 22.2kB/s]
Downloading (…)44be6/tokenizer.json: 100%|██| 1.84M/1.84M [00:00<00:00, 4.09MB/s]
Traceback (most recent call last):
  File "app.py", line 86, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 486, in from_engine_args
    engine = cls(engine_args.worker_use_ray,
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 270, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 306, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 108, in __init__
    self._init_workers(distributed_init_method)
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 140, in _init_workers
    self._run_workers(
  File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 692, in _run_workers
    output = executor(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 68, in init_model
    self.model = get_model(self.model_config)
  File "/usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader.py", line 75, in get_model
    raise ValueError(
ValueError: The quantization method awq is not supported for the current GPU. Minimum capability: 80. Current capability: 75.

Please lower the requirements accordingly.

@casper-hansen
Copy link
Contributor

#1252 needs to be merged to resolve this. I added support separately based on the PR

@WoosukKwon
Copy link
Collaborator

This issue was fixed by #1252

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants