We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ValueError
AutoAWQ states that in order to use AWQ, you need a GPU with:
Compute Capability 7.5 (sm75). Turing and later architectures are supported.
But when I try to use vLLM to serve my AWQ LLM:
+ python app.py --host 0.0.0.0 --port 5085 --model wasertech/assistant-llama2-7b-chat-awq --tokenizer hf-internal-testing/llama-tokenizer --dtype half --tensor-parallel-size 1 --gpu-memory-utilization 0.65 --quantization awq Downloading (…)lve/main/config.json: 100%|███████| 677/677 [00:00<00:00, 118kB/s] INFO 10-07 06:41:25 llm_engine.py:72] Initializing an LLM engine with config: model='wasertech/assistant-llama2-7b-chat-awq', tokenizer='hf-internal-testing/llama-tokenizer', tokenizer_mode=auto, revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=awq, seed=0) Downloading (…)cial_tokens_map.json: 100%|████| 72.0/72.0 [00:00<00:00, 14.2kB/s] Downloading (…)e6/added_tokens.json: 100%|████| 42.0/42.0 [00:00<00:00, 8.29kB/s] Downloading (…)okenizer_config.json: 100%|██████| 825/825 [00:00<00:00, 82.4kB/s] Downloading (…)e6/quant_config.json: 100%|████| 90.0/90.0 [00:00<00:00, 15.4kB/s] Downloading (…)neration_config.json: 100%|██████| 132/132 [00:00<00:00, 22.2kB/s] Downloading (…)44be6/tokenizer.json: 100%|██| 1.84M/1.84M [00:00<00:00, 4.09MB/s] Traceback (most recent call last): File "app.py", line 86, in <module> engine = AsyncLLMEngine.from_engine_args(engine_args) File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 486, in from_engine_args engine = cls(engine_args.worker_use_ray, File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 270, in __init__ self.engine = self._init_engine(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/vllm/engine/async_llm_engine.py", line 306, in _init_engine return engine_class(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 108, in __init__ self._init_workers(distributed_init_method) File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 140, in _init_workers self._run_workers( File "/usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py", line 692, in _run_workers output = executor(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py", line 68, in init_model self.model = get_model(self.model_config) File "/usr/local/lib/python3.8/dist-packages/vllm/model_executor/model_loader.py", line 75, in get_model raise ValueError( ValueError: The quantization method awq is not supported for the current GPU. Minimum capability: 80. Current capability: 75.
Please lower the requirements accordingly.
The text was updated successfully, but these errors were encountered:
#1252 needs to be merged to resolve this. I added support separately based on the PR
Sorry, something went wrong.
This issue was fixed by #1252
No branches or pull requests
AutoAWQ states that in order to use AWQ, you need a GPU with:
But when I try to use vLLM to serve my AWQ LLM:
Please lower the requirements accordingly.
The text was updated successfully, but these errors were encountered: