You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$ ./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA L40S, compute capability 8.9, VMM: yes
version: 4625 (5598f47)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
Intel(R) Xeon(R) w5-3425 + NVIDIA L40S
Models
unsloth/DeepSeek-R1-GGUF
Problem description & steps to reproduce
When attempting to use llama-cli to inference, it becomes CPU bound and is painfully slow (less than one token per second). nvtop shows that the GPU is 0% utilized (all CPU being used) despite 14 layers and 44GB offloaded to VRAM. I'm following the instructions outlined on Unsloth's blog and running the following command: !build/bin/llama-cli \ --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \ --cache-type-k q4_0 \ --threads 64 \ --prio 2 \ --temp 0.6 \ --ctx-size 8192 \ --seed 3407 \ --n-gpu-layers 16 \ -no-cnv \ --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
Name and Version
$ ./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA L40S, compute capability 8.9, VMM: yes
version: 4625 (5598f47)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
Intel(R) Xeon(R) w5-3425 + NVIDIA L40S
Models
unsloth/DeepSeek-R1-GGUF
Problem description & steps to reproduce
When attempting to use llama-cli to inference, it becomes CPU bound and is painfully slow (less than one token per second). nvtop shows that the GPU is 0% utilized (all CPU being used) despite 14 layers and 44GB offloaded to VRAM. I'm following the instructions outlined on Unsloth's blog and running the following command:
!build/bin/llama-cli \ --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \ --cache-type-k q4_0 \ --threads 64 \ --prio 2 \ --temp 0.6 \ --ctx-size 8192 \ --seed 3407 \ --n-gpu-layers 16 \ -no-cnv \ --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: