Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speculative Decoding slow: Doesn't prioritize offloading Draft model to GPU #458

Open
deftdawg opened this issue Feb 23, 2025 · 0 comments

Comments

@deftdawg
Copy link

deftdawg commented Feb 23, 2025

Which version of LM Studio?
LM Studio 0.3.10

System Details

  • NixOS (Linux)
  • w/ 128GB of RAM (CPU)
  • w/ 16GB of VRAM (RX 6900XT)

What is the bug?
llamacpp server code supports an argument -ngld / --n-gpu-layers-draft that specifies how many layers of the Draft model to offload to GPU VRAM. As far as I can tell, based on my limited benchmarking, LM Studio isn't making use of it.

Based on the test runs I've done (see Logs), I would expect Speculative Decoding to be ~340% faster (baseline 1.5 tok/s for model + draft CPU spec dec -> 60% * 1.5 tok/s + ((40% - observation 1 * 1.5 tok/s) * 7 - observation 2) = 0.9 + 4.2 = 5.1 tok/s)

Observations:

  1. 7B Draft model scores >40% token acceptance
  2. 7B Draft model is ~ 7 times faster running fully offloaded to GPU, than running on CPU
  3. Offloading 10 layers of the Target model w/ no offloading of the Draft model is minimally faster than not offloading either. (~10% faster)

Logs

Model Device Speed (tok/sec) Draft Model Draft Tokens Accepted
deepseek-r1-distill-qwen-7b - Q4 GPU (28/28 layers offloaded) 77.77 N/A N/A
deepseek-r1-distill-qwen-7b - Q4 CPU 11.91 N/A N/A
deepseek-r1-distill-qwen-32b - BF16 CPU - No Speculative Decoding 0.97 N/A N/A
deepseek-r1-distill-qwen-32b - BF16 CPU w/ Speculative Decoding 1.59 deepseek-r1-distill-qwen-7b - Q4 403/933 (43.2%)
deepseek-r1-distill-qwen-32b - BF16 CPU w/ Speculative Decoding 1.29 deepseek-r1-distill-qwen-1.5b - Q4 253/927 (27.3%)
deepseek-r1-distill-qwen-32b - BF16 CPU/GPU (10-layer offloaded) w/ Speculative Decoding 1.75 deepseek-r1-distill-qwen-7b - Q4 482/940 (51.3%)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant