Speculative Decoding slow: Doesn't prioritize offloading Draft model to GPU #458

deftdawg · 2025-02-23T09:15:24Z

Which version of LM Studio?
LM Studio 0.3.10

System Details

NixOS (Linux)
w/ 128GB of RAM (CPU)
w/ 16GB of VRAM (RX 6900XT)

What is the bug?
llamacpp server code supports an argument -ngld / --n-gpu-layers-draft that specifies how many layers of the Draft model to offload to GPU VRAM. As far as I can tell, based on my limited benchmarking, LM Studio isn't making use of it.

Based on the test runs I've done (see Logs), I would expect Speculative Decoding to be ~340% faster (baseline 1.5 tok/s for model + draft CPU spec dec -> 60% * 1.5 tok/s + ((40% - observation 1 * 1.5 tok/s) * 7 - observation 2) = 0.9 + 4.2 = 5.1 tok/s)

Observations:

7B Draft model scores >40% token acceptance
7B Draft model is ~ 7 times faster running fully offloaded to GPU, than running on CPU
Offloading 10 layers of the Target model w/ no offloading of the Draft model is minimally faster than not offloading either. (~10% faster)

Logs

Model	Device	Speed (tok/sec)	Draft Model	Draft Tokens Accepted
deepseek-r1-distill-qwen-7b - Q4	GPU (28/28 layers offloaded)	77.77	N/A	N/A
deepseek-r1-distill-qwen-7b - Q4	CPU	11.91	N/A	N/A
deepseek-r1-distill-qwen-32b - BF16	CPU - No Speculative Decoding	0.97	N/A	N/A
deepseek-r1-distill-qwen-32b - BF16	CPU w/ Speculative Decoding	1.59	deepseek-r1-distill-qwen-7b - Q4	403/933 (43.2%)
deepseek-r1-distill-qwen-32b - BF16	CPU w/ Speculative Decoding	1.29	deepseek-r1-distill-qwen-1.5b - Q4	253/927 (27.3%)
deepseek-r1-distill-qwen-32b - BF16	CPU/GPU (10-layer offloaded) w/ Speculative Decoding	1.75	deepseek-r1-distill-qwen-7b - Q4	482/940 (51.3%)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative Decoding slow: Doesn't prioritize offloading Draft model to GPU #458

Speculative Decoding slow: Doesn't prioritize offloading Draft model to GPU #458

deftdawg commented Feb 23, 2025 •

edited

Loading

Speculative Decoding slow: Doesn't prioritize offloading Draft model to GPU #458

Speculative Decoding slow: Doesn't prioritize offloading Draft model to GPU #458

Comments

deftdawg commented Feb 23, 2025 • edited Loading

deftdawg commented Feb 23, 2025 •

edited

Loading