You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is the bug? llamacpp server code supports an argument -ngld / --n-gpu-layers-draft that specifies how many layers of the Draft model to offload to GPU VRAM. As far as I can tell, based on my limited benchmarking, LM Studio isn't making use of it.
Based on the test runs I've done (see Logs), I would expect Speculative Decoding to be ~340% faster (baseline 1.5 tok/s for model + draft CPU spec dec -> 60% * 1.5 tok/s + ((40% - observation 1 * 1.5 tok/s) * 7 - observation 2) = 0.9 + 4.2 = 5.1 tok/s)
Observations:
7B Draft model scores >40% token acceptance
7B Draft model is ~ 7 times faster running fully offloaded to GPU, than running on CPU
Offloading 10 layers of the Target model w/ no offloading of the Draft model is minimally faster than not offloading either. (~10% faster)
Which version of LM Studio?
LM Studio 0.3.10
System Details
What is the bug?
llamacpp server code supports an argument
-ngld
/--n-gpu-layers-draft
that specifies how many layers of theDraft
model to offload to GPU VRAM. As far as I can tell, based on my limited benchmarking, LM Studio isn't making use of it.Based on the test runs I've done (see Logs), I would expect Speculative Decoding to be ~340% faster (baseline 1.5 tok/s for model + draft CPU spec dec -> 60% * 1.5 tok/s + ((40% - observation 1 * 1.5 tok/s) * 7 - observation 2) = 0.9 + 4.2 = 5.1 tok/s)
Observations:
Draft
model scores >40% token acceptanceDraft
model is ~ 7 times faster running fully offloaded to GPU, than running on CPUTarget
model w/ no offloading of theDraft
model is minimally faster than not offloading either. (~10% faster)Logs
The text was updated successfully, but these errors were encountered: