-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to run DeepSeek-R1 IQ1_S 1.58bit at 140 Token/Sec #1591
Comments
Reported to Llama.CPP ggerganov/llama.cpp#11474 |
Same here.
Hardware
Performance
|
Hey! Whoops guys apologies - just found out it should be 10 to 14 tokens / s for generation speed and not 140 (140 tok/s is the prompt eval time) on 2xH100. 😢 Sorry I didn't get any sleep over the past week since I was too excited to pump out the 1.58bit and release it to everyone. 😢 I mentioned most people should expect to get 1 to 3 tokens / s on most local GPUs, so I'm unsure how I missed the 140 tokens / s. The 140 tokens / s is the prompt eval time - the generation / decode speed is in fact 10 to 14 tokens / s - so I must have reported the wrong line. Eg - 137.66 tok / s for prompt processing and 10.69 tok / s for decoding:
I've changed the blog post, docs and everywhere to reflect this issue. I also uploaded a screen recording GIF showing 140tok/s for prompt eval and 10 tok/s for generation for the 1st minute and the last minute to show an example: So 140 tok / s is the prompt processing / eval time, and I so I reported the wrong line - decoding speed is 10 to 14 tok / s. On more analysis - I can see via Open Router https://openrouter.ai/deepseek/deepseek-r1 the API tokens / s is around 3 or 4 tokens / s for R1. Throughput though is a different measure - https://artificialanalysis.ai/models/deepseek-r1/providers reports 60 tok / s for DeepSeek's official API. Assuming 6 tok / s for DeepSeek per single user, then throughput should be attainable at 10 * single user tokens / s. |
Thanks @loretoparisi again for reporting the issue! Extremely appreciate the testing and checks - also thanks @ikergarcia1996 for verifying the check! However I hope the 1.58bit model at least functions well as reported! Again thanks for trying it out! |
@danielhanchen further benchmarks
So apparently the inference pipe is not scaling on the device count. My
Still looking to furtherly improve the inference. |
Thanks for posting @loretoparisi . Was scratching my head at the same thing after trying out a few different run parameters. Can also confirm that I tend to get around 12 tokens/second on two H100 80GBs. Note that even though the node I was running on had 8 GPUs, I used
Going to look into batching and see if the throughput increases. I see the blog post has been updated, but still prominently displays "140 tokens/second" as the throughput number (I assume that's assuming batched inputs?). Was that experimentally proven @danielhanchen? Update: Testing with Ran this command the following command (Results here: batch_results.md):
|
@tenzinhl Nice results! yes so it's throughput still ie (tokens/s * # of batches) Yes 140 is batched - The tokens/s can be improved to approx 20 to 30 tokens / s via llama.cpp optims - for now it's around 15 tokens / s max. Prompt processing ie the prefill step can attain much higher tokens / s, but sadly because Flash Attention is not yet enabled, batch processing with Flash Attention won't help yet (it should help reduce latency) Deepseek themselves use I think 3 (or was it 4) lm_heads for multi token prediction so they essentially added speculative decoding - if this is enabled, we can achieve 40 to 50 tokens / s. Currently I don't think llama.cpp supports draft models from diff vocabs, but if that's enabled, one would assume using a distilled Llama 8B or Qwen 3B would help a lot. |
@danielhanchen Good points! Assumed that speculative decoding can be considered an external component, the multi token prediction could be an option, while regarding the Attention, do you mean FlashAttention in Llama.CPP? |
Following the blog post Run DeepSeek R1 Dynamic 1.58-bit I tried to reproduce the 140 token/second when running DeepSeek-R1-UD-IQ1_S i.e. 1.58-bit / 131GB / IQ1_S.
My setup was to offload to gpu all layers:
./llama.cpp/build/bin/llama-cli \ --model DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \ --cache-type-k q4_0 \ --threads 12 -no-cnv --n-gpu-layers 61 --prio 2 \ --temp 0.6 \ --ctx-size 8192 \ --seed 3407 \ --prompt "<|User|>What is the capital of Italy?<|Assistant|>"
With this config and 2x H100/80GB hardware
resulting to this performances:
The whole Llama.cpp output with model details:
So my top speed in terms of Token/Sec was 9-10 token per seconds when offloading 61 layers with 12 threads.
How to achieve 140 tokens / second?
The text was updated successfully, but these errors were encountered: