-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research: Benchmarking DeepSeek-R1 IQ1_S 1.58bit #11474
Comments
Thanks for the replication! Here is a silly nitpick, shouldn't it be 14 tokens per second?
Original:
|
Hey! Whoops guys apologies - just found out it should be 10 to 14 tokens / s for generation speed and not 140 (140 tok/s is the prompt eval time) on 2xH100. 😢 Sorry I didn't get any sleep over the past week since I was too excited to pump out the 1.58bit and release it to everyone. 😢 I mentioned most people should expect to get 1 to 3 tokens / s on most local GPUs, so I'm unsure how I missed the 140 tokens / s. The 140 tokens / s is the prompt eval time - the generation / decode speed is in fact 10 to 14 tokens / s - so I must have reported the wrong line. Eg - 137.66 tok / s for prompt processing and 10.69 tok / s for decoding:
I've changed the blog post, docs and everywhere to reflect this issue. I also uploaded a screen recording GIF showing 140tok/s for prompt eval and 10 tok/s for generation for the 1st minute and the last minute to show an example: So 140 tok / s is the prompt eval time, and I so I reported the wrong line - decoding speed is 10 to 14 tok / s. On more analysis - I can see via Open Router https://openrouter.ai/deepseek/deepseek-r1 the API tokens / s is around 3 or 4 tokens / s for R1. Throughput though is a different measure - https://artificialanalysis.ai/models/deepseek-r1/providers reports 60 tok / s for DeepSeek's official API. Assuming 6 tok / s for DeepSeek per single user, then throughput should be attainable at 10 * single user tokens / s. |
Also @loretoparisi extreme appreciate the testing so thanks again! Again thank you for testing the model out - hope the 1.58bit model functions well! |
You can try to offload also the non-repeating tensors by using Btw, here is a data point for r1.mp4The prompt processing reported by |
@ggerganov Super cool! Glad it worked well on Mac! I'm a Linux user so I had to ask someone else to test it, but good thing you verified it works smoothly :) Good work again on llama.cpp! |
@ggerganov why did you strike out the fa? Is it working now? |
Sorry for the confusion. At first I thought that to enable FA it only requires to support |
@ggerganov this would enable v quantization, right? And maybe some speed ups? |
It will:
The Metal changes for FA should be relatively simple I think if someone wants to take a stab at it. |
@ggerganov thanks! I did
with some little improvment
how to change pipe parallelism? |
It's enabled by default. The |
This is the updated results with a larger prompt. It scored
|
@ggerganov these are test results from
|
Further tests on 2x H100 / 80GB, matching 12 tokens per second
and benchmarks:
while 4x H100/80GB @ 214 TFLOPS we have
or
So I didnt' see any isignificative improvement in increasing GPUs or increasing threads > 12 right now. Apparently from |
Surprised the generation speed (ie: non-prompt processing) is so similar for H100, A100 and M2 Ultra? Isn't the memory bandwidth approximately: H100 = 2 x A100 = 4 x M2 Ultra? |
I'm even more confused now as people seem to be getting ~1.5 tokens per second using SSDs: https://huggingface.co/unsloth/DeepSeek-R1-GGUF/discussions/13 https://old.reddit.com/r/LocalLLaMA/comments/1iczucy/running_deepseek_r1_iq2xxs_200gb_from_ssd/ At best these have 1/100th the memory bandwidth of even the M2 Ultra? |
There's a Reddit thread on this now: https://www.reddit.com/r/LocalLLaMA/comments/1idivqe/the_mac_m2_ultra_is_faster_than_2xh100s_in/
It would be interesting to see the results for K quants (and _0 if anyone can run them). |
@ggerganov This is vLLM for comparison started as
It seems that when there is a
In fact when you have
Worth to note that PyTorch Eager execution mode was enforced. |
@loretoparisi That 10.5 t/s is just averaging between 31T/s and 0T/s after your prompt ended, not actually a meaningful number right? |
@loretoparisi |
It's R1 official Distillation from Qwen2.5-32B |
But the whole point of this thread is to benchmark the deepseek-v3 architecture? :) |
Reporting 8 tok/s on 2x A100 (pcie). |
1.58bit R1
|
Here are our results for DeepSeek-R1 IQ1_S 1.58bit: AMD EPYC 9654 96-Core 768GB RAM, 1 * Nvidia RTX 3090 (24GB VRAM) Results:
AMD EPYC 7713 64-Core 952GB RAM, 8 * Nvidia L40 (45GB VRAM, 360GB total VRAM) Results:
AMD EPYC 7V12 64-Core 1820GB RAM, 8 * A100 SXM4 (80GB VRAM, 640GB total VRAM) Results:
|
6-11 t/s on 8x3090. Bigger context lower side, and visa versa. |
Speed is plenty good for generation 👍 |
Out of curiosity, I went searching for a fat model with similar total parameters to R1's active parameters. I'm running 10x P40's, so both models fit in Vram. Falcon 40b - iQ1s DeepSeek R1 - iQ1s |
So I had some issues with getting CUDA out of memory during prompt processing at 10k+ context, even though it would allow me to load the model etc. I also picked up another 3090 today, so I have 9x3090 now. I loaded the It's notable to mention that each GPU with MoE architecture pulls 130-150w~ on inference and not higher, but I believe it's more so peak utilisation that spikes too much so I limited to 280w. I'm still playing around with Without FA, there's a lot of vram usage for context. Also using -ub at 128 to fix a bigger context. It's also unbalanced with splitting layers, here's how it looks. It's hard to balance right with tensor split.
Prompt wise, here you go:
Prompt processing is quite slow with -ub 128. Token generation also got quite a bit slower. I would say that's a combination of bigger quant, -ub 128, and GPUs limited to 280w. GPU utilisation during inference really sits around 10%, so I believe there is huge potential for optimisation here. |
Here's some more examples of inference:
EDIT: Was able to get ub 256 working with ok context. ``-ub 256 --ctx-size 9216 -m DeepSeek-R1-UD-IQ1_M```
|
I am the author of the PR mentioned above. To enable FA attention the main issue I ran into was that ggml does not yet support padding FP16 tensors. I added a CUDA kernel to do FP16 padding and now I have generation working but there is a bug that I need to track down. I will work on this some more tomorrow but I wanted to get an ok from code owners that this seems like a reasonable approach. cc @loretoparisi |
Really appreciate your work, looking forward to trying your PR when you update it! |
Hi, just wanted to add my results in here for the community: GPUs are only running at 25%, so presume we are data transfer rate/memory bandwidth constraint. Tweaking tensor-split and gpu-layers helps to max out memory usage on the GPUs. Interestingly enough during inference, GPU0 memory usage ticks up, hence the 19/20 layer split. Increasing threads does not meaningfully increase the token/s rate.
with --ctx 16384 (need to be reduced as more memory used system memory usage jumps from ~10g to 120g) : slower at 2.43tok/s but also appears to hallucinate/gets confused in it's thinking:
just for comparison using 2xa6000 via local rpc: yielded:
|
I've been messing around with DeepSeek-R1 IQ1_S 1.58bit on my M1 Ultra 128gb Studio as well as experimenting with running it using RPC. With just the M1 Ultra with 36 gpu layers and 1024 tokens of context (after fresh restart and setting iogpu.wired_limit_mb)
results look like this
and then if I use the M1 Ultra plus RPC (through a wired gigabit router) for 2 machines (a 3090 and a 4090) with 61 gpu layers and 8k context
results look like this:
|
Just to add an RPC datapoint. I was experimenting with my 2xA6000 setup with RPC to 2x3090 over a slowish 100mb network. Even thought I could load the entire model in memory, the network latency cancels out any gains. At startup it can take up to 30mins to transfer the model layers to the remote servers (3090). With a very small ctx size (1024), I get just over 1.2 tok/s :
The initial loading time is also painfully long, taking up to 30mins to get the 2x3090 loaded with model data. I an investigating if it might be possible to have the remote servers load the model locally if it is availble on disk, instead of via the network. |
Epyc 7402 24c/48t
|
UD-IQ1_S(1.58b) on 4xA100-SXM-40G. one-batch: batched-bench: memory use: |
I try to do same thing on our server , But i use model CPU: EYPC 9654 * 2
Here are DeepSeek-R1-UD-IQ2_XXS 's result DeepSeek-R1-UD-IQ1_M 's result llama_perf_sampler_print: sampling time = 0.84 ms / 21 runs ( 0.04 ms per token, 25000.00 tokens per second) It has 5 token/second. And i try to open server apis for UI calls. could you advice me which numbet can be use as concurrent usage settings |
@CHN-STUDENT Unlike the batched-bench test cases, the llamacpp or ollama can't batching the multi-requests in real server UI. It seems that running the request sequentially no matter how many ‘--parallel' you setting. So the speed is very slow. If you want use batching inference, you can use vllm or sglang. |
Research Stage
Previous existing literature and research
Command
Model
DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S
1.58Bit, 131GB
Hardware
Hypothesis
Reported performances is 140 token/second
Implementation
No response
Analysis
Llama.cpp Performance Analysis
Raw Benchmarks
Detailed Analysis
1. Token Sampling Performance
2. Model Loading
3. Prompt Evaluation
4. Generation Evaluation
5. Total Processing Time
Key Insights
Performance Bottlenecks:
Processing Stages:
Overall Performance:
Relevant log output
The text was updated successfully, but these errors were encountered: